Statistical Computing and Graphics in R

# Statistical Models

Statistical Data Analysis

## The Poisson Regression in R

The Poisson regression model should be used when the dependent (response) variable is in the form of counts or values of the response variables following a Poisson distribution. In R, glm() function can be used to perform Poisson regression analysis.

The Poisson regression is used to analyze count data.

For the Poisson model, let us consider another built-in data set warpbreaks. This data set describes the effect of wool type (A or B) and tension (Low, Medium, and High) on the number of warp breaks per loom, where a loom corresponds to a fixed length of yarn.

head(warpbreaks)

The $breaks$ variable is considered a response variable since it contains the number of breaks (count of breaks). The $tension$ and $type$ variables are taken as predictor variables.

pois_mod <- glm(breaks ~ wool + tension, data = warpbreaks, family = poisson)

The output from the pois_mod object is

The glm() provides eight choices for a family with the following default link functions:

The detailed output (estimation and testing of parameters) can be obtained as

summary(pois_mod)

Example:

• A number of cargo ships were damaged by waves (McCullagh & Nelder, 1989).
• Number of deaths due to AIDs in Australia per quarter (3 month periods) from January 1983 – June 1986.
• A number of violent incidents were exhibited over a 6-month period by patients who had been treated in the ER of a psychiatric hospital (Gardner, Mulvey, & Shaw, 1995).
• Daily homicide counts in California (Grogger, 1990).
• Founding of daycare centers in Toronto (Baum & Oliver, 1992).
• Political party-switching among members of the US House of Representatives (King, 1988).
• Number of presidential appointments to the Supreme Court (King, 1987).
• A number of children in a classroom that a child lists as being their friend (unlimited nomination procedure, sociometric data).
• A number of hard disk failures during a year.
• Number of deaths due to SARs (Yu, Chan & Fung, 2006).
• A number of arrests resulted from 911 calls.
• A number of orders of protection were issued.

## Non-Linear Regression Model

In the least square method, the regression model is established in such a way that the sum of the squares of the vertical distances of different points (residuals) from the regression line is minimized. When the relationship between the variables is not linear (one has a non-linear regression model), one may

1. try to transform the data to linearize the relationship,
2. fit polynomial or complex spline model to the data, or
3. fit a non-linear regression to the data.

In the non-linear regression model, a function is specified by a set of parameters to fit the data. The non-linear least squares approach is used to estimate such parameters. In R, the nls() is used to approximate the non-linear function using a linear one and iteratively try to find the best parameter values.

Some frequently used non-linear regression models are listed in the Table below.

Let fit Michaelis-Menten non-linear function to the data given below.

x <- seq(1, 10, 1)
y <- c(3.7, 7.1, 11.9, 19, 27, 38.5, 51, 67.7, 85, 102)

nls_model <- nls(y ~ a * x/(1 + b * x), start = list(a = 1, b = 1))

summary(nls_model)
#### Output
Formula: y ~ a * x/(1 + b * x)

Parameters:
Estimate Std. Error t value Pr(>|t|)
a  4.107257   0.226711   18.12 8.85e-08 ***
b -0.060900   0.002708  -22.49 1.62e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.805 on 8 degrees of freedom

Number of iterations to convergence: 11
Achieved convergence tolerance: 6.354e-06

Let plot the non-linear predicted values from 10 data points of newly generated x-values

new.data <- data.frame(x = seq(min(x), max(x), len = 10))

plot(x, y)

lines(new.data$x, predict(nls_model, newdata = new.data) ) The sum of squared residuals and the confidence interval of the chosen values of the coefficient can be obtained by issuing the commands, sum(resid(nls_model)^2) # or print(sum(resid(nls_model)^2)) confint(nls_model) # or print(confint(nls_model)) Note that the formula for nls() does not use special coding in linear terms, factors, interactions, etc. The right-hand side in the expression of nls() computes the expected value to the left-hand side. The start argument contains the list of starting values of the parameter used in the expression and is varied by the algorithm. ## Logistic Regression Models in R In logistic regression models, the response variable ($y$) is of categorical (binary, dichotomous) values such as 1 or 0 (TRUE/ FALSE). It measures the probability of a binary response variable based on mathematical equation relating the values of response variable with the predictor(s). The built-in function glm() can be used to perform logistic regression analysis. The odds are used in logistic regression. If$p$is the probability of success, the odds of in favour of success are,$\frac{p}{q}=\frac{p}{1-p}$. Note that probability can be converted to odds and odds can also be converted to probability. However, unlike probability, odds can exceed 1. For example, if the probability of an event is 0.25, the odds in the favour of that event are$\frac{0.25}{0.75}=0.33$, but the odds against the event are$\frac{0.75}{0.25}=3$. In built-in dataset (“mtcars“), the column (am) describes the transmission mode (automatic or manual) which is of binary value (0 or 1). The logistic regression models between the column (response variable) am and other columns (regressors) namely hp, wt, and cyl can be performed as given are: ### Logistic regression with one dichotomous predictor logmodel1 <- glm(am ~ vs, family = "binomial") summary(logmodel1) ### Logistic regression with one continuous predictor logmodel2 <- glm(am ~ wt, family = "binomial") summary(logmodel2) ### Logistic regression with multiple predictors logmodel3 <- glm(am ~ cyl + hp + wt, family = "binomial") summary(logmodel3) plot(logmodel3) Note: in logistic regression model, both dichotomous and continuous variables can be used as predictor. In R, the coefficients returned by logistic regression are a logit, or the log of the odds. To convert logits to odds ratio exponentiate it and to convert logits to probability use$\frac{e^\beta}{1-e^\beta}$. For example, logmodel1 <- glm(am ~ vs, family = "binomial", data = mtcars) logit_coef <- logmodel1$coef

exp(logmodel1$coef) exp(logit_coef)/(1 + exp(logmodel1$coef))

## Generalized Linear Models (GLM) in R

The generalized linear models (GLM) can be used when the distribution of the response variable is non-normal or when the response variable is transformed into linearity. The GLMs are flexible extensions of linear models that are used to fit the regression models to non-Gaussian data.

The basic form of a Generalized linear model is
\begin{align*}
g(\mu_i) &= X_i’ \beta \\
&= \beta_0 + \sum\limits_{j=1}^p x_{ij} \beta_j
\end{align*}
where $\mu_i=E(U_i)$ is the expected value of the response variable $Y_i$ given the predictors, $g(\cdot)$ is a smooth and monotonic link function that connects $\mu_i$ to the predictors, $X_i’=(x_{i0}, x_{i1}, \cdots, x_{ip})$ is the known vector having $i$th observations with $x_{i0}=1$, and $\beta=(\beta_0, \beta_1, \cdots, \beta_p)’$ is the unknown vector of regression coefficients.

The glm() is a function that can be used to fit a generalized linear model, using the form

mod <- glm(formula, family = gaussian, data = data.frame)

The family argument is a description of the error distribution and link function to be used in the model.

The class of generalized linear models is specified by giving a symbolic description of the linear predictor and a description of the error distribution.

### Generalized Linear Model Example in R

Consider the “cars” dataset available in R.

data(cars)
head(cars)
attach(cars)
scatter.smooth(x=speed, y=dist, main = "Dist ~ Speed")

# Linear Model
lm(dist ~ speed, data = cars)
summary(lm(dist ~ speed, data = cars)

# Generalized Linear Model
glm(dist ~ speed, data=cars, family = "binomial")
plot(glm(dist ~ speed, data = cars))
summary(glm(dist ~ speed, data = cars))
Scroll to top