Statistical Models

Statistical Data Analysis

lm Function in R

Many generic functions are available for the computation of regression coefficients, for the testing of coefficients, for computation of residuals or predictions values, etc. Therefore, a good grasp of lm() function is necessary. Suppose, we have performed the regression analysis using lm() function as done in the previous lesson.

mod <- lm(mpg ~ hp, data = mtcars)

The object returned by the lm() function has a class of “lm”. The objects associated with the “lm” class have mode as a list.

class(mod)

The name of the objects related to the “lm” class can be queried via

names(mod)

All the components of the “lm” class can be assessed directly. For example,

mod$rank
mod$coef   # or mod$coefficients

The following is the list of some generic functions for the fitted “lm” model.

Generic FunctionShort Description
print()print or displays the results in R Console
summary()print or displays regression coefficients, their standard errors, t-ratios, p-values, and significance
coef()extracts regression coefficients
residuals()or resid(): extracts residuals of the fitted model
fitted()or fitted.values() : extracts fitted values
anova()perform comparisons of the nested model
predict()compute predicted values for new data
plot()draw diagnostics plot of the regression model
confint()compute the confidence intervals for regression coefficients
deviance()compute the residual sum of squares
vcov()compute estimated variance-covariance matrix
logLik()compute the log-likelihood
AIC(), BIC()compute information criteria

It is better to save objects from the summary() function.

The summary() function returns an object of class “summy.lm()” and its components can be queried via

sum_mod <- summary(mod)
names(sum_mod)

names( summary(mod) )

The objects from the summary() function can be obtained as

sum_mod$residuals
sum_mod$r.squared
sum_mod$adj.r.squared
sum_mod$df
sum_mod$sigma
sum_mod$fstatistic

The confidence interval for estimated coefficients can be computed as

confint(mod, level = 0.95)

Note that level argument is optional if the confidence level is 95% (significance level is 5%).

The prediction intervals for mean and individual for hp (regressor) equal to 200 and 160, can be computed as

predict(mod, newdata=data.frame(hp = c(200, 160)), interval = "confidence" )
predict(mod, newdata=data.frame(hp = c(200, 160)), interval = "prediction" )

The prediction intervals can be used for computing and visualizing confidence bands. For example,

x = seq(50, 350, length=32 )

pred <-predict(mod, newdata=data.frame(x), interval = "prediction" )
plot(hp, mpg)
lines(pred[,1] ~ x, col = 1) # fitted values
lines(pred[,2] ~ x, col = 2) # lower limit
lines(pred[,3] ~ x, col = 2) # upper limit

For diagnostics plot, plot() function can be used and it provides four graphs of

  • residuals vs fitted values
  • QQ plot of standardized residuals
  • scale-location plot of fitted values against the square root of standardized residuals
  • standardized residuals vs leverage
diagnostic plots lm function

To plot say QQ plot only use

plot(mod, which = 2)

which argument is used to select the graph produced out of four.

Simple Linear Regression Model

The linear regression model, typically estimated by the ordinary least squares (OLS) technique. The model in general form is

$$Y_i=x’_i\beta + \varepsilon, \quad\quad i=1,2,\cdots,n$$

In matrix notation

$$y=X\beta + \varepsilon,$$

where $y$ is a vector of order $n\times 1$ that contains values of the dependent variable, $X=(x_1,x_2,\cdots,x_n)’$ is regressor(s) matrix containing $n$ observations. $X$ matrix also called model matrix (whose column represents regressors), The $\beta$ is a $p\times 1$ vector of regressor coefficients, and $\varepsilon$ is an vector of order $n\times 1$ containing error terms.

The regression coefficients $\beta$’s can be estimated

$$\hat{\beta}=(X’X)^{-1}X’Y$$

The fitted values can be computed

$$\hat{y}=X\hat{\beta}$$

The residuals are

$$\hat{\varepsilon} = y – \hat{y}$$

The residual sum of squares is

$$\hat{\varepsilon}\varepsilon$$

R language has excellent facilities for fitting linear models. The basic function for fitting linear models by the least square method is lm() function.  The model is specified by a formula notation.

We will consider mtcars the dataset. Let $Y=mpg$ and $X=hp$, the simple linear regression model is

$$Y_i = \beta_1 + \beta_2 hp + \varepsilon_i$$

where $\beta_1$ is the intercept and $\beta_2$ is the slope coefficient.

To fit this simple linear regression model in R, one can follow:

attach(mtcars)
mod &lt;- lm(mpg ~ hp)
mod

The lm() function uses a formula mpg ~ hp with the response variable on the left of tilde (~) and predictor on the right. It is better to supply the data argument to lm() function. That is,

mod &lt;- lm(mpg ~ hp, data = mtcars)

The lm() function returns an object of the class lm, saved in a variable mod (it can be different). Printing the object produces a brief report. For hypothesis testing of regression coefficients summary() function should be used. It will bring more information about the fitted model such as standard errors, t-values, and p-values for each coefficient of the model fitting. For example,

summary(mod)

One can fit a regression model without an intercept term if required.

lm(mpg ~ hp -1, data = mtcars)

For the graphical representation of the model, one can use plot() function to draw scatter points and abline() function to draw the regression line.

plot(hp, mpg)
abline(mod)

Note the order of variables in plot() function. The first argument to plot() function represents the predictor variable while the second argument to plot() function represents the response variable.

The function abline() plots a line on the graph according to the slope and intercept provided by the argument mod or by providing manually.

One can change the style of the regression line using lty argument. Similarly, the color of the regression line can be changed from black to some other color using col argument. That is,

plot(hp, mpg)
abline(mod, lty = 2, col = "blue")

Note that one can identify different observations on a graph using identify() function. For example,

identify(hp, mpg)
plot and identify function in R

Note to identify a point, place the mouse pointer near the point and press the left mouse button, to exit from identify procedure, press the right mouse button, or ESC button from the keyboard.

Read more on Statistical models in R.

Statistical Models in R Language

R language provides an interlocking suite of facilities that make fitting statistical models very simple. The output from statistical models in R language is minimal and one needs to ask for the details by calling extractor functions.

Defining Statistical Models; Formulae in R Language

The template for a statistical model is a linear regression model with independent, heteroscedastic errors, that is
$latex \sum_{j=0}^p \beta_j x_{ij}+ e_i, \quad e_i \sim NID(0, \sigma^2), \quad i=1,2,\dots, n, j=1,2,\cdots, p$

In matrix form, statistical model can be written as
$latex y=X\beta+e$,
where the $latex y$ is the dependent (response) variable, $latex X$ is the model matrix or design matrix (matrix of regressors) and has columns $latex x_0, x_1, \cdots, x_p$, the determining variables with intercept term. Usually $latex x_0$ is a column of ones defining an intercept term in statistical model.

Statistical Model Examples
Suppose $latex y, x, x_0, x_1, x_2, \cdots$ are numeric variables, $latex X$ is a matrix. Following are some examples that specify statistical models in R.

  • y ~ x    or   y ~ 1 + x
    Both examples imply the same simple linear regression model of $latex y$ on $latex x$. The first formulae has an implicit intercept term and the second formulae has an explicit intercept term.
  • y ~ 0 + x  or  y ~ -1 + x  or y ~ x – 1
    All these imply the same simple linear regression model of $latex y$ on $latex x$ through the origin, that is, without an intercept term.
  • log(y) ~ x1 + x2
    Imply multiple regression of the transformed variable, $latex(log(y)$ on $latex x_1$ and $latex x_2$ with an implicit intercept term.
  • y ~ poly(x , 2)  or  y ~ 1 + x + I(x, 2)
    Imply a polynomial regression model of $latex$ y on $ latex x$ of degree 2 (second degree polynomials) and the second formulae uses explicit powers as basis.
  • y~ X + poly(x, 2)
    Multiple regression $latex y$ with model matrix consisting of the design matrix $latex X$ as well as polynomial terms in $latex x$ to degree 2.

Note that the operator ~ is used to define a model formula in R language. The form of an ordinary linear regression model is, $latex response\,\, ~ \,\, op_1\,\, term_1\,\, op_2\,\, term_2\,\, op_3\,\, term_3\,\, \cdots $,

where

The response is a vector or matrix defining the response (dependent) variable(s).
$latex op_i$ is an operator, either + or -, implying the inclusion or exclusion of a term in the model. The + operator is optional.
$latex term_i$ is either a matrix or vector or 1. It may be a factor or a formula expression consisting of factors, vectors or matrices connected by formula operators.

Scroll to top
x  Powerful Protection for WordPress, from Shield Security
This Site Is Protected By
Shield Security