Category «Statistical Models»

Statistical Models in R Language

R language provides an interlocking suite of facilities that make fitting statistical models very simple. The output from statistical models in R language is minimal and one needs to ask for the details by calling extractor functions.

Defining Statistical Models; Formulae in R Language

The template for a statistical model is a linear regression model with independent, heteroscedastic errors, that is
\sum_{j=0}^p \beta_j x_{ij}+ e_i, \quad e_i \sim NID(0, \sigma^2), \quad i=1,2,\dots, n, j=1,2,\cdots, p

In matrix form, statistical model can be written as
y=X\beta+e,
where the y is the dependent (response) variable, X is the model matrix or design matrix (matrix of regressors) and has columns x_0, x_1, \cdots, x_p, the determining variables with intercept term. Usually x_0 is a column of ones defining an intercept term in statistical model.

Statistical Model Examples
Suppose y, x, x_0, x_1, x_2, \cdots are numeric variables, X is a matrix. Following are some examples that specify statistical models in R.

  • y ~ x    or   y ~ 1 + x
    Both examples imply the same simple linear regression model of y on x. The first formulae has an implicit intercept term and the second formulae has an explicit intercept term.
  • y ~ 0 + x  or  y ~ -1 + x  or y ~ x – 1
    All these imply the same simple linear regression model of y on x through the origin, that is, without an intercept term.
  • log(y) ~ x1 + x2
    Imply multiple regression of the transformed variable, $latex(log(y)$ on x_1 and x_2 with an implicit intercept term.
  • y ~ poly(x , 2)  or  y ~ 1 + x + I(x, 2)
    Imply a polynomial regression model of $latex$ y on $ latex x$ of degree 2 (second degree polynomials) and the second formulae uses explicit powers as basis.
  • y~ X + poly(x, 2)
    Multiple regression y with model matrix consisting of the design matrix X as well as polynomial terms in x to degree 2.

Note that the operator ~ is used to define a model formula in R language. The form of an ordinary linear regression model is, response\,\, ~ \,\, op_1\,\, term_1\,\, op_2\,\, term_2\,\, op_3\,\, term_3\,\, \cdots ,

where

response is a vector or matrix defining the response (dependent) variable(s).
op_i is an operator, either + or -, implying the inclusion or exclusion of a term in the model. The + operator is optional.
term_i is either a matrix or vector or 1. It may be a factor or a formula expression consisting of factors, vectors or matrices connected by formula operators.

%d bloggers like this: