mctest: An R package for Detection of Collinearity among Regressors

The problem of multicollinearity plagues the numerical stability of regression estimates. It also causes some serious problem in validation and interpretation of the regression model. Consider the usual multiple linear regression model,

y = X \beta+u,

where y is an n\times 1 vector of observation on dependent variable, X is known design matrix of order \times p, having full-column rank p, \beta is p \times 1 vector of unknown parameters and u is an n\times 1 vector of random errors with mean zero and variance \sigma^2 I_n, where I_n is an identity matrix of order n.

Existence of linear dependence (relationship) between regressors can affect the regression model ability to estimate the model’s parameters (regression coefficients). Therefore, multicollinearity is lack of independence or the presence of interdependence signified by usually high inter-correlations within a set of regressors (predictors).

In case of sever multicollinearity (ill-conditioning) of X'X matrix, implausible signs, low t-ratios, high R-squared values, inflated standard errors, wider confidence intervals, very large condition number (CN) and non-significant and/or magnitude of regression coefficient estimates are some of possible issues.

There are many diagnostic methods are available to check the existence of collinearity among regressors, such as variance inflation Factor (VIF), values of pair-wise correlation among regressors, eigenvalues, CN, Farrar and Glauber tests, Theil’s measure, klein’s rule etc.

Our recently developed R package mctest computes several diagnostic measures to test the existence of collinearity among regressors. We classified these measures as individual collinearity diagnostic and overall collinearity diagnostics. Overall collinearity diagnostic include determinant of X'X matrix, red indicator, Farrar Chi-Square test, Theil indicator, CN, and sum of lambda inverse values. Individual collinearity diagnostics include VIF/ TOL, Farrar and Glaube Wi test, relationship between $R^2$ and F-test, corrected VIF (CVIF) and Klein’s rule.

How to use mctest package

You must have installed and load the mctest package to start with testing of collinearity among regressors. As an example, we used Hald data which is already bundled in mctest package.

mctest package have 4 functions namely, mctest(), omcdiag(), imcdiag() and mc.plot() functions. The mctest() function can be used to have overall and/or individual collinearity diagnostic. The mc.plot() can be used to draw graph of VIF and eigenvalues to have graphical judgement of among collinearity among regressors.

mctest illustrative Example
The argument of mctest is

mctest(x, y, type = c(“o”, “I”, “b”), na.rm = TRUE, Inter = TRUE, method = NULL, corr = FALSE, detr = 0.01, red = 0.5, theil = 0.5, cn = 30, vif = 10, tol = 0.1, conf = 0.95, cvif = 10, leamer = 0.1, all = all)

For detail of each argument see the mctest package documentation. Following are few commands that can be used get different collinearity diagnostics.

x<-Hald[ ,-1]  # X variables from Hald data
> y<-Hald[ ,1]   # y variable from Hald data
> mctest(x, y)   # default collinearity diagnostics
> mctest(x, y, type = “i”)  # individual collinearity diagnostics
> mctest(x, y, type = “o”) # overall collinearity diagnostics

Overall collinearity diagnostics

For overall collinearity diagnostics, eigenvalues and condition numbers are also produced either intercept term is included or not. The syntax of omcdiag() function is

omcdiag(x, y, na.rm = TRUE, Inter = True, detr = 0.01, red = 0.5, conf = 0.95, theil = 0.5, cn = 30, …)

Determinant of correlation matrix, Farrar test of Chi-square, Red indicator, sum of lambda inverse values, Theils’ indicator and CN.

> omcdiag(x, y, Inter=FALSE)
> omcdiag(x, y)[1]
> omcidag(x,y, detr=0.001, conf=0.99)

The output of last command (with threshold for determinant and confidence interval for Farrar and Glauber test).

mctest: overall collinearity diagnostics

Individual collinearity diagnostics

imcdiag(x, y, method = NULL, na.rm = TRUE, corr = FALSE, vif = 10, tol = 0.1, conf = 0.95, cvif = 10, leamer = 0.1, all = all)

The imcdiag() function detects the existence of multicollinearity due to certain X-variable. This includes VIF, TOL, Klein’s rule, CVIF, F&G test of Chi-square and F-test.

> imcdiag(x = x, y)
> imcdiag(x = x, y, corr = TRUE) # correlation matrix
> imcdiag(x = x, y, vif = 5, leamer = 0.05)   # with threshold of VIF and leamer method

mctest: individual collinearity diagnostics
> imcdiag(x = x, y, all = True)
> imcdiag(x = x, y, all = TRUE, vif = 5, leamer = 0.2, cvif = 5)

mctest: individual collinearity diagnostics

Graphical representation of VIF and Eigenvalues

> mc.plot(x, y, Inter = FALSE, vif = 10, ev = 0.01)
> mc.plot(x, y)
> mc.plot(x, y, vif = 5, ev =  0.2)

mctest: collinearity diagnostic measures

For further detail about collinearity diagnostic see

Statistical Models in R Language

R language provides an interlocking suite of facilities that make fitting statistical models very simple. The output from statistical models in R language is minimal and one needs to ask for the details by calling extractor functions.

Defining Statistical Models; Formulae in R Language

The template for a statistical model is a linear regression model with independent, heteroscedastic errors, that is
\sum_{j=0}^p \beta_j x_{ij}+ e_i, \quad e_i \sim NID(0, \sigma^2), \quad i=1,2,\dots, n, j=1,2,\cdots, p

In matrix form, statistical model can be written as
y=X\beta+e,
where the y is the dependent (response) variable, X is the model matrix or design matrix (matrix of regressors) and has columns x_0, x_1, \cdots, x_p, the determining variables with intercept term. Usually x_0 is a column of ones defining an intercept term in statistical model.

Statistical Model Examples
Suppose y, x, x_0, x_1, x_2, \cdots are numeric variables, X is a matrix. Following are some examples that specify statistical models in R.

  • y ~ x    or   y ~ 1 + x
    Both examples imply the same simple linear regression model of y on x. The first formulae has an implicit intercept term and the second formulae has an explicit intercept term.
  • y ~ 0 + x  or  y ~ -1 + x  or y ~ x – 1
    All these imply the same simple linear regression model of y on x through the origin, that is, without an intercept term.
  • log(y) ~ x1 + x2
    Imply multiple regression of the transformed variable, $latex(log(y)$ on x_1 and x_2 with an implicit intercept term.
  • y ~ poly(x , 2)  or  y ~ 1 + x + I(x, 2)
    Imply a polynomial regression model of $latex$ y on $ latex x$ of degree 2 (second degree polynomials) and the second formulae uses explicit powers as basis.
  • y~ X + poly(x, 2)
    Multiple regression y with model matrix consisting of the design matrix X as well as polynomial terms in x to degree 2.

Note that the operator ~ is used to define a model formula in R language. The form of an ordinary linear regression model is, response\,\, ~ \,\, op_1\,\, term_1\,\, op_2\,\, term_2\,\, op_3\,\, term_3\,\, \cdots ,

where

response is a vector or matrix defining the response (dependent) variable(s).
op_i is an operator, either + or -, implying the inclusion or exclusion of a term in the model. The + operator is optional.
term_i is either a matrix or vector or 1. It may be a factor or a formula expression consisting of factors, vectors or matrices connected by formula operators.

How to View Source Code of R Method/ Function?

Source Code of R Method

There are different ways to view the source code of an R method or function. It will help to know how function is working.

Internal Functions
If you want to see the source code of internal function (functions from base packages), just type the name of the function at R prompt such as;

> rowMeans
view R code of method

Functions or Methods from S3 Class System
For S3 classes, methods function can be used to list the methods for a particular generic function or class.

> methods(predict)

view R code S3 Class
Note that “Non-Visible functions are asterisked” means that the function is not exported from its package’s namespace.

One can still view its source code via the ::: function such as

> stats:::predict.lm

or by using getAnywhere() function, such as

> getAnywhere(predict.lm)

Note that the getAnywhere() function is useful as you don’t need to know from which package the function or method is came from.

Functions or Methods from S4 Class System
The S4 system is a newer method dispatch system and is an alternative to the S3 system. The package ‘Matrix’ is an example of S4 function.

> library(Matrix)
> chol2inv
view R code S4 Class
The output already offers a lot of information. The standardGeneric is an indicator of an S4 function. The method to see defined S4 methods is to use showMethods(chol2inv), that is;

>showMethods(chol2inv)
view R code S4 System
The getMethod can be used to see the source code of one of the methods, such as,
> getMethod (“chol2inv”, “diagonalMatrix”)
view R code S4 System
Functions that Calls Unexported Functions
In the case of unexported functions such as ts.union, .cbindts and .makeNamesTs from the stats namespace, one can view source code of these unexported functions using ::: operator or getAnywhere() function, for example;
> stats::: .makeNamesTs
> getAnywhere(.makeNamesTs)
view R code S4 System
View on Youtube

Greek letters in R plot label and title

Question: How one can include Greek letter (symbols) in R plot labels?
Answer: Greek letters or symbols can be included in titles and labels of graph using the expression command. Following are some examples

Note that in these example random data is generated from normal distribution. You can use your own data set to produce graphs that have symbols or Greek letters in their labels or titles.

Example 1:

> mycoef <- rnorm (1000)
> hist(mycoef, main = expression(beta) )

where beta in expression is Greek letter (symbol) of \beta. A histogram similar to following will be produced.

greek symbols in r plot-1

Example 2:

sample <- rnorm(mean=5, sd=1, n=100)
> hist(sample, main=expression( paste(“sampled values, “, mu, “=5, “, sigma, “=1” )))

where mu and sigma are symbols of \mu and \sigma respectively. Now histogram will look like

greek symbols in r plot-2Example 3:

curve(dnorm, from= -3, to=3, n=1000, main=”Normal Probability Density Function”)

will produce curve of Normal probability density function ranging from -3 to 3.

greek symbols in r plot-3To add normal density function formula, we need to use text and paste command, that is

> text(-2, 0.3, expression(f(x)== paste(frac(1, sqrt(2*pi* sigma^2 ) ), ” “, e^{frac(-(x-mu)^2, 2*sigma^2)})), cex=1.2)

Now the updated curve of Normal probability density function will be

greek symbols in r plot-4Example 4:

x <- dnorm( seq(-3, 3, 0.001))
> plot(seq(-3, 3, 0.001), cumsum(x)/sum(x), type=”l”, col=”blue”, xlab=”x”, main=”Normal Cumulative Distribution Function”)

The Normal Cumulative Distribution function will look like,

greek symbols in r plot-5To add formula, use text and paste command, that is

text(-1.5, 0.7, expression(phi(x)==paste(frac(1, sqrt(2*pi)), ” “, integral(e^(-t^2/2)*dt, -infinity, x))), cex=1.2)

The Curve of Normal Cumulative Distribution Function and its formula in plot will look like,

greek symbols in r plot-6

Import Data using read.table function

Question: How I can check my Working Directory so that I would be able to import my data in R.
Answer: To find working directory, the command getwd() can be used, that is

> getwd()

Question: How I can change working directory to my own path.
Answer: Use function setwd(), that is

> setwd(“d:/mydata”)
> setwd(“C:/Users/XYZ/Documents”)

Question: I have data set stored in text format (ASCII) that contain rectangular data. How I can read this data in tabular form. I have already set my working directory.
Answer: As data is already in a directory, which is set as working directory, use following command

> mydata <- read.table(“data.dat”)
> mydata <- read.table(“data.txt”)

mydata is named object that will have data from file “data.dat” or “data.txt” in data frame format. Each variable in data file will be named by default V1, V2, ….

Question: How this stored data can be to accessed?
Answer: To access the stored data, write data frame object name (“mydata”) with $ sign and name of the variable. That is,

mydata$V1
mydata$V2
mydata[“V1”]
mydata[,1]

Question: My data file has variables names in first row of the data file. In previous Question, variables names were V1, V2, V3, … How I can get actual names of the variable store in first row of data.dat file.
Answer: Instead of reading data file with default values of arguments, use

> read.table(“data.dat”, header = TRUE)

Question: I want to read a data file which is not store in working directory?
Answer: To access the data file which is not stored in working directory, provide complete path of the file, such as.

> read.table(“d:/data.dat” , header = TRUE)
> read.table(“d:/Rdata/data.txt” , header = TRUE)

Note that read.table() is used to read the data from external files that has a normally a special form:

  • The first line of the file should have a name for each variable in the data frame. However, if first row does not contains name of variable then header argument should not be set to FALSE.
  • Each additional line of the file has it first item a row label and the values for each variable.

In R it is strongly suggested that variables need to be held in data frame. For this purpose read.table() function can be used. For further details about read.table() function use,

help(read.table)

 

%d bloggers like this: