Performing Linear Regression in R
Regression is to build a function of independent variables (also known as predictors, regressors, explanatory variables, and features) to predict a dependent variable (also called a response, target, and regressand). Here we will focus on performing linear regression in R Language.
Linear regression is to predict response with a linear function of predictors as $$y=\beta_0+\beta_1x_1+\beta_2x_2+\cdots + \beta_kx_k,$$ where $x_1, x_2, \cdots, x_k$ are predictors and $y$ is the response to predict.
Before performing the regression analysis it will be very helpful to computer the coefficient of correlation between dependent variable and independent variable and also better to draw the scatter diagram.
Performing Linear Regression in R:
Load the mtcars
data, and check the data structure using str()
.
str(mtcars)
You have data stored in some external file such as CSV, then you can use read.csv()
function to load the data in R. To learn about importing data files in R follow the link: Import Data files in R
Let us want to check the impact of weight (wt
) on miles per gallon (mpg
) and test the significance of regression coefficient and other statistics to see the goodness of our fitted model
mod <- lm(mpg ~ wt, data = mtcars) summary(mod)
Now look at the objects of results stored in mod
names(mod)
Let us get the coefficients of the fitted regression model
mod$coef coef(mod)
To obtain the confidence intervals of the estimated coefficients, one can use the confint()
confint(mod)
Fitted values from the regression model can be obtained by using fitted()
mod$fitted fitted(mod)
The residuals can be obtained for the regression model using residual()
function
mod$resid resid(mod)
One can check the formula used to perform the simple/ multiple regression. It will tell you which one variable is used as response and others a explanatory variables.
formula (mod)
To graphically visualize the relationship between variables or pairs of variables one can use plot()
or pair()
functions. Let us draw the scatter diagram between the dependent variable mpg
and the explanatory variable wt
using the plot()
function.
plot(mpg ~ wt, data = mtcars)
One can add a best-fitted line to the scatter plot. For this purpose use abline()
with an object having the class lm
such as mod
in this case
abline(mod)
There are many other functions and R packages to perform linear regression models in R Language.
Learn more about lm()
function, click the link lm() function in R
Probability Distributions in R
We often make probabilistic statements when working with statistical Probability Distributions. We want to know four things:
- The density (PDF) at a particular value,
- The distribution (CDF) at a particular probability,
- The quantile value corresponding to a particular probability, and
- A random draw of values from a particular distribution.
R has plenty of functions for obtaining density, distribution, quantile, and random variables.
Consider a random variable $X$ which is $N(\mu = 2, \sigma^2 = 16)$. We want to:
1) Calculate the value of PDF at $x=3$ (that is, the height of the curve at $x=3$)
dnorm(x=3, mean = 2, sd = sqrt(16) ) dnorm(x=3, mean = 2, sd = 4) dnorm(x=3, 2, 4)
2) Calculate the value of the CDF at $x=3$ (that is, $P(X\le 3)$)
pnorm(q=3, m=2, sd=4)
3) Calculate the quantile for probability 0.975
qnorm(p = 0.975, m = 2, sd = 4)
4) Generate a random sample of size $n = 10$
rnorm(n = 10, m = 2, sd = 5)
There are many probability distributions available in the R Language. I will list only a few.
Binomial | dbinom( ) | qbinom( ) | pbinom( ) | rbinom( ) |
t | dt( ) | qt( ) | pt( ) | rt( ) |
Poisson | dpois( ) | qpois( ) | ppois( ) | rpois( ) |
f | df( ) | qf( ) | pf( ) | rf( ) |
Chi-Square | dchisq( ) | qchisq( ) | pchisq( ) | rchisq() |
Observe that a prefix (d, q, p, and r) is added for each distribution.
Distribution | Distribution Name in R | Parameters |
Binomial | binom | n=Number of trials, and p= probability of success for one trial |
Geometric | geom | p=probability of success for one trial |
Poisson | pois | lambda = mean |
Beta | beta | shape1, shape2 |
Chi-Square | chisq | df=degrees of freedom |
F | f | df1, df2 degrees of freedom |
Logistic | logis | location, scale |
normal | norm | mean, sd |
Student’s t | t | df=degrees of freedom |
Weibull | weibull | shape, scale |
Drawing the Density function
The density function dnorm( )
can be used to draw a graph of normal (or any distribution). Let us compare two normal distributions both with mean = 20, and one with sd = 6, and other with sd = 3.
For this purpose, we need $x$-axis values, such as $\overline{x} \pm 3SD \Rightarrow 20 + \pm 3\times 6$.
xaxis <- seq(0, 40, 0.5) y1 <- dnorm(xaxis, 20, 6) y2 <- dnorm(xaxis, 20, 3) plot(xaxis, y2, type = "l", main = "comparing two normal distributions", col = "blue") points(xaxis, y1, type="l", col = "red")
Finding Probabilities
#Left Tailed Probability pnorm(1.96) #Area between two Z-scores pnorm(1.96) - pnorm(-1.96)
Finding Right Tailed Probabilities
1 - pnorm(1.96)
Solving Real Problem
Suppose, you took a standardized test that has a mean of 500 and a standard deviation of 100. You took 720 marks (score). You are interested in the approximate percentile on this test.
To solve this problem, you have to find the Z-score of 720 and then use the pnorm( )
to find the percentile of your score.
zscore<-scale(x = 720, 500, 100) pnorm(2.2) pnorm(zscore[1,1]) pnorm(zscore[1]) pnorm(zscore[1, ])
Some Descriptive Statistics in R
Here, we will consider the data mtcars
to get descriptive statistics in R. You can use a dataset of your own choice.
Getting Dataset Information
Let get some basic information about data. It will help to understand the mode (type) of variables in the datasets.
# attach the mtcars datasets attach(mtcars) # data structure str(mtcars)
You will see the dataset mtcars
contains 32 observations and 11 variables.
It is also best to inspect the first and last rows of the dataset.
# for the first six rows head(mtcars) # for the last six rows tail(mtcars)
Getting Numerical Information about Dataset and Variables
To get a quick overview of the dataset, the summary( )
function can also be used. We can use the summary( )
function separately for each of the variables in the dataset.
summary(mtcars) summary(mpg) summary(gear)
Note that the summary( )
function provides five-number summary statistics (minimum, first quartile, median, third quartile, and maximum) and an average value of the variable used as the argument. Note the difference between the output of the following code.
summary(cyl) summary( factor(cyl) )
Note that if for a certain variable the datatype is defined or changed R will automatically choose an appropriate descriptive statistics in R. If categorical variables are defined as a factor, the summary( )
function will result in a frequency table.
There are some other functions that can be used instead of summary()
function.
# average value mean(mpg) # median value median(mpg) # minimum value min(mpg) # maximum value max(mpg) # Quatiles, percentiles, deciles quantile(mpg) quantile(mpg, probs=c(10, 20, 30, 70, 90)) # variance and standard deviation var(mpg) sd(mpg) # Inter-quartile range IQR(mpg) # Range range(mpg)
Creating a Frequency Table
We can produce a frequency table and a relative frequency table for any categorical variable.
freq <- table(cyl); freq rf <- prop.table(freq) barplot(freq) barplot(rf) pie(freq) pie(rf)
Creating a Contingency Table (Cross-Tabulation)
The contingency table can be used to summarize the relationship between two categorical variables. The xtab( )
or table( )
functions can be used to produce cross-tabulation (contingency table).
xtabs(~cyl + gear, data = mtcars) table(cyl, gear)
Finding a Correlation between Variables
The cor( )
function can be used to find the degree of relationship between variables using Pearson’s method.
cor(mpg, wt)
However, if variables are heavily skewed, the non-parametric method Spearman’s correlation can be used.
cor(mpg, wt, method = "spearman")
The scatter plot can be drawn suing plot( )
function.
plot(mpg ~ wt)
Learn more about plot( )
function: plot( )
function
lm Function in R
Many generic functions are available for the computation of regression coefficients, for the testing of coefficients, for computation of residuals or predictions values, etc. Therefore, a good grasp of lm()
function is necessary. Suppose, we have performed the regression analysis using lm() function as done in the previous lesson.
mod <- lm(mpg ~ hp, data = mtcars)
The object returned by the lm() function has a class of “lm”. The objects associated with the “lm” class have mode as a list.
class(mod)
The name of the objects related to the “lm” class can be queried via
names(mod)
All the components of the “lm” class can be assessed directly. For example,
mod$rank mod$coef # or mod$coefficients
The following is the list of some generic functions for the fitted “lm” model.
Generic Function | Short Description |
print() | print or displays the results in R Console |
summary() | print or displays regression coefficients, their standard errors, t-ratios, p-values, and significance |
coef() | extracts regression coefficients |
residuals() | or resid() : extracts residuals of the fitted model |
fitted() | or fitted.values() : extracts fitted values |
anova() | perform comparisons of the nested model |
predict() | compute predicted values for new data |
plot() | draw diagnostics plot of the regression model |
confint() | compute the confidence intervals for regression coefficients |
deviance() | compute the residual sum of squares |
vcov() | compute estimated variance-covariance matrix |
logLik() | compute the log-likelihood |
AIC(), BIC() | compute information criteria |
It is better to save objects from the summary()
function.
The summary() function returns an object of class “summy.lm()
” and its components can be queried via
sum_mod <- summary(mod) names(sum_mod) names( summary(mod) )
The objects from the summary()
function can be obtained as
sum_mod$residuals sum_mod$r.squared sum_mod$adj.r.squared sum_mod$df sum_mod$sigma sum_mod$fstatistic
The confidence interval for estimated coefficients can be computed as
confint(mod, level = 0.95)
Note that level argument is optional if the confidence level is 95% (significance level is 5%).
The prediction intervals for mean and individual for hp
(regressor) equal to 200 and 160, can be computed as
predict(mod, newdata=data.frame(hp = c(200, 160)), interval = "confidence" ) predict(mod, newdata=data.frame(hp = c(200, 160)), interval = "prediction" )
The prediction intervals can be used for computing and visualizing confidence bands. For example,
x = seq(50, 350, length=32 ) pred <-predict(mod, newdata=data.frame(x), interval = "prediction" ) plot(hp, mpg) lines(pred[,1] ~ x, col = 1) # fitted values lines(pred[,2] ~ x, col = 2) # lower limit lines(pred[,3] ~ x, col = 2) # upper limit
For diagnostics plot, plot() function can be used and it provides four graphs of
- residuals vs fitted values
- QQ plot of standardized residuals
- scale-location plot of fitted values against the square root of standardized residuals
- standardized residuals vs leverage
To plot say QQ plot only use
plot(mod, which = 2)
which argument is used to select the graph produced out of four.