## Performing Linear Regression in R

Regression is to build a function of independent variables (also known as predictors, regressors, explanatory variables, and features) to predict a dependent variable (also called a response, target, and regressand). Here we will focus on performing linear regression in R Language.

Linear regression is to predict response with a linear function of predictors as $$y=\beta_0+\beta_1x_1+\beta_2x_2+\cdots + \beta_kx_k,$$ where $x_1, x_2, \cdots, x_k$ are predictors and $y$ is the response to predict.

Before performing the regression analysis it will be very helpful to computer the coefficient of correlation between dependent variable and independent variable and also better to draw the scatter diagram.

**Performing Linear Regression in R:**

Load the `mtcars`

data, and check the data structure using `str()`

.

str(mtcars)

You have data stored in some external file such as CSV, then you can use `read.csv()`

function to load the data in R. To learn about importing data files in R follow the link: Import Data files in R

Let us want to check the impact of weight (`wt`

) on miles per gallon (`mpg`

) and test the significance of regression coefficient and other statistics to see the goodness of our fitted model

mod <- lm(mpg ~ wt, data = mtcars) summary(mod)

Now look at the objects of results stored in `mod`

names(mod)

Let us get the coefficients of the fitted regression model

mod$coef coef(mod)

To obtain the confidence intervals of the estimated coefficients, one can use the `confint()`

confint(mod)

Fitted values from the regression model can be obtained by using `fitted()`

mod$fitted fitted(mod)

The residuals can be obtained for the regression model using `residual()`

function

mod$resid resid(mod)

One can check the formula used to perform the simple/ multiple regression. It will tell you which one variable is used as response and others a explanatory variables.

formula (mod)

To graphically visualize the relationship between variables or pairs of variables one can use `plot()`

or `pair()`

functions. Let us draw the scatter diagram between the dependent variable `mpg`

and the explanatory variable `wt`

using the `plot()`

function.

plot(mpg ~ wt, data = mtcars)

One can add a best-fitted line to the scatter plot. For this purpose use `abline()`

with an object having the class `lm`

such as `mod`

in this case

abline(mod)

There are many other functions and R packages to perform linear regression models in R Language.

Learn more about `lm()`

function, click the link lm() function in R

## Probability Distributions in R

We often make probabilistic statements when working with statistical Probability Distributions. We want to know four things:

- The density (PDF) at a particular value,
- The distribution (CDF) at a particular probability,
- The quantile value corresponding to a particular probability, and
- A random draw of values from a particular distribution.

R has plenty of functions for obtaining density, distribution, quantile, and random variables.

Consider a random variable $X$ which is $N(\mu = 2, \sigma^2 = 16)$. We want to:

1) Calculate the value of PDF at $x=3$ (that is, the height of the curve at $x=3$)

dnorm(x=3, mean = 2, sd = sqrt(16) ) dnorm(x=3, mean = 2, sd = 4) dnorm(x=3, 2, 4)

2) Calculate the value of the CDF at $x=3$ (that is, $P(X\le 3)$)

pnorm(q=3, m=2, sd=4)

3) Calculate the quantile for probability 0.975

qnorm(p = 0.975, m = 2, sd = 4)

4) Generate a random sample of size $n = 10$

rnorm(n = 10, m = 2, sd = 5)

There are many probability distributions available in the R Language. I will list only a few.

Binomial | dbinom( ) | qbinom( ) | pbinom( ) | rbinom( ) |

t | `dt( )` | `qt( )` | `pt( )` | `rt( )` |

Poisson | `dpois( )` | `qpois( )` | `ppois( )` | `rpois( )` |

f | `df( )` | `qf( )` | `pf( )` | `rf( )` |

Chi-Square | `dchisq( )` | `qchisq( )` | `pchisq( )` | `rchisq()` |

Observe that a prefix (d, q, p, and r) is added for each distribution.

Distribution | Distribution Name in R | Parameters |

Binomial | `binom` | n=Number of trials, and p= probability of success for one trial |

Geometric | `geom` | p=probability of success for one trial |

Poisson | `pois` | lambda = mean |

Beta | `beta` | shape1, shape2 |

Chi-Square | `chisq` | df=degrees of freedom |

F | `f` | df1, df2 degrees of freedom |

Logistic | `logis` | location, scale |

normal | `norm` | mean, sd |

Student’s t | `t` | df=degrees of freedom |

Weibull | `weibull` | shape, scale |

**Drawing the Density function**

The density function `dnorm( )`

can be used to draw a graph of normal (or any distribution). Let us compare two normal distributions both with mean = 20, and one with sd = 6, and other with sd = 3.

For this purpose, we need $x$-axis values, such as $\overline{x} \pm 3SD \Rightarrow 20 + \pm 3\times 6$.

xaxis <- seq(0, 40, 0.5) y1 <- dnorm(xaxis, 20, 6) y2 <- dnorm(xaxis, 20, 3) plot(xaxis, y2, type = "l", main = "comparing two normal distributions", col = "blue") points(xaxis, y1, type="l", col = "red")

**Finding Probabilities**

#Left Tailed Probability pnorm(1.96) #Area between two Z-scores pnorm(1.96) - pnorm(-1.96)

**Finding Right Tailed Probabilities**

1 - pnorm(1.96)

**Solving Real Problem**

Suppose, you took a standardized test that has a mean of 500 and a standard deviation of 100. You took 720 marks (score). You are interested in the approximate percentile on this test.

To solve this problem, you have to find the Z-score of 720 and then use the `pnorm( )`

to find the percentile of your score.

zscore<-scale(x = 720, 500, 100) pnorm(2.2) pnorm(zscore[1,1]) pnorm(zscore[1]) pnorm(zscore[1, ])

## Some Descriptive Statistics in R

Here, we will consider the data `mtcars`

to get descriptive statistics in R. You can use a dataset of your own choice.

**Getting Dataset Information**

Let get some basic information about data. It will help to understand the mode (type) of variables in the datasets.

# attach the mtcars datasets attach(mtcars) # data structure str(mtcars)

You will see the dataset `mtcars`

contains 32 observations and 11 variables.

It is also best to inspect the first and last rows of the dataset.

# for the first six rows head(mtcars) # for the last six rows tail(mtcars)

**Getting Numerical Information about Dataset and Variables**

To get a quick overview of the dataset, the `summary( )`

function can also be used. We can use the `summary( )`

function separately for each of the variables in the dataset.

summary(mtcars) summary(mpg) summary(gear)

Note that the `summary( )`

function provides five-number summary statistics (minimum, first quartile, median, third quartile, and maximum) and an average value of the variable used as the argument. Note the difference between the output of the following code.

summary(cyl) summary( factor(cyl) )

Note that if for a certain variable the datatype is defined or changed R will automatically choose an appropriate descriptive statistics in R. If categorical variables are defined as a factor, the `summary( )`

function will result in a frequency table.

There are some other functions that can be used instead of `summary()`

function.

# average value mean(mpg) # median value median(mpg) # minimum value min(mpg) # maximum value max(mpg) # Quatiles, percentiles, deciles quantile(mpg) quantile(mpg, probs=c(10, 20, 30, 70, 90)) # variance and standard deviation var(mpg) sd(mpg) # Inter-quartile range IQR(mpg) # Range range(mpg)

**Creating a Frequency Table**

We can produce a frequency table and a relative frequency table for any categorical variable.

freq <- table(cyl); freq rf <- prop.table(freq) barplot(freq) barplot(rf) pie(freq) pie(rf)

**Creating a Contingency Table (Cross-Tabulation)**

The contingency table can be used to summarize the relationship between two categorical variables. The `xtab( )`

or `table( )`

functions can be used to produce cross-tabulation (contingency table).

xtabs(~cyl + gear, data = mtcars) table(cyl, gear)

**Finding a Correlation between Variables**

The `cor( )`

function can be used to find the degree of relationship between variables using Pearson’s method.

cor(mpg, wt)

However, if variables are heavily skewed, the non-parametric method Spearman’s correlation can be used.

cor(mpg, wt, method = "spearman")

The scatter plot can be drawn suing `plot( )`

function.

plot(mpg ~ wt)

Learn more about `plot( )`

function: `plot( )`

function

## lm Function in R

Many generic functions are available for the computation of regression coefficients, for the testing of coefficients, for computation of residuals or predictions values, etc. Therefore, a good grasp of `lm()`

function is necessary. Suppose, we have performed the regression analysis using lm() function as done in the previous lesson.

mod <- lm(mpg ~ hp, data = mtcars)

The object returned by the lm() function has a class of “lm”. The objects associated with the “lm” class have mode as a list.

class(mod)

The name of the objects related to the “lm” class can be queried via

names(mod)

All the components of the “lm” class can be assessed directly. For example,

mod$rank mod$coef # or mod$coefficients

The following is the list of some generic functions for the fitted “lm” model.

Generic Function | Short Description |

`print()` | print or displays the results in R Console |

`summary()` | print or displays regression coefficients, their standard errors, t-ratios, p-values, and significance |

`coef()` | extracts regression coefficients |

`residuals()` | or `resid()` : extracts residuals of the fitted model |

`fitted()` | or `fitted.values()` : extracts fitted values |

`anova()` | perform comparisons of the nested model |

`predict()` | compute predicted values for new data |

`plot()` | draw diagnostics plot of the regression model |

`confint()` | compute the confidence intervals for regression coefficients |

`deviance()` | compute the residual sum of squares |

`vcov()` | compute estimated variance-covariance matrix |

`logLik()` | compute the log-likelihood |

`AIC(), BIC()` | compute information criteria |

It is better to save objects from the `summary()`

function.

The summary() function returns an object of class “`summy.lm()`

” and its components can be queried via

sum_mod <- summary(mod) names(sum_mod) names( summary(mod) )

The objects from the `summary()`

function can be obtained as

sum_mod$residuals sum_mod$r.squared sum_mod$adj.r.squared sum_mod$df sum_mod$sigma sum_mod$fstatistic

The confidence interval for estimated coefficients can be computed as

confint(mod, level = 0.95)

Note that level argument is optional if the confidence level is 95% (significance level is 5%).

The prediction intervals for mean and individual for `hp`

(regressor) equal to 200 and 160, can be computed as

predict(mod, newdata=data.frame(hp = c(200, 160)), interval = "confidence" ) predict(mod, newdata=data.frame(hp = c(200, 160)), interval = "prediction" )

The prediction intervals can be used for computing and visualizing confidence bands. For example,

x = seq(50, 350, length=32 ) pred <-predict(mod, newdata=data.frame(x), interval = "prediction" ) plot(hp, mpg) lines(pred[,1] ~ x, col = 1) # fitted values lines(pred[,2] ~ x, col = 2) # lower limit lines(pred[,3] ~ x, col = 2) # upper limit

For diagnostics plot, plot() function can be used and it provides four graphs of

- residuals vs fitted values
- QQ plot of standardized residuals
- scale-location plot of fitted values against the square root of standardized residuals
- standardized residuals vs leverage

To plot say QQ plot only use

plot(mod, which = 2)

which argument is used to select the graph produced out of four.