Performing Linear Regression in R: A Quick Reference

Introduction to Performing Linear Regression in R

Regression is to build a function of independent variables (also known as predictors, regressors, explanatory variables, and features) to predict a dependent variable (also called a response, target, and regressand). Here we will focus on performing linear regression in R Language.

Linear regression is to predict response with a linear function of predictors as $$y=\beta_0+\beta_1x_1+\beta_2x_2+\cdots + \beta_kx_k,$$ where $x_1, x_2, \cdots, x_k$ are predictors and $y$ is the response to predict.

Before performing the regression analysis it will be very helpful to computer the coefficient of correlation between dependent variable and independent variable and also better to draw the scatter diagram.

Performing Linear Regression in R

Load the mtcars data, and check the data structure using str().

str(mtcars)

You have data stored in some external file such as CSV, then you can use read.csv() function to load the data in R. To learn about importing data files in R follow the link: Import Data files in R

Let us want to check the impact of weight (wt) on miles per gallon (mpg) and test the significance of the regression coefficient and other statistics to see the goodness of our fitted model

mod <- lm(mpg ~ wt, data = mtcars)
summary(mod)
Performing Linear Regression in R Estimation and Testing

Now look at the objects of results stored in mod

names(mod)

Getting Coefficients and Different Regression Statistics

Let us get the coefficients of the fitted regression model in R

mod$coef
coef(mod)

To obtain the confidence intervals of the estimated coefficients, one can use the confint()

confint(mod)

Fitted values from the regression model can be obtained by using fitted()

mod$fitted
fitted(mod)

The residuals can be obtained for the regression model using residual() function

mod$resid
resid(mod)

One can check the formula used to perform the simple/ multiple regression. It will tell you which variable is used as a response and others as explanatory variables.

formula (mod)

Graphical Representation of Relationship

To graphically visualize the relationship between variables or pairs of variables one can use plot() or pair() functions. Let us draw the scatter diagram between the dependent variable mpg and the explanatory variable wt using the plot() function.

plot(mpg ~ wt, data = mtcars)
Scatter Plot and Performing Linear Regression in R

One can add a best-fitted line to the scatter plot. For this purpose use abline() with an object having the class lm such as mod in this case

abline(mod)

There are many other functions and R packages to perform linear regression models in the R Language.

FAQS about Performing Linear Regression Models in R

  1. What is the use of abline() function in R?
  2. How a simple linear regression model can be visualized in R?
  3. How one can obtain fitted/predicted values of the simple linear regression model in R?
  4. Write a command that saves the residuals of lm() model in a variable.
  5. State the step-by-step procedure of performing linear regression in R.

To learn more about the lm() function in R

https://itfeature.com

Probability Distributions in R: A Comprehensive Tutorial

The article is a discussion about Probability Distributions in R Language.

We often make probabilistic statements when working with statistical Probability Distributions. We want to know four things:

  • The density (PDF) at a particular value,
  • The distribution (CDF) at a particular probability,
  • The quantile value corresponding to a particular probability, and
  • A random draw of values from a particular distribution.

Probability Distributions in R Language

R language has plenty of functions for obtaining density, distribution, quantile, and random numbers and variables.

Consider a random variable $X$ which is $N(\mu = 2, \sigma^2 = 16)$. We want to:

1) Calculate the value of PDF at $x=3$ (that is, the height of the curve at $x=3$)

dnorm(x = 3, mean = 2, sd = sqrt(16) ) 

dnorm(x = 3, mean = 2, sd = 4) 
dnorm(x = 3, 2, 4)

2) Calculate the value of the CDF at $x=3$ (that is, $P(X\le 3)$)

pnorm(q = 3, m = 2, sd = 4)

3) Calculate the quantile for probability 0.975

qnorm(p = 0.975, m = 2, sd = 4)

4) Generate a random sample of size $n = 10$

rnorm(n = 10, m = 2, sd = 5)

There are many probability distributions available in the R Language. I will list only a few.

Binomialdbinom( )qbinom( )pbinom( )rbinom( )
tdt( )qt( )pt( )rt( )
Poissondpois( )qpois( )ppois( )rpois( )
fdf( )qf( )pf( )rf( )
Chi-Squaredchisq( )qchisq( )pchisq( )rchisq()

Observe that a prefix (d, q, p, and r) is added for each distribution.

DistributionDistribution Name in RParameters
Binomialbinomn = Number of trials, and p= probability of success for one trial
Geometricgeomp=probability of success for one trial
Poissonpoislambda = mean
Betabetashape1, shape2
Chi-Squarechisqdf=degrees of freedom
Ffdf1, df2 degrees of freedom
Logisticlogislocation, scale
normalnormmean, sd
Student’s ttdf=degrees of freedom
Weibullweibullshape, scale

Drawing the Density Function

The density function dnorm() can be used to draw a graph of normal (or any distribution). Let us compare two normal distributions both with mean = 20, one with sd = 6, and the other with sd = 3.

For this purpose, we need $x$-axis values, such as $\overline{x} \pm 3SD \Rightarrow 20 + \pm 3\times 6$.

xaxis <- seq(0, 40, 0.5)
y1 <- dnorm(xaxis, 20, 6)
y2 <- dnorm(xaxis, 20, 3)

plot(xaxis, y2, type = "l", main = "comparing two normal distributions", col = "blue")

points(xaxis, y1, type="l", col = "red")
Comparing Normal Probability Distributions in R

Finding Probabilities in R

Probabilities in R language can be computed using pnorm() function for normal distribution.

#Left Tailed Probability
pnorm(1.96)

#Area between two Z-scores
pnorm(1.96) - pnorm(-1.96)

Finding Right-Tailed Probabilities

1 - pnorm(1.96)

Solving Real Problem

Suppose, you took a standardized test that has a mean of 500 and a standard deviation of 100. You took 720 marks (score). You are interested in the approximate percentile on this test.

To solve this problem, you have to find the Z-score of 720 and then use the pnorm( ) to find the percentile of your score.

zscore <- scale(x = 720,  500,  100)

pnorm(2.2)
pnorm(zscore[1,1])
pnorm(zscore[1])
pnorm(zscore[1, ])

MCQs in Statistics

Some Descriptive Statistics in R: A Comprehensive R Tutorial

Descriptive Statistics in R

There are numerous functions in the R language that are used to computer descriptive statistics. Here, we will consider the data mtcars to get descriptive statistics in R. You can use a dataset of your own choice. To learn about what are descriptive statistics, read the different posts from the Basic Statistics Section.

Getting Dataset Information in R

Before performing any descriptive or inferential statistics, it is better to get some basic information about the data. It will help to understand the mode (type) of variables in the datasets.

# attach the mtcars datasets
attach(mtcars)

# data structure
str(mtcars)

You will see the dataset mtcars contains 32 observations and 11 variables.

It is also best to inspect the first and last rows of the dataset.

# for the first six rows
head(mtcars)

# for the last six rows
tail(mtcars)

Getting Numerical Descriptive Statistics in R

To get a quick overview of the dataset, the summary( ) function can also be used. We can use the summary( ) function separately for each of the variables in the dataset.

summary(mtcars)
summary(mpg)
summary(gear)
Some Descriptive Statistics in R

Note that the summary( ) the function provides five-number summary statistics (minimum, first quartile, median, third quartile, and maximum) and an average value of the variable used as the argument. Note the difference between the output of the following code.

summary(cyl)
summary( factor(cyl) )

Remember that if for a certain variable, the datatype is defined or changed R will automatically choose an appropriate descriptive statistics in R. If categorical variables are defined as a factor, the summary( ) function will result in a frequency table.

Some other functions can be used instead of summary() function.

# average value
mean(mpg)
# median value
median(mpg)
# minimum value
min(mpg)
# maximum value
max(mpg)
# Quatiles, percentiles, deciles
quantile(mpg)
quantile(mpg, probs=c(10, 20, 30, 70, 90))
# variance and standard deviation
var(mpg)
sd(mpg)
# Inter-quartile range
IQR(mpg)
# Range
range(mpg)

Creating a Frequency Table in R

We can produce a frequency table and a relative frequency table for any categorical variable.

freq <- table(cyl); freq
rf <- prop.table(freq)

barplot(freq)
barplot(rf)
pie(freq)
pie(rf)
Barplot and Pie chart Some Descriptive Statistics in R

Creating a Contingency Table (Cross-Tabulation)

The contingency table can be used to summarize the relationship between two categorical variables. The xtab( ) or table( ) functions can be used to produce cross-tabulation (contingency table).

xtabs(~cyl + gear, data = mtcars)
table(cyl, gear)

Finding a Correlation between Variables

The cor( ) function can be used to find the degree of relationship between variables using Pearson’s method.

cor(mpg, wt)

However, if variables are heavily skewed, the non-parametric method Spearman’s correlation can be used.

cor(mpg, wt, method = "spearman")

The scatter plot can be drawn using plot( ) a function.

plot(mpg ~ wt)

FAQs about Descriptive Statistics in R

  1. How to check the data types of different variables/columns in R?
  2. What is the use of str() function in R?
  3. What is the use of head() and tail() functions in R?
  4. How numerical statistics of a variable or data set can be obtained in R?
  5. What is the use of summary() function in R?
  6. How summary() function can be used to perform descriptive statistics of a categorical variable in R?
  7. How to produce a frequency table in R?
  8. What is the use of xtab() function in R?
  9. What is the use cor(), plot(), and pie() functions, explain with the help of examples.
  10. What functions are used to compute, mean, median, standard deviation, variance, Quantiles, and IQR.

Learn more about plot( ) function: plot( ) function

Visit: Learn Basic Statistics