Some Descriptive Statistics in R: A Comprehensive R Tutorial

Descriptive Statistics in R

There are numerous functions in the R language that are used to computer descriptive statistics. Here, we will consider the data mtcars to get descriptive statistics in R. You can use a dataset of your own choice. To learn about what are descriptive statistics, read the different posts from the Basic Statistics Section.

Getting Dataset Information in R

Before performing any descriptive or inferential statistics, it is better to get some basic information about the data. It will help to understand the mode (type) of variables in the datasets.

# attach the mtcars datasets
attach(mtcars)

# data structure
str(mtcars)

You will see the dataset mtcars contains 32 observations and 11 variables.

It is also best to inspect the first and last rows of the dataset.

# for the first six rows
head(mtcars)

# for the last six rows
tail(mtcars)

Getting Numerical Descriptive Statistics in R

To get a quick overview of the dataset, the summary( ) function can also be used. We can use the summary( ) function separately for each of the variables in the dataset.

summary(mtcars)
summary(mpg)
summary(gear)
Some Descriptive Statistics in R

Note that the summary( ) the function provides five-number summary statistics (minimum, first quartile, median, third quartile, and maximum) and an average value of the variable used as the argument. Note the difference between the output of the following code.

summary(cyl)
summary( factor(cyl) )

Remember that if for a certain variable, the datatype is defined or changed R will automatically choose an appropriate descriptive statistics in R. If categorical variables are defined as a factor, the summary( ) function will result in a frequency table.

Some other functions can be used instead of summary() function.

# average value
mean(mpg)
# median value
median(mpg)
# minimum value
min(mpg)
# maximum value
max(mpg)
# Quatiles, percentiles, deciles
quantile(mpg)
quantile(mpg, probs=c(10, 20, 30, 70, 90))
# variance and standard deviation
var(mpg)
sd(mpg)
# Inter-quartile range
IQR(mpg)
# Range
range(mpg)

Creating a Frequency Table in R

We can produce a frequency table and a relative frequency table for any categorical variable.

freq <- table(cyl); freq
rf <- prop.table(freq)

barplot(freq)
barplot(rf)
pie(freq)
pie(rf)
Barplot and Pie chart Some Descriptive Statistics in R

Creating a Contingency Table (Cross-Tabulation)

The contingency table can be used to summarize the relationship between two categorical variables. The xtab( ) or table( ) functions can be used to produce cross-tabulation (contingency table).

xtabs(~cyl + gear, data = mtcars)
table(cyl, gear)

Finding a Correlation between Variables

The cor( ) function can be used to find the degree of relationship between variables using Pearson’s method.

cor(mpg, wt)

However, if variables are heavily skewed, the non-parametric method Spearman’s correlation can be used.

cor(mpg, wt, method = "spearman")

The scatter plot can be drawn using plot( ) a function.

plot(mpg ~ wt)

FAQs about Descriptive Statistics in R

  1. How to check the data types of different variables/columns in R?
  2. What is the use of str() function in R?
  3. What is the use of head() and tail() functions in R?
  4. How numerical statistics of a variable or data set can be obtained in R?
  5. What is the use of summary() function in R?
  6. How summary() function can be used to perform descriptive statistics of a categorical variable in R?
  7. How to produce a frequency table in R?
  8. What is the use of xtab() function in R?
  9. What is the use cor(), plot(), and pie() functions, explain with the help of examples.
  10. What functions are used to compute, mean, median, standard deviation, variance, Quantiles, and IQR.

Learn more about plot( ) function: plot( ) function

Visit: Learn Basic Statistics

lm Function in R: A Comprehensive Guide

Introduction to lm Function in R

Many generic functions are available for the computation of regression coefficients, for example, testing the coefficients, computing the residuals, prediction values, etc. Therefore, a good grasp of the lm() function is necessary. It is assumed that you are aware of performing the regression analysis using the lm function.

mod <- lm(mpg ~ hp, data = mtcars)

To learn about performing linear regression analysis using the lm function you can visit the article “Performing Linear Regression in R

Objects of “lm” Class

The object returned by the lm() function has a class of “lm”. The objects associated with the “lm” class have mode as a list.

class(mod)

The name of the objects related to the “lm” class can be queried via

names(mod)

All the components of the “lm” class can be assessed directly. For example,

mod$rank

mod$coef   # or mod$coefficients

Generic Functions of “lm” model

The following is the list of some generic functions for the fitted “lm” model.

Generic FunctionShort Description
print()print or display the results in the R Console
summary()print or displays regression coefficients, their standard errors, t-ratios, p-values, and significance
coef()extracts regression coefficients
residuals()or resid(): extracts residuals of the fitted model
fitted()or fitted.values() : extracts fitted values
anova()perform comparisons of the nested model
predict()compute predicted values for new data
plot()draw a diagnostics plot of the regression model
confint()compute the confidence intervals for regression coefficients
deviance()compute the residual sum of squares
vcov()compute estimated variance-covariance matrix
logLik()compute the log-likelihood
AIC(), BIC()compute information criteria

It is better to save objects from the summary() function.

The summary() function returns an object of class “summary.lm()” and its components can be queried via

sum_mod <- summary(mod)

names(sum_mod)
names( summary(mod) )
lm class objects

The objects from the summary() function can be obtained as

sum_mod$residuals
sum_mod$r.squared
sum_mod$adj.r.squared
sum_mod$df
sum_mod$sigma
sum_mod$fstatistic

Computation and Visualization of Prediction and Confidence Interval

The confidence interval for estimated coefficients can be computed as

confint(mod, level = 0.95)

Note that level argument is optional if the confidence level is 95% (significance level is 5%).

The prediction intervals for mean and individual for hp (regressor) equal to 200 and 160, can be computed as

predict(mod, newdata=data.frame(hp = c(200, 160)), interval = "confidence" )
predict(mod, newdata=data.frame(hp = c(200, 160)), interval = "prediction" )

The prediction intervals can be used for computing and visualizing confidence bands. For example,

x = seq(50, 350, length = 32 )
pred <- predict(mod, newdata=data.frame(x), interval = "prediction" )

plot(hp, mpg)
lines(pred[,1] ~ x, col = 1) # fitted values
lines(pred[,2] ~ x, col = 2) # lower limit
lines(pred[,3] ~ x, col = 2) # upper limit
Visualization of prediction intervals and confidence band

Regression Diagnostics

For diagnostics plot, the plot() function can be used and it provides four graphs of

  • residuals vs fitted values
  • QQ plot of standardized residuals
  • scale-location plot of fitted values against the square root of standardized residuals
  • standardized residuals vs leverage
diagnostic plot of model from lm function

To plot say QQ plot only use

plot(mod, which = 2)

which argument is used to select the graph produced out of four.

FAQS about lm() Functions in R

  1. What is the use of lm() function in R?
  2. What is the class of lm() and name the object of lm() function too?
  3. Describe the generic functions for the object of class lm.
  4. What are the important objects of summary.lm() object?
  5. How objects of summary.lm() function can be accessed?
  6. How confidence and prediction intervals can be visualized in R for linear models?
  7. How diagnostics are performed in the R Language?
  8. What is the use of confint(), fitted(), coef(), anova(), vcov(), deviance(), and residuals generic functions?

Test Preparation MCQs

Learn R Programming

Simple Linear Regression Model

Introduction to Simple Linear Regression Model

The linear regression model is typically estimated by the ordinary least squares (OLS) technique. The model in general form is

$$Y_i=x’_i\beta + \varepsilon, \quad\quad i=1,2,\cdots,n$$

In matrix notation

$$y=X\beta + \varepsilon,$$

where $y$ is a vector of order $n\times 1$ that contains values of the dependent variable, $X=(x_1,x_2,\cdots,x_n)’$ is regressor(s) matrix containing $n$ observations. $X$ matrix also called model matrix (whose column represents regressors), The $\beta$ is a $p\times 1$ vector of regressor coefficients, and $\varepsilon$ is a vector of order $n\times 1$ containing error terms. To learn more about Simple linear Models, visit the link: Simple Linear Regression Models.

Estimating Regression Coefficients

The regression coefficients $\ beta$ can be estimated

$$\hat{\beta}=(X’X)^{-1}X’Y$$

The fitted values can be computed

$$\hat{y}=X\hat{\beta}$$

The residuals are

$$\hat{\varepsilon} = y – \hat{y}$$

The residual sum of squares is

$$\hat{\varepsilon}\varepsilon$$

R language has excellent facilities for fitting linear models. The basic function for fitting linear models by the least square method is lm() function.  The model is specified by a formula notation.

We will consider mtcars the dataset. Let $Y=mpg$ and $X=hp$, the simple linear regression model is

$$Y_i = \beta_1 + \beta_2 hp + \varepsilon_i$$

where $\beta_1$ is the intercept and $\beta_2$ is the slope coefficient.

Fitting Simple Linear Regression Model in R

To fit this simple linear regression model in R, one can follow:

attach(mtcars)

mod <- lm(mpg ~ hp)
mod

The lm() function uses a formula mpg ~ hp with the response variable on the left of the tilde (~) and predictor on the right. It is better to supply the data argument to lm() function. That is,

mod <- lm(mpg ~ hp, data = mtcars)

The lm() function returns an object of the class lm, saved in a variable mod (it can be different). Printing the object produces a brief report.

Hypothesis Testing of Regression Coefficients

For hypothesis testing regression coefficients summary() function should be used. It will bring more information about the fitted model such as standard errors, t-values, and p-values for each coefficient of the model fitting. For example,

summary(mod)

One can fit a regression model without an intercept term if required.

lm(mpg ~ hp -1, data = mtcars)

Graphical Representation of the Model

For the graphical representation of the model, one can use the plot() function to draw scatter points and the abline() function to draw the regression line.

plot(hp, mpg)
abline(mod)

Note the order of variables in the plot() function. The first argument to plot() function represents the predictor variable while the second argument to plot() function represents the response variable.

The function abline() plots a line on the graph according to the slope and intercept provided by the argument mod or by providing it manually.

One can change the style of the regression line using lty argument. Similarly, the color of the regression line can be changed from black to some other color using col argument. That is,

plot(hp, mpg)
abline(mod, lty = 2, col = "blue")

Note that one can identify different observations on a graph using the identify() function. For example,

identify(hp, mpg)
Simple Linear Regression Model

Note to identify a point, place the mouse pointer near the point and press the left mouse button, to exit from identify procedure, press the right mouse button, or ESC button from the keyboard.

FAQs about Simple Linear Regression in R

  1. What is a simple linear regression model? How it can be performed in the R Language?
  2. How lm() function is used to fit a simple linear regression model in detail?
  3. How estimation and testing of the regression coefficient can be performed in R?
  4. What is the use of summary() function in R, explain.
  5. How visualization of regression models in R can be performed?

Read more on Statistical models in R

MCQs in Statistics