Probability Distributions in R: A Comprehensive Tutorial

The article is a discussion about Probability Distributions in R Language.

We often make probabilistic statements when working with statistical Probability Distributions. We want to know four things:

  • The density (PDF) at a particular value,
  • The distribution (CDF) at a particular probability,
  • The quantile value corresponding to a particular probability, and
  • A random draw of values from a particular distribution.

Probability Distributions in R Language

R language has plenty of functions for obtaining density, distribution, quantile, and random numbers and variables.

Consider a random variable $X$ which is $N(\mu = 2, \sigma^2 = 16)$. We want to:

1) Calculate the value of PDF at $x=3$ (that is, the height of the curve at $x=3$)

dnorm(x = 3, mean = 2, sd = sqrt(16) ) 

dnorm(x = 3, mean = 2, sd = 4) 
dnorm(x = 3, 2, 4)

2) Calculate the value of the CDF at $x=3$ (that is, $P(X\le 3)$)

pnorm(q = 3, m = 2, sd = 4)

3) Calculate the quantile for probability 0.975

qnorm(p = 0.975, m = 2, sd = 4)

4) Generate a random sample of size $n = 10$

rnorm(n = 10, m = 2, sd = 5)

There are many probability distributions available in the R Language. I will list only a few.

Binomialdbinom( )qbinom( )pbinom( )rbinom( )
tdt( )qt( )pt( )rt( )
Poissondpois( )qpois( )ppois( )rpois( )
fdf( )qf( )pf( )rf( )
Chi-Squaredchisq( )qchisq( )pchisq( )rchisq()

Observe that a prefix (d, q, p, and r) is added for each distribution.

DistributionDistribution Name in RParameters
Binomialbinomn = Number of trials, and p= probability of success for one trial
Geometricgeomp=probability of success for one trial
Poissonpoislambda = mean
Betabetashape1, shape2
Chi-Squarechisqdf=degrees of freedom
Ffdf1, df2 degrees of freedom
Logisticlogislocation, scale
normalnormmean, sd
Student’s ttdf=degrees of freedom
Weibullweibullshape, scale

Drawing the Density Function

The density function dnorm() can be used to draw a graph of normal (or any distribution). Let us compare two normal distributions both with mean = 20, one with sd = 6, and the other with sd = 3.

For this purpose, we need $x$-axis values, such as $\overline{x} \pm 3SD \Rightarrow 20 + \pm 3\times 6$.

xaxis <- seq(0, 40, 0.5)
y1 <- dnorm(xaxis, 20, 6)
y2 <- dnorm(xaxis, 20, 3)

plot(xaxis, y2, type = "l", main = "comparing two normal distributions", col = "blue")

points(xaxis, y1, type="l", col = "red")
Comparing Normal Probability Distributions in R

Finding Probabilities in R

Probabilities in R language can be computed using pnorm() function for normal distribution.

#Left Tailed Probability
pnorm(1.96)

#Area between two Z-scores
pnorm(1.96) - pnorm(-1.96)

Finding Right-Tailed Probabilities

1 - pnorm(1.96)

Solving Real Problem

Suppose, you took a standardized test that has a mean of 500 and a standard deviation of 100. You took 720 marks (score). You are interested in the approximate percentile on this test.

To solve this problem, you have to find the Z-score of 720 and then use the pnorm( ) to find the percentile of your score.

zscore <- scale(x = 720,  500,  100)

pnorm(2.2)
pnorm(zscore[1,1])
pnorm(zscore[1])
pnorm(zscore[1, ])

MCQs in Statistics

Some Descriptive Statistics in R: A Comprehensive R Tutorial

Descriptive Statistics in R

There are numerous functions in the R language that are used to computer descriptive statistics. Here, we will consider the data mtcars to get descriptive statistics in R. You can use a dataset of your own choice. To learn about what are descriptive statistics, read the different posts from the Basic Statistics Section.

Getting Dataset Information in R

Before performing any descriptive or inferential statistics, it is better to get some basic information about the data. It will help to understand the mode (type) of variables in the datasets.

# attach the mtcars datasets
attach(mtcars)

# data structure
str(mtcars)

You will see the dataset mtcars contains 32 observations and 11 variables.

It is also best to inspect the first and last rows of the dataset.

# for the first six rows
head(mtcars)

# for the last six rows
tail(mtcars)

Getting Numerical Descriptive Statistics in R

To get a quick overview of the dataset, the summary( ) function can also be used. We can use the summary( ) function separately for each of the variables in the dataset.

summary(mtcars)
summary(mpg)
summary(gear)
Some Descriptive Statistics in R

Note that the summary( ) the function provides five-number summary statistics (minimum, first quartile, median, third quartile, and maximum) and an average value of the variable used as the argument. Note the difference between the output of the following code.

summary(cyl)
summary( factor(cyl) )

Remember that if for a certain variable, the datatype is defined or changed R will automatically choose an appropriate descriptive statistics in R. If categorical variables are defined as a factor, the summary( ) function will result in a frequency table.

Some other functions can be used instead of summary() function.

# average value
mean(mpg)
# median value
median(mpg)
# minimum value
min(mpg)
# maximum value
max(mpg)
# Quatiles, percentiles, deciles
quantile(mpg)
quantile(mpg, probs=c(10, 20, 30, 70, 90))
# variance and standard deviation
var(mpg)
sd(mpg)
# Inter-quartile range
IQR(mpg)
# Range
range(mpg)

Creating a Frequency Table in R

We can produce a frequency table and a relative frequency table for any categorical variable.

freq <- table(cyl); freq
rf <- prop.table(freq)

barplot(freq)
barplot(rf)
pie(freq)
pie(rf)
Barplot and Pie chart Some Descriptive Statistics in R

Creating a Contingency Table (Cross-Tabulation)

The contingency table can be used to summarize the relationship between two categorical variables. The xtab( ) or table( ) functions can be used to produce cross-tabulation (contingency table).

xtabs(~cyl + gear, data = mtcars)
table(cyl, gear)

Finding a Correlation between Variables

The cor( ) function can be used to find the degree of relationship between variables using Pearson’s method.

cor(mpg, wt)

However, if variables are heavily skewed, the non-parametric method Spearman’s correlation can be used.

cor(mpg, wt, method = "spearman")

The scatter plot can be drawn using plot( ) a function.

plot(mpg ~ wt)

FAQs about Descriptive Statistics in R

  1. How to check the data types of different variables/columns in R?
  2. What is the use of str() function in R?
  3. What is the use of head() and tail() functions in R?
  4. How numerical statistics of a variable or data set can be obtained in R?
  5. What is the use of summary() function in R?
  6. How summary() function can be used to perform descriptive statistics of a categorical variable in R?
  7. How to produce a frequency table in R?
  8. What is the use of xtab() function in R?
  9. What is the use cor(), plot(), and pie() functions, explain with the help of examples.
  10. What functions are used to compute, mean, median, standard deviation, variance, Quantiles, and IQR.

Learn more about plot( ) function: plot( ) function

Visit: Learn Basic Statistics

lm Function in R: A Comprehensive Guide

Introduction to lm Function in R

Many generic functions are available for the computation of regression coefficients, for example, testing the coefficients, computing the residuals, prediction values, etc. Therefore, a good grasp of the lm() function is necessary. It is assumed that you are aware of performing the regression analysis using the lm function.

mod <- lm(mpg ~ hp, data = mtcars)

To learn about performing linear regression analysis using the lm function you can visit the article “Performing Linear Regression in R

Objects of “lm” Class

The object returned by the lm() function has a class of “lm”. The objects associated with the “lm” class have mode as a list.

class(mod)

The name of the objects related to the “lm” class can be queried via

names(mod)

All the components of the “lm” class can be assessed directly. For example,

mod$rank

mod$coef   # or mod$coefficients

Generic Functions of “lm” model

The following is the list of some generic functions for the fitted “lm” model.

Generic FunctionShort Description
print()print or display the results in the R Console
summary()print or displays regression coefficients, their standard errors, t-ratios, p-values, and significance
coef()extracts regression coefficients
residuals()or resid(): extracts residuals of the fitted model
fitted()or fitted.values() : extracts fitted values
anova()perform comparisons of the nested model
predict()compute predicted values for new data
plot()draw a diagnostics plot of the regression model
confint()compute the confidence intervals for regression coefficients
deviance()compute the residual sum of squares
vcov()compute estimated variance-covariance matrix
logLik()compute the log-likelihood
AIC(), BIC()compute information criteria

It is better to save objects from the summary() function.

The summary() function returns an object of class “summary.lm()” and its components can be queried via

sum_mod <- summary(mod)

names(sum_mod)
names( summary(mod) )
lm class objects

The objects from the summary() function can be obtained as

sum_mod$residuals
sum_mod$r.squared
sum_mod$adj.r.squared
sum_mod$df
sum_mod$sigma
sum_mod$fstatistic

Computation and Visualization of Prediction and Confidence Interval

The confidence interval for estimated coefficients can be computed as

confint(mod, level = 0.95)

Note that level argument is optional if the confidence level is 95% (significance level is 5%).

The prediction intervals for mean and individual for hp (regressor) equal to 200 and 160, can be computed as

predict(mod, newdata=data.frame(hp = c(200, 160)), interval = "confidence" )
predict(mod, newdata=data.frame(hp = c(200, 160)), interval = "prediction" )

The prediction intervals can be used for computing and visualizing confidence bands. For example,

x = seq(50, 350, length = 32 )
pred <- predict(mod, newdata=data.frame(x), interval = "prediction" )

plot(hp, mpg)
lines(pred[,1] ~ x, col = 1) # fitted values
lines(pred[,2] ~ x, col = 2) # lower limit
lines(pred[,3] ~ x, col = 2) # upper limit
Visualization of prediction intervals and confidence band

Regression Diagnostics

For diagnostics plot, the plot() function can be used and it provides four graphs of

  • residuals vs fitted values
  • QQ plot of standardized residuals
  • scale-location plot of fitted values against the square root of standardized residuals
  • standardized residuals vs leverage
diagnostic plot of model from lm function

To plot say QQ plot only use

plot(mod, which = 2)

which argument is used to select the graph produced out of four.

FAQS about lm() Functions in R

  1. What is the use of lm() function in R?
  2. What is the class of lm() and name the object of lm() function too?
  3. Describe the generic functions for the object of class lm.
  4. What are the important objects of summary.lm() object?
  5. How objects of summary.lm() function can be accessed?
  6. How confidence and prediction intervals can be visualized in R for linear models?
  7. How diagnostics are performed in the R Language?
  8. What is the use of confint(), fitted(), coef(), anova(), vcov(), deviance(), and residuals generic functions?

Test Preparation MCQs

Learn R Programming