Exploring Data Distribution in R

Exploring Data Distribution in R Language

Suppose we have univariate data and need to examine its distribution. There are a variety of tools and techniques to explore univariate data distributions. The simplest way is to explore the numbers. The summary() and fivenum() are numerical while the stem() is a display of the numbers to examine the distribution of the data set. This post will teach you the basics of exploring data distribution in the R Language.

Five Number Summary and Stem and Leaf Plot

One can use numeric and visual tools in exploring data distribution. For example,

attach(faithful)
summary(eruptions)

## Output
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.600   2.163   4.000   3.488   4.454   5.100 

fivenum(eruptions)

## Output
 1.6000 2.1585 4.0000 4.4585 5.1000

stem(eruptions)
Exploring Data Distribution in R Language stem and leaf display

Histogram and Density Plot

The stem-and-leaf display is like a histogram which can be drawn using the hist() function to plot histograms in R language. The boxplot() function can also be used to visualize the distribution of the data. This will help in exploring data distribution.

# make the bins smaller, and make a plot of density

hist(eruptions)
hist(eruptions, seq(1.6, 5.2, 0.2), prob=TRUE)
lines(density(eruptions, bw=0.1))
rug(eruptions) # Show the actual data points
Exploring data distribution in R using hist and density function

The density can be used to create more elegant density plots, a line is also produced by the density and bw bandwidth is chosen by trial and error as the defaults give too much smoothing (it usually does for “interesting” densities). Better automated methods for bandwidth are also available (in the above example bw="SJ" gives good results.)

Empirical Cumulative Distribution Function

One can also plot the empirical cumulative distribution function by using the function ecdf.

plot(ecdf(eruptions), do.points = FALSE, verticals = TRUE)
cdf in R language

For the right-hand mode (eruptions of longer than 3 minutes), let us fit a normal distribution and overlay the fitted CDF.

long <- eruptions[eruptions > 3]
plot (ecdf(long), do.points = FALSE, verticals = TRUE)
x <- seq(3, 5.4, 0.01)
lines(x, pnorm(x, mean = mean(long), sd = sqrt(var(long))), lty = 3)
cdf and normality plot in R
par(pty = "s")
qqnorm(long)
qqline(long)
Normal qq plot

The Quantile-Quantile (QQ Plot) long shows a reasonable fit but a shorter right tail than one would expect from a normal distribution. One can compare it with some simulated data from t-distribution.

x <- rt(250, df = 5)
qqnorm(x)
qqline(x)

which will show a longer tail (as a random sample from the t distribution) compared to a normal distribution.

normal qq plot in r for longer tails

Normality Test in R

To determine if the data follows the normal distribution,

    Shapiro-Wilk normality test
shapiro.test(eruptions)
## Output
		Shapiro-Wilk normality test

data:  eruptions
W = 0.84592, p-value = 9.036e-16

The Kolmogorov-Smirnov Test using the ks.test() function can determine if the data follows a normal distribution

ks.test(eruptions, "pnorm")

## Output
        Asymptotic one-sample Kolmogorov-Smirnov test

data:  eruptions
D = 0.94857, p-value < 2.2e-16
alternative hypothesis: two-sided

Warning message:
In ks.test.default(eruptions, "pnorm") :
  ties should not be present for the one-sample Kolmogorov-Smirnov test

By combining the above techniques, exploring data distribution helps in gaining valuable insights into the distribution of univariate data, identifying potential outliers, and assessing normality assumptions for further statistical analysis.

Online Quiz Website, Learn Basic Statistics

Binomial Random Numbers Generation in R

We will learn how to generate Bernoulli or Binomial Random Numbers (Binomial distribution) in R with the example of a flip of a coin. This tutorial is based on how to generate random numbers according to different statistical probability distributions in R. Our focus is on binomial random numbers generation in R.

Binomial Random Numbers in R

We know that in Bernoulli distribution, either something will happen or not such as a coin flip has two outcomes head or tail (either head will occur or head will not occur i.e. tail will occur). For an unbiased coin, there will be a 50% chance that the head or tail will occur in the long run. To generate a random number that is binomial in R, use the rbinom(n, size, prob) command.

rbinom(n, size, prob) #command has three parameters, namey

where
‘$n$’ is the number of observations
‘$size$’ is the number of trials (it may be zero or more)
‘$prob$’ is the probability of success on each trial for example 1/2

Examples of Generation Binomial Random Numbers

  • One coin is tossed 10 times with a probability of success=0.5
    the coin will be fair (unbiased coin as p=1/2)
    rbinom(n=10, size=1, prob=1/2)
    OUTPUT: 1 1 0 0 1 1 1 1 0 1
  • Two coins are tossed 10 times with a probability of success=0.5
  • rbinom(n=10, size=2, prob=1/2)
    OUTPUT: 2 1 2 1 2 0 1 0 0 1
  • One coin is tossed one hundred thousand times with a probability of success=0.5
    rbinom(n=100,000, size=1, prob=1/2)
  • store simulation results in $x$ vector
    x <- rbinom(n=100000, size=5, prob=1/2)
    count 1’s in x vector
    sum(x)
    find the frequency distribution
    table(x)
    creates a frequency distribution table with frequency
    t = (table(x)/n *100)
    plot frequency distribution table
    plot(table(x),ylab = "Probability",main = "size=5,prob=0.5")
Binomial Random Numbers

View the Video tutorial on rbinom command

Learn Basic Statistics and Online MCQs about Statistics

Probability Distributions in R: A Comprehensive Tutorial

The article is a discussion about Probability Distributions in R Language.

We often make probabilistic statements when working with statistical Probability Distributions. We want to know four things:

  • The density (PDF) at a particular value,
  • The distribution (CDF) at a particular probability,
  • The quantile value corresponding to a particular probability, and
  • A random draw of values from a particular distribution.

Probability Distributions in R Language

R language has plenty of functions for obtaining density, distribution, quantile, and random numbers and variables.

Consider a random variable $X$ which is $N(\mu = 2, \sigma^2 = 16)$. We want to:

1) Calculate the value of PDF at $x=3$ (that is, the height of the curve at $x=3$)

dnorm(x = 3, mean = 2, sd = sqrt(16) ) 

dnorm(x = 3, mean = 2, sd = 4) 
dnorm(x = 3, 2, 4)

2) Calculate the value of the CDF at $x=3$ (that is, $P(X\le 3)$)

pnorm(q = 3, m = 2, sd = 4)

3) Calculate the quantile for probability 0.975

qnorm(p = 0.975, m = 2, sd = 4)

4) Generate a random sample of size $n = 10$

rnorm(n = 10, m = 2, sd = 5)

There are many probability distributions available in the R Language. I will list only a few.

Binomialdbinom( )qbinom( )pbinom( )rbinom( )
tdt( )qt( )pt( )rt( )
Poissondpois( )qpois( )ppois( )rpois( )
fdf( )qf( )pf( )rf( )
Chi-Squaredchisq( )qchisq( )pchisq( )rchisq()

Observe that a prefix (d, q, p, and r) is added for each distribution.

DistributionDistribution Name in RParameters
Binomialbinomn = Number of trials, and p= probability of success for one trial
Geometricgeomp=probability of success for one trial
Poissonpoislambda = mean
Betabetashape1, shape2
Chi-Squarechisqdf=degrees of freedom
Ffdf1, df2 degrees of freedom
Logisticlogislocation, scale
normalnormmean, sd
Student’s ttdf=degrees of freedom
Weibullweibullshape, scale

Drawing the Density Function

The density function dnorm() can be used to draw a graph of normal (or any distribution). Let us compare two normal distributions both with mean = 20, one with sd = 6, and the other with sd = 3.

For this purpose, we need $x$-axis values, such as $\overline{x} \pm 3SD \Rightarrow 20 + \pm 3\times 6$.

xaxis <- seq(0, 40, 0.5)
y1 <- dnorm(xaxis, 20, 6)
y2 <- dnorm(xaxis, 20, 3)

plot(xaxis, y2, type = "l", main = "comparing two normal distributions", col = "blue")

points(xaxis, y1, type="l", col = "red")
Comparing Normal Probability Distributions in R

Finding Probabilities in R

Probabilities in R language can be computed using pnorm() function for normal distribution.

#Left Tailed Probability
pnorm(1.96)

#Area between two Z-scores
pnorm(1.96) - pnorm(-1.96)

Finding Right-Tailed Probabilities

1 - pnorm(1.96)

Solving Real Problem

Suppose, you took a standardized test that has a mean of 500 and a standard deviation of 100. You took 720 marks (score). You are interested in the approximate percentile on this test.

To solve this problem, you have to find the Z-score of 720 and then use the pnorm( ) to find the percentile of your score.

zscore <- scale(x = 720,  500,  100)

pnorm(2.2)
pnorm(zscore[1,1])
pnorm(zscore[1])
pnorm(zscore[1, ])

MCQs in Statistics