Exploring Data Distribution in R Language
Suppose we have univariate data and need to examine its distribution. There are a variety of tools and techniques to explore univariate data distributions. The simplest way is to explore the numbers. The summary()
and fivenum()
are numerical while the stem()
is a display of the numbers to examine the distribution of the data set. This post will teach you the basics of exploring data distribution in the R Language.
Table of Contents
Five Number Summary and Stem and Leaf Plot
One can use numeric and visual tools in exploring data distribution. For example,
attach(faithful) summary(eruptions) ## Output Min. 1st Qu. Median Mean 3rd Qu. Max. 1.600 2.163 4.000 3.488 4.454 5.100 fivenum(eruptions) ## Output 1.6000 2.1585 4.0000 4.4585 5.1000 stem(eruptions)
Histogram and Density Plot
The stem-and-leaf display is like a histogram which can be drawn using the hist()
function to plot histograms in R language. The boxplot()
function can also be used to visualize the distribution of the data. This will help in exploring data distribution.
# make the bins smaller, and make a plot of density hist(eruptions) hist(eruptions, seq(1.6, 5.2, 0.2), prob=TRUE) lines(density(eruptions, bw=0.1)) rug(eruptions) # Show the actual data points
The density
can be used to create more elegant density plots, a line is also produced by the density
and bw
bandwidth is chosen by trial and error as the defaults give too much smoothing (it usually does for “interesting” densities). Better automated methods for bandwidth are also available (in the above example bw="SJ"
gives good results.)
Empirical Cumulative Distribution Function
One can also plot the empirical cumulative distribution function by using the function ecdf
.
plot(ecdf(eruptions), do.points = FALSE, verticals = TRUE)
For the right-hand mode (eruptions of longer than 3 minutes), let us fit a normal distribution and overlay the fitted CDF.
long <- eruptions[eruptions > 3] plot (ecdf(long), do.points = FALSE, verticals = TRUE) x <- seq(3, 5.4, 0.01) lines(x, pnorm(x, mean = mean(long), sd = sqrt(var(long))), lty = 3)
par(pty = "s") qqnorm(long) qqline(long)
The Quantile-Quantile (QQ Plot) long
shows a reasonable fit but a shorter right tail than one would expect from a normal distribution. One can compare it with some simulated data from t-distribution.
x <- rt(250, df = 5) qqnorm(x) qqline(x)
which will show a longer tail (as a random sample from the t distribution) compared to a normal distribution.
Normality Test in R
To determine if the data follows the normal distribution,
- The Shapiro-Wilk Normality Test using the
shapiro.test()
function can determine if the data follows a normal distribution.
Shapiro-Wilk normality test
shapiro.test(eruptions) ## Output Shapiro-Wilk normality test data: eruptions W = 0.84592, p-value = 9.036e-16
The Kolmogorov-Smirnov Test using the ks.test()
function can determine if the data follows a normal distribution
ks.test(eruptions, "pnorm") ## Output Asymptotic one-sample Kolmogorov-Smirnov test data: eruptions D = 0.94857, p-value < 2.2e-16 alternative hypothesis: two-sided Warning message: In ks.test.default(eruptions, "pnorm") : ties should not be present for the one-sample Kolmogorov-Smirnov test
By combining the above techniques, exploring data distribution helps in gaining valuable insights into the distribution of univariate data, identifying potential outliers, and assessing normality assumptions for further statistical analysis.
Online Quiz Website, Learn Basic Statistics