Exploring Data Distribution in R

Exploring Data Distribution in R Language

Suppose we have univariate data and need to examine its distribution. There are a variety of tools and techniques to explore univariate data distributions. The simplest way is to explore the numbers. The summary() and fivenum() are numerical while the stem() is a display of the numbers to examine the distribution of the data set. This post will teach you the basics of exploring data distribution in the R Language.

Five Number Summary and Stem and Leaf Plot

One can use numeric and visual tools in exploring data distribution. For example,

attach(faithful)
summary(eruptions)

## Output
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.600   2.163   4.000   3.488   4.454   5.100 

fivenum(eruptions)

## Output
 1.6000 2.1585 4.0000 4.4585 5.1000

stem(eruptions)

Exploring Data Distribution in R Language stem and leaf display

Histogram and Density Plot

The stem-and-leaf display is like a histogram which can be drawn using the hist() function to plot histograms in R language. The boxplot() function can also be used to visualize the distribution of the data. This will help in exploring data distribution.

# make the bins smaller, and make a plot of density

hist(eruptions)
hist(eruptions, seq(1.6, 5.2, 0.2), prob=TRUE)
lines(density(eruptions, bw=0.1))
rug(eruptions) # Show the actual data points

Exploring data distribution in R using hist and density function

The density can be used to create more elegant density plots, a line is also produced by the density and bw bandwidth is chosen by trial and error as the defaults give too much smoothing (it usually does for “interesting” densities). Better automated methods for bandwidth are also available (in the above example bw="SJ" gives good results.)

Empirical Cumulative Distribution Function

One can also plot the empirical cumulative distribution function by using the function ecdf.

plot(ecdf(eruptions), do.points = FALSE, verticals = TRUE)

For the right-hand mode (eruptions of longer than 3 minutes), let us fit a normal distribution and overlay the fitted CDF.

long <- eruptions[eruptions > 3]
plot (ecdf(long), do.points = FALSE, verticals = TRUE)
x <- seq(3, 5.4, 0.01)
lines(x, pnorm(x, mean = mean(long), sd = sqrt(var(long))), lty = 3)

par(pty = "s")
qqnorm(long)
qqline(long)

The Quantile-Quantile (QQ Plot) long shows a reasonable fit but a shorter right tail than one would expect from a normal distribution. One can compare it with some simulated data from t-distribution.

x <- rt(250, df = 5)
qqnorm(x)
qqline(x)

which will show a longer tail (as a random sample from the t distribution) compared to a normal distribution.

Normality Test in R

To determine if the data follows the normal distribution,

The Shapiro-Wilk Normality Test using the shapiro.test() function can determine if the data follows a normal distribution.

    Shapiro-Wilk normality test

shapiro.test(eruptions)
## Output
		Shapiro-Wilk normality test

data:  eruptions
W = 0.84592, p-value = 9.036e-16

The Kolmogorov-Smirnov Test using the ks.test() function can determine if the data follows a normal distribution

ks.test(eruptions, "pnorm")

## Output
        Asymptotic one-sample Kolmogorov-Smirnov test

data:  eruptions
D = 0.94857, p-value < 2.2e-16
alternative hypothesis: two-sided

Warning message:
In ks.test.default(eruptions, "pnorm") :
  ties should not be present for the one-sample Kolmogorov-Smirnov test

By combining the above techniques, exploring data distribution helps in gaining valuable insights into the distribution of univariate data, identifying potential outliers, and assessing normality assumptions for further statistical analysis.

Online Quiz Website, Learn Basic Statistics