Exploring Data in R

Examination of data (Exploring Data), particularly graphical examination and representation of data is an important prelude to statistical data analysis and modeling. Note that there are some limitations on the kinds of graphs that we can create.

One should be familiar with standard procedures for exploratory data analysis, statistical graphics, and data transformation too. We can categorize the graphical representation of data on the basis of nature (or type) of variable, number of variables, and objectivity of analysis. For example, if we are comparing groups then comparison graphs such as bar graphs can be used and if we are interested in the kind of relationship between variables then a scatter plot can be useful.

  • Distributional Displays:
    The distributional displays include stem and leaf display, histograms, density estimates, quantile comparison plots, and box plots.
  • Plots of the Relationship between two variables:
    The graphical representations for the relationship between two variables include various versions of scatter plots, scatter plot smoothers, bivariate density estimates, and parallel box plots.
  • Multivariate Displays:
    Multivariate graphical representations include scatter plot matrices, coplots, and dynamic three dimensional scatter plots.

For exploring the data in R, following are some examples:

Stem and Leaf display and Histogram in R

hist(mpg, nclass=3, col=3)

Density Estimates

Consider the following R code for a representation of distribution by smoothing the histogram.

hist(mpg, probability=T, ylab='Density')
lines(density(mpg, lwd=2))
points(mpg, rep(0, length(mpg)), pch="|")
lines(density(mpg, adjust=0.9), lwd=1)

The hist() function constructs the histogram with probability = TRUE specifying density scaling. The lines() function draws the density estimate on the graph having a thickness of the line as double due to parameter lwd=2. The points() function draws a one-dimensional scatter plot at the bottom of the graph by using a vertical bar as the plotting symbol. The second call to density in lines() function with adjust=0.9, specifies a bandwidth 0.9 the default value.

Quantile Comparison Plots

Quantile plots help in comparing the distribution of a variable with a theoretical distribution such as the normal distribution.


Note that the qqPlot() function is available in car library. The qq.plot() function is defunct.

Relationship Graphs

To explore the relationship between two quantitative variables use plot() function and for a more enhanced version of a scatter plot between two variables use scatterplot() function. This function plot the variables with least squares and non-parametric regression lines. For example,

plot(mpg, wt)
scatterplot(mpg, wt)
scatterplot(mpg, wt, labels=rownames(cyl))

CLICK to learn about plot() function in R