Exploring Data in R

Exploring Data in R

Examination of data (Exploring Data), particularly graphical examination and representation of data is an important prelude to statistical data analysis and modeling. Note that there are some limitations on the kinds of graphs that we can create.

One should be familiar with standard procedures for exploratory data analysis, statistical graphics, and data transformation. We can categorize the graphical representation of data based on the nature (or type) of the variable, the number of variables, and the objectivity of the analysis. For example, if we are comparing groups then comparison graphs such as bar graphs can be used and if we are interested in the kind of relationship between variables then a scatter plot can be useful.

  • Distributional Displays:
    The distributional displays include stem and leaf displays, histograms, density estimates, quantile comparison plots, and box plots.
  • Plots of the Relationship between two variables:
    The graphical representations of the relationship between two variables include various versions of scatter plots, scatter plot smoothers, bivariate density estimates, and parallel box plots.
  • Multivariate Displays:
    Multivariate graphical representations include scatter plot matrices, coplots, and dynamic three-dimensional scatter plots.

For exploring the data in R, the following are some examples:

Stem and Leaf display and Histogram in R

attach(mtcars)
hist(mpg)
hist(mpg, nclass=3, col=3)
stem(mpg)

Density Estimates

Consider the following R code for a representation of distribution by smoothing the histogram.

hist(mpg, probability=T, ylab='Density')
lines(density(mpg, lwd=2))
points(mpg, rep(0, length(mpg)), pch="|")
lines(density(mpg, adjust=0.9), lwd=1)

The hist() function constructs the histogram with probability = TRUE specifying density scaling. The lines() the function draws the density estimate on the graph having a thickness of the line as double due to the parameter lwd=2. The points() function draws a one-dimensional scatter plot at the bottom of the graph by using a vertical bar as the plotting symbol. The second call to density in lines() the function with adjust=0.9, specifies a bandwidth of 0.9 the default value.

Quantile Comparison Plots

Quantile plots help in comparing the distribution of a variable with a theoretical distribution such as the normal distribution.

library(car)
qqPlot(mpg)

Note that the qqPlot() function is available in the car library. The qq.plot() function is defunct.

Relationship Graphs

To explore the relationship between two quantitative variables use plot() function and for a more enhanced version of a scatter plot between two variables use scatterplot() function. This function plots the variables with least squares and non-parametric regression lines. For example,

plot(mpg, wt)
scatterplot(mpg, wt)
scatterplot(mpg, wt, labels=rownames(cyl))

CLICK to learn about plot() function in R

Scroll to top
x  Powerful Protection for WordPress, from Shield Security
This Site Is Protected By
Shield Security