Examination of data (Exploring Data), particularly graphical examination and representation of data is an important prelude to statistical data analysis and modeling. Note that there are some limitations on the kinds of graphs that we can create.
Table of Contents
One should be familiar with standard procedures for exploratory data analysis, statistical graphics, and data transformation. We can categorize the graphical representation of data based on the variable’s nature (or type), the number of variables, and the objectivity of the analysis. For example, if we are comparing groups then comparison graphs such as bar graphs can be used. If we are interested in the kind of relationship between variables then a scatter plot can be useful.
- Distributional Displays:
The distributional displays include stem and leaf displays, histograms, density estimates, quantile comparison plots, and box plots. - Plots of the Relationship between two variables:
The graphical representations of the relationship between two variables include various versions of scatter plots, scatter plot smoothers, bivariate density estimates, and parallel box plots. - Multivariate Displays:
Multivariate graphical representations include scatter plot matrices,coplots , and dynamic three-dimensional scatter plots.
For exploring the data in R, the following are some examples:
Stem and Leaf Display and Histogram in R
attach(mtcars) hist(mpg) hist(mpg, nclass = 3, col = 3) stem(mpg)
Exploring Data in R: Density Estimates
Consider the following R code for a representation of distribution by smoothing the histogram.
hist(mpg, probability = T, ylab = 'Density') lines(density(mpg, lwd = 2)) points(mpg, rep(0, length(mpg)), pch = "|") lines(density(mpg, adjust = 0.9), lwd = 1)
The hist()
function constructs the histogram with probability=TRUE
specifying density scaling. The lines()
function draws the density estimate on the graph having a thickness of the line as double due to the parameter lwd=2
. The points() function draws a one-dimensional scatter plot at the bottom of the graph by using a vertical bar as the plotting symbol. The second call to density in lines() the function with adjust=0.9
, specifies a bandwidth of 0.9 the default value.
Quantile Comparison Plots
Quantile plots help in comparing the distribution of a variable with a theoretical distribution such as the normal distribution.
library(car) qqPlot(mpg)
Note that the qqPlot()
function is available in the car library. The qq.plot()
function is defunct.
Exploring Data: Relationship Graphs
To explore the relationship between two quantitative variables use plot()
function and for a more enhanced version of a scatter plot between two variables use scatterplot()
function. This function plots the variables with least squares and non-parametric regression lines. For example,
plot(mpg, wt) scatterplot(mpg, wt) scatterplot(mpg, wt, labels = rownames(cyl))
CLICK to learn about plot()
function in R
FAQs about R Language
- What do you mean by exploring data?
- What are the objectives of exploratory data analysis?
- What are the important visualizations for exploratory data analysis?
- For exploratory analysis, which graph is used for comparison purposes?
- For exploratory analysis, which graph is used to explore the relationship between variables?
- What is a quantile comparison plot?
- What is the objective of density estimation graphs?
- Name some of the multivariate plots used for EDA.