Introduction to Scatter Plots in R Language
Scatter plots (scatter diagrams) are bivariate graphical representations for examining the relationship between two quantitative variables. Scatter plots are essential for visualizing correlations and trends in data. Scatter plots in R can be drawn in several ways. Here we will discuss how to make several kinds of scatter plots in R.
Table of Contents
The plot function in R
In plot()
function when two numeric vectors are provided as arguments (one for horizontal and the other for vertical coordinates), the default behavior of the plot()
function is to make a scatter diagram. For example,
library(car) attach(Prestige) plot(income, prestige)
will draw a simple scatterplot of prestige by income.
Usually, the interpretation of a scatterplot is often assisted by enhancing the plot with least-squares or non-parametric regression lines. For this purpose scatterplot()
in car
package can be used and it will add marginal boxplots for the two variables
scatterplot(prestige ~ income, lwd = 3 )
Note that in the scatterplot, the non-parametric regression curve is drawn by a local regression smoother, where local regression works by fitting a least-square line in the neighborhood of each observation, placing greater weight on points closer to the focal observation. A fitted value for the focal observation is extracted from each local regression, and the resulting fitted values are connected to produce the non-parametric regression line.
Coded Scatterplots
The scatterplot()
function can also be used to create coded scatterplots. For this purpose, a categorical variable is used for coloring or using different symbols for each category. For example, let us plot prestige by income, coded by the type of occupation
scatterplot(prestige ~ income | type)
Note that variables in the scatterplot are given in a formula-style (as y ~ x | groups)
.
The coded scatterplot indicates that the relationship between prestige and income may well be linear within occupation types. The slope of the relationship looks steepest for blue-collar (bc) occupations, and least steep for professional and managerial occupations.
Jittering Scatter Plots
Jittering the data by adding a small random quantity to each coordinate serves to separate the overplotted points.
data(Vocab) attach(Vocab) plot(education, vocabulary) # without jittering plot(jitter (education), jitter(vocabulary) )
The degree of jittering can be controlled via factor argument. For example, specifying factor = 2
doubles the jitter.
plot(jitter(education, factor = 2), jitter(vocabulary, factor = 2))
Let’s add the least-squares and non-parametric regression line.
abline(lm(vocabulary ~ education), lwd = 3, lty = 2)
lines(lowess(education, vocabulary, f = 0.2), lwd = 3)
The lowess function (an acronym for locally weighted regression) returns coordinates for the local regression curve, which is drawn by lines. The “f” arguments set the span of the local regression to lowess.
Using these different kinds of graphical representations of relationships between variables may help to identify some hidden information (hidden due to overplotting).
See more on plot() function