Scatter Plots In R

Scatter plots (scatter diagrams) are bivariate graphical representations for examining the relationship between two quantitative variables. Here we will discuss how to make several kinds of scatter plots in R.

In plot() function when two numeric vectors are provided as arguments (one for horizontal and other for vertical coordinates), the default behaviour of the plot() function is to make a scatter diagram. For example,

plot(income, prestige)

will draw a simple scatterplot of prestige by income.

Usually, the interpretation of a scatterplot is often assisted by enhancing the plot with least-squares or non-parametric regression lines. For this purpose scatterplot() in car package can be used and it will add marginal boxplots for the two variables

scatterplot(prestige ~ income, lwd = 3 )

Note that in the scatterplot, the non-parametric regression curve is drawn by a local regression smoother, where local regression works by fitting a least-square line in the neighborhood of each observation, placing greater weight on points closer to the focal observation. A fitted value for the focal observation is extracted from each local regression, and the resulting fitted values are connected to produce the non-parametric regression line.

Coded Scatterplots

The scatterplot() function can also be used to create coded scatterplots. For this purpose, a categorical variable is used for coloring or using different symbols for each category. For example, let us plot prestige by income, coded by the type of occupation

scatterplot(prestige ~ income | type)

Note that variables in the scatterplot are given in a formula-style (as y ~ x | groups).

The coded scatterplot indicates that the relationship between prestige and income may well be linear within occupation types. The slope of the relationship looks steepest for blue-collar (bc) occupation, and least steep for professional and managerial occupation.

Jittering scatterplots

Jittering the data by adding a small random quantity to each coordinate serves to separate the overplotted points.

plot(education, vocabulary) 
# without jittering
plot(jitter (education), jitter(vocabulary) )

The degree of jittering can be controlled via factor argument. For example, specifying factor = 2 doubles the jitter.

plot(jitter(education, factor = 2), jitter(vocabulary, factor = 2))

Let add the least-squares and non-parametric regression line.

abline(lm(vocabulary ~ education), lwd=3, lty = 2)
lines(lowess(education, vocabulary, f = 0.2), lwd = 3)

The lowess function (an acronym for locally weighted regression) returns coordinates for the local-regression curve, which is drawn by lines. The span of the local regression is set by the f arguments to lowess.

Using these different kinds of graphical representations of relationships between variables may help to identify some hidden information (hidden due to overplotting).

See more on plot() function