Statistical Computing and Graphics in R

# R Graphics

## Graphical Representation in R

In the R language, there is much graphical representation of qualitative and quantitative data. We will only discuss the histogram, bar plot, and box plot in this post.

### Histogram

To visualize a single variable, the histogram can be drawn using the hist( ) function. The use of histograms is to judge the shape and distribution of data in a graphical way. Histograms are also used to check the normality of the variable.

Let us the data from iris dataset.

attach(iris)
hist(Petal.Width)

We can enhance the histogram by using some arguments/parameters related to the hist( ) function. For example,

hist(Petal.Width,
xlab = "Petal Width",
ylab = "Frequency",
main = "Histogram of Petal Width from Iris Data set",
breaks =10,
col = "dodgerblue",
border = "orange")

If these arguments are not provided, R will attempt to intelligently guess them, especially the number of breaks. See the YouTube tutorial for a graphical representation of the histogram.

### Barplots

The bar plots are the best choice for visual inspection of a categorical variable (or a numeric variable with a finite number of values), or a rank variable. Usually, one can use bar plots for comparison purposes. See the example,

library(mtcars)
barplot( table(cyl) )
barplot(table(cyl),
ylab = "Frequency",
xlab = "Cylinders (4, 6, 8)",
main = "Number of cylinders ",
col = "green",
border = "blue")

### Boxplots

One can use Boxplots to visualize the normality, skewness, and existence of outliers in the data based on five-number summary statistics.

boxplot(mpg)
boxplot(Petal.Width)
boxplot(Petal.Length)


However, one can compare a numerical variable for different values of a categorical/grouping variable. For example,

boxplot(mpg ~ cyl, data = mtcars)

The reads the formula mpg ~ cyl as: “Plot the mpg variable against the cyl variable using the dataset mtcars. The symbol ~ used to specify a formula in R.

boxplot(mpg ~ cyl, data =mtcars,
xlab = "Cylinders",
ylab = "Miles per Gallon",
pch = 20,
cex = 2,
col = "pink",
border = "black")

See How to perform descriptive statistics

Visit: MCQs and Quiz site https://gmstat.com

## Scatter Plots In R

Scatter plots (scatter diagrams) are bivariate graphical representations for examining the relationship between two quantitative variables. Here we will discuss how to make several kinds of scatter plots in R.

In plot() function when two numeric vectors are provided as arguments (one for horizontal and the other for vertical coordinates), the default behavior of the plot() function is to make a scatter diagram. For example,

library(car)
attach(Prestige)
plot(income, prestige)

will draw a simple scatterplot of prestige by income.

Usually, the interpretation of a scatterplot is often assisted by enhancing the plot with least-squares or non-parametric regression lines. For this purpose scatterplot() in car package can be used and it will add marginal boxplots for the two variables

scatterplot(prestige ~ income, lwd = 3 )

Note that in the scatterplot, the non-parametric regression curve is drawn by a local regression smoother, where local regression works by fitting a least-square line in the neighborhood of each observation, placing greater weight on points closer to the focal observation. A fitted value for the focal observation is extracted from each local regression, and the resulting fitted values are connected to produce the non-parametric regression line.

### Coded Scatterplots

The scatterplot() function can also be used to create coded scatterplots. For this purpose, a categorical variable is used for coloring or using different symbols for each category. For example, let us plot prestige by income, coded by the type of occupation

scatterplot(prestige ~ income | type)

Note that variables in the scatterplot are given in a formula-style (as y ~ x | groups).

The coded scatterplot indicates that the relationship between prestige and income may well be linear within occupation types. The slope of the relationship looks steepest for blue-collar (bc) occupations, and least steep for professional and managerial occupations.

### Jittering scatterplots

Jittering the data by adding a small random quantity to each coordinate serves to separate the overplotted points.

data(Vocab)
attach(Vocab)
plot(education, vocabulary)
# without jittering
plot(jitter (education), jitter(vocabulary) )

The degree of jittering can be controlled via factor argument. For example, specifying factor = 2 doubles the jitter.

plot(jitter(education, factor = 2), jitter(vocabulary, factor = 2))

Let’s add the least-squares and non-parametric regression line.

abline(lm(vocabulary ~ education), lwd=3, lty = 2)
lines(lowess(education, vocabulary, f = 0.2), lwd = 3)

The lowess function (an acronym for locally weighted regression) returns coordinates for the local regression curve, which is drawn by lines. The span of the local regression is set by the f arguments to lowess.

Using these different kinds of graphical representations of relationships between variables may help to identify some hidden information (hidden due to overplotting).

See more on plot() function

## Exploring Data in R

Examination of data (Exploring Data), particularly graphical examination and representation of data is an important prelude to statistical data analysis and modeling. Note that there are some limitations on the kinds of graphs that we can create.

One should be familiar with standard procedures for exploratory data analysis, statistical graphics, and data transformation. We can categorize the graphical representation of data based on the nature (or type) of the variable, the number of variables, and the objectivity of the analysis. For example, if we are comparing groups then comparison graphs such as bar graphs can be used and if we are interested in the kind of relationship between variables then a scatter plot can be useful.

• Distributional Displays:
The distributional displays include stem and leaf displays, histograms, density estimates, quantile comparison plots, and box plots.
• Plots of the Relationship between two variables:
The graphical representations of the relationship between two variables include various versions of scatter plots, scatter plot smoothers, bivariate density estimates, and parallel box plots.
• Multivariate Displays:
Multivariate graphical representations include scatter plot matrices, coplots, and dynamic three-dimensional scatter plots.

For exploring the data in R, the following are some examples:

### Stem and Leaf display and Histogram in R

attach(mtcars)
hist(mpg)
hist(mpg, nclass=3, col=3)
stem(mpg)

### Density Estimates

Consider the following R code for a representation of distribution by smoothing the histogram.

hist(mpg, probability=T, ylab='Density')
lines(density(mpg, lwd=2))
points(mpg, rep(0, length(mpg)), pch="|")
lines(density(mpg, adjust=0.9), lwd=1)

The hist() function constructs the histogram with probability = TRUE specifying density scaling. The lines() the function draws the density estimate on the graph having a thickness of the line as double due to the parameter lwd=2. The points() function draws a one-dimensional scatter plot at the bottom of the graph by using a vertical bar as the plotting symbol. The second call to density in lines() the function with adjust=0.9, specifies a bandwidth of 0.9 the default value.

### Quantile Comparison Plots

Quantile plots help in comparing the distribution of a variable with a theoretical distribution such as the normal distribution.

library(car)
qqPlot(mpg)

Note that the qqPlot() function is available in the car library. The qq.plot() function is defunct.

### Relationship Graphs

To explore the relationship between two quantitative variables use plot() function and for a more enhanced version of a scatter plot between two variables use scatterplot() function. This function plots the variables with least squares and non-parametric regression lines. For example,

plot(mpg, wt)
scatterplot(mpg, wt)
scatterplot(mpg, wt, labels=rownames(cyl))

## Greek letters in R plot label and title

Question: How one can include Greek letter (symbols) in R plot labels?
Answer: Greek letters or symbols can be included in titles and labels of a graph using the expression command. Following are some examples

Note that in these example random data is generated from a normal distribution. You can use your own data set to produce graphs that have symbols or Greek letters in their labels or titles.

Example 1:

> mycoef <- rnorm (1000)
> hist(mycoef, main = expression(beta) )

where beta in expression is Greek letter (symbol) of $latex \beta$. A histogram similar to the following will be produced.

Example 2:

> sample <- rnorm(mean=5, sd=1, n=100)
> hist(sample, main=expression( paste(“sampled values, “, mu, “=5, “, sigma, “=1” )))

where mu and sigma are symbols of $latex \mu$ and $latex \sigma$ respectively. Now histogram will look like

Example 3:

> curve(dnorm, from= -3, to=3, n=1000, main=”Normal Probability Density Function”)

will produce curve of Normal probability density function ranging from $latex -3$ to $latex 3$.

To add normal density function formula, we need to use text and paste command, that is

> text(-2, 0.3, expression(f(x)== paste(frac(1, sqrt(2*pi* sigma^2 ) ), ” “, e^{frac(-(x-mu)^2, 2*sigma^2)})), cex=1.2)

Now the updated curve of Normal probability density function will be

Example 4:

> x <- dnorm( seq(-3, 3, 0.001))
> plot(seq(-3, 3, 0.001), cumsum(x)/sum(x), type=”l”, col=”blue”, xlab=”x”, main=”Normal Cumulative Distribution Function”)

The Normal Cumulative Distribution function will look like,

To add formula, use text and paste command, that is

> text(-1.5, 0.7, expression(phi(x) == paste(frac(1, sqrt(2*pi)), ” “, integral(e^(-t^2/2)*dt, -infinity, x))), cex = 1.2)

The Curve of Normal Cumulative Distribution Function and its formula in the plot will look like,

Scroll to top