One-Way ANOVA in R: A Comprehensive Quide

In this post, we will learn about one-way ANOVA in R Language.

The two-sample t or z-test is used to compare two groups from the independent population. However, if there are more than two groups, One-Way ANOVA (analysis of variance) or its further versions can be used in R.

Introduction to One-Way ANOVA

The statistical test statistic associated with ANOVA is the F-test (also called F-ratio). In the Anova procedure, an observed F-value is computed and then compared with a critical F-value derived from the relevant F-distribution. The F-value comes from a family of F-distribution defined by two numbers (the degrees of freedom). Note that the F-distribution cannot be negative as it is the ratio of variance and variances are always positive numbers.

The One-Way ANOVA is also known as one-factor ANOVA. It is the extension of the independent two-sample test for comparing means when there are more than two groups. The data in One-Way ANOVA is organized into several groups based on grouping variables (called factor variables too).

To compute the F-value, the ratio of “the variance between groups”,  and the “variance within groups” needs to be computed. The assumptions of ANOVA should also be checked before performing the ANOVA test. We will learn how to perform One-Way ANOVA in R.

One-Way ANOVA in R

Suppose we are interested in finding the difference of miles per gallon based on number of the cylinders in an automobile; from the dataset “mtcars”

Let us get some basic insight into the data before performing the ANOVA.

# load and attach the data mtcars
attach(mtcars)
# see the variable names and initial observations
head(mtcars)

Let us find the means of each number of the cylinder group

print(model.tables(res, "means"), digits = 4)

Let us draw the boxplot of each group

boxplot(mpg ~ cyl, main="Boxplot", xlab="Number of Cylinders", ylab="mpg")

Now, to perform One-Way ANOVA in R using the aov( ) function. For example,

aov(mpg ~ cyl)

The variable “mpg” is continuous and the variable “cyl” is the grouping variable. From the output note the degrees of freedom under the variable “cyl”. It will be one. It means the results are not correct as the degrees of freedom should be two as there are three groups on “cyl”. In the mode (data type) of grouping variable required for ANOVA  should be the factor variable. For this purpose, the “cyl” variable can be converted to factor as

cyl <- as.factor(cyl)

Now re-issue the aov( ) function as

aov(mpg ~ cyl)

Now the results will be as required. To get the ANOVA table, use the summary( ) function as

summary(aov (mpg ~ cyl))

Let’s store the ANOVA results obtained from aov( ) in object say res

res <- aov(mpg ~ cyl)
summary(res)

Post-Hoc Analysis (Multiple Pairwise Comparison)

Post-hoc tests or multiple-pairwise comparison tests help in finding out which groups differ (significantly) from one other and which do not. The post-hoc tests allow for multiple-pairwise comparisons without inflating the type-I error. To understand it, suppose the level of significance (type-I error) is 5%. Then the probability of making at least one Type-I error (assuming independence of three events), the maximum family-wise error rate will be

$1-(0.95 \times 0.95 \times 0.95) =  14.2%$

It will give the probability of having at least one FALSE alarm (type-I error).

To perform Tykey’s post-hoc test and plot the group’s differences in means from Tukey’s test.

# Tukey Honestly Significant Differences
TukeyHSD(res)
plot(TukeyHSD(res))

Diagnostic Plots (Checking Model Assumptions)

The diagnostic plots can be used to check the assumption of heteroscedasticity, normality, and influential observations.

layout(matrix(c(1,2,3,4), 2,2))
plot(res)
Diagnostic Plots for One-Way ANOVA in R

Levene’s Test

To check the assumption of ANOVA, Levene’s test can be used. For this purpose leveneTest( ) function can be used which is available in the car package.

library(car)
leveneTest(res)

https://itfeature.com

https://gmstat.com

Statistical Models in R Language: Secrets

R language provides an interlocking suite of facilities that make fitting statistical models very simple. The output from statistical models in R language is minimal and one needs to ask for the details by calling extractor functions.

Defining Statistical Models in R Language

The template for a statistical model is a linear regression model with independent, heteroscedastic errors, that is
$$\sum_{j=0}^p \beta_j x_{ij}+ e_i, \quad e_i \sim NID(0, \sigma^2), \quad i=1,2,\dots, n, j=1,2,\cdots, p$$

In matrix form, the statistical model can be written as

$$y=X\beta+e$$

where the $y$ is the dependent (response) variable, $X$ is the model matrix or design matrix (matrix of regressors), and has columns $x_0, x_1, \cdots, x_p$, the determining variables with intercept term. Usually, $x_0$ is a column of ones defining an intercept term in the statistical model.

Statistical Model Examples

Suppose $y, x, x_0, x_1, x_2, \cdots$ are numeric variables, $X$ is a matrix. Following are some examples that specify statistical models in R.

  • y ~ x    or   y ~ 1 + x
    Both examples imply the same simple linear regression model of $y$ on $x$. The first formulae have an implicit intercept term and the second formulae have an explicit intercept term.
  • y ~ 0 + x  or  y ~ -1 + x  or y ~ x – 1
    All these imply the same simple linear regression model of $y$ on $x$ through the origin, without an intercept term.
  • log(y) ~ x1 + x2
    Imply multiple regression of the transformed variable, $(log(y)$ on $x_1$ and $x_2$ with an implicit intercept term.
  • y ~ poly(x , 2)  or  y ~ 1 + x + I(x, 2)
    Imply a polynomial regression model of $y$ on $x$ of degree 2 (second-degree polynomials) and the second formulae use explicit powers as a basis.
  • y~ X + poly(x, 2)
    Multiple regression $y$ with a model matrix consisting of the design matrix $X$ as well as polynomial terms in $x$ to degree 2.

Note that the operator ~ defines a model formula in R language. The form of an ordinary linear regression model is, $response\,\, ~ \,\, op_1\,\, term_1\,\, op_2\,\, term_2\,\, op_3\,\, term_3\,\, \cdots $,

where

  • The response is a vector or matrix defining the response (dependent) variable(s).
  • $op_i$ is an operator, either + or -, implying the inclusion or exclusion of a term in the model. The + operator is optional.
  • $term_i$ is either a matrix or vector or 1. It may be a factor or a formula expression consisting of factors, vectors, or matrices connected by formula operators.
Statistical Models in R Language

FAQS about Statistical Models in R

  1. How statistical models are specified in R Language?
  2. How linear regression is performed in R language using the formula?
  3. How linear regression can be performed without intercept in r?
  4. How polynomial regression can be performed in R?
  5. Write about the ~ operator in R.
Statistical Models in R Language R FAQs https://rfaqs.com

https://gmstat.com
https://itfeature.com

Handling Missing Values in R: A Quick Guide

The article is about Handling Missing Values in R Language.

Question: What are the differences between missing values in R and other Statistical Packages?

Answer: Missing values (NA) cannot be used in comparisons, as already discussed in the previous post on missing values in R. In other statistical packages (software) a “missing value” is assigned to some code either very high or very low in magnitude such as 99 or -99 etc. These coded values are considered as missing and can be used to compare to other values and other values can be compared to missing values.

In R language NA values are used for all kinds of missing data, while in other packages, missing strings and missing numbers are represented differently, for example, empty quotations for strings, and periods, large or small numbers. Similarly, non-NA values cannot be interpreted as missing while in other package systems, missing values are designated from other values.

Handling Missing Values in R

Question: What are NA options in R?
Answer: In the previous post on missing values, I introduced is.na() function as a tool for both finding and creating missing values. The is.na() is one of several functions built around NA. Most of the other functions for missing values (NA) are options for na.action(). The possible na.action() settings within R are:

  • na.omit() and na.exclude(): These functions return the object with observations removed if they contain any missing (NA) values. The difference between these two functions na.omit() and na.exclude() can be seen in some prediction and residual functions.
  • na.pass(): This function returns the object unchanged.
  • na.fail(): This function returns the object only if it contains no missing values.

To understand these NA options use the following lines of code.

getOption("na.action")

(m <- as.data.frame(matrix(c(1 : 5, NA), ncol=2)))
na.omit(m)
na.exclude(m)
na.fail(m)
na.pass(m)
Handling Missing Values in R Language

Note that it is wise to investigate the missing values in your data set and also make use of the help files for all functions you are willing to use for handling missing values. You should be either aware of and comfortable with the default treatments (handling) of missing values or specifying the treatment of missing values you want for your analysis.

FAQs about Missing Values in R

  1. What is meant by a missing value?
  2. How one can handle missing values in R?
  3. What is NA in R?
  4. How one can identify missing values in R?
  5. What is is.na() function?
  6. What is the use of na.omit() function in R?
  7. Why it is importance of investigate missing values before performing any data analysis?
Handling Missing values in R

https://itfeature.com, Test Preparation MCQs