Statistical Computing and Graphics in R

# Comparisons Tests

T test and ANOVA

## Mean Comparison Tests: Hypothesis Testing (One Sample and Two Sample)

Here we learn some basics about how to perform Mean Comparison Tests: hypothesis testing for one sample test, two-sample independent test, and dependent sample test. We will also learn how to find the p-values for a certain distribution such as t-distribution, and critical region values. We will also see how to perform one-tailed and two-tailed hypothesis tests.

How to Perform One-Sample t-Test in R

A recent article in The Wall Street Journal reported that the 30-year mortgage rate is now less than 6%. A sample of eight small banks in the Midwest revealed the following 30-year rates (in percent)

At the 0.01 significance level (probability of type-I error), can we conclude that the 30-year mortgage rate for small banks is less than 6%?

Manual Calculations for One-Sample t-Test and Confidence Interval

# Manual way
X <- c(4.8, 5.3, 6.5, 4.8, 6.1, 5.8, 6.2, 5.6)
xbar <- mean(X)
s <- sd(X)
mu = 6
n = length(X)
df = n - 1
tcal = (xbar - mu)/(s/sqrt(n) )
tcal

c(xbar - qt(0.995, df = df) * s/sqrt(n), xbar + qt(0.995, df = df) * s/sqrt(n))

Critical Values from t-Table

# Critical Value for Left Tail
qt(0.01, df = df, lower.tail = T)

# Critical Value for Right Tail
qt(0.99, df = df, lower.tail = T)

# Critical Vale for Both Tails
qt(0.995, df = df)

Finding p-Values

# p-value (altenative is less)
pt(tcal, df = df)

# p-value (altenative is greater)
1 - pt(tcal, df = df)

# p-value (alternative two tailed or not equal to)
2 * pt(tcal, df = df)


Performing One-Sample Confidence Interval and t-test Using Built-in Function

# Left Tail test
t.test(x = X, mu = 6, alternative = c("less"), conf.level = 0.99)

# Right Tail test
t.test(x = X, mu = 6, alternative = c("greater"), conf.level = 0.99)

# Two Tail test
t.test(x = X, mu = 6, alternative = c("two.sided"), conf.level = 0.99)


How to Perform two-Sample t-Test in R

Consider we have two samples stored in two vectors $X$ and $Y$ as shown in R code. We are interested in the Mean Comparison Test among two groups of people regarding (say) their wages in a certain week.

X = c(70, 82, 78, 70, 74, 82, 90)
Y = c(60, 80, 91, 89, 77, 69, 88, 82)

Manual Calculations for Two-Sample t-Test and Confidence Interval

nx = length(X)
ny = length(Y)
xbar = mean(X)
sx = sd(X)
ybar = mean(Y)
sy = sd(Y)
df = nx + ny - 2
# Pooled Standard Deviation/ Variance
SP = sqrt( ( (nx-1) * sx^2 + (ny-1) * sy^2) / df )
tcal = (( xbar - ybar ) - 0) / (SP *sqrt(1/nx + 1/ny))
tcal

# Confidence Interval
LL <- (xbar - ybar) - qt(0.975, df)* sqrt((SP^2 *(1/nx + 1/ny) ))
UL <- (xbar - ybar) + qt(0.975, df)* sqrt((SP^2 *(1/nx + 1/ny) ))
c(LL, UL)

Finding p-values

# The p-value at the left-hand side of Critical Region
pt(tcal, df )

# The p-value for two-tailed Critical Region
2 * pt(tcal, df )

# The p-value at the right-hand side of Critical Region
1 - pt(tcal, df)

Finding Critical Values from t-Table

# Left Tail
qt(0.025, df = df, lower.tail = T)

# Right Tail
qt(0.975, df = df, lower.tail = T)

# Both tails
qt(0.05, df = df)

Performing Two-Sample Confidence Interval and T-test using Built-in Function

# Left Tail test
t.test(X, Y, alternative = c("less"), var.equal = T)

# Right Tail test
t.test(X, Y, alternative = c("greater"), var.equal = T)

# Two Tail test
t.test(X, Y, alternative = c("two.sided"), var.equal = T)

Note if $X$ and $Y$ variables are from a data frame then perform the two-sample t-test using the formula symbol (~). Let’s first make the data frame from vectors $X$ and $$Y.

data <- data.frame(values = c(X, Y), group = c(rep("A", nx), rep("B", ny)))

t.test(values ~ group, data = data, alternative = "less", var.equal = T)
t.test(values ~ group, data = data, alternative = "greater", var.equal = T)
t.test(values ~ group, data = data, alternative = "two.side", var.equal = T)


To understand probability distributions functions in R click the link: Probability Distributions in R

## One-Way ANOVA in R

The two-sample t or z-test is used to compare two groups from the independent population. However, if there are more than two groups, analysis of variance (ANOVA) can be used.

The statistical test statistic associated with ANOVA is the F-test (also called F-ratio). In the Anova procedure, an observed F-value is computed and then compared with a critical F-value derived from the relevant F-distribution. The F-value comes from a family of F-distribution defined by two numbers (the degrees of freedom). Note that the F-distribution cannot be negative as it is the ratio of variance and variances are always positive numbers.

The One-Way ANOVA is also known as one-factor ANOVA. It is the extension of the independent two-sample test for comparing means when there are more than two groups. The data in One-Way ANOVA is organized into several groups based on grouping variables (called factor variables too).

To compute the F-value, the ratio of “the variance between groups”,  and the “variance within groups” needs to be computed. The assumptions of ANOVA should also be checked before performing the ANOVA test. We will learn how to perform One-Way ANOVA in R.

Suppose we are interested in finding the difference of miles per gallon based on number of the cylinders in an automobile; from the dataset “mtcars

Let us get some basic insight into the data before performing the ANOVA.

# load and attach the data mtcars
attach(mtcars)
# see the variable names and initial observations
head(mtcars)

Let us find the means of each number of the cylinder group

print(model.tables(res, "means"), digits=4)

Let us draw the boxplot of each group

boxplot(mpg ~ cyl, main="Boxplot", xlab="Number of Cylinders", ylab="mpg")

Now, to perform One-Way ANOVA in R using the aov( ) function. For example,

aov(mpg ~ cyl)

The variable “mpg” is continuous and the variable “cyl” is the grouping variable. From the output note the degrees of freedom under the variable “cyl”. It will be one. It means the results are not correct as the degrees of freedom should be two as there are three groups on “cyl”. In the mode (data type) of grouping variable required for ANOVA  should be the factor variable. For this purpose, the “cyl” variable can be converted to factor as

cyl <- as.factor(cyl)

Now re-issue the aov( ) function as

aov(mpg ~ cyl)

Now the results will be as required. To get the ANOVA table, use the summary( ) function as

summary(aov (mpg ~ cyl))

Let’s store the ANOVA results obtained from aov( ) in object say res

res <- aov(mpg ~ cyl)
summary(res)

#### Post-Hoc Analysis

Post-hoc tests or multiple-pairwise comparison tests help in finding out which groups differ (significantly) from one other and which do not. The post-hoc tests allow for multiple-pairwise comparisons without inflating the type-I error. To understand it, suppose the level of significance (type-I error) is 5%. Then the probability of making at least one Type-I error (assuming independence of three events), the maximum family-wise error rate will be

$1-(0.95 \times 0.95 \times 0.95) = 14.2%$

It will give the probability of having at least one FALSE alarm (type-I error).

To perform Tykey’s post-hoc test and plot the group’s differences in means from Tukey’s test.

# Tukey Honestly Significant Differences
TukeyHSD(res)
plot(TukeyHSD(res))

#### Diagnostic Plots

The diagnostic plots can be used to check the assumption of heteroscedasticity, normality, and influential observations.

layout(matrix(c(1,2,3,4), 2,2))
plot(res)

#### Levene’s Test

To check the assumption of ANOVA, Levene’s test can be used. For this purpose leveneTest( ) function can be used which is available in the car package.

library(car)
leveneTest(res)
Scroll to top