Load Data from R Library (2020)

Here we will discuss how to read the data from R library. Many R libraries contain datasets, which may be called data libraries. For example, the car package contains a Duncan dataset that can be used for learning and implementing different R functions. To use Duncan’s data, first, you have to load the car package. Note that the car package must be installed to make use of the Duncan dataset. Let us read data from the R library and make use of the Duncan dataset.

Getting Data from R Library

To Read or load Data stored in an R library, one needs to load the library first.

library(car)
data(Duncan)
attach(Duncan)

If the car the package is not installed on your system, one can install using the following command. Note your system should be connected to the internet.

install.packages("car")

Reading Data from R Library

The attach( ) function makes each variable accessible without writing the variable name with the respective dataset name. After attaching the Duncan dataset one can access the variable say education instead of writing Duncan$education. Let us make some functions to read data from R library.

head(Duncan)

The head( ) function will display the top six observations with their variable names in table-type format. It will help to understand the structure of the dataset.

summary(Duncan)

For quantitative variables, the summary( ) function will provide five-number summary statistics with the mean value. For qualitative variables, the summary( ) function will provide the frequency of each group.

To plot a scatter plot one can use the plot function. For example,

plot(education, income)
Scatter plot of Education and Income

The scatter plot shows the strength and direction of the relationship between “Percentage of occupational incumbents in 1950 who were high school graduates’ and ‘Percentage of occupational incumbents in the 1950 US Census who earned $3,500’.

Getting Basic Data Information

To check how many observations and columns are in a dataset, one can make use of nrow( ) and ncol( ) function. For example,

nrow(Duncan)
ncol(Duncan)

To get the definition of a dataset and its variable, one can read the dataset documentation:

?Duncan

To see the list of pre-loaded data, type the function data( ):

Reading Data from R Library
data( )

It is best practice to attach data only one at a time when reading data from the R library or importing from the data file. To remove a data frame from the search path, use detach()function.

Exercise for Data from R Library

Try the following dataset and make use of all the functions discussed in this lecture.

mtcars
iris
TootGrowth
PlantGrowth
USAarrests

SPSS Data Analysis

MCQs General Knowledge

Mean Comparison Tests in R

Comparing means between groups is fundamental in statistical analysis and data science. Whether you are testing drug efficacy, evaluating marketing campaigns, or analyzing experimental data, there are Powerful Tools for mean comparison tests in R Language.

Mean Comparison Tests in R Language

Here, we learn some basics about how to perform the Mean Comparison Test in R Language: hypothesis testing for one sample test, two-sample independent test, and dependent sample test. We will also learn how to find the p-values for a certain distribution, such as t-distribution and critical region values. We will also see how to perform one-tailed and two-tailed hypothesis tests.

How to Perform One-Sample t-Test in R

A recent article in The Wall Street Journal reported that the 30-year mortgage rate is now less than 6%. A sample of eight small banks in the Midwest revealed the following 30-year rates (in percent)

4.85.36.54.86.15.86.25.6

At the 0.01 significance level (probability of type-I error), can we conclude that the 30-year mortgage rate for small banks is less than 6%?

Manual Calculations for One-Sample t-Test and Confidence Interval

One sample mean comparison test can be performed manually.

# Manual way
X <- c(4.8, 5.3, 6.5, 4.8, 6.1, 5.8, 6.2, 5.6)
xbar <- mean(X)
s <- sd(X)
mu = 6
n = length(X)
df = n - 1 
tcal = (xbar - mu)/(s/sqrt(n) )
tcal
c(xbar - qt(0.995, df = df) * s/sqrt(n), xbar + qt(0.995, df = df) * s/sqrt(n))
Mean Comparison Tests: One sample Confidence Interval

Critical Values from t-Table

# Critical Value for Left Tail
qt(0.01, df = df, lower.tail = T)
# Critical Value for Right Tail
qt(0.99, df = df, lower.tail = T)
# Critical Vale for Both Tails
qt(0.995, df = df)

Finding p-Values

# p-value (altenative is less)
pt(tcal, df = df)
# p-value (altenative is greater)
1 - pt(tcal, df = df)
# p-value (alternative two tailed or not equal to)
2 * pt(tcal, df = df)

Performing One-Sample Confidence Interval and t-test Using Built-in Function

One can perform one sample mean comparison test using built-in functions available in the R Language.

# Left Tail test
t.test(x = X, mu = 6, alternative = c("less"), conf.level = 0.99)
# Right Tail test
t.test(x = X, mu = 6, alternative = c("greater"), conf.level = 0.99)
# Two Tail test
t.test(x = X, mu = 6, alternative = c("two.sided"), conf.level = 0.99)

How to Perform a Two-Sample t-Test in R

Consider we have two samples stored in two vectors $X$ and $Y$ as shown in R code. We are interested in the Mean Comparison Test among two groups of people regarding (say) their wages in a certain week.

X = c(70, 82, 78, 70, 74, 82, 90)
Y = c(60, 80, 91, 89, 77, 69, 88, 82)

Manual Calculations for Two-Sample t-Test and Confidence Interval

The manual calculation for two sample t-tests as a mean comparison test is as follows.

nx = length(X)
ny = length(Y)
xbar = mean(X)
sx = sd(X)
ybar = mean(Y)
sy = sd(Y)
df = nx + ny - 2
# Pooled Standard Deviation/ Variance 
SP = sqrt( ( (nx-1) * sx^2 + (ny-1) * sy^2) / df )
tcal = (( xbar - ybar ) - 0) / (SP *sqrt(1/nx + 1/ny))
tcal
# Confidence Interval
LL <- (xbar - ybar) - qt(0.975, df)* sqrt((SP^2 *(1/nx + 1/ny) ))
UL <- (xbar - ybar) + qt(0.975, df)* sqrt((SP^2 *(1/nx + 1/ny) ))
c(LL, UL)

Finding p-values

# The p-value at the left-hand side of Critical Region 
pt(tcal, df ) 
# The p-value for two-tailed Critical Region 
2 * pt(tcal, df ) 
# The p-value at the right-hand side of Critical Region 
1 - pt(tcal, df)

Finding Critical Values from the t-Table

# Left Tail
qt(0.025, df = df, lower.tail = T)
# Right Tail
qt(0.975, df = df, lower.tail = T)
# Both tails
qt(0.05, df = df)

Performing Two-Sample Confidence Interval and T-test using Built-in Function

One can perform two sample mean comparison tests using built-in functions in R Language.

# Left Tail test
t.test(X, Y, alternative = c("less"), var.equal = T)
# Right Tail test
t.test(X, Y, alternative = c("greater"), var.equal = T)
# Two Tail test
t.test(X, Y, alternative = c("two.sided"), var.equal = T)

Note that if $X$ and $Y$ variables are from a data frame, then perform the two-sample t-test using the formula symbol (~). Let’s first make the data frame from vectors $X$ and $$Y.

data <- data.frame(values = c(X, Y), group = c(rep("A", nx), rep("B", ny)))
t.test(values ~ group, data = data, alternative = "less", var.equal = T)
t.test(values ~ group, data = data, alternative = "greater", var.equal = T)
t.test(values ~ group, data = data, alternative = "two.side", var.equal = T)
Frequently Asked Questions About R
Mean Comparison Test in R

To understand probability distributions functions in R, click the link: Probability Distributions in R

MCQs in Statistics

Factors in R (Categorical Data): Learning Made Easy

Factors in R Language are used to represent categorical data in the R language. Factors in R can be ordered or unordered. One can think of a factor as an integer vector where each integer has a label. Factors are specially treated by modeling functions such as lm() and glm().  Factors are the data objects used for categorical data and stored as levels. They can store both string and integer variables. 

Using factors with labels is better than using integers as factors are self-describing; having a variable that has values “Male” and “Female” is better than a variable having values 1 and 2.

Creating a Simple Factor in R

The following example creates a simple factor variable that has two levels.

# Simple factor with two levels
x <- factor(c("yes", "yes", "no", "yes", "no"))
# computes frequency of factors
table(x)
# strips out the class
unclass(x)
Factors in R

The order of the levels can be set using the levels argument to factor(). This can be important in linear modeling because the first level is used as the baseline level.

x <- factor(c("yes","yes","no","yes","no"), levels = c("yes","no"))

Naming Factors in R

Factors can be given names using the label argument. The label argument changes the old values of the variable to a new one. For example,

x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"), label = c(1,2) )
x <- factor(c("yes","yes","no","yes","no"), levels = c("yes","no"), label = c("Level-1", "level-2"))

x <- factor(c("yes","yes","no","yes","no"), levels = c("yes","no"), label = c("group-1", "group-2"))

Suppose, you have a factor variable with numerical values. You want to compute the mean. The mean vector will result in the average value of the vector, but the mean of the factor variable will result in a warning message. To calculate the mean of the original numeric values of the "f" variable, you have to convert the values using the level argument. For example,

# vector
v <- c(10,20,20,50,10,20,10,50,20)
# vector converted to factor
f <- factor(v)
# mean of the vector
mean(v)

# mean of factor
mean(f)
mean(as.numeric(levels(f)[f]))

Use of cut() Function in R

The the cut() function in R can also be used to convert a numeric variable into a factor. The breaks argument can be used to describe how ranges of numbers will be converted to factor values. If the breaks argument is set to a single number then the resulting factor will be created by dividing the range of the variable into that number of equal-length intervals. However, if a vector of values is given to the breaks argument, the values in the vectors are used to determine the breakpoint. The number of levels of the resultant factor will be one less than the number of values in the vector provided to the breaks argument. For example,

attach(mtcars)
cut(mpg, breaks = 3)
factors <- cut(mpg, breaks = c(10, 18, 25, 30, 35) )
table(factors)
Factors in R using Cut Function

You will notice that the default label for factors produced by the cut() function in R contains the actual range of values that were used to divide the variable into factors.

Learn about Data Frames in R

https://itfeature.com