Reading Text Files In R Language: A Quick Guide

We can import data that is already saved (available) in a file created in text (*.txt) files, MS Excel, SPSS, or some other software. Before importing/reading data stored in a file (that is, reading text files in R), one should be clear and understand the following:

  1. Usually, data from spreadsheets reserved the first row as header (name of variables), while the first column was used to identify the sampling unit (observation number).
  2. Avoid names, and the value of fields with blank spaces, each word may be interpreted as a separate variable, resulting in errors.
  3. To concatenate words, use a full stop (.) instead of space between words.
  4. Name variables with short or abbreviated names.
  5. Try to avoid using names of variables that contain symbols such as ?, $, %, ^, *, (, ), -, #, <, >, /, |, ,\, [, ], {, and }.
  6. Delete comments you have made in your Excel file.
  7. Make sure missing values in your dataset are indicated with NA.

Preparing R workspace

Before importing data in R, it is better to delete all objects using the following line of code

rm(list = ls() )

The rm( ) function “remove objects from a specified environment”. Since no argument to ls( ) function is provided, datasets and user-defined functions will be deleted.

Confirm your working directory before importing a file to R, using

getwd()

If possible change the path of your working directory. such as

setwd("D:\\Stat\\STA-654")

Note you may have to create the directory (folder) and the path discussed above.

Reading Text Files in R

Reading Text Files In R Language

Reading Text files in R is easy and simple enough. If you have data in a *.txt file or a tab-delimited text file, you can easily import it with the read.table( ) function. Suppose we have a data file named "Hald.txt" stored at the path "D:\STAT\STA-654\Hald.txt". The following code line can be used for reading text files in R:

datafile <- read.table ("D:/stat/sta-654/Hald.txt", header = TRUE)

If you have data stored on some web address, you can also import it as

datafile <- read.table ("http://itfeature.com/wp-content/uploads/2020/03/Hald.txt", header = TRUE)

Note that the first argument of read.table() provide the name and extension of the file that you want to import in R. The header argument specifies whether or not you have specified column names in your data file. The Hald.txt file will be imported as data.frame an object.

Computer MCQs Online Test

MCQs in Statistics

Load Data from R Library (2020)

Here we will discuss how to read the data from R library. Many R libraries contain datasets, which may be called data libraries. For example, the car package contains a Duncan dataset that can be used for learning and implementing different R functions. To use Duncan’s data, first, you have to load the car package. Note that the car package must be installed to make use of the Duncan dataset. Let us read data from the R library and make use of the Duncan dataset.

Getting Data from R Library

To Read or load Data stored in an R library, one needs to load the library first.

library(car)
data(Duncan)
attach(Duncan)

If the car the package is not installed on your system, one can install using the following command. Note your system should be connected to the internet.

install.packages("car")

Reading Data from R Library

The attach( ) function makes each variable accessible without writing the variable name with the respective dataset name. After attaching the Duncan dataset one can access the variable say education instead of writing Duncan$education. Let us make some functions to read data from R library.

head(Duncan)

The head( ) function will display the top six observations with their variable names in table-type format. It will help to understand the structure of the dataset.

summary(Duncan)

For quantitative variables, the summary( ) function will provide five-number summary statistics with the mean value. For qualitative variables, the summary( ) function will provide the frequency of each group.

To plot a scatter plot one can use the plot function. For example,

plot(education, income)
Scatter plot of Education and Income

The scatter plot shows the strength and direction of the relationship between “Percentage of occupational incumbents in 1950 who were high school graduates’ and ‘Percentage of occupational incumbents in the 1950 US Census who earned $3,500’.

Getting Basic Data Information

To check how many observations and columns are in a dataset, one can make use of nrow( ) and ncol( ) function. For example,

nrow(Duncan)
ncol(Duncan)

To get the definition of a dataset and its variable, one can read the dataset documentation:

?Duncan

To see the list of pre-loaded data, type the function data( ):

Reading Data from R Library
data( )

It is best practice to attach data only one at a time when reading data from the R library or importing from the data file. To remove a data frame from the search path, use detach()function.

Exercise for Data from R Library

Try the following dataset and make use of all the functions discussed in this lecture.

mtcars
iris
TootGrowth
PlantGrowth
USAarrests

SPSS Data Analysis

MCQs General Knowledge

Mean Comparison Tests: Hypothesis Testing (One Sample and Two Sample)

Here we learn some basics about how to perform Mean Comparison Tests: hypothesis testing for one sample test, two-sample independent test, and dependent sample test. We will also learn how to find the p-values for a certain distribution such as t-distribution, and critical region values. We will also see how to perform one-tailed and two-tailed hypothesis tests.

How to Perform One-Sample t-Test in R

A recent article in The Wall Street Journal reported that the 30-year mortgage rate is now less than 6%. A sample of eight small banks in the Midwest revealed the following 30-year rates (in percent)

4.85.36.54.86.15.86.25.6

At the 0.01 significance level (probability of type-I error), can we conclude that the 30-year mortgage rate for small banks is less than 6%?

Manual Calculations for One-Sample t-Test and Confidence Interval

One sample mean comparison test can be performed manually.

# Manual way
X <- c(4.8, 5.3, 6.5, 4.8, 6.1, 5.8, 6.2, 5.6)
xbar <- mean(X)
s <- sd(X)
mu = 6
n = length(X)
df = n - 1 
tcal = (xbar - mu)/(s/sqrt(n) )
tcal
c(xbar - qt(0.995, df = df) * s/sqrt(n), xbar + qt(0.995, df = df) * s/sqrt(n))
Mean Comparison Tests: One sample Confidence Interval

Critical Values from t-Table

# Critical Value for Left Tail
qt(0.01, df = df, lower.tail = T)
# Critical Value for Right Tail
qt(0.99, df = df, lower.tail = T)
# Critical Vale for Both Tails
qt(0.995, df = df)

Finding p-Values

# p-value (altenative is less)
pt(tcal, df = df)
# p-value (altenative is greater)
1 - pt(tcal, df = df)
# p-value (alternative two tailed or not equal to)
2 * pt(tcal, df = df)

Performing One-Sample Confidence Interval and t-test Using Built-in Function

One can perform one sample mean comparison test using built-in functions available in the R Language.

# Left Tail test
t.test(x = X, mu = 6, alternative = c("less"), conf.level = 0.99)
# Right Tail test
t.test(x = X, mu = 6, alternative = c("greater"), conf.level = 0.99)
# Two Tail test
t.test(x = X, mu = 6, alternative = c("two.sided"), conf.level = 0.99)

How to Perform two-Sample t-Test in R

Consider we have two samples stored in two vectors $X$ and $Y$ as shown in R code. We are interested in the Mean Comparison Test among two groups of people regarding (say) their wages in a certain week.

X = c(70, 82, 78, 70, 74, 82, 90)
Y = c(60, 80, 91, 89, 77, 69, 88, 82)

Manual Calculations for Two-Sample t-Test and Confidence Interval

The manual calculation for two sample t-tests as mean comparison test is as follows.

nx = length(X)
ny = length(Y)
xbar = mean(X)
sx = sd(X)
ybar = mean(Y)
sy = sd(Y)
df = nx + ny - 2
# Pooled Standard Deviation/ Variance 
SP = sqrt( ( (nx-1) * sx^2 + (ny-1) * sy^2) / df )
tcal = (( xbar - ybar ) - 0) / (SP *sqrt(1/nx + 1/ny))
tcal
# Confidence Interval
LL <- (xbar - ybar) - qt(0.975, df)* sqrt((SP^2 *(1/nx + 1/ny) ))
UL <- (xbar - ybar) + qt(0.975, df)* sqrt((SP^2 *(1/nx + 1/ny) ))
c(LL, UL)

Finding p-values

# The p-value at the left-hand side of Critical Region 
pt(tcal, df ) 
# The p-value for two-tailed Critical Region 
2 * pt(tcal, df ) 
# The p-value at the right-hand side of Critical Region 
1 - pt(tcal, df)

Finding Critical Values from t-Table

# Left Tail
qt(0.025, df = df, lower.tail = T)
# Right Tail
qt(0.975, df = df, lower.tail = T)
# Both tails
qt(0.05, df = df)

Performing Two-Sample Confidence Interval and T-test using Built-in Function

One can perform two sample mean comparison test using built-in functions in R Language.

# Left Tail test
t.test(X, Y, alternative = c("less"), var.equal = T)
# Right Tail test
t.test(X, Y, alternative = c("greater"), var.equal = T)
# Two Tail test
t.test(X, Y, alternative = c("two.sided"), var.equal = T)

Note if $X$ and $Y$ variables are from a data frame then perform the two-sample t-test using the formula symbol (~). Let’s first make the data frame from vectors $X$ and $$Y.

data <- data.frame(values = c(X, Y), group = c(rep("A", nx), rep("B", ny)))
t.test(values ~ group, data = data, alternative = "less", var.equal = T)
t.test(values ~ group, data = data, alternative = "greater", var.equal = T)
t.test(values ~ group, data = data, alternative = "two.side", var.equal = T)
Frequently Asked Questions About R
Mean Comparison Test in R

To understand probability distributions functions in R click the link: Probability Distributions in R

MCQs in Statistics