Shapiro-Wilk Test in R (2024)

One should check/test the assumption of normality before performing a statistical test that requires the assumption of normality. In this article, we will discuss the Shapiro-Wilk Test in R (one sample t-test). The hypothesis is

$H_0$: The data are normally distributed

$H_1$: The data are not normally distributed

Performing Shapiro-Wilk Test in R

To check the normality using the Shapiro-Wilk test in R, we will use a built-in data set of mtcars.

attach(mtcars)
shapiro.test(mpg)
Shapiro-Wilk Test in R Checking Normality Assumption

The results indicate that the $mpg$ variable is statistically normal as the p-value from the Shapiro-Wilk Test is much greater than the 0.05 level of significance.

  • By looking at the p-value, one can determine whether to reject or accept the null hypothesis of normality:
    • If the p-value is less than the chosen significance level (e.g., 0.05), reject the null hypothesis and conclude that the data is likely not normally distributed.
    • If the p-value is greater than the chosen significance level, one failed to reject the null hypothesis, suggesting the data might be normal (but it does not necessarily confirm normality).

The normality can be visualized using a QQ plot.

# QQ Plot from Base Package
qqnorm(mpg, pch = 1, fram = F)
qqline(mpg, col="red", lwd = 2)

QQ Plot from Base Package

From the QQ plot of the base package, it can be seen that there are a few points due to which $mpg$ variable is not normally distributed.

# QQ plot from car Package
library(car)
qqPlot(mpg)
QQ Plot from car Package

From the QQ plot (with confidence interval band), one can observe that the $mpg$ variable is approximately normally distributed.

Note that

  • The Shapiro-Wilk test is generally more powerful than other normality tests like the Kolmogorov-Smirnov test for smaller sample sizes (typically less than 5000).
  • It is important to visually inspect the data using a histogram or Q-Q plot to complement the Shapiro-Wilk test results for a more comprehensive assessment of normality.

https://itfeature.com

Data Frame in R Language

Introduction to Data Frame in R Language

In R Programming language a data frame is a two-dimensional data structure. The data frame objects contain rows and columns. The number of rows for each column should have equal length. The cross-section of the row and column can be considered as a cell. Each cell of the data frame is associated with a combination of row number and column number.

A data frame in R Programming Langauge has:

  • Rows: Represent individual observations or data points.
  • Columns: Represent variables or features being measured. Each column holds values for a single variable across all observations.
  • Data Types: Columns can hold data of different types, including numeric, character, logical (TRUE/FALSE), and factors (categorical variables).

One can modify, extract, and re-arrange the data contents of a data frame; the process is called the manipulation of the data frame. To create a data frame a general syntax can be followed

Data Frame Syntax in R

The general syntax of a data frame in R Language is

df <- data.frame(first column = c(data values separated with commas,
                           second column = c(data values separate with commans,
                           ......
          )

An exemplary data frame in the R Programming language is

df = data.frame(age = c(23, 24, 25, 26, 23, 25, 29, 20),
                marks = c(99, 80, 67, 56, 98, 65, 45, 77),
                grade = c("A", "A", "C", "D", "A", "B", "F", "B")
                )
print(df)
Data Frame in R Language

One can name or rename the columns and rows of the data frame

# Naming / renaming columns 
colnames(df) <- c("Age", "Score", "Grad")

# Naming / renaming rows
row.names(df) <- c("1st", "2nd", "3rd", "4th", "5th", "6th", "7th", "8th")
Data Frame in R Language colnames and row names

Subsetting a Data Frame

The subset() method can be used to create a new data set by removing specified column(s). This splits the data frame into two sets, one with excluded columns and the other with included columns. To understand subsetting a data frame, let us create a data frame first.

# creating a data frame
df = data.frame(row1 = 0:3, row2 = 3:6, row3 = 6:9)

# creating a subset
df <- subset(df, select = c(row1, row2))
subsetting a data frame

Question: Data Frame in R Language

Suppose we have a frequency distribution of sales from a sample of 100 sales receipts.

Price ValueNumber of Sales
0 to 2016
20 to 4018
40 to 6014
60 to 8024
80 to 10020
100 to 1208

Calculate the mean, median, variance, standard deviation, and coefficient of variation by using the R code.

Solution

# Crate a data frame 

df <- data.frame(lower_class = seq(0, 100, by = 20), upper_class=seq(20, 120, by=20), freq = c(16, 18, 14, 24, 20, 8))

# mid points
m <- (df["lower_class"] + df["upper_class"])/2

mf <- df["freq"] * m
mfsquare <- df["freq"] * m^2


data <- cbind(df, m, mf, mfsquare)
colnames(data) <- c("LL","UL", "freq" , "M", "mf", "mf2")

# Computation
avg = sum(data$mf)/sum(data$freq)
var = (sum(data$mf2) - sum(data$mf)^2 / sum(data$freq))/(sum(data$freq)-1)
sd = sqrt(var)
CV = sd/avg * 100

## Outputs
paste("Mean = ", round(avg, 3))
paste("Variance = ", round(var, 3))
paste("Standard Deviation = ", round(sd, 3))
paste("Coefficient of Variation = ", round(CV, 3))
Frequency Distribution and Descriptive Statistics

Using Logical Conditions for Selecting Rows and Columns

For selecting rows and columns using logical conditions, we consider the iris data set. Here, suppose we are interested in Selecting rows whose values are higher than the median for Sepal Length and whose Petal.Width >= 1.7. In the code below, each value is Sepal.Length variable (column) is compared with the median value of Sepal.Length. Similarly, each value of Petal.Width is compared with 1.7 to extract the required values from these two columns.

attach(iris) 

iris[(Sepal.Length > median(Sepal.Length) & Petal.Width >= 1.7), ]

One can select only the numeric columns from the data frame by following the code below

# Selecting Numeric Columns only
iris[ , sapply(iris, is.numeric)]

# Selecting factor columns only
iris[, sapply(iris, is.factor)]

# Selecting only certain Species
 iris[Species == "virginica", ]

Omitting Missing Observations in a Data Frame

# Omit rows with missing data
na.omit(iris)

# check for missing data across rows
apply(iris, 2, is.na)
iris[complete.cases(iris), ]

https://itfeature.com

https://gmstat.com

Important MCQs R Package Development 13

The post is about MCQs R Package Development Quiz. The quiz also contains questions about git. There are a total of 17 questions and some of the questions have multiple correct answers. Let us start with MCQs R Package Development.

Online MCQs about R Package Development

1. How is attaching a package namespace different from loading a namespace?

 
 
 
 

2. In which sub-directory of an R package should tests be placed?

 
 
 
 

3. What does the ::: operator do?

 
 
 
 

4. Which of the following functions from the `devtools` package are you likely to use often, rather than just once per package, when building a package?

 
 
 
 

5. What is Git?

 
 
 
 

6. What is the purpose of the DESCRIPTION file in a package?

 
 
 
 

7. Which of the following files and folders are required in an R package?

 
 
 
 
 
 

8. What is a pull request on GitHub?

 
 
 
 

9. When a test fails in a call to expect_that(), what happens?

 
 
 
 

10. For packages that require C code, what should be installed on your system?

 
 
 
 

11. What is the purpose of the Imports field in the DESCRIPTION file?

 
 
 
 

12. Which of the following files and subdirectories will be included in the initial package directory if you create a new package using the ‘create’ function from ‘devtools’?

 
 
 
 
 
 
 
 
 
 
 

13. What does the is_a() function do in the context of testthat?

 
 
 
 

14. Which of the following are good reasons to build an R Package?

 
 
 
 

15. Which of the following are good reasons for open-sourcing your software?

 
 
 
 

16. Which of the following statements correctly describes how R functions should be defined with the package directory?

 
 
 
 

17. The GNU General Public License is called a copyleft license because

 
 
 
 

MCQs R Package Development with Answers

R FAQS Logo: MCQs R Package Development
  • Which of the following are good reasons to build an R Package?
  • Which of the following files and folders are required in an R package?
  • Which of the following files and subdirectories will be included in the initial package directory if you create a new package using the ‘create’ function from ‘devtools’?
  • Which of the following functions from the devtools package are you likely to use often, rather than just once per package, when building a package?
  • What is the purpose of the DESCRIPTION file in a package?
  • Which of the following statements correctly describes how R functions should be defined with the package directory?
  • How is attaching a package namespace different from loading a namespace?
  • For packages that require C code, what should be installed on your system?
  • What is the purpose of the Imports field in the DESCRIPTION file?
  • Which of the following are good reasons for open-sourcing your software?
  • When a test fails in a call to expect_that(), what happens?
  • What does the is_a() function do in the context of testthat?
  • In which sub-directory of an R package should tests be placed?
  • What is Git?
  • What is a pull request on GitHub?
  • The GNU General Public License is called a copyleft license because
  • What does the ::: operator do?

https://itfeature.com, https://gmstat.com