Best Statistical Inference Quiz in R 14

The article contains a Statistical Inference quiz in R language with Answers. There are 16 questions in the “Statistical Inference Quiz in R Language”. The MCQs are from probability and regression models. Let us Start with the Statistical Inference Quiz in R.

Statistical Inference Quiz in R Language

1. Consider the mtcars data set. Fit a model with mpg as the outcome that includes numbers of cylinders as a factor variable and weight as a possible confounding variable. Compare the effect of 8 versus 4 cylinders on mpg for the adjusted and unadjusted by-weight models. Here, adjusted means including the weight variable as a term in the regression model and unadjusted means the model without weight included. What can be said about the effect comparing 8 and 4 cylinders after looking at models with and without weight included?

 
 
 
 

2. Consider the mtcars data set. Fit a model with mpg as the outcome that considers numbers of cylinders as a factor variable and weight as a confounder. Now fit a second model with mpg as the outcome model that considers the interaction between numbers of cylinders (as a factor variable) and weight. Give the P-value for the likelihood ratio test comparing the two models and suggest a model using 0.05 as a type I error rate significance benchmark.

 
 
 
 

3. Consider the following data set

x <- c(0.586, 0.166, -0.042, -0.614, 11.72)
y <- c(0.549, -0.026, -0.127, -0.751, 1.344)

Give the slope dfbeta for the point with the highest hat value.

influence.measures(fit5)$infmat[which.max(abs(influence.measures(fit5)$infmat[, 2])), 2]

 
 
 
 

4. Suppose that diastolic blood pressures (DBPs) for men aged 35-44 are normally distributed with a mean of 80 (mm Hg) and a standard deviation of 10. About what is the probability that a random 35-44-year-old has a DBP less than 70?

 
 
 
 

5. Consider the mtcars data set. Fit a model with mpg as the outcome that includes the number of cylinders as a factor variable and weight included in the model as

lm(mpg ~ I(wt * 0.5) + factor(cyl), data = mtcars)

How is the wt coefficient interpreted?

 
 
 
 

6. Brain volume for adult women is normally distributed with a mean of about 1,100 cc for women with a standard deviation of 75 cc. What brain volume represents the 95th percentile?

 
 
 
 

7. The respiratory disturbance index (RDI), a measure of sleep disturbance, for a specific population has a mean of 15 (sleep events per hour) and a standard deviation of 10. They are not normally distributed. Give your best estimate of the probability that a sample average RDI of 100 people is between 14 and 16 events per hour.

 
 
 
 

8. Consider the following data set
x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05)
Fit the regression through the origin and get the slope treating $y$ as the outcome and $x$ as the regressor.

(Hint, do not center the data since we want regression through the origin, not through the means of the data.)

 
 
 
 

9. Consider the following data set
x <- c(0.586, 0.166, -0.042, -0.614, 11.72)
y <- c(0.549, -0.026, -0.127, -0.751, 1.344)
Give the hat diagonal for the most influential point

 
 
 
 

10. Consider the following PMF shown below in R
x <- 1:4
p <- x/sum(x)
temp <- rbind(x, p)
rownames(temp) <- c("X", "Prob")
temp
What is the mean?

 
 
 
 

11. Consider a standard uniform density. The mean for this density is 0.5 and the variance is 1 / 12. You sample 1,000 observations from this distribution and take the sample mean, what value would you expect it to be near?

 
 
 
 

12. Consider the following data set. What is the intercept for fitting the model with $x$ as the predictor and $y$ as the outcome?
x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05)

 
 
 
 

13. The number of people showing up at a bus stop is assumed to be Poisson with a mean of 5 people per hour. You watch the bus stop for 3 hours. About what’s the probability of viewing 10 or fewer people?

 
 
 
 

14. Do data(mtcars) from the datasets package and fit the regression model with mpg as the outcome and weight as the predictor. Give the slope coefficient.

 
 
 
 

15. You flip a fair coin 5 times, about what’s the probability of getting 4 or 5 heads?

 
 
 
 

16. Consider the mtcars data set. Fit a model with mpg as the outcome that includes a number of cylinders as a factor variable and weight as a confounder. Give the adjusted estimate for the expected change in mpg comparing 8 cylinders to 4.

 
 
 
 

Statistical Inference Quiz in R Language

Statistical Inference Quiz in R with Answers

  • Consider the following PMF shown below in R
    x <- 1:4 p <- x/sum(x)
    temp <- rbind(x, p)
    rownames(temp) <- c(“X”, “Prob”)
    temp
    What is the mean?
  • Suppose that diastolic blood pressures (DBPs) for men aged 35-44 are normally distributed with a mean of 80 (mm Hg) and a standard deviation of 10. About what is the probability that a random 35-44-year-old has a DBP less than 70?
  • Brain volume for adult women is normally distributed with a mean of about 1,100 cc for women with a standard deviation of 75 cc. What brain volume represents the 95th percentile?
  • You flip a fair coin 5 times, about what’s the probability of getting 4 or 5 heads?
  • The respiratory disturbance index (RDI), a measure of sleep disturbance, for a specific population has a mean of 15 (sleep events per hour) and a standard deviation of 10. They are not normally distributed. Give your best estimate of the probability that a sample average RDI of 100 people is between 14 and 16 events per hour.
  • Consider a standard uniform density. The mean for this density is 0.5 and the variance is 1 / 12. You sample 1,000 observations from this distribution and take the sample mean, what value would you expect it to be near?
  • The number of people showing up at a bus stop is assumed to be Poisson with a mean of 5 people per hour. You watch the bus stop for 3 hours. About what’s the probability of viewing 10 or fewer people?
  • Consider the mtcars data set. Fit a model with mpg as the outcome that includes a number of cylinders as a factor variable and weight as a confounder. Give the adjusted estimate for the expected change in mpg comparing 8 cylinders to 4.
  • Consider the mtcars data set. Fit a model with mpg as the outcome that includes the number of cylinders as a factor variable and weight included in the model as
    lm(mpg ~ I(wt * 0.5) + factor(cyl), data = mtcars)
    How is the wt coefficient interpreted?
  • Consider the following data set
    x <- c(0.586, 0.166, -0.042, -0.614, 11.72)
    y <- c(0.549, -0.026, -0.127, -0.751, 1.344)
    Give the hat diagonal for the most influential point
  • Consider the following data set
    x <- c(0.586, 0.166, -0.042, -0.614, 11.72)
    y <- c(0.549, -0.026, -0.127, -0.751, 1.344)
    Give the slope dfbeta for the point with the highest hat value. influence.measures(fit5)$infmat[which.max(abs(influence.measures(fit5)$infmat[, 2])), 2]
  • Consider the mtcars data set. Fit a model with mpg as the outcome that includes a number of cylinders as a factor variable and weight as a possible confounding variable. Compare the effect of 8 versus 4 cylinders on mpg for the adjusted and unadjusted by-weight models. Here, adjusted means including the weight variable as a term in the regression model and unadjusted means the model without weight included. What can be said about the effect comparing 8 and 4 cylinders after looking at models with and without weight included?
  • Consider the mtcars data set. Fit a model with mpg as the outcome that considers a number of cylinders as a factor variable and weight as a confounder. Now fit a second model with mpg as the outcome model that considers the interaction between numbers of cylinders (as a factor variable) and weight. Give the P-value for the likelihood ratio test comparing the two models and suggest a model using 0.05 as a type I error rate significance benchmark.
  • Consider the following data set
    x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
    y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05)
    Fit the regression through the origin and get the slope treating $y$ as the outcome and $x$ as the regressor. (Hint, do not center the data since we want regression through the origin, not through the means of the data.)
  • Do data(mtcars) from the datasets package and fit the regression model with mpg as the outcome and weight as the predictor. Give the slope coefficient.
  • Consider the following data set. What is the intercept for fitting the model with $x$ as the predictor and $y$ as the outcome?
    x <- c(0.8, 0.47, 0.51, 0.73, 0.36, 0.58, 0.57, 0.85, 0.44, 0.42)
    y <- c(1.39, 0.72, 1.55, 0.48, 1.19, -1.59, 1.23, -0.65, 1.49, 0.05)

Sampling and Sampling Distributions Quiz with Answers

Intermediate Mathematics Part-I Quiz with Answers

Shapiro-Wilk Test in R (2024)

One should check/test the assumption of normality before performing a statistical test that requires the assumption of normality. In this article, we will discuss the Shapiro-Wilk Test in R (one sample t-test). The hypothesis is

$H_0$: The data are normally distributed

$H_1$: The data are not normally distributed

Performing Shapiro-Wilk Test in R

To check the normality using the Shapiro-Wilk test in R, we will use a built-in data set of mtcars.

attach(mtcars)
shapiro.test(mpg)
Shapiro-Wilk Test in R Checking Normality Assumption

The results indicate that the $mpg$ variable is statistically normal as the p-value from the Shapiro-Wilk Test is much greater than the 0.05 level of significance.

  • By looking at the p-value, one can determine whether to reject or accept the null hypothesis of normality:
    • If the p-value is less than the chosen significance level (e.g., 0.05), reject the null hypothesis and conclude that the data is likely not normally distributed.
    • If the p-value is greater than the chosen significance level, one failed to reject the null hypothesis, suggesting the data might be normal (but it does not necessarily confirm normality).

The normality can be visualized using a QQ plot.

# QQ Plot from Base Package
qqnorm(mpg, pch = 1, fram = F)
qqline(mpg, col="red", lwd = 2)

QQ Plot from Base Package

From the QQ plot of the base package, it can be seen that there are a few points due to which $mpg$ variable is not normally distributed.

# QQ plot from car Package
library(car)
qqPlot(mpg)
QQ Plot from car Package

From the QQ plot (with confidence interval band), one can observe that the $mpg$ variable is approximately normally distributed.

Note that

  • The Shapiro-Wilk test is generally more powerful than other normality tests like the Kolmogorov-Smirnov test for smaller sample sizes (typically less than 5000).
  • It is important to visually inspect the data using a histogram or Q-Q plot to complement the Shapiro-Wilk test results for a more comprehensive assessment of normality.

https://itfeature.com

Data Frame in R Language

Introduction to Data Frame in R Language

In R Programming language a data frame is a two-dimensional data structure. The data frame objects contain rows and columns. The number of rows for each column should have equal length. The cross-section of the row and column can be considered as a cell. Each cell of the data frame is associated with a combination of row number and column number.

A data frame in R Programming Langauge has:

  • Rows: Represent individual observations or data points.
  • Columns: Represent variables or features being measured. Each column holds values for a single variable across all observations.
  • Data Types: Columns can hold data of different types, including numeric, character, logical (TRUE/FALSE), and factors (categorical variables).

One can modify, extract, and re-arrange the data contents of a data frame; the process is called the manipulation of the data frame. To create a data frame a general syntax can be followed

Data Frame Syntax in R

The general syntax of a data frame in R Language is

df <- data.frame(first column = c(data values separated with commas,
                           second column = c(data values separate with commans,
                           ......
          )

An exemplary data frame in the R Programming language is

df = data.frame(age = c(23, 24, 25, 26, 23, 25, 29, 20),
                marks = c(99, 80, 67, 56, 98, 65, 45, 77),
                grade = c("A", "A", "C", "D", "A", "B", "F", "B")
                )
print(df)
Data Frame in R Language

One can name or rename the columns and rows of the data frame

# Naming / renaming columns 
colnames(df) <- c("Age", "Score", "Grad")

# Naming / renaming rows
row.names(df) <- c("1st", "2nd", "3rd", "4th", "5th", "6th", "7th", "8th")
Data Frame in R Language colnames and row names

Subsetting a Data Frame

The subset() method can be used to create a new data set by removing specified column(s). This splits the data frame into two sets, one with excluded columns and the other with included columns. To understand subsetting a data frame, let us create a data frame first.

# creating a data frame
df = data.frame(row1 = 0:3, row2 = 3:6, row3 = 6:9)

# creating a subset
df <- subset(df, select = c(row1, row2))
subsetting a data frame

Question: Data Frame in R Language

Suppose we have a frequency distribution of sales from a sample of 100 sales receipts.

Price ValueNumber of Sales
0 to 2016
20 to 4018
40 to 6014
60 to 8024
80 to 10020
100 to 1208

Calculate the mean, median, variance, standard deviation, and coefficient of variation by using the R code.

Solution

# Crate a data frame 

df <- data.frame(lower_class = seq(0, 100, by = 20), upper_class=seq(20, 120, by=20), freq = c(16, 18, 14, 24, 20, 8))

# mid points
m <- (df["lower_class"] + df["upper_class"])/2

mf <- df["freq"] * m
mfsquare <- df["freq"] * m^2


data <- cbind(df, m, mf, mfsquare)
colnames(data) <- c("LL","UL", "freq" , "M", "mf", "mf2")

# Computation
avg = sum(data$mf)/sum(data$freq)
var = (sum(data$mf2) - sum(data$mf)^2 / sum(data$freq))/(sum(data$freq)-1)
sd = sqrt(var)
CV = sd/avg * 100

## Outputs
paste("Mean = ", round(avg, 3))
paste("Variance = ", round(var, 3))
paste("Standard Deviation = ", round(sd, 3))
paste("Coefficient of Variation = ", round(CV, 3))
Frequency Distribution and Descriptive Statistics

Using Logical Conditions for Selecting Rows and Columns

For selecting rows and columns using logical conditions, we consider the iris data set. Here, suppose we are interested in Selecting rows whose values are higher than the median for Sepal Length and whose Petal.Width >= 1.7. In the code below, each value is Sepal.Length variable (column) is compared with the median value of Sepal.Length. Similarly, each value of Petal.Width is compared with 1.7 to extract the required values from these two columns.

attach(iris) 

iris[(Sepal.Length > median(Sepal.Length) & Petal.Width >= 1.7), ]

One can select only the numeric columns from the data frame by following the code below

# Selecting Numeric Columns only
iris[ , sapply(iris, is.numeric)]

# Selecting factor columns only
iris[, sapply(iris, is.factor)]

# Selecting only certain Species
 iris[Species == "virginica", ]

Omitting Missing Observations in a Data Frame

# Omit rows with missing data
na.omit(iris)

# check for missing data across rows
apply(iris, 2, is.na)
iris[complete.cases(iris), ]

https://itfeature.com

https://gmstat.com