Data Analysis - R Programming FAQs

Descriptive Summary in R

September 4, 2024July 22, 2024 by Muhammad Imdad Ullah

Introduction to Descriptive Summary in R

Statistics is a study of data: describing properties of data (descriptive statistics) and drawing conclusions about a population based on information in a sample (inferential statistics). In this article, we will discuss the computation of descriptive summary in R (Descriptive statistics in R Programming).

Example: Twenty elementary school children were asked if they live with both parents (B), father only (F), mother only (M), or someone else (S) and how many brothers has he. The responses of the children are as follows:

Case	Sex	No. of His Brothers	Case	Sex	No. of His Brothers
M	Female	3	B	Male	2
B	Female	2	F	Male	1
B	Female	3	B	Male	0
M	Female	4	M	Male	0
F	Male	3	M	Male	3
S	Male	1	B	Female	4
B	Male	2	B	Female	3
M	Male	2	F	Male	2
F	Female	4	B	Female	1
B	Female	3	M	Female	2

Consider the following computation is required. These computations are related to the Descriptive summary in R.

Construct a frequency distribution table in r relative to the case of each one.
Draw a bar and pie graphs of the frequency distribution for each category using the R code.

Creating the Frequency Table in R

# Enter the data in the vector form 
x <- c("M", "B", "B", "M", "F", "S", "B", "M", "F", "B", "B", "F", "B", "M", "M", "B", "B", "F", "B", "M") 

# Creating the frequency table use Table command 
tabx=table(x) ; tabx

# Output
x
B F M S 
9 4 6 1

Draw a Bar Chart and Pie Chart from the Frequency Table

# Drawing the bar chart for the resulting table in Green color with main title, x label and y label 

barplot(tabx, xlab = "x", ylab = "Frequency", main = "Sample of Twenty elementary school children ",col = "Green") 

# Drawing the pie chart for the resulting table with main title.
pie(tabx, main = "Sample of Twenty elementary school children ")

Graphical Descriptive summary in R Programming Language

Descriptive summary in R Programming Language

Descriptive Statistics for Air Quality Data

Consider the air quality data for computing numerical and graphical descriptive summary in R. The air quality data already exists in the R Datasets package.

attach(airquality)

# To choose the temperature degree only
Temperature = airquality[, 4]
hist(Temperature)

hist(Temperature, main="Maximum daily temperature at La Guardia Airport", xlab="Temperature in degrees Fahrenheit", xlim = c(50, 100), col="darkmagenta", freq=T)

h <- hist(Temperature, ylim = c(0,40))
text(h$mids, h$counts, labels=h$counts, adj=c(0.5, -0.5))

Histogram Descriptive Statistics in R Programming Language

In the above histogram, the frequency of each bar is drawn at the top of each bar by using the text() function.

Note that to change the number of classes or the interval, we should use the sequence function to divide the $range$, $Max$, and $Min$, into $n$ using the function length.out=n+1

hist(Temperature, breaks = seq(min(Temperature), max(Temperature), length.out = 7))

Histogram with breaks. Descriptive Statistics in R Programming Language

Median for Ungrouped Data

Numeric descriptive statistics such as median, mean, mode, and other summary statistics can be computed.

median(Temperature)
## Output 79
mean(Temperature)
summary(Temperature)

Numerical Descriptive Statistics in R Programming Language

A customized function for the computation of the median can be created. For example

arithmetic.median <- function(xx){
    modulo <- length(xx) %% 2
    if (modulo == 0){
      (sort(xx)[ceiling(length(xx)/2)] + sort(xx)[ceiling(1+length(xx)/2)])/2
    } else{
     sort(xx)[ceiling(length(xx)/2)]
  }
}
arithmetic.median(Temperature)

Computing Quartiles and IQR

The quantiles (Quartiles, Deciles, and Percentiles) can be computed using the function quantile() in R. The interquartile range (IQR) can also be computed using the iqr() function.

y = airquality[, 4]  # temperature variable

quantile(y)

quantile(y, probs = c(0.25,0.5,0.75))
quantile(y, probs = c(0.30,0.50,0.70,0.90))

IQR(y)

Quartiles Descriptive summary in R Programming Language

One can create a custom function for the computation of Quartiles and IQR. For example,

quart<- function(x) {
   x <- sort(x)
   n <- length(x)
   m <- (n+1)/2
   if (floor(m) != m) {
      l <- m-1/2; u <- m+1/2
     } else {
     l <- m-1; u <- m+1
     }
   c(Q1 = median(x[1:l]), 
   Q3 = median(x[u:n]), 
   IQR = median(x[u:n])-median(x[1:l]))
}

quart(y)

FAQs in R Language

How one can perform descriptive statistics in R Language?
Discuss the strategy of creating a frequency table in R.
How Pie Charts and Bar Charts can be drawn in R Language? Discuss the commands and important arguments.
What default function is used to compute the quartiles of a data set?
You are interested in computing the median for group and ungroup data in R. Write a customized R function.
Create a User-Defined function that can compute, Quaritles and IQR of the inputted data set.

https://itfeature.com

https://gmstat.com

Shapiro-Wilk Test in R (2024)

June 20, 2024May 24, 2024 by Muhammad Imdad Ullah

One should check/test the assumption of normality before performing a statistical test that requires the assumption of normality. In this article, we will discuss the Shapiro-Wilk Test in R (one sample t-test). The hypothesis is

$H_0$: The data are normally distributed

$H_1$: The data are not normally distributed

Performing Shapiro-Wilk Test in R

To check the normality using the Shapiro-Wilk test in R, we will use a built-in data set of mtcars.

attach(mtcars)
shapiro.test(mpg)

Shapiro-Wilk Test in R Checking Normality Assumption

The results indicate that the $mpg$ variable is statistically normal as the p-value from the Shapiro-Wilk Test is much greater than the 0.05 level of significance.

By looking at the p-value, one can determine whether to reject or accept the null hypothesis of normality:
- If the p-value is less than the chosen significance level (e.g., 0.05), reject the null hypothesis and conclude that the data is likely not normally distributed.
- If the p-value is greater than the chosen significance level, one failed to reject the null hypothesis, suggesting the data might be normal (but it does not necessarily confirm normality).

The normality can be visualized using a QQ plot.

# QQ Plot from Base Package
qqnorm(mpg, pch = 1, fram = F)
qqline(mpg, col="red", lwd = 2)

From the QQ plot of the base package, it can be seen that there are a few points due to which $mpg$ variable is not normally distributed.

# QQ plot from car Package
library(car)
qqPlot(mpg)

From the QQ plot (with confidence interval band), one can observe that the $mpg$ variable is approximately normally distributed.

Note that

The Shapiro-Wilk test is generally more powerful than other normality tests like the Kolmogorov-Smirnov test for smaller sample sizes (typically less than 5000).
It is important to visually inspect the data using a histogram or Q-Q plot to complement the Shapiro-Wilk test results for a more comprehensive assessment of normality.

https://itfeature.com

R Language: A Quick Reference Guide – IV

September 5, 2024October 17, 2023 by Muhammad Imdad Ullah

R Quick Reference Guide

R language: A Quick Reference Guide about learning R Programming with a short description of the widely used commands. It will help the learner and intermediate user of the R Programming Language to get help with different functions quickly. This Quick Reference is classified into different groups. Let us start with R Language: A Quick Reference – IV.

This Quick Reference will help in performing different descriptive statistics on vectors, matrices, lists, data frames, arrays, and factors.

Basic Descriptive Statistics in R Language

The following is the list of widely used functions that are further helpful in computing descriptive statistics. The functions below are not direct descriptive statistics functions, however, these functions are helpful to compute other descriptive statistics.

R Command	Short Description
sum(x1, x2, … , xn)	Computes the sum/total of $n$ numeric values given as argument
prod(x1, x2, … , xn)	Computes the product of all $n$ numeric values given as argument
min(x1, x2, … , xn)	Gives smallest of all $n$ values given as argument
max(x1, x2, …, xn)	Gives largest of all $n$ values given as argument
range(x1, x2, … , xn)	Gives both the smallest and largest of all $n$ values given as argument
pmin(x1, x2, …)	Returns minima of the input values
pmax(x1, x2, …)	Returns maxima of the input values

Statistical Descriptive Statistics in R Language

The following functions are used to compute measures of central tendency, measures of dispersion, and measures of positions.

R Command	Short Description
mean(x)	Computes the arithmetic mean of all elements in $x$
sd(x)	Computes the standard deviation of all elements in $x$
var(x)	Computes the variance of all elements in $x$
median(x)	Computes the median of all elements in $x$
quantile(x)	Computes the median, quartiles, and extremes in $x$
quantile(x, p)	Computes the quantiles specified by $p$

Cumulative Summaries in R Language

The following functions are also helpful in computing the other descriptive calculations.

R Command	Short Description
cumsum(x)	Computes the cumulative sum of $x$
cumprod(x)	Computes the cumulative product of $x$
cummin(x)	Computes the cumulative minimum of $x$
cummax(x)	Computes the cumulative maximum of $x$

Sorting and Ordering Elements in R Language

The sorting and ordering functions are useful in especially non-parametric methods.

R Command	Short Description
sort(x)	Sort the all elements of $x$ in ascending order
sort(x, decreasing = TRUE)	Sor the all elements of $x$ in descending order
rev(x)	Reverse the elements in $x$
order(x)	Get the ordering permutation of $x$

Sequence and Repetition of Elements in R Language

These functions are used to generate a sequence of numbers or repeat the set of numbers $n$ times.

R Command	Short Description
a:b	Generates a sequence of numbers from $a$ to $b$ in steps of size 1
seq(n)	Generates a sequence of numbers from 1 to $n$
seq(a, b)	Generates a sequence of numbers from $a$ to $b$ in steps of size 1, it is the same as a:b
seq(a, b, by=s)	Generates a sequence of numbers from $a$ to $b$ in steps of size $s$.
seq(a, b, length=n)	Generates a sequence of numbers having length $n$ from $a$ to $b$
rep(x, n)	Repeats the elements $n$ times
rep(x, each=n)	Repeats the elements of $x$, each element is repeated $n$ times

R Quick Reference Guide Frequently Asked Questions About R

R Language: A Quick Reference – I