Exploring Data in R: A Comprehensive R Tutorial

Examination of data (Exploring Data), particularly graphical examination and representation of data is an important prelude to statistical data analysis and modeling. Note that there are some limitations on the kinds of graphs that we can create.

One should be familiar with standard procedures for exploratory data analysis, statistical graphics, and data transformation. We can categorize the graphical representation of data based on the variable’s nature (or type), the number of variables, and the objectivity of the analysis. For example, if we are comparing groups then comparison graphs such as bar graphs can be used. If we are interested in the kind of relationship between variables then a scatter plot can be useful.

  • Distributional Displays:
    The distributional displays include stem and leaf displays, histograms, density estimates, quantile comparison plots, and box plots.
  • Plots of the Relationship between two variables:
    The graphical representations of the relationship between two variables include various versions of scatter plots, scatter plot smoothers, bivariate density estimates, and parallel box plots.
  • Multivariate Displays:
    Multivariate graphical representations include scatter plot matrices, coplots, and dynamic three-dimensional scatter plots.

For exploring the data in R, the following are some examples:

Stem and Leaf Display and Histogram in R

attach(mtcars)
hist(mpg)
hist(mpg, nclass = 3, col = 3)
stem(mpg)
Histogram: Exploring Data in R

Exploring Data in R: Density Estimates

Consider the following R code for a representation of distribution by smoothing the histogram.

hist(mpg, probability = T, ylab = 'Density')
lines(density(mpg, lwd = 2))
points(mpg, rep(0, length(mpg)), pch = "|")
lines(density(mpg, adjust = 0.9), lwd = 1)

The hist() function constructs the histogram with probability=TRUE specifying density scaling. The lines() function draws the density estimate on the graph having a thickness of the line as double due to the parameter lwd=2. The points() function draws a one-dimensional scatter plot at the bottom of the graph by using a vertical bar as the plotting symbol. The second call to density in lines() the function with adjust=0.9, specifies a bandwidth of 0.9 the default value.

Quantile Comparison Plots

Quantile plots help in comparing the distribution of a variable with a theoretical distribution such as the normal distribution.

library(car)
qqPlot(mpg)

Note that the qqPlot() function is available in the car library. The qq.plot() function is defunct.

Exploring Data: Relationship Graphs

To explore the relationship between two quantitative variables use plot() function and for a more enhanced version of a scatter plot between two variables use scatterplot() function. This function plots the variables with least squares and non-parametric regression lines. For example,

plot(mpg, wt)
scatterplot(mpg, wt)
scatterplot(mpg, wt, labels = rownames(cyl))

CLICK to learn about plot() function in R

FAQs about R Language

  1. What do you mean by exploring data?
  2. What are the objectives of exploratory data analysis?
  3. What are the important visualizations for exploratory data analysis?
  4. For exploratory analysis, which graph is used for comparison purposes?
  5. For exploratory analysis, which graph is used to explore the relationship between variables?
  6. What is a quantile comparison plot?
  7. What is the objective of density estimation graphs?
  8. Name some of the multivariate plots used for EDA.

R Programming Language

Computer MCQs Online Test

Best R Language MCQs 1

The post is about “MCQs R Language” which will help you to check your ability to execute some basic operations on objects in the R language and will help in understanding some basic concepts. This quiz may also improve your computational understanding, and it will also help you to learn and practice R language MCQs.

Online Multiple Choice Questions about R Language

1. R is an __________ programming language?

 
 
 
 

2. Which of the following is a primary tool for debugging?

 
 
 
 

3. _______ command is used to skip an iteration of a loop.

 
 
 
 

4. R is an interpreted language.  It can be accessed through _____________?

 
 
 
 

5. In R Language every operation has a ______ call.

 
 
 
 

6. In 1991, R was created by Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of

 
 
 
 

7. Many quantitative analysts use R as their ____ tool.

 
 
 
 

8. Who developed R?

 
 
 

9. Vectors come in two parts _____ and _____.

 
 
 
 

10. The ____________ in R is a vector.

 
 
 
 

11. R is technically much closer to the Scheme language than it is to the original _____ language.

 
 
 
 

12. Which of the following commands will find the maximum value in the vector x, excluding the missing values

 
 
 
 

13. _________ initiates an infinite loop right from the start.

 
 
 
 

14. R was named partly after the first names of ____ R authors.

 
 
 
 

15. Packages are useful in collecting sets into a _____ unit

 
 
 

16. _________ and _________ are types of matrices functions?

 
 
 
 

17. How many types of R objects are present in the R data type?

 
 
 
 

18. Which function is used to create the vector with more than one element?

 
 
 
 

19. R Language functionality is divided into a number of ________

 
 
 
 

R Language MCQs with Answers

R FAQS Logo: R Language MCQs
  • R is an interpreted language.  It can be accessed through ———–?
  • Who developed R?
  • Many quantitative analysts use R as their ———– tool.
  • R is an ———– programming language?
  • R was named partly after the first names of ———– R authors.
  • Packages are useful in collecting sets into a ———– unit
  • How many types of R objects are present in the R data type?
  • Which of the following is a primary tool for debugging?
  • Which function is used to create the vector with more than one element?
  • In R Language every operation has a ———– call.
  • The ———– in R is a vector.
  • Vectors come in two parts ———– and ———–.
  • ———– and ———– are types of matrices functions?
  • Which of the following commands will find the maximum value in the vector x, excluding the missing values
  • ———– initiates an infinite loop right from the start.
  • ———– command is used to skip an iteration of a loop.
  • In 1991, R was created by Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of
  • R is technically much closer to the Scheme language than it is to the original ———– language.
  • R Language functionality is divided into a number of ———–.
R Language MCQs

https://itfeature.com

https://gmstat.com

One-Way ANOVA in R: A Comprehensive Quide

In this post, we will learn about one-way ANOVA in R Language.

The two-sample t or z-test is used to compare two groups from the independent population. However, if there are more than two groups, One-Way ANOVA (analysis of variance) or its further versions can be used in R.

Introduction to One-Way ANOVA

The statistical test statistic associated with ANOVA is the F-test (also called F-ratio). In the Anova procedure, an observed F-value is computed and then compared with a critical F-value derived from the relevant F-distribution. The F-value comes from a family of F-distribution defined by two numbers (the degrees of freedom). Note that the F-distribution cannot be negative as it is the ratio of variance and variances are always positive numbers.

The One-Way ANOVA is also known as one-factor ANOVA. It is the extension of the independent two-sample test for comparing means when there are more than two groups. The data in One-Way ANOVA is organized into several groups based on grouping variables (called factor variables too).

To compute the F-value, the ratio of “the variance between groups”,  and the “variance within groups” needs to be computed. The assumptions of ANOVA should also be checked before performing the ANOVA test. We will learn how to perform One-Way ANOVA in R.

One-Way ANOVA in R

Suppose we are interested in finding the difference of miles per gallon based on number of the cylinders in an automobile; from the dataset “mtcars”

Let us get some basic insight into the data before performing the ANOVA.

# load and attach the data mtcars
attach(mtcars)
# see the variable names and initial observations
head(mtcars)

Let us find the means of each number of the cylinder group

print(model.tables(res, "means"), digits = 4)

Let us draw the boxplot of each group

boxplot(mpg ~ cyl, main="Boxplot", xlab="Number of Cylinders", ylab="mpg")

Now, to perform One-Way ANOVA in R using the aov( ) function. For example,

aov(mpg ~ cyl)

The variable “mpg” is continuous and the variable “cyl” is the grouping variable. From the output note the degrees of freedom under the variable “cyl”. It will be one. It means the results are not correct as the degrees of freedom should be two as there are three groups on “cyl”. In the mode (data type) of grouping variable required for ANOVA  should be the factor variable. For this purpose, the “cyl” variable can be converted to factor as

cyl <- as.factor(cyl)

Now re-issue the aov( ) function as

aov(mpg ~ cyl)

Now the results will be as required. To get the ANOVA table, use the summary( ) function as

summary(aov (mpg ~ cyl))

Let’s store the ANOVA results obtained from aov( ) in object say res

res <- aov(mpg ~ cyl)
summary(res)

Post-Hoc Analysis (Multiple Pairwise Comparison)

Post-hoc tests or multiple-pairwise comparison tests help in finding out which groups differ (significantly) from one other and which do not. The post-hoc tests allow for multiple-pairwise comparisons without inflating the type-I error. To understand it, suppose the level of significance (type-I error) is 5%. Then the probability of making at least one Type-I error (assuming independence of three events), the maximum family-wise error rate will be

$1-(0.95 \times 0.95 \times 0.95) =  14.2%$

It will give the probability of having at least one FALSE alarm (type-I error).

To perform Tykey’s post-hoc test and plot the group’s differences in means from Tukey’s test.

# Tukey Honestly Significant Differences
TukeyHSD(res)
plot(TukeyHSD(res))

Diagnostic Plots (Checking Model Assumptions)

The diagnostic plots can be used to check the assumption of heteroscedasticity, normality, and influential observations.

layout(matrix(c(1,2,3,4), 2,2))
plot(res)
Diagnostic Plots for One-Way ANOVA in R

Levene’s Test

To check the assumption of ANOVA, Levene’s test can be used. For this purpose leveneTest( ) function can be used which is available in the car package.

library(car)
leveneTest(res)

https://itfeature.com

https://gmstat.com