Creating Vectors in R, Subsetting, and Vectorization

The article is about creating vectors in R language. You will also learn about quick and short methods of subsetting the vectors in R and the vectorization of vectors

Creating Vectors in R Using c() Function

The c() function can be used for creating vectors of objects in R. This function concatenates the values having one dimension (either row or column matrix in a sense). The following are some examples related to creating different types of vectors in R.

# Numeric vector
x <- c(1, 2, 5, 0.5, 10, 20, pi)
# Logical vector
x <- c(TRUE, FALSE, FALSE, T, T, F)
# Character vector
x <- c("a", "z", "good", "bad", "null hypothesis")
# Integer vector 
x <- 9 : 29   # (colon operator is used)
x <- c(1L, 5L, 0L, 15L)
# Complex vector
x <- c(1+0i, 2+4i, 0+0i)

Using vector() Function

Creates a vector of $n$ elements with a default value of zero for numeric vector, an empty string for character vector, FALSE for logical vector, and 0+0i for complex vector.

# Numeric vector of lenght 10 (default is zero)
x <- vector("numeric", length = 10)
# Integer vector of length 10 (default is integer zeros)
x <- vector("integer", length = 10)
# Character vector of length 10 (default is empty string)
x <- vector("character", length = 10)
# Logical vector of length 10 (default is FALSE)
x <- vector("logical", length = 10)
# Complex vector of length 10 (default is 0+0i)
x <- vector("complex", length=10)
Vectors in R

Creating Vectors with Mixed Objects

When different objects are mixed in a vector, coercion occurs, that is, the data type of the vector changes intelligently.

The following are examples

# coerce to character vector 
y <- c(1.2, "good")
y <- c("a", T)
# coerce to a numeric vector
y <- c(T, 2)

From the above examples, the coercion will make each element of the vector of the same class.

Explicitly Coercing Objects to Other Class

Objects can be explicitly coerced from one class to another class using as.character(), as.numeric(), as.integer(), as.complex(), and as.logical() functions. For example;

x <- 0:6
as.numeric(x)
as.logical(x)
as.character(x)
as.complex(x)

Note that non-sensual coercion results in NAs (missing values). For example,

x <- c("a", "b", "c")
as.numeric(x)
as.logical(x)
as.complex(x)
as.integer(x)

Vectorization in R

Many operations in the R Language are vectorized. The operations ( +, -, *, and / ) are performed element by element. For example,

r vectors
x <- 1 : 4
y <- 6 : 9

# Arithmetics
x + y
x - y
x * y
x / y
# Logical Operation
x >= 2
x < 3
y == 8

Without vectorization (as in other languages) one has to use a for loop for performing element-by-element operations on say vectors.

Subsetting Vectors in R Language

Subsetting in the R Language can be done easily. Subsetting vectors means extracting the elements of a vector. For this purpose square brackets ([ ]) are used. For example;

x <- c(1, 6, 10, -15, 0, 13, 5, 2, 10, 9)

# Subsetting  Examples
x[1]   # extract first element of x vecotr
x[1:5] # extract first five values of x
x[-1]  # extract all values except first
x[x > 2] # extracts all elements that are greater than 2

head(x)  # extracts first 6 elements of x
tail(x)  # extracts last 6 elements of x

x[x > 5 & x < 10]  # extracts elements that are greater than 5 but less than 10

One can use the subset() function to extract the desired element using logical operators, For example,

subset(x, x > 5)
subset(x, x > 5 & x < 10)
subset(x, !x < 0 )

Learn more about Vectors

https://itfeature.com

https://gmstat.com

Summary Statistics in R

In this article, you will learn about how to perform Summary Statistics in R Language on a data set and finally, you will create a data quality Report file. Let us start learning “Computing Summary Statistics in R”.

We will follow each step as a Task for better understanding. It will also help us to complete all work in sequential tasks.

Task 1: Load and View Data Set

It is better to confirm the working directory using getwd() and save your data in the working directory, or save the data in the required folder and then set the path of this folder (directory) in R using setwd() function.

getwd()
data <- read.csv("data.csv")

Task 2: Calculate Measure of Frequency Metrics in R

Before calculating the frequency metrics it is better to check the data structure and some other useful information about the data, For example,

Note: here we are using mtcars data set.

data <- mtcars
str(data)
head(data)
length(data$cyl)
length(unique(data$cyl))
table(data$cyl)

freq <- table(data$cyl)
freq <- sort(freq, descreasing = T)
print(freq)
Descriptive summary Statistics in R

The above lines of code will tell you about the number of observations in the data set, the frequency of the cylinder variable, its unique category, and finally sorted frequency in order.

Task 3: Calculate the Measure of Central Tendency in R

Here we will calculate some available measures of central tendencies such as mean, median, and mode. One can easily calculate the measures of central tendency in R by following the commands below:

mean(data$mpg)
mean(data$mpg, na.rm = T)
median(data$mpg)
median(data$mpg, na.rm = T)

Note the use of na.rm argument. If there are missing values in the data then na.rm should be set to true. Since the mtcars data set does not contain any missing values, therefore, results for both will be the same.

There is no direct function to compute the most repeated value in the variable. However, using a combination of different functions we can calculate the mode. For example

# for continuous variable
uniquevalues <- unique(data$hp)
uniquevalues[which.max(tabulate(match(data$ho, uniquevalues)))]
# for categorical variable
uniquevalues <- unique(data$cyl)
uniquevalues[which.max(tabulate(match(data$cyl, uniquevalues)))]

Task 4: Calculate Measure of Dispersion in R Programming

The measures of dispersion such as range, variance, and standard deviation can be computed as given below. The use of different functions for the measure of dispersion in R programming is described as follows:

min(data$disp)
min(data$disp, na.rm = T)
max(data$disp)
max(data$disp, na.rm = T)
range(data$disp, na.rm = T)
var(data$disp, na.rm = T)
sd(data$disp, na.rm = T)

Task 5: Calculate Additional Quality Data Metrics

To compute more data metrics we must be aware of the data type of variables. Suppose we have numbers but its data type is set to the character. For example,

test <- as.character(1:3)

Finding the mean of such character variable (the numbers are converted to character class) will result in a warning.

mean(test)

[1] NA 
Warning message: In mean.default(test) : argument is not numeric or logical: returning NA

Therefore, one must be aware of the data type and class of the variable for which calculations are being performed. The class of variable in R can be checked using class() function. For example

class(data$hp)
class(mtcars)

It may also be useful if we know the number of missing observations in the data set.

test2 <- c(NA, 2, 55, 10, NA)

sum(is.na(test2))
sum(is.na(data$hp))
sum(is.na(data$hp))

Note that the data set we are using does not contain any missing values.

Task 6: Computing Summary Statistics in R on all Columns

There are functions in R that can be applied to each column to perform certain calculations on them. For example, apply() the function is used to compute the number of observations in the data set using length function as an argument of apply() function.

apply(data, MARGIN=2, length)

sapply(data, function(x) min(x, na.rm=T))

Let us create a user-defined function that can compute the minimum, maximum, mean, total, number of missing values, unique values, and data type of each variable (column) of the data frame.

quality_data <- function(df = NULL){
    if (is.null(df))
          print("Please Pass a non-empty data frame")
  
summary_tab <- do.call(data.frame,
     list(
           Min = sapply(df, function(x) min(x, na.rm = T) ),
           Max = sapply(df, function(x) max(x, na.rm = T) ),
           Mean = sapply(df, function(x) mean(x, na.rm = T) ),
           Total = apply(df, 2, length),
           NULLS = sapply(df, function(x) sum(is.na(x)) ),
           Unique = sapply(df, function(x) length(unique(x)) ),
           DataType = sapply(df, class)
      )
)
                         
nums <- vapply(summary_tab, is.numeric, FUN.VALUE = logical(1))
summary_tab[, nums] &lt;- round(summary_tab[, nums], digits = 3)
      
return(summary_tab)

}

quality_data(data)

Task 7: Generate a Quality Data Report File

df_quality <- quality_data(data)
df_quality <- cbind(columns = rownames(df_quality),
                    data.frame(df_quality, row.names = NULL)  )

write.csv(df_quality, "Data Quality Report.csv", row.names = F)

write.csv(df_quality, paste0("Data Quality Repor", 
      format(Sys.time(), "%d-%m-%Y-%M%M%S"), ".csv"),
      row.names = F)

The write.csv() function will create a file that contains all the results produced by the quality_data() function.

That’s all about Calculating Descriptive Statistics in R. There are many other descriptive measures, we will learn in future posts.

To learn about importing and exporting different data files, see the post on Importing and Exporting Data in R.

FAQs in R

  1. What summary statistics can easily be computed in R?
  2. How to load the data set in the current workspace?
  3. What are the functions that can be used to compute different measures of dispersions in R Language?
  4. How to compute the summary statistics of all columns at once in R?
  5. What measure of central tendencies can be computed in R?
  6. What functions can be used to get information about the loaded dataset in R?
  7. How missing observations can be identified in R?

Learn Basic Statistics

Scatter Plots In R

Introduction to Scatter Plots in R Language

Scatter plots (scatter diagrams) are bivariate graphical representations for examining the relationship between two quantitative variables. Scatter plots are essential for visualizing correlations and trends in data. A scatter plot helps identify the direction and strength of the relationship between two quantitative variables. The scatter plot also helps in identifying the linear to non-linear trend in the data. If there are more than two variables in a data set, one can draw a scatter matrix diagram between all/different pairs of quantitative variables.

Scatter plots in R can be drawn in several ways. Here we will discuss how to make several kinds of scatter plots in R.

The plot function in R

In plot() function when two numeric vectors are provided as arguments (one for horizontal and the other for vertical coordinates), the default behavior of the plot() function is to make a scatter diagram. For example,

library(car)
attach(Prestige)
plot(income, prestige)

will draw a simple scatterplot of prestige by income.

Usually, the interpretation of a scatterplot is often assisted by enhancing the plot with least-squares or non-parametric regression lines. For this purpose scatterplot() in car package can be used and it will add marginal boxplots for the two variables

scatterplot(prestige ~ income, lwd = 3 )

Note that in the scatterplot, the non-parametric regression curve is drawn by a local regression smoother, where local regression works by fitting a least-square line in the neighborhood of each observation, placing greater weight on points closer to the focal observation. A fitted value for the focal observation is extracted from each local regression, and the resulting fitted values are connected to produce the non-parametric regression line.

Coded Scatterplots

The scatterplot() function can also be used to create coded scatterplots. For this purpose, a categorical variable is used for coloring or using different symbols for each category. For example, let us plot prestige by income, coded by the type of occupation

scatterplot(prestige ~ income | type)

Note that variables in the scatterplot are given in a formula-style (as y ~ x | groups).

The coded scatterplot indicates that the relationship between prestige and income may well be linear within occupation types. The slope of the relationship looks steepest for blue-collar (bc) occupations, and least steep for professional and managerial occupations.

Jittering Scatter Plots

Jittering the data by adding a small random quantity to each coordinate serves to separate the overplotted points.

data(Vocab)
attach(Vocab)
plot(education, vocabulary) 
# without jittering
plot(jitter (education), jitter(vocabulary) )
Scatter Plots in R Language

The degree of jittering can be controlled via factor argument. For example, specifying factor = 2 doubles the jitter.

plot(jitter(education, factor = 2), jitter(vocabulary, factor = 2))

Let’s add the least-squares and non-parametric regression line.

abline(lm(vocabulary ~ education), lwd = 3, lty = 2)
lines(lowess(education, vocabulary, f = 0.2), lwd = 3)

The lowess function (an acronym for locally weighted regression) returns coordinates for the local regression curve, which is drawn by lines. The “f” arguments set the span of the local regression to lowess.

Using these different kinds of graphical representations of relationships between variables may help to identify some hidden information (hidden due to overplotting).

FAQs about Scatter Plots in R

  1. How one can draw a scatter plot in R Language?
  2. What is the importance of scatter plots?
  3. What function can be used to draw scatter plots in R?
  4. What is the use of scatterplot() function in R?
  5. What is meant by a coded scatter plot?
  6. What are jittering scatter plots in R?
  7. What are the important arguments of a plot() function to draw a scatter plot?

See more on plot() function

https://itfeature.com, https://gmstat.com