Statistical Computing and Graphics in R

# Data Structure

## Factors in R (Categorical Data)

Factors in R Language are used to represent categorical data in the R language. Factors can be ordered or unordered. One can think of a factor as an integer vector where each integer has a label. Factors are specially treated by modeling functions such as lm() and glm().  Factors are the data objects used for categorical data and store it as levels. Factors can store both string and integer variables.

Using factors with labels is better than using integers as factors are self-describing; having a variable that has values “Male” and “Female” is better than a variable having values 1 and 2.

Creating a Simple Factor

create a simple factor that has two levels

# Simple factor with two levels
x <- factor(c("yes", "yes", "no", "yes", "no"))
# computes frequency of factors
table(x)

# strips out the class
unclass(x)

The order of the levels can be set using the levels argument to factor(). This can be important in linear modeling because the first level is used as the baseline level.

x <- factor(c("yes","yes","no","yes","no"), levels = c("yes","no"))

Factors can be given names using the label argument. The label argument changes the old values of the variable to a new one. For example,

x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"), label = c(1,2) )


x <- factor(c("yes","yes","no","yes","no"), levels = c("yes","no"), label = c("Level-1", "level-2"))

x <- factor(c("yes","yes","no","yes","no"), levels = c("yes","no"), label = c("group-1", "group-2"))

Suppose, you have a factor variable with numerical values. You want to compute the mean. The mean vector will result in the average value of the vector, but the mean of the factor variable will result in a warning message. To calculate the mean of the original numeric values of the "f" variable, you have to convert the values using the level argument. For example,

# vector
v <- c(10,20,20,50,10,20,10,50,20)
# vector converted to factor
f <- factor(v)

# mean of the vector
mean(v)

# mean of factor
mean(f)

mean(as.numeric(levels(f)[f]))

Use of cut( ) Function to Create a Factor Variable

The the cut( ) function can also be used to convert a numeric variable into factor. The breaks argument can be used to describe how ranges of numbers will be converted to factor values. If the breaks argument is set to a single number then the resulting factor will be created by dividing the range of the variable into that number of equal-length intervals. However, if a vector of values is given to the breaks argument, the values in the vectors are used to determine the breakpoint. The number of levels of the resultant factor will be one less than the number of values in the vector provided to the breaks argument. For example,

attach(mtcars)
cut(mpg, breaks = 3)

factors <- cut(mpg, breaks = c(10, 18, 25, 30, 35) )

table(factors)


You will notice that the default label for factors produced by cut() function contains the actual range of values that were used to divide the variable into factors.

## Introduction: Matrices in R

While dealing with matrices in R, all columns in the matrix must have the same mode (numeric, character, etc.), and the same length. A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix( ) functions.

The general syntax of creating matrices in R is:

matrix_name <- matrix(vector, nrow = r, ncol = c,
byrow = FALSE, dimnames = list(char_vector_rownames,
char_vector_colnames)
)

byrow = TRUE indicates that the matrix will be filled by rows.

dimnames provides optional labels for the columns and rows.

## Creating Matrices in R

Following the general syntax of the matrix( ) function, let us create a matrix from a vector of the first 20 numbers.

### Example 1:

# Generate matrix having 5 rows and 4 columns
y1 <- matrix (1 : 20, nrow = 5, ncol = 4) ; y1

# Output
> y1
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
y2 <- matrix (1 : 20, nrow = 5, ncol = 4, byrow = FALSE); y2

# Output
> y2
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
y3 <- matrix (1 : 20, nrow = 5, ncol = 4, byrow = TRUE) ; y3

# Output
> y3
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
[5,] 17 18 19 20

### Example 2:

elements <- c(11, 23, 29, 67)
rownames <- c("R1", "R2")
colnames <- c("C1", "C2")
m1 <- matrix(elements, nrow = 2, ncol = 2, byrow = TRUE,
dimnames = list(rownames, colnames)
)

# Output
> m1
C1 C2
R1 11 23
R2 29 67

Try

> nrow = 4 and ncol = 1, byrow = FALSE

Note the difference. You may also have some errors related to the number of rows or columns. Therefore, if you change the number of rows or columns then ensure that you have the same number of row names and column names too.

#### Matrix Operations in R

In the R language, there are some operators and functions that can be used to perform computation on one or more matrices. Some basic matrix operations in R are:

The following are some examples related to these operators and matrix functions.

m1 <- matrix(c(11, 23, 9, 35), nrow = 2)
m2 <- matrix(c(5, 19, 11, 20), nrow =2)
m3 <- m1 + m2
m4 <- m1 - m2
m5 <- m1 %*% m2
m6 <- m1 / m2

m1t <- t(m1)
m1tminv <- solve(m1t %*% m1)
diag(m1tminv)

# Output

> m1
[,1] [,2]
[1,]   11    9
[2,]   23   35
> m2
[,1] [,2]
[1,]    5   11
[2,]   19   20
> m3
[,1] [,2]
[1,]   16   20
[2,]   42   55
> m4
[,1] [,2]
[1,]    6   -2
[2,]    4   15
> m5
[,1] [,2]
[1,]  226  301
[2,]  780  953
> m6
[,1]      [,2]
[1,] 2.200000 0.8181818
[2,] 1.210526 1.7500000
> m1t
[,1] [,2]
[1,]   11   23
[2,]    9   35
> m1tminv
[,1]        [,2]
[1,]  0.04121954 -0.02853175
[2,] -0.02853175  0.02051509
> diag(m1tminv)
[1] 0.04121954 0.02051509


Some other important functions can be used to perform some required computations on matrices in R. These matrix operations in R are described below for matrix $X$. You can use your matrix.

Consider we have a matrix X with elements.

X <- matrix(1:20, nrow = 4, ncol = 5)
X

#### Obtaining $\beta$’s using Matrices in R

Consider we have a dataset that has a response variable and few regressors. There are many ways to create data (or variables), such as one can create a vector for each variable, a data frame for all of the variables, matrices, or can read data stored in a file.

Here we try it using vectors, then bind the vectors where required.

y  <- c(5, 6, 7, 9, 8, 4, 3, 2, 1, 6, 0, 7)
x1 <- c(4, 5, 6, 7, 8, 3, 4, 9, 9, 8, 7, 5)
x2 <- c(10, 22, 23, 10, 11, 14, 15, 16, 17, 12, 11, 17)
x  <- cbind(1, x1, x2)

The cbind( ) function is used to create a matrix x. Note that 1 is also bound to get the intercept term (the model with the intercept term). Let us compute $\beta$’s from OLS using matrix functions and operators.

xt <- t(x)
xtx <- xt %*% x
xtxinv <- solve(xtx)
xty <- xt %*% y
b <- xtxinv %*% xty

The output is

> x
x1 x2
[1,] 1  4 10
[2,] 1  5 22
[3,] 1  6 23
[4,] 1  7 10
[5,] 1  8 11
[6,] 1  3 14
[7,] 1  4 15
[8,] 1  9 16
[9,] 1  9 17
[10,] 1  8 12
[11,] 1  7 11
[12,] 1  5 17
> xt
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
1    1    1    1    1    1    1    1    1     1     1     1
x1    4    5    6    7    8    3    4    9    9     8     7     5
x2   10   22   23   10   11   14   15   16   17    12    11    17
> xtx
x1   x2
12   75  178
x1  75  515 1103
x2 178 1103 2854
> xtxinv


Data Structure Matrix in R

visit https://gmstat.com

## Creating, Subsetting, and Vectorization in R

### Creating Vectors in R Using c() Function

The c() function can be used to create vectors of objects. This function concatenates the values having one dimension (either row or column matrix in a sense). The following are some examples related to creating different types of vectors in R.

# Numeric vector
x <- c(1, 2, 5, 0.5, 10, 20, pi)

# Logical vector
x <- c(TRUE, FALSE, FALSE, T, T, F)

# Character vector
x <- c("a", "z", "good", "bad", "null hypothesis")

# Integer vector
x <- 9 : 29   # (colon operator is used)
x <- c(1L, 5L, 0L, 15L)

# Complex vector
x <- c(1+0i, 2+4i, 0+0i)

### Using vector() Function

Creates a vector of $n$ elements with a default value of zero for numeric vector, an empty string for character vector, FALSE for logical vector, and 0+0i for complex vector.

# Numeric vector of lenght 10 (default is zero)
x <- vector("numeric", length = 10)

# Integer vector of length 10 (default is integer zeros)
x <- vector("integer", length = 10)

# Character vector of length 10 (default is empty string)
x <- vector("character", length = 10)

# Logical vector of length 10 (default is FALSE)
x <- vector("logical", length = 10)

# Complex vector of length 10 (default is 0+0i)
x<- vector("complex", length=10)

### Creating Vectors with Mixed objects

When different objects are mixed in a vector, coercion occurs, that is, the data type of vector changes intelligently.

The following are examples

# coerce to character vector
y <- c(1.2, "good")
y <- c("a", T)

# coerce to a numeric vector
y <- c(T, 2)

From the above examples, the coercion will make each element of the vector of the same class.

### Explicitly Coercing Objects to Other Class

Objects can be explicitly coerced from one class to another class using as.character(), as.numeric(), as.integer(), as.complex(), and as.logical() functions. For example;

x <- 0:6
as.numeric(x)
as.logical(x)
as.character(x)
as.complex(x)

Note that non-sensual coercion results in NAs (missing values). For example,

x <- c("a", "b", "c")
as.numeric(x)
as.logical(x)
as.complex(x)
as.integer(x)

### Vectorization in R

Many operations in R Language are vectorized. The operations ( +, -, *, and / ) are performed element by element. For example,

x <- 1 : 4
y <- 6 : 9
x + y
x - y
x * y
x / y
x >= 2
x < 3
y == 8

Without vectorization (as in other languages) one has to use for loop for performing element by element operation on say vectors.

### Subsetting Vectors in R Language

By subsetting vectors means that extracting the elements of a vector. For this purpose square brackets ([ ]) are used. For example;

x <- c(1, 6, 10, -15, 0, 13, 5, 2, 10, 9)

# Subsetting  Examples
x[1]   # extract first element of x vecotr
x[1:5] # extract first five values of x
x[-1]  # extract all values except first
x[x > 2] # extracts all elements that are greater than 2
head(x)  # extracts first 6 elements of x
tail(x)  # extracts last 6 elements of x

x[x > 5 & x < 10]  # extracts elements that are greater than 5 but less than 10

One can use subset() function to extract the desired element using logical operators, For example,

subset(x, x > 5)
subset(x, x > 5 & x < 10)
subset(x, !x < 0 )

## Reading and Writing JSON files in R

A JSON file store simple data structures and objects in JavaScript object Notation (JSON) format. JSON is a standard data lightweight interchange format that is primarily used for transmitting data between a web application and a server. The JSON file is a text file that is language independent, self-describing, and easy to understand. Here we will discuss reading and writing JSON files in R Language in detail using the R package “rjson“.

Since JSON file format is text only, which can be sent to and from a server, and used as a data format by any programming language. The data in the JSON file is nested and hierarchical. Let us start reading and writing JSON files in R.

## Reading JSON files in R

R can read JSON files using the rjson package. First, install rjson package.

Issue the following command in the R console, to install the rjson package.

install.packages("rjson")

Let create a JSON file. Copy the following lines into a text editor such as notepad. Save the file with a .json extension and choosing the file type as all files(*.*). Let the file name is “data.json”, stored on “D:” drive.

{
"ID":["1","2","3","4","5","6","7","8" ],
"Name":["Rick","Dan","Michelle","Ryan","Gary","Nina","Simon","Guru" ],
"Salary":["623.3","515.2","611","729","843.25","578","632.8","722.5" ],

"StartDate":[ "1/1/2012","9/23/2013","11/15/2014","5/11/2014","3/27/2015","5/21/2013",
"7/30/2013","6/17/2014"],
"Dept":[ "IT","Operations","IT","HR","Finance","IT","Operations","Finance"]
}

To read a JSON file, the rjson package needs to be loaded. Use fromJSON( ) function to read the file.

# Give the data file name to the function.
result <- fromJSON(file = "D:\\data.json")
# Print the result.
print(result)

The JSON file now can be converted to a Data Frame for further analysis using the as.data.frame() function.

# Convert JSON file to a data frame.
json_data_frame <- as.data.frame(result)
print(json_data_frame)

## Writing JSON objects to .Json file

To write JSON Object to file, the toJSON() function from the rjson library can be used to prepare a JSON object and then use the write() function for writing the JSON object to a local file.

Let create a list of objects as follows

list1 <- vector(mode="list", length=2)
list1[[1]] <- c("apple", "banana", "rose")
list1[[2]] <- c("fruit", "fruit", "flower")

read the above list to JSON

jsonData < toJSON(list1)

write JSON object to file

write(jsonData, "output.json")

Read more about importing and exporting data in R: see the post

Scroll to top