Exploring Data Distribution in R

Exploring Data Distribution in R Language

Suppose we have univariate data and need to examine its distribution. There are a variety of tools and techniques to explore univariate data distributions. The simplest way is to explore the numbers. The summary() and fivenum() are numerical while the stem() is a display of the numbers to examine the distribution of the data set. This post will teach you the basics of exploring data distribution in the R Language.

Five Number Summary and Stem and Leaf Plot

One can use numeric and visual tools in exploring data distribution. For example,

attach(faithful)
summary(eruptions)

## Output
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.600   2.163   4.000   3.488   4.454   5.100 

fivenum(eruptions)

## Output
 1.6000 2.1585 4.0000 4.4585 5.1000

stem(eruptions)
Exploring Data Distribution in R Language stem and leaf display

Histogram and Density Plot

The stem-and-leaf display is like a histogram which can be drawn using the hist() function to plot histograms in R language. The boxplot() function can also be used to visualize the distribution of the data. This will help in exploring data distribution.

# make the bins smaller, and make a plot of density

hist(eruptions)
hist(eruptions, seq(1.6, 5.2, 0.2), prob=TRUE)
lines(density(eruptions, bw=0.1))
rug(eruptions) # Show the actual data points
Exploring data distribution in R using hist and density function

The density can be used to create more elegant density plots, a line is also produced by the density and bw bandwidth is chosen by trial and error as the defaults give too much smoothing (it usually does for “interesting” densities). Better automated methods for bandwidth are also available (in the above example bw="SJ" gives good results.)

Empirical Cumulative Distribution Function

One can also plot the empirical cumulative distribution function by using the function ecdf.

plot(ecdf(eruptions), do.points = FALSE, verticals = TRUE)
cdf in R language

For the right-hand mode (eruptions of longer than 3 minutes), let us fit a normal distribution and overlay the fitted CDF.

long <- eruptions[eruptions > 3]
plot (ecdf(long), do.points = FALSE, verticals = TRUE)
x <- seq(3, 5.4, 0.01)
lines(x, pnorm(x, mean = mean(long), sd = sqrt(var(long))), lty = 3)
cdf and normality plot in R
par(pty = "s")
qqnorm(long)
qqline(long)
Normal qq plot

The Quantile-Quantile (QQ Plot) long shows a reasonable fit but a shorter right tail than one would expect from a normal distribution. One can compare it with some simulated data from t-distribution.

x <- rt(250, df = 5)
qqnorm(x)
qqline(x)

which will show a longer tail (as a random sample from the t distribution) compared to a normal distribution.

normal qq plot in r for longer tails

Normality Test in R

To determine if the data follows the normal distribution,

    Shapiro-Wilk normality test
shapiro.test(eruptions)
## Output
		Shapiro-Wilk normality test

data:  eruptions
W = 0.84592, p-value = 9.036e-16

The Kolmogorov-Smirnov Test using the ks.test() function can determine if the data follows a normal distribution

ks.test(eruptions, "pnorm")

## Output
        Asymptotic one-sample Kolmogorov-Smirnov test

data:  eruptions
D = 0.94857, p-value < 2.2e-16
alternative hypothesis: two-sided

Warning message:
In ks.test.default(eruptions, "pnorm") :
  ties should not be present for the one-sample Kolmogorov-Smirnov test

By combining the above techniques, exploring data distribution helps in gaining valuable insights into the distribution of univariate data, identifying potential outliers, and assessing normality assumptions for further statistical analysis.

Online Quiz Website, Learn Basic Statistics

cbind and rbind Forming Partitioned Matrices in R

Introduction to Forming Partitioned Matrices in R

In the R language, partitioned matrices (known as block matrices) can easily be formed by combining smaller matrices or vectors into larger ones. This may be called forming partitioned matrices in R Language. This is very useful for organizing and manipulating data, particularly when dealing with large matrices.

The matrices can be built up from other matrices or vectors by using the functions cbind() and rbind(). The cbind() function forms the matrices by binding vectors or matrices together column-wise (or horizontally), while rbind() function binds vectors or matrices together row-wise (or vertically).

cbind() Function

The cbind() function combines matrices or vectors column-wise after making sure that the number of rows in each argument is the same.

A <- matrix(1:4, nrow = 2)
B <- matrix(5:8, nrow = 2)
C <- cbind(A, B)

## Output
     [,1] [,2] [,3] [,4]
[1,]    1    3    5    7
[2,]    2    4    6    8

The arguments to cbind() function must be either a vector of any length or matrices with the same number of rows (that is, the column size). The above example will result in the matrix with the concatenated arguments $A, B$ forming the matrices.

Note that in this case, some of the arguments to cbind() function are vectors that have a shorter length (number of rows) than the column size of any matrices present, in which case they are cyclically extended to match the matrix column size (or the length of the longest vector if no matrices are given).

rbind() Function

The rbind() Function combines matrices or vectors row-wise after making sure that the number of columns in each argument is the same.

A <- matrix(1:4, nrow = 2)
B <- matrix(5:8, nrow = 2)
C <- rbind(A, B)

## Output
     [,1] [,2]
[1,]    1    3
[2,]    2    4
[3,]    5    7
[4,]    6    8

The rbind() function does the corresponding operation for rows. In this case, any vector argument, possibly cyclically extended, is of course taken as row vectors.

The results of both cbind() and rbind() function are always of matrix status. The rbind() and cbind() are the simplest ways to explicitly combine vectors to be treated as row or column matrices, respectively.

Creating a 2 x 2 matrix using cbind() or rbind()

# Create four smaller matrices
A <- matrix(1:4, nrow = 2, ncol = 2)
B <- matrix(5:8, nrow = 2, ncol = 2)
C <- matrix(9:12, nrow = 2, ncol = 2)
D <- matrix(13:16, nrow = 2, ncol = 2)

# Combine them into a 2x2 block matrix
m1 <- rbind(cbind(A, B), cbind(C, D))
m2 <- cbind(cbind(A, B), cbind(C, D))
m3 <- cbind(rbind(A, B), rbind(C, D))
m4 <- rbind(rbind(A, B), rbind(C, D))
cbind, rbind forming partitioned matrices in R Language

Visualizing Partitioned Matrices

To visualize partitioned matrices, one can use libraries like ggplot2 or lattice. For simple visualizations, one can use base R functions like image() or heatmap().

Applications of Partitioned Matrices

  • Organizing Data: Grouping related data into blocks can improve readability and understanding.
  • Matrix Operations: Performing operations on submatrices can be more efficient than working with the entire matrix.
  • Linear Algebra: Many linear algebra operations, such as matrix multiplication and inversion, can be performed on partitioned matrices using block matrix operations.

Practical Applications of Matrices

  • Block Matrix Operations: Perform matrix operations on individual blocks, such as multiplication, inversion, or solving linear systems.
  • Statistical Modeling: Use partitioned matrices to represent complex statistical models, such as mixed-effects models.
  • Sparse Matrix Representation: Efficiently store and manipulate large sparse matrices by partitioning them into smaller, denser blocks.
  • Machine Learning: Organize and process large datasets in a structured manner.

By effectively using ِcbind() and rbind(), one can create complex matrix structures in R that can be useful in solving a wide range of various data analysis, modeling tasks, and computational problems.

Forming Partitioned matrices in R Language https://rfaqs.com

Online Quiz Website, Learn Statistics and Data Analysis

The Class of an Object In R Language

Introduction to Class of an Object in R

In R language, all objects have a class, which can be reported using the class() function. For simple vectors, this is just the mode, such as numeric, character, list, or logical. The other possible modes are array, matrix, factor, and data frame.

A special attribute known as the class of the object is used to allow for an object-oriented style of programming in R language. For example, an object having class as “data.frame” will be printed in a certain way, the plot() function will display it graphically in a certain way, and other generic functions such as summary() will react to it as an argument in a way sensitive to its class.

How to Determine the Class of an Object in R

The class() function is used to determine the class of an object. For example,

class(mtcars)
x <- c(1, 2, 3)
class(x)

y <- c("a", "b", "c")
class(y)

z <- c(TRUE, FALSE)
class(z)
Class of an Object in R Language

Common Object Classes in R

Here are some of the most common object classes in R:

  1. Integer: Represents integer values.
  2. Numeric: Represents numerical data.
  3. Character: Represents text strings.
  4. Factor: Represents categorical data.
  5. Logical: Represents logical values (TRUE or FALSE).
  6. Date: Represents dates.
  7. List: Represents a collection of objects of different types.
  8. Matrix: Represents a two-dimensional array of numbers.
  9. Data Frame: Represents a tabular data structure with rows and columns.
  10. POSIXct: Represents date and time.

It is important to note that one can define one’s classes using S3 or S4 object-oriented systems. This allows the user to define specific methods and behavior for different objects.

Why Classes Matter in R Language

The class of an object determines how R behaves when a user applies functions to it. In simple words, a class defines the object’s type and determines the operations that can be performed on it. For instance:

  • Arithmetic operations: These are typically performed on numeric objects.
  • String manipulation: These are performed on character objects.
  • Statistical analysis: These are often performed on numeric or factor objects.

The importance of classes can be described as:

  • Method Dispatch: The class of an object in R language determines which function to call when you apply a generic function to it. For example, the summary() function behaves differently for numeric vectors, data frames, and linear models.
  • Object-Oriented Programming: R supports object-oriented programming, and classes are fundamental to this paradigm. One can create custom classes to represent complex data structures and define methods to operate on these objects.
  • Data Manipulation: Understanding the class of an object helps one to choose the appropriate functions for data manipulation. For instance, one might use different functions for subsetting, sorting, and summarizing numeric vectors, character vectors, and data frames.

Remove the Class of an Object in R

To remove temporarily the effect of a class from an object, one can use the unclass() function. For example, if mtcars has the class “data.frame” then typing the just mtcars on the command prompt will print it in data frame form, which is rather like a matrix.

mtcars

Whereas, typing unclass(mtcars) will print/display it as an ordinary list.

unclass(mtcars)

## Output
$mpg
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7 15.0 21.4

$cyl
 [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4

$disp
 [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8 275.8 275.8 472.0 460.0 440.0  78.7  75.7  71.1 120.1 318.0 304.0 350.0 400.0  79.0 120.3  95.1 351.0 145.0 301.0 121.0

$hp
 [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52  65  97 150 150 245 175  66  91 113 264 175 335 109

$drat
 [1] 3.90 3.90 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 3.92 3.07 3.07 3.07 2.93 3.00 3.23 4.08 4.93 4.22 3.70 2.76 3.15 3.73 3.08 4.08 4.43 3.77 4.22 3.62 3.54 4.11

$wt
 [1] 2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440 4.070 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520 3.435 3.840 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780

$qsec
 [1] 16.46 17.02 18.61 19.44 17.02 20.22 15.84 20.00 22.90 18.30 18.90 17.40 17.60 18.00 17.98 17.82 17.42 19.47 18.52 19.90 20.01 16.87 17.30 15.41 17.05 18.90 16.70 16.90 14.50 15.50 14.60 18.60

$vs
 [1] 0 0 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 1 0 0 0 1

$am
 [1] 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1

$gear
 [1] 4 4 4 3 3 3 3 4 4 4 4 3 3 3 3 3 3 4 4 4 3 3 3 3 3 4 5 5 5 5 5 4

$carb
 [1] 4 4 1 1 2 1 4 2 2 4 4 3 3 3 4 4 4 1 2 1 1 2 2 4 2 1 2 2 4 6 8 2

attr(,"row.names")
 [1] "Mazda RX4"       "Mazda RX4 Wag"    "Datsun 710"       "Hornet 4 Drive"   "Hornet Sportabout"
 [6] "Valiant"         "Duster 360"       "Merc 240D"        "Merc 230"         "Merc 280"
[11] "Merc 280C"       "Merc 450SE"       "Merc 450SL"       "Merc 450SLC"      "Cadillac Fleetwood" 
[16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"      "Honda Civic"     "Toyota Corolla" 
[21] "Toyota Corona"    "Dodge Challenger"  "AMC Javelin"     "Camaro Z28"      "Pontiac Firebird"   
[26] "Fiat X1-9"       "Porsche 914-2"     "Lotus Europa"     "Ford Pantera L"  "Ferrari Dino"       
[31] "Maserati Bora"   "Volvo 142E"

Changing the Class of an Object in R

While it’s generally not recommended to manually change an object’s class, there are functions like as.numeric(), as.character(), as.factor(), etc., that can coerce objects into different classes. However, be cautious, as inappropriate coercion can lead to unexpected results.

Understanding object classes is fundamental to effective R programming. By recognizing the class of an object, you can choose the appropriate functions and operations to work with it. By understanding the class of an object, you can effectively work with R’s diverse data structures and leverage its powerful data analysis capabilities.

FAQs about Class of an Object

  1. What is the concept of class in R language?
  2. How can one check the class of an object?
  3. For different data types (modes) what are the common classes used in R?
  4. How can one change the class of an object?
  5. Give examples to determine the class of different objects.

https://itfeature.com, https://gmstat.com