Introduction to Data Frame in R Language
In R Programming language a data frame is a two-dimensional data structure. The data frame objects contain rows and columns. The number of rows for each column should have equal length. The cross-section of the row and column can be considered as a cell. Each cell of the data frame is associated with a combination of row number and column number.
Table of Contents
A data frame in R Programming Langauge has:
- Rows: Represent individual observations or data points.
- Columns: Represent variables or features being measured. Each column holds values for a single variable across all observations.
- Data Types: Columns can hold data of different types, including numeric, character, logical (TRUE/FALSE), and factors (categorical variables).
One can modify, extract, and re-arrange the data contents of a data frame; the process is called the manipulation of the data frame. To create a data frame a general syntax can be followed
Data Frame Syntax in R
The general syntax of a data frame in R Language is
df <- data.frame(first column = c(data values separated with commas,
second column = c(data values separate with commans,
......
)
An exemplary data frame in the R Programming language is
df = data.frame(age = c(23, 24, 25, 26, 23, 25, 29, 20), marks = c(99, 80, 67, 56, 98, 65, 45, 77), grade = c("A", "A", "C", "D", "A", "B", "F", "B") ) print(df)
One can name or rename the columns and rows of the data frame
# Naming / renaming columns colnames(df) <- c("Age", "Score", "Grad") # Naming / renaming rows row.names(df) <- c("1st", "2nd", "3rd", "4th", "5th", "6th", "7th", "8th")
Subsetting a Data Frame
The subset() method can be used to create a new data set by removing specified column(s). This splits the data frame into two sets, one with excluded columns and the other with included columns. To understand subsetting a data frame, let us create a data frame first.
# creating a data frame df = data.frame(row1 = 0:3, row2 = 3:6, row3 = 6:9) # creating a subset df <- subset(df, select = c(row1, row2))
Question: Data Frame in R Language
Suppose we have a frequency distribution of sales from a sample of 100 sales receipts.
Price Value | Number of Sales |
---|---|
0 to 20 | 16 |
20 to 40 | 18 |
40 to 60 | 14 |
60 to 80 | 24 |
80 to 100 | 20 |
100 to 120 | 8 |
Calculate the mean, median, variance, standard deviation, and coefficient of variation by using the R code.
Solution
# Crate a data frame df <- data.frame(lower_class = seq(0, 100, by = 20), upper_class=seq(20, 120, by=20), freq = c(16, 18, 14, 24, 20, 8)) # mid points m <- (df["lower_class"] + df["upper_class"])/2 mf <- df["freq"] * m mfsquare <- df["freq"] * m^2 data <- cbind(df, m, mf, mfsquare) colnames(data) <- c("LL","UL", "freq" , "M", "mf", "mf2") # Computation avg = sum(data$mf)/sum(data$freq) var = (sum(data$mf2) - sum(data$mf)^2 / sum(data$freq))/(sum(data$freq)-1) sd = sqrt(var) CV = sd/avg * 100 ## Outputs paste("Mean = ", round(avg, 3)) paste("Variance = ", round(var, 3)) paste("Standard Deviation = ", round(sd, 3)) paste("Coefficient of Variation = ", round(CV, 3))
Using Logical Conditions for Selecting Rows and Columns
For selecting rows and columns using logical conditions, we consider the iris data set. Here, suppose we are interested in Selecting rows whose values are higher than the median for Sepal Length and whose Petal.Width >= 1.7. In the code below, each value is Sepal.Length variable (column) is compared with the median value of Sepal.Length. Similarly, each value of Petal.Width is compared with 1.7 to extract the required values from these two columns.
attach(iris) iris[(Sepal.Length > median(Sepal.Length) & Petal.Width >= 1.7), ]
One can select only the numeric columns from the data frame by following the code below
# Selecting Numeric Columns only iris[ , sapply(iris, is.numeric)] # Selecting factor columns only iris[, sapply(iris, is.factor)] # Selecting only certain Species iris[Species == "virginica", ]
Omitting Missing Observations in a Data Frame
# Omit rows with missing data na.omit(iris) # check for missing data across rows apply(iris, 2, is.na) iris[complete.cases(iris), ]