In R Programming language a data frame is a two-dimensional data structure. The data frame objects contain rows and columns. The number of rows for each column should have equal length. The cross-section of the row and column can be considered as a cell. Each cell of the data frame is associated with a combination of row number and column number.

One can modify, extract, and re-arrange the data contents of a data frame; the process is called the manipulation of the data frame. To create a data frame a general syntax can be followed

### Data Frame Syntax in R

```
df <- data.frame(first column = c(data values separated with commas,
second column = c(data values separate with commans,
......
)
```

An exemplary data frame in the R language is

df = data.frame(age = c(23, 24, 25, 26, 23, 25, 29, 20), marks = c(99, 80, 67, 56, 98, 65, 45, 77), grade = c("A", "A", "C", "D", "A", "B", "F", "B") ) print(df)

One can name or rename the columns and rows of the data frame

# Naming / renaming columns colnames(df) <- c("Age", "Score", "Grad") # Naming / renaming rows row.names(df) <- c("1st", "2nd", "3rd", "4th", "5th", "6th", "7th", "8th")

### Subsetting a Data Frame

The subset() method can be used to create a new data set by removing specified column(s). This splits the data frame into two sets, one with excluded colujmns and the other with included columns. To understand sbusetting a data frame, let us create a data frame first.

# creating a data frame df = data.frame(row1 = 0:3, row2 = 3:6, row3 = 6:9) # creating a subset df <- subset(df, select = c(row1, row2))

**Question: Data Frame in R Language**

Suppose we have a frequency distribution of sales from a sample of 100 sales receipts.

Price value | Number of Sales |
---|---|

0 to 20 | 16 |

20 to 40 | 18 |

40 to 60 | 14 |

60 to 80 | 24 |

80 to 100 | 20 |

100 to 120 | 8 |

Calculate mean, median, variance, standard deviation and coefficient of variation by using R code.

**Solution**

# Crate a data frame df <- data.frame(lower_class = seq(0, 100, by = 20), upper_class=seq(20, 120, by=20), freq = c(16, 18, 14, 24, 20, 8)) # mid points m <- (df["lower_class"] + df["upper_class"])/2 mf <- df["freq"] * m mfsquare <- df["freq"] * m^2 data <- cbind(df, m, mf, mfsquare) colnames(data) <- c("LL","UL", "freq" , "M", "mf", "mf2") # Computation avg = sum(data$mf)/sum(data$freq) var = (sum(data$mf2) - sum(data$mf)^2 / sum(data$freq))/(sum(data$freq)-1) sd = sqrt(var) CV = sd/avg * 100 ## Outputs paste("Mean = ", round(avg, 3)) paste("Variance = ", round(var, 3)) paste("Standard Deviation = ", round(sd, 3)) paste("Coefficient of Variation = ", round(CV, 3))

### Using Logical Conditions for Selecting Rows and Columns

For selecting rows and columns using logical conditions, we consider the iris data set. Here, suppose we are interested in Selecting rows who values are higher than the median for Sepal Length and whose Petal.Width >= 1.7. In the code below, each value in Sepal.Length variable (column) is compared with the median value of Sepal.Length. Similarly, the each value of Petal.Width is compared with 1.7 to extract the required values from these to columns.

attach(iris) iris[(Sepal.Length > median(Sepal.Length) & Petal.Width >= 1.7), ]

One can select only the numeric columns from the data frame by following the code below

# Selecting Numeric Columns only iris[ , sapply(iris, is.numeric)] # Selecting factor columns only iris[, sapply(iris, is.factor)] # Selecting only certain Species iris[Species == "virginica", ]

### Omitting Missing Observations in a Data Frame

# Omit rows with missing data na.omit(iris) # check for missing data across rows apply(iris, 2, is.na) iris[complete.cases(iris), ]