Factors in R (Categorical Data)

Factors in R Language are used to represent categorical data in the R language. Factors can be ordered or unordered. One can think of a factor as an integer vector where each integer has a label. Factors are specially treated by modeling functions such as lm() and glm().  Factors are the data objects used for categorical data and store it as levels. Factors can store both string and integer variables.

Using factors with labels is better than using integers as factors are self-describing; having a variable that has values “Male” and “Female” is better than a variable having values 1 and 2.

Creating a Simple Factor

create a simple factor that has two levels

# Simple factor with two levels
x <- factor(c("yes", "yes", "no", "yes", "no"))
# computes frequency of factors
table(x)

# strips out the class
unclass(x)

The order of the levels can be set using the levels argument to factor(). This can be important in linear modeling because the first level is used as the baseline level.

x <- factor(c("yes","yes","no","yes","no"), levels = c("yes","no"))

Factors can be given names using the label argument. The label argument changes the old values of the variable to a new one. For example,

x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"), label = c(1,2) )


x <- factor(c("yes","yes","no","yes","no"), levels = c("yes","no"), label = c("Level-1", "level-2"))

x <- factor(c("yes","yes","no","yes","no"), levels = c("yes","no"), label = c("group-1", "group-2"))

Suppose, you have a factor variable with numerical values. You want to compute the mean. The mean vector will result in the average value of the vector, but the mean of the factor variable will result in a warning message. To calculate the mean of the original numeric values of the "f" variable, you have to convert the values using the level argument. For example,

# vector
v <- c(10,20,20,50,10,20,10,50,20)
# vector converted to factor
f <- factor(v)

# mean of the vector
mean(v)

# mean of factor
mean(f)

mean(as.numeric(levels(f)[f]))

Use of cut( ) Function to Create a Factor Variable

The the cut( ) function can also be used to convert a numeric variable into factor. The breaks argument can be used to describe how ranges of numbers will be converted to factor values. If the breaks argument is set to a single number then the resulting factor will be created by dividing the range of the variable into that number of equal-length intervals. However, if a vector of values is given to the breaks argument, the values in the vectors are used to determine the breakpoint. The number of levels of the resultant factor will be one less than the number of values in the vector provided to the breaks argument. For example,

attach(mtcars)
cut(mpg, breaks = 3)

factors <- cut(mpg, breaks = c(10, 18, 25, 30, 35) )

table(factors)


You will notice that the default label for factors produced by cut() function contains the actual range of values that were used to divide the variable into factors.