Logistic Regression Models in R

The article is about the use and application of Logistic Regression Models in R Language. In logistic regression models, the response variable ($y$) is of categorical (binary, dichotomous) values such as 1 or 0 (TRUE/ FALSE). It measures the probability of a binary response variable based on a mathematical equation relating the values of the response variable with the predictor(s). The built-in glm() function in R can be used to perform logistic regression analysis.

Probability and Odds Ratio

The odds are used in logistic regression. If $p$ is the probability of success, the odds of in favour of success are, $\frac{p}{q}=\frac{p}{1-p}$.

Note that probability can be converted to odds and odds can also be converted to likelihood (probability). However, unlike probability, odds can exceed 1. For example, if the likelihood of an event is 0.25, the odds in favour of that event are $\frac{0.25}{0.75}=0.33$. And the odds against the same event are $\frac{0.75}{0.25}=3$.

Logistic Regression Models in R (Example)

In built-in dataset (“mtcars“), the column (am) describes the transmission mode (automatic or manual) which is of binary value (0 or 1). Let us perform logistic regression models between the response variable “am” and other regressors: “hp”, “wt”, and “cyl” as given:

Logistic Regression with one Dichotomous Predictor

logmodel1 <- glm(am ~ vs, family = "binomial")
summary(logmodel1)

Logistic Regression with one Continuous Predictor

If the prediction variable is continuous then the logistic regression formula in R would be as given below:

logmodel2 <- glm(am ~ wt, family = "binomial")
summary(logmodel2)

Multiple Predictors in Logistic Regression

The following is an example of a logistic regression model with more than one predictor. For the model diagnostic plots are also drawn.

logmodel3 <- glm(am ~ cyl + hp + wt, family = "binomial")
summary(logmodel3)
plot(logmodel3)

Note: in the logistic regression model, dichotomous and continuous variables can be used as predictors.

Logistic Regression Models in R
Logistic Regression Models in R and Diagnostic Plots

In R language, the coefficients returned by logistic regression are a logit, or the log of the odds. To convert logits to odds ratio exponentiates it and to convert logits to probability use $\frac{e^\beta}{1-e^\beta}$. For example,

logmodel1 <- glm(am ~ vs, family = "binomial", data = mtcars)
logit_coef <- logmodel1$coef
exp(logmodel1$coef)
exp(logit_coef)/(1 + exp(logmodel1$coef))
Logistic Regression in R

Generalized Linear Models (GLM) in R

The generalized linear models (GLM) can be used when the distribution of the response variable is non-normal or when the response variable is transformed into linearity. The GLMs are flexible extensions of linear models that are used to fit the regression models to non-Gaussian data.

One can classify a regression model as linear or non-linear regression models.

Generalized Linear Models

The basic form of a Generalized linear model is
\begin{align*}
g(\mu_i) &= X_i’ \beta \\
&= \beta_0 + \sum\limits_{j=1}^p x_{ij} \beta_j
\end{align*}
where $\mu_i=E(U_i)$ is the expected value of the response variable $Y_i$ given the predictors, $g(\cdot)$ is a smooth and monotonic link function that connects $\mu_i$ to the predictors, $X_i’=(x_{i0}, x_{i1}, \cdots, x_{ip})$ is the known vector having $i$th observations with $x_{i0}=1$, and $\beta=(\beta_0, \beta_1, \cdots, \beta_p)’$ is the unknown vector of regression coefficients.

Fitting Generalized Linear Models

The glm() is a function that can be used to fit a generalized linear model, using the generic form of the model below. The formula argument is similar to that used in the lm() function for the linear regression model.

mod <- glm(formula, family = gaussian, data = data.frame)

The family argument is a description of the error distribution and link function to be used in the model.

The class of generalized linear models is specified by giving a symbolic description of the linear predictor and a description of the error distribution. The link functions for different families of the probability distribution of the response variables are given below. The family name can be used as an argument in the glm( ) function.

Family NameLink Functions
binomiallogit , probit, cloglog
gaussianidentity, log, inverse
Gammaidentity, inverse, log
inverse gaussian$1/ \mu^2$, identity, inverse,log
poissonlogit, probit, cloglog, identity, inverse
quasilog, $1/ \mu^2$, sqrt

Generalized Linear Models Example in R

Consider the “cars” dataset available in R. Let us fit a generalized linear regression model on the data set by assuming the “dist” variable as the response variable, and the “speed” variable as the predictor. Both the linear and generalized linear models are performed in the example below.

data(cars)
head(cars)
attach(cars)

scatter.smooth(x=speed, y=dist, main = "Dist ~ Speed")

# Linear Model
lm(dist ~ speed, data = cars)
summary(lm(dist ~ speed, data = cars)

# Generalized Linear Model
glm(dist ~ speed, data=cars, family = "gaussian")
plot(glm(dist ~ speed, data = cars))
summary(glm(dist ~ speed, data = cars))
Generalized Linear Models

Diagnostic Plots of Generalized Linear Models

generalized linear models

https://gmstat.com

Important R Language MCQs ggplot2 with Answers 8

The quiz “R Language MCQS ggplot2” will help you check your ability to execute some basic operations on objects in the R language, and it will also help you understand some basic concepts. This quiz may also improve your computational understanding.

Quiz about R Language

1. Which of the following are standards of tidy data?

 
 
 
 

2. Which summary functions can you use to preview data frames in R Language?

 
 
 
 

3. When programming in R, what is a pipe used as an alternative for?

 
 
 
 

4. What is the class of the object defined by the expression? x <- c(4,5,10)?

 

 
 
 
 

5. Which R function can be used to make changes to a data frame?

 
 
 
 

6. In R the following are all atomic data types EXCEPT:

 
 
 
 

7. Data analysts are working with customer information from their company’s sales data. The first and last names are in separate columns, but they want to create one column with both names instead. Which of the following functions can they use?

 
 
 
 

8. You are cleaning a data frame with improperly formatted column names. To clean the data frame you want to use the clean_names() function. Which column names will be changed using the clean_names() with default parameters?

 
 
 
 

9. Write the R commands for generating 700 random variables from normal distribution by using the following information: Mean = 14, SD = 3, n = 5, k = 2000.

 
 
 
 

10. For the population y<-c(1,2,3,4,5), write the R command to find the mean?

 
 
 
 

11. Suppose you want to simulate a coin toss 20 times in R. Write the command.

 
 
 
 

12. A data scientist is trying to print a data frame but when you print the data frame to the console output produces too many rows and columns to be readable. What could they use instead of a data frame to make printing more readable?

 
 
 
 

13. Which is the R command for obtaining 1000 random numbers through normal distribution with mean 0 and variance 1?

 
 
 
 

14. In ggplot2, an _____ is a visual property of an object in your plot.

 
 
 
 

15. For the population y<-c(1,2,3,4,5), write the R command to find the median?

 

 
 
 
 

16. Let us have 1000 random samples of size 6 under SRSWOR using the following population (111, 150, 121, 198, 112, 136, 114, 129, 117, 115, 186, 110, 121, 115, 114) which is the R command for repeating this procedure 1500 times?

 
 
 
 

17. Why are tibbles a useful variation of data frames?

 
 
 
 

18. A data analyst is working with the penguin’s data. They write the following code:
penguins %>%

The variable species includes three penguin species: Adelie, Chinstrap, and Gentoo. What code chunk does the analyst add to create a data frame that only includes the Gentoo species?

 
 
 
 

19. How sampling with and without replacement can be done using R?

 
 
 
 

20. Data analysts are cleaning their data in R. They want to be sure that their column names are unique and consistent to avoid any errors in their analysis. What R function can they use to do this automatically?

 
 
 
 

Frequently Asked Questions About R Language MCQs ggplot2

R Language MCQs ggplot2 Function

  • What is the class of the object defined by the expression? x <- c(4,5,10)?  
  • In R the following are all atomic data types EXCEPT:
  • For the population y<-c(1,2,3,4,5), write the R command to find the mean.
  • For the population y<-c(1,2,3,4,5), write the R command to find the median.
  • Let us have 1000 random samples of size 6 under SRSWOR using the following population (111, 150, 121, 198, 112, 136, 114, 129, 117, 115, 186, 110, 121, 115, 114) which is the R command for repeating this procedure 1500 times?
  • Which is the R command for obtaining 1000 random numbers through normal distribution with mean 0 and variance 1?
  • How sampling with and without replacement can be done using R?
  • Write the R commands for generating 700 random variables from normal distribution by using the following information: Mean = 14, SD = 3, n = 5, k = 2000.
  • Suppose you want to simulate a coin toss 20 times in R. Write the command.
  • When programming in R, what is a pipe used as an alternative for?
  • Which of the following are standards of tidy data?
  • Which summary functions can you use to preview data frames in R Language?
  • Which R function can be used to make changes to a data frame?
  • Why are tibbles a useful variation of data frames?
  • Data analysts are cleaning their data in R.
  • They want to be sure that their column names are unique and consistent to avoid any errors in their analysis. What R function can they use to do this automatically?
  • Data analysts are working with customer information from their company’s sales data. The first and last names are in separate columns, but they want to create one column with both names instead. Which of the following functions can they use?
  • A data scientist is trying to print a data frame but when you print the data frame to the console output produces too many rows and columns to be readable. What could they use instead of a data frame to make printing more readable?
  • A data analyst is working with the penguin’s data. They write the following code: penguins %>% The variable species includes three penguin species: Adelie, Chinstrap, and Gentoo. What code chunk does the analyst add to create a data frame that only includes the Gentoo species?
  • You are cleaning a data frame with improperly formatted column names. To clean the data frame you want to use the clean_names() function. Which column names will be changed using the clean_names() with default parameters?
  • In ggplot2, an ———- is a visual property of an object in your plot.

R Language MCQs 2

Computer MCQs Online Test