The generalized linear models (GLM) can be used when the distribution of the response variable is non-normal or when the response variable is transformed into linearity. The GLMs are flexible extensions of linear models that are used to fit the regression models to non-Gaussian data.
Introduction to Generalized Linear Models
Generalized Linear Models (GLMs) in R are an extension of linear regression that allow for response variables with non-normal distributions. GLMs are used to model relationships between a dependent variable and one or more independent variables. Generalized Linear Models consist of three components:
- Random Component: Specifies the probability distribution of the response variable (e.g., Gaussian, Binomial, Poisson).
- Systematic Component: The linear predictor, which is a linear combination of the predictors (independent variables).
- Link Function: Connects the mean of the response variable to the linear predictor (e.g., identity, logit, log).
One can classify a regression model as linear or non-linear regression models.
Basic Form of a Generalized Linear Models
The basic form of a Generalized linear model is
\begin{align*}
g(\mu_i) &= X_i’ \beta \\
&= \beta_0 + \sum\limits_{j=1}^p x_{ij} \beta_j
\end{align*}
where $\mu_i=E(U_i)$ is the expected value of the response variable $Y_i$ given the predictors, $g(\cdot)$ is a smooth and monotonic link function that connects $\mu_i$ to the predictors, $X_i’=(x_{i0}, x_{i1}, \cdots, x_{ip})$ is the known vector having $i$th observations with $x_{i0}=1$, and $\beta=(\beta_0, \beta_1, \cdots, \beta_p)’$ is the unknown vector of regression coefficients.
Syntax of glm() Function
In R, GLMs are fitted using the glm()
function. The basic syntax of glm()
function is
glm(formula, family, data)
formula
: Specifies the model (e.g.,y ~ x1 + x2
).family
: Describes the distribution and link function (e.g.,gaussian(link = "identity")
,binomial(link = "logit")
,poisson(link = "log")
).data
: The dataset containing the variables.
Fitting Generalized Linear Models
The glm() is a function that can be used to fit a generalized linear model, using the generic form of the model below. The formula argument is similar to that used in the lm() function for the linear regression model.
mod <- glm(formula, family = gaussian, data = data.frame)
The family
argument is a description of the error distribution and link function to be used in the model.
The class of generalized linear models is specified by giving a symbolic description of the linear predictor and a description of the error distribution. The link functions for different families of the probability distribution of the response variables are given below. The family name can be used as an argument in the glm( ) function.
Link Functions for Different Families
Family Name | Link Functions |
---|---|
binomial | logit , probit , cloglog |
gaussian | identity , log , inverse |
Gamma | identity , inverse , log |
inverse gaussian | $1/ \mu^2$, identity , inverse ,log |
poisson | logit , probit , cloglog , identity , inverse |
quasi | log , $1/ \mu^2$, sqrt |
Generalized Linear Models, GLM Example in R
Consider the “cars” dataset available in R. Let us fit a generalized linear regression model on the data set by assuming the “dist” variable as the response variable, and the “speed” variable as the predictor. Both the linear and generalized linear models are performed in the example below.
data(cars) head(cars) attach(cars) scatter.smooth(x=speed, y=dist, main = "Dist ~ Speed") # Linear Model lm(dist ~ speed, data = cars) summary(lm(dist ~ speed, data = cars) # Generalized Linear Model glm(dist ~ speed, data=cars, family = "gaussian") plot(glm(dist ~ speed, data = cars)) summary(glm(dist ~ speed, data = cars))
Diagnostic Plots of Generalized Linear Models
Generalized Linear Models Types and Applications
GLM Type | Response Variable | Real-Life Example |
---|---|---|
Logistic Regression | Binary (0/1) | Customer churn, disease diagnosis |
Poisson Regression | Count data | Insurance claims, website visits |
Gamma Regression | Positive, skewed continuous | Insurance claim amounts, machine failure time |
Multinomial Regression | Multi-category | Product choice, species classification |
Negative Binomial Regression | Overdispersed count data | Accident counts, sick days |
Ordinal Regression | Ordered categories | Customer satisfaction, disease severity |
Tweedie Regression | Zero-inflated continuous | Insurance claims with many zeros |