Data Analysis - R Programming FAQs

Generalized Linear Models (GLM) in R

March 21, 2025August 9, 2022 by Muhammad Imdad Ullah

The generalized linear models (GLM) can be used when the distribution of the response variable is non-normal or when the response variable is transformed into linearity. The GLMs are flexible extensions of linear models that are used to fit the regression models to non-Gaussian data.

Introduction to Generalized Linear Models

Generalized Linear Models (GLMs) in R are an extension of linear regression that allow for response variables with non-normal distributions. GLMs are used to model relationships between a dependent variable and one or more independent variables. Generalized Linear Models consist of three components:

Random Component: Specifies the probability distribution of the response variable (e.g., Gaussian, Binomial, Poisson).
Systematic Component: The linear predictor, which is a linear combination of the predictors (independent variables).
Link Function: Connects the mean of the response variable to the linear predictor (e.g., identity, logit, log).

One can classify a regression model as linear or non-linear regression models.

Basic Form of a Generalized Linear Models

The basic form of a Generalized linear model is
\begin{align*}
g(\mu_i) &= X_i’ \beta \\
&= \beta_0 + \sum\limits_{j=1}^p x_{ij} \beta_j
\end{align*}
where $\mu_i=E(U_i)$ is the expected value of the response variable $Y_i$ given the predictors, $g(\cdot)$ is a smooth and monotonic link function that connects $\mu_i$ to the predictors, $X_i’=(x_{i0}, x_{i1}, \cdots, x_{ip})$ is the known vector having $i$th observations with $x_{i0}=1$, and $\beta=(\beta_0, \beta_1, \cdots, \beta_p)’$ is the unknown vector of regression coefficients.

Syntax of glm() Function

In R, GLMs are fitted using the glm() function. The basic syntax of glm() function is

glm(formula, family, data)

formula: Specifies the model (e.g., y ~ x1 + x2).
family: Describes the distribution and link function (e.g., gaussian(link = "identity"), binomial(link = "logit"), poisson(link = "log")).
data: The dataset containing the variables.

Fitting Generalized Linear Models

The glm() is a function that can be used to fit a generalized linear model, using the generic form of the model below. The formula argument is similar to that used in the lm() function for the linear regression model.

mod <- glm(formula, family = gaussian, data = data.frame)

The family argument is a description of the error distribution and link function to be used in the model.

The class of generalized linear models is specified by giving a symbolic description of the linear predictor and a description of the error distribution. The link functions for different families of the probability distribution of the response variables are given below. The family name can be used as an argument in the glm( ) function.

Link Functions for Different Families

Family Name	Link Functions
`binomial`	`logit` , `probit`, `cloglog`
`gaussian`	`identity`, `log`, `inverse`
`Gamma`	`identity`, `inverse`, `log`
`inverse gaussian`	$1/ \mu^2$, `identity`, `inverse`,`log`
`poisson`	`logit`, `probit`, `cloglog`, `identity`, `inverse`
`quasi`	`log`, $1/ \mu^2$, `sqrt`

Generalized Linear Models, GLM Example in R

Consider the “cars” dataset available in R. Let us fit a generalized linear regression model on the data set by assuming the “dist” variable as the response variable, and the “speed” variable as the predictor. Both the linear and generalized linear models are performed in the example below.

data(cars)
head(cars)
attach(cars)

scatter.smooth(x=speed, y=dist, main = "Dist ~ Speed")

# Linear Model
lm(dist ~ speed, data = cars)
summary(lm(dist ~ speed, data = cars)

# Generalized Linear Model
glm(dist ~ speed, data=cars, family = "gaussian")
plot(glm(dist ~ speed, data = cars))
summary(glm(dist ~ speed, data = cars))

Diagnostic Plots of Generalized Linear Models

Generalized Linear Models Types and Applications

GLM Type	Response Variable	Real-Life Example
Logistic Regression	Binary (0/1)	Customer churn, disease diagnosis
Poisson Regression	Count data	Insurance claims, website visits
Gamma Regression	Positive, skewed continuous	Insurance claim amounts, machine failure time
Multinomial Regression	Multi-category	Product choice, species classification
Negative Binomial Regression	Overdispersed count data	Accident counts, sick days
Ordinal Regression	Ordered categories	Customer satisfaction, disease severity
Tweedie Regression	Zero-inflated continuous	Insurance claims with many zeros

https://gmstat.com

Weighted Least Squares In R: A Quick WLS Tutorial

September 5, 2024August 18, 2021 by Muhammad Imdad Ullah

Introduction to Weighted Least Squares in R

This post will discuss the implementation of Weighted Least Squares (WLS) in R. The OLS method minimizes the sum of squared residuals, while the WLS weights the square residuals. The WLS technique is used when the OLS assumption related to constant variance in the errors is violated.

The WLS technique is also known as weighted linear regression it is a generalization of ordinary least squares (OLS) and linear regression in which knowledge of the variance of observations is incorporated into the regression.

WLS in R Language

Let us perform the WLS in R.

Here we will use the mtcars dataset. You need to load this data set first using the attach() function. For example,

attach(mtcars)

Consider the following example regarding the weighted least squares in which the reciprocal of $wt$ variable is used as weights. Here two different weighted models are performed and then check the fit of the model using anova() function.

# Weighted Model 1
w_model1 <- lm(mpg ~ wt + hp, data = mtcars)

# Weighted Model 2
w_model2 <- lm(mpg ~ wt + hp, data = mtcars, weights = 1/wt)

Check/ Test The Model Fit

To check the model fit, summary statistics of the fitted model, and different diagnostic plots of the fitted model, one can use the built-in functions as,

anova(w_model1, w_model2 )
summary(w_model1)
summary(w_model2)

plot(w_model1)
plot(w_model2)

The output of the First Weighted Model is:

Call:
lm(formula = mpg ~ wt + hp, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.941 -1.600 -0.182  1.050  5.854 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
wt          -3.87783    0.63273  -6.129 1.12e-06 ***
hp          -0.03177    0.00903  -3.519  0.00145 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared:  0.8268,	Adjusted R-squared:  0.8148 
F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

The output of the Second Weighted Model is

Call:
lm(formula = mpg ~ wt + hp, data = mtcars, weights = 1/wt)

Weighted Residuals:
    Min      1Q  Median      3Q     Max 
-2.2271 -1.0538 -0.3894  0.6397  3.7627 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.002317   1.541462  25.302  < 2e-16 ***
wt          -4.443823   0.688300  -6.456 4.59e-07 ***
hp          -0.031460   0.009776  -3.218  0.00317 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.554 on 29 degrees of freedom
Multiple R-squared:  0.8389,	Adjusted R-squared:  0.8278 
F-statistic: 75.49 on 2 and 29 DF,  p-value: 3.189e-12

Graphical Representation of Models

The graphical representation of both models is:

Weighted Least Squares (Model 1) — Diagnostic Plots for WLS model: Model-1

Weighted Least Squares Model-2 — Diagnostic Plots for WLS model: Model-2

FAQS Weighted Least Square in R

How weighted least squares can be performed in R Lanague?
How lm() function can be used to conduct a WLS in R?
What are the important arguments for performing a weighted least squares model in R?

Learn about Generalized Least Squares

Curvilinear Regression in R: A Quick Reference

September 5, 2024August 7, 2021 by Muhammad Imdad Ullah

Introduction to Curvilinear Regression in R Language

In this post, we will learn about some basics of curvilinear regression in R.

The curvilinear/non-linear regression analysis is used to determine if there is a non-linear trend exists between $X$ and $Y$.

Adding more parameters to an equation results in a better fit to the data. A quadratic and cubic equation will always have higher $R^2$ than the linear regression model. Similarly, a cubic equation will usually have higher $R^2$ than a quadratic one.

Logarithmic and Polynomial Relationships

The logarithmic relationship can be described as follows:
$$Y=m\, log(x)++c$$
the polynomial relationship can be described as follows:
$$Y=m_1x + m_2x^2 + m_3x^3 + m_nx^n + c$$

The logarithmic example is more akin to a simple regression, whereas the polynomial example is multiple regression. Logarithmic relationships are common in the natural world; you may encounter them in many circumstances. Drawing the relationships between response and predictor variables as a scatter plot is generally a good starting point.

Consider the following data that are related in a curvilinear form,

Growth	Nutrient
2	2
9	4
11	6
12	8
13	10
14	16
17	22
19	28
17	30
18	36
20	48

Performing Curvilinear Regression in R

Let us perform a curvilinear regression in R language.

Growth <- c(2, 9, 11, 12, 13, 14, 17, 19, 17, 18, 20)
Nutrient <- c(2, 4, 6, 8, 10, 16, 22, 28, 30, 36, 48)
data <- as.data.frame(cbind(Growth, Nutrient))

ggplot(data, aes(Nutrient, Growth) ) +
  geom_point() +
  stat_smooth()

The Scatter plot shows the relationship appears to be a logarithmic one.

Linear Regression in R

Let us carry out a linear regression using the lm() function by taking the $\log$ of the predictor variable rather than the basic variable itself.

data <- cbind(Growth, Nutrient)
mod  <- lm(Growth~log(Nutrient, data))
summary(mod)

##
Call:

lm(formula = Growth ~ log(Nutrient), data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.2274 -0.9039  0.5400  0.9344  1.3097 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.6914     1.0596   0.652     0.53    
log(Nutrient)   5.1014     0.3858  13.223 3.36e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.229 on 9 degrees of freedom
Multiple R-squared:  0.951,     Adjusted R-squared:  0.9456 
F-statistic: 174.8 on 1 and 9 DF,  p-value: 3.356e-07

FAQS about Curvilinear Regression in R

Write in detail about curvilinear regression models.
How visually one can guess the curvilinear relationship between the response and predictor variable?
What may be the consequences, if a curvilinear relationship is estimated using a simple linear regression model?

Learn about Performing Linear Regression in R

Learn Statistics