An introduction to logistic regression

Alessandro Filazzola Centre for Urban Environments

Generalized Linear Models

A univariate analysis is a statistical test that compares a response variable to a series of predictors. There are many different versions including linear regression, ANOVAs, ANCOVAs, and GLMs. We are going to briefly touch upon the math and then go into the differences between these models.

First, let’s touch upon the math behind linear regressions. The form for a linear regression is as follows:

E [Y] = β X + ε

This equation states that the expected value of a dependent variable (E[Y]) is determined by the value of a predictor variable (X) multiplied by a parameter coefficient (β) plus a residual (ε). In other words, your response is determined by your predictor multiplied by some value plus or minus some difference.

Ex. Let’s say that you are a jogger that runs every day and you want to determine the speed that you run at. You record the duration and distance that you run every day. Thus, the predictor is the duration of your run per day, the response is the distance that was run, and the parameter coefficient is the speed that you run at. The residuals in this model are represented by unexplained variation from this relationship, such as you having to stop to tie your shoe that would increase the length but not the distance.

duration <- c(30,60,45,44,36,58,57,22,25)
distance <- c(2.5,5.1,3.9,3.9,3,4.9,4.8,1.7,2.1)

plot(duration,distance)

To determine this parameter coefficient we run a linear model using the ~ tilda as the regressor. The formula is represented as such:

m1 <- lm(distance~duration) ## linear model with response ~ predictor
summary(m1) ## see output of model

## 
## Call:
## lm(formula = distance ~ duration)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.10485 -0.05361 -0.02854  0.03275  0.17091 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.119385   0.096981  -1.231    0.258    
## duration     0.087465   0.002201  39.745 1.66e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09038 on 7 degrees of freedom
## Multiple R-squared:  0.9956, Adjusted R-squared:  0.995 
## F-statistic:  1580 on 1 and 7 DF,  p-value: 1.663e-09

Here the parameter coefficient is under Estimate as 0.087 kilometres per minute or 5.22 kph. We also notice that the p value is extremely small and the R squared close to 1. These inform us that the model fits accurately. We can plot this relationship using the abline function. The distance between the line and each point is the residuals ε

plot(duration,distance)
abline(m1, col="red") ## adds line from model

A Generalized Linear Model (GLM) is extremely similar to that of the LM in this regard. However, where the GLM differs is that it has an extra parameter called a “link function” (g^-1). The link function determines how the mean of the response is related to the predictor. Thus, the GLM is represented as:

E [Y] = (β X + ε) * g^-1

Why use a GLM? There are three reasons why GLMs may be more appropriate:

The response variable Yi is not normally distributed or the range is restricted (e.g. binomial, Poisson)
The variance of Y depends on the mean.
GLMs can fit to particular distributions without transformations. The benefits of this include i) the homogeneity of variance does NOT need to be satisfied and ii) the response variable does not need to be changed.

This last assumption has some associated complexities that could be read more on here. However, as a rule, log transforming a response variable and then fitting it to a normal distribution is not the same as conducting a GLM with a log link.

E(log(Y)) ≠ log(E(Y))

Logistic regression - a form of GLM

To best understand how a GLM is executed, it is a good idea to go through an example. As said above, at the most basic, a GLM compares a response to one or more predictors. Let us consider a population of people that have a disease. Three drugs are being considered as potentially curing this disease. Eighty people are randomly divided into four groups, three for each drug treatment and one as a control. We would like to see if a drug was effective.

This is an example of a logistic regression that is a member of the GLM family fit to a binomial distribution. The response variable here is not normal because the outcome of an individual can only be two values: 1 = cured or 0 = not cured.

drug.trial <- read.csv("data//drug.trial.csv")

##load data
head(drug.trial)

##   response predictor
## 1        1   control
## 2        0   control
## 3        0   control
## 4        0   control
## 5        1   control
## 6        0   control

## drug.trial <- read.csv("https://raw.githubusercontent.com/afilazzola/UoALogisticRegression/main/data/drug.trial.csv")

The formula has the same syntax as the LM shown above. The response is followed by the tilda and the predictors. Here, we also specify where the data is coming from. In this case it is coming from the drug.trial data frame we created. Lastly, the main nuance of using the GLM function is the specification of a family. Since our data is binary and follows a binomial distribution, that is what we refer to it as.

m2<- glm(response~predictor, data=drug.trial, family="binomial") ##glm for drugs potential to curse

summary(m2) ##output of model

## 
## Call:
## glm(formula = response ~ predictor, family = "binomial", data = drug.trial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6651  -0.8446  -0.7585   1.0935   1.6651  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     -1.0986     0.5164  -2.127  0.03338 * 
## predictordrug1   0.2513     0.7105   0.354  0.72354   
## predictordrug2   1.2993     0.6846   1.898  0.05772 . 
## predictordrug3   2.1972     0.7303   3.009  0.00262 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 110.453  on 79  degrees of freedom
## Residual deviance:  96.947  on 76  degrees of freedom
## AIC: 104.95
## 
## Number of Fisher Scoring iterations: 4

The output of the model reveals lots of information. The most important things to look at are the coefficients and the residual deviance. The GLM takes one of the predictor factors and treats it as a “dummy” variable. The result is that all the other factors are compared against that first value. These P-values therefore represent a comparison of each drug to the control which in itself is interesting, but not exactly what we are looking for. It is also important to examine the residual deviance and the residual degrees of freedom. As a general rule of thumb, the deviance should not be more than double the degrees of freedom, otherwise your data is overdispersed. You can read more here about dummy variables and overdisperson.

The summary results does not quite answer our question. We are interested if the drugs have an effect and if so, which drug is the best? To do this we need to conduct a significance test on the predictor to see if the effect is significantly different than zero. For binomial distributions this can be done with a chisq test.

anova(m2, test="Chisq") ## chisq test on predictor from model

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: response
## 
## Terms added sequentially (first to last)
## 
## 
##           Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
## NULL                         79    110.453            
## predictor  3   13.506        76     96.947 0.003661 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we see that degrees freedom from the model (i.e. n-1), the residual deviance and DF again from the model output. The p-value here identifies that the effect is significant. R for some reason refers to the Chi-sq value from this test as “deviance”. Judging by the high Chi-sq value, we can believe that the effect of this predictor is significant. We now must determine which drug treatment is most effective.

Let’s try another example with a continuous variable such as the effects of winter air temperature on lake ice

## Create some data
iceData <- data.frame(icefree = c(0,0,0,0,1,0,1,1,1,1), temperature = c(-10,-7.5,-8.6,-5.2,-5,-2.3,0.5,0,1.1,3.7))

m3<- glm(icefree~temperature,  family="binomial", data=iceData) 
summary(m3)

## 
## Call:
## glm(formula = icefree ~ temperature, family = "binomial", data = iceData)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.46564  -0.33562  -0.00841   0.38789   1.65777  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   2.1369     1.6037   1.333   0.1827  
## temperature   0.6439     0.3644   1.767   0.0773 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13.8629  on 9  degrees of freedom
## Residual deviance:  6.1633  on 8  degrees of freedom
## AIC: 10.163
## 
## Number of Fisher Scoring iterations: 6

anova(m3, test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: icefree
## 
## Terms added sequentially (first to last)
## 
## 
##             Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
## NULL                            9    13.8629            
## temperature  1   7.6996         8     6.1633 0.005523 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(ggplot2)
## Plot the logistic regression
ggplot(iceData, aes(x= temperature, y=icefree)) + geom_point() + geom_smooth(method="glm", method.args=list(family="binomial"), se=F)

## Plot your own range of values
m3Out <- data.frame(temperature = -10:4)
m3Out[,"icefree"] <- predict(m3, m3Out, type="response")

ggplot(m3Out, aes(x=temperature, y=icefree)) + geom_line() + geom_point(data=iceData, aes(x= temperature, y=icefree))

## Plot the log odds model
m3Out[,"logOdds"] <- predict(m3, m3Out, type="link")

ggplot(m3Out, aes(x=temperature, y=logOdds)) + geom_line()

## Psuedo R2
m3null<- glm(icefree~1,  family="binomial", data=iceData) ##glm for drugs potential to curse
1-logLik(m3)/logLik(m3null)

## 'log Lik.' 0.5554085 (df=2)

Examine model sensitivity and specificity

library(caret)
library(pROC)

## List actual vs predicted
actualData <- iceData$icefree
predictData <- predict( m3, iceData, type="response")

## set threshold
predictData <- ifelse(predictData > 0.5, 1, 0)

## Create confusion matrix
confusionMat <- table(actualData, predictData)
confusionMat

##           predictData
## actualData 0 1
##          0 4 1
##          1 1 4

## Extract outputs
confusionMatrix(confusionMat)

## Confusion Matrix and Statistics
## 
##           predictData
## actualData 0 1
##          0 4 1
##          1 1 4
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.4439, 0.9748)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.05469         
##                                           
##                   Kappa : 0.6             
##                                           
##  Mcnemar's Test P-Value : 1.00000         
##                                           
##             Sensitivity : 0.8             
##             Specificity : 0.8             
##          Pos Pred Value : 0.8             
##          Neg Pred Value : 0.8             
##              Prevalence : 0.5             
##          Detection Rate : 0.4             
##    Detection Prevalence : 0.5             
##       Balanced Accuracy : 0.8             
##                                           
##        'Positive' Class : 0               
##

## Create ROC graph
m3ROC <-  roc(icefree~temperature, data = iceData)
plot(m3ROC)

m3ROC ## get AUC

## 
## Call:
## roc.formula(formula = icefree ~ temperature, data = iceData)
## 
## Data: temperature in 5 controls (icefree 0) < 5 cases (icefree 1).
## Area under the curve: 0.96

Pairwise comparisons and packages

To see which drug is most effective, we need to conduct pairwise comparisons. There are many packages that are available in R to do this. One of the easier ones to work with is emmeans.

library(emmeans)

## Welcome to emmeans.
## NOTE -- Important change from versions <= 1.41:
##     Indicator predictors are now treated as 2-level factors by default.
##     To revert to old behavior, use emm_options(cov.keep = character(0))

The syntax for conducting a pairwise comparison is model, type of comparison (in this case pairwise), and lastly adjustment type. Tukey is a common technique that is also used for ANOVAs.

# emmeans(m3, pairwise~predictor, adjust="tukey") ##conduct pairwise comparisons on predictor

The output of this function provides a lot of information including the model statistics for each of the groups. However, the most relative content now to our question is the p.value to determine which groups were different from one another and the estimate to see in what direction. Here, we can see that the only group that is significantly different is drug3. This drug was more effective than a control and drug1, although it had a similar cure rate as drug2. This is a typical workflow for a GLM that begins by testing the effect of the predictor and is followed by a pairwise comparison to determine where the significance can be attributed to.

Different types of GLMs

There are many different types of GLMs that are possible in addition to the binomial family. Some of them involve changing the family in the glm function to the most appropriate data distribution. Other commons families include:

family	type of data
binomial	binary (y = 0,1) or proportion (0<y<1).
poisson	discrete count (y = 0, 1, 2, 3 … i).
gamma	continuous data that is right skewed, positive, and is on the log scale.
quasibinomial	binary or proportion data that is overdispersed
quasipoisson	discrete count data that is overdispersed.

Quasibinomial and quasipoisson compensate for overdispersed data with an extra parameter ϕ. Theta here is a dispersion parameter that attempts to describe additional variance that cannot be explained by the distribution alone. As a general rule of thumb follow this workflow.