A univariate analysis is a statistical test that compares a response variable to a series of predictors. There are many different versions including linear regression, ANOVAs, ANCOVAs, and GLMs. We are going to briefly touch upon the math and then go into the differences between these models.
First, let’s touch upon the math behind linear regressions. The form for a linear regression is as follows:
E [Y] = β X + ε
This equation states that the expected value of a dependent variable (E[Y]) is determined by the value of a predictor variable (X) multiplied by a parameter coefficient (β) plus a residual (ε). In other words, your response is determined by your predictor multiplied by some value plus or minus some difference.
Ex. Let’s say that you are a jogger that runs every day and you want to determine the speed that you run at. You record the duration and distance that you run every day. Thus, the predictor is the duration of your run per day, the response is the distance that was run, and the parameter coefficient is the speed that you run at. The residuals in this model are represented by unexplained variation from this relationship, such as you having to stop to tie your shoe that would increase the length but not the distance.
library(LearnCommAnalysis)
duration <- c(30,60,45,44,36,58,57,22,25)
distance <- c(2.5,5.1,3.9,3.9,3,4.9,4.8,1.7,2.1)
plot(duration,distance)
To determine this parameter coefficient we run a linear model using the ~
tilda as the regressor. The formula is represented as such:
m1 <- lm(distance~duration) ## linear model with response ~ predictor
summary(m1) ## see output of model
##
## Call:
## lm(formula = distance ~ duration)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.10485 -0.05361 -0.02854 0.03275 0.17091
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.119385 0.096981 -1.231 0.258
## duration 0.087465 0.002201 39.745 1.66e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09038 on 7 degrees of freedom
## Multiple R-squared: 0.9956, Adjusted R-squared: 0.995
## F-statistic: 1580 on 1 and 7 DF, p-value: 1.663e-09
Here the parameter coefficient is under Estimate as 0.087 kilometres per minute or 5.22 kph. We also notice that the p value is extremely small and the R squared close to 1. These inform us that the model fits accurately. We can plot this relationship using the abline
function. The distance between the line and each point is the residuals ε
plot(duration,distance)
abline(m1, col="red") ## adds line from model
A Generalized Linear Model (GLM) is extremely similar to that of the LM in this regard. However, where the GLM differs is that it has an extra parameter called a “link function” (g-1). The link function determines how the mean of the response is related to the predictor. Thus, the GLM is represented as:
E [Y] = (β X + ε) * g^-1
Why use a GLM? There are three reasons why GLMs may be more appropriate:
This last assumption has some associated complexities that could be read more on here. However, as a rule, log transforming a response variable and then fitting it to a normal distribution is not the same as conducting a GLM with a log link.
E(log(Y)) ≠ log(E(Y))
To best understand how a GLM is executed, it is a good idea to go through an example. As said above, at the most basic, a GLM compares a response to one or more predictors. Let us consider a population of people that have a disease. Three drugs are being considered as potentially curing this disease. Eighty people are randomly divided into four groups, three for each drug treatment and one as a control. We would like to see if a drug was effective.
This is an example of a logistic regression that is a member of the GLM family fit to a binomial distribution. The response variable here is not normal because the outcome of an individual can only be two values: 1 = cured or 0 = not cured.
##load data
head(drug.trial)
## response predictor
## 1 1 control
## 2 0 control
## 3 0 control
## 4 0 control
## 5 1 control
## 6 0 control
## drug.trial <- read.csv("https://raw.githubusercontent.com/afilazzola/YorkU.BioStats.2019/master/data/drug.trial.csv")
The formula has the same syntax as the LM shown above. The response is followed by the tilda and the predictors. Here, we also specify where the data is coming from. In this case it is coming from the drug.trial data frame we created. Lastly, the main nuance of using the GLM function is the specification of a family. Since our data is binary and follows a binomial distribution, that is what we refer to it as.
m2<- glm(response~predictor, data=drug.trial, family="binomial") ##glm for drugs potential to curse
summary(m2) ##output of model
##
## Call:
## glm(formula = response ~ predictor, family = "binomial", data = drug.trial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6651 -0.8446 -0.7585 1.0935 1.6651
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0986 0.5164 -2.127 0.03338 *
## predictordrug1 0.2513 0.7105 0.354 0.72354
## predictordrug2 1.2993 0.6846 1.898 0.05772 .
## predictordrug3 2.1972 0.7303 3.009 0.00262 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 110.453 on 79 degrees of freedom
## Residual deviance: 96.947 on 76 degrees of freedom
## AIC: 104.95
##
## Number of Fisher Scoring iterations: 4
The output of the model reveals lots of information. The most important things to look at are the coefficients and the residual deviance. The GLM takes one of the predictor factors and treats it as a “dummy” variable. The result is that all the other factors are compared against that first value. These P-values therefore represent a comparison of each drug to the control which in itself is interesting, but not exactly what we are looking for. It is also important to examine the residual deviance and the residual degrees of freedom. As a general rule of thumb, the deviance should not be more than double the degrees of freedom, otherwise your data is overdispersed. You can read more here about dummy variables and overdisperson.
The summary results does not quite answer our question. We are interested if the drugs have an effect and if so, which drug is the best? To do this we need to conduct a significance test on the predictor to see if the effect is significantly different than zero. For binomial distributions this can be done with a chisq test.
anova(m2, test="Chisq") ## chisq test on predictor from model
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: response
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 79 110.453
## predictor 3 13.506 76 96.947 0.003661 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here we see that degrees freedom from the model (i.e. n-1), the residual deviance and DF again from the model output. The p-value here identifies that the effect is significant. R for some reason refers to the Chi-sq value from this test as “deviance”. Judging by the high Chi-sq value, we can believe that the effect of this predictor is significant. We now must determine which drug treatment is most effective.
To see which drug is most effective, we need to conduct pairwise comparisons. There are many packages that are available in R to do this. One of the easier ones to work with is lsmeans
.
library(emmeans)
## Warning: package 'emmeans' was built under R version 3.5.3
The syntax for conducting a pairwise comparison is model, type of comparison (in this case pairwise), and lastly adjustment type. Tukey is a common technique that is also used for ANOVAs.
emmeans(m2, pairwise~predictor, adjust="tukey") ##conduct pairwise comparisons on predictor
## $emmeans
## predictor emmean SE df asymp.LCL asymp.UCL
## control -1.099 0.516 Inf -2.1107 -0.0865
## drug1 -0.847 0.488 Inf -1.8037 0.1091
## drug2 0.201 0.449 Inf -0.6803 1.0816
## drug3 1.099 0.516 Inf 0.0865 2.1107
##
## Results are given on the logit (not the response) scale.
## Confidence level used: 0.95
##
## $contrasts
## contrast estimate SE df z.ratio p.value
## control - drug1 -0.251 0.710 Inf -0.354 0.9848
## control - drug2 -1.299 0.685 Inf -1.898 0.2289
## control - drug3 -2.197 0.730 Inf -3.009 0.0140
## drug1 - drug2 -1.048 0.663 Inf -1.580 0.3903
## drug1 - drug3 -1.946 0.710 Inf -2.739 0.0313
## drug2 - drug3 -0.898 0.685 Inf -1.312 0.5555
##
## Results are given on the log odds ratio (not the response) scale.
## P value adjustment: tukey method for comparing a family of 4 estimates
The output of this function provides a lot of information including the model statistics for each of the groups. However, the most relative content now to our question is the p.value to determine which groups were different from one another and the estimate to see in what direction. Here, we can see that the only group that is significantly different is drug3. This drug was more effective than a control and drug1, although it had a similar cure rate as drug2. This is a typical workflow for a GLM that begins by testing the effect of the predictor and is followed by a pairwise comparison to determine where the significance can be attributed to.
There are many different types of GLMs that are possible in addition to the binomial
family. Some of them involve changing the family in the glm function to the most appropriate data distribution. Other commons families include:
family | type of data |
---|---|
binomial | binary (y = 0,1) or proportion (0<y<1). |
poisson | discrete count (y = 0, 1, 2, 3 … i). |
gamma | continuous data that is right skewed, positive, and is on the log scale. |
quasibinomial | binary or proportion data that is overdispersed |
quasipoisson | discrete count data that is overdispersed. |
Quasibinomial and quasipoisson compensate for overdispersed data with an extra parameter ϕ
. Theta here is a dispersion parameter that attempts to describe additional variance that cannot be explained by the distribution alone. As a general rule of thumb follow this workflow.
Here is an example of bird surveys in Southern Ontario in forests, urban parks, and city centers. Determine if there is a difference between the habitats and which was the greatest.
head(bird.data)
## species.rich habitat
## 1 5 forest
## 2 7 forest
## 3 1 forest
## 4 7 forest
## 5 3 forest
## 6 5 forest
A Generalized Linear Mixed Model (GLMM) is a type of GLM that includes a random effect. Random effects are variables that are included in the model to account for common variance among your replicates. Let’s consider a regular linear regression, where you have different levels of animal grazing. Here, the number of cattle per hectare is a fixed effect and is represented by the x-axis (predictor variable). The response variable is amount of surveyed biomass.
## Make up some data
head(ranch)
## cattle biomass year
## 1 20 0.09915519 2014
## 2 18 0.32291192 2015
## 3 19 0.35357170 2016
## 4 18 0.40782605 2017
## 5 17 0.67523442 2014
## 6 19 0.78686544 2015
plot(ranch$cattle, ranch$biomass, ylab="Biomass (kg/m)", xlab="Cattle (per ha)", pch=19)
We can see the relationship is linear, but these samples are not independent from each other, which violates an assumption of linear regression. These samples were collected over four different years 2014 - 2017. If you were interested in examining year effects on grazing you could include year as a fixed effect and have an interaction term grazing * year
to satisfy that assumption. However, what if the assumption is that the differences among years effect everything equally. That is to say, different years can have different biomass values at the sites because of grow conditions, but cattle at their respective densities, will always eat a consistent amount. In this case, you fit year as a random effect because you are not interested in year effects, but are still accounting the associated variation. Another way to consider it, is a random effect is something that should effect the y-intercept of your linear model and a fixed effect is something that would effect the slope. Although there are random effects that consider slope as well (see here), this will cover most instances.
NB One thing important to note is that random vs. fixed effects are determined by your research question. If your question was “What are the effects of inter-annual variability on grazing?” you must treat year as a fixed effect.
To conduct a mixed model requires the package lme4
and the random factor is fitted using the operator (1 | randomeff)
library(lme4)
m1 <- lmer(biomass ~ cattle + (1|year), data=ranch)
summary(m1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: biomass ~ cattle + (1 | year)
## Data: ranch
##
## REML criterion at convergence: 65.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.2432 -0.8638 -0.2438 0.5126 2.1241
##
## Random effects:
## Groups Name Variance Std.Dev.
## year (Intercept) 0.000 0.000
## Residual 1.373 1.172
## Number of obs: 20, groups: year, 4
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 6.57422 0.82442 7.974
## cattle -0.35766 0.06036 -5.925
##
## Correlation of Fixed Effects:
## (Intr)
## cattle -0.948
## convergence code: 0
## boundary (singular) fit: see ?isSingular
But wait?! There are no p.values? There is a fairly long, but humourous explanation by Douglas Bates who wrote the lme4
package as to why there are no reported p.values. The short version is that the degrees freedom can only be approximated and thus p.values cannot be calculated. There are a few solutions, the easiest being comparing the mixed model to one with just an intercept. This effectively determines if the fixed effect is signficant but gets unweidely the more predictor variables that are in your model.
m1 <- lmer(biomass ~ cattle + (1|year), data=ranch)
## boundary (singular) fit: see ?isSingular
m0 <- lmer(biomass ~ 1 + (1|year), data=ranch)
## boundary (singular) fit: see ?isSingular
anova(m1,m0, test="Chisq")
## refitting model(s) with ML (instead of REML)
## Data: ranch
## Models:
## m0: biomass ~ 1 + (1 | year)
## m1: biomass ~ cattle + (1 | year)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## m0 3 88.636 91.623 -41.318 82.636
## m1 4 68.996 72.979 -30.498 60.996 21.64 1 3.289e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The solution recommended by Bates is to use a Markov Chain Monte Carlo sample, but I find this computer intensive and some people struggle understanding the output. Instead, I take the same approach as SAS (i.e. Satterthwaite’s Method) to calculate p.values in mixed models that can be replicated using the lmerTest
package
library(lmerTest)
m1 <- lmer(biomass ~ cattle + (1|year), data=ranch)
anova(m1)
## Type III Analysis of Variance Table with Satterthwaite's method
## Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
## cattle 48.22 48.22 1 18 35.11 1.314e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
GAMs borrow all the flexibility of GLMs with the exception of being able to account for non-linearity. A person with the above knowledge could easily fit a basic GAM once they have learned the syntax in R. However, there is some degree of complexity in determining the smoothing term that is associated with the fit. Dr. Gavin Simpson has many resources and tutorials that do a great job explaining everything that is GAM. In a sense, a GAM will do almost anything you ever want because it is a GLM with no boundaries. There are two general caveats though before you consider the GAM route:
The main package for GAMs are mgcv
. We can explore fitting a simple GAM model with the previous ranch data. What changes in a GAM is the presence of a smoothing term.
library(mgcv)
head(gam.data)
## year ndvi
## 1 1980 1.53
## 2 1981 1.57
## 3 1982 1.53
## 4 1983 1.85
## 5 1984 1.84
## 6 1985 1.85
## Plot the data
ggplot(gam.data, aes(x=year, y=ndvi)) + geom_point()
## Run a linear model
m1 <- lm(ndvi~year, data=gam.data)
summary(m1)
##
## Call:
## lm(formula = ndvi ~ year, data = gam.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5634 -0.2160 -0.0344 0.2661 0.8303
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -29.277823 15.253649 -1.919 0.0648 .
## year 0.015790 0.007646 2.065 0.0479 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3808 on 29 degrees of freedom
## Multiple R-squared: 0.1282, Adjusted R-squared: 0.09815
## F-statistic: 4.265 on 1 and 29 DF, p-value: 0.04794
## add linear model to plot
ggplot(gam.data, aes(x=year, y=ndvi)) + geom_point()+ geom_smooth(method = "lm", formula = y ~x)
## check the fit
par(mfrow = c(2,2))
plot(m1)
When we plot changes in NDVI over three decades, we notice that there appears to be a non-linear trend, potentially following some form of cycle or determine by a driver that shifted twice significantly. If we were to fit a typical linear model, the response is significant. However, when we look at the diagnostic plots, there are some strongly suspicious findings. Particularly, with the residuals vs the fitted values.
Instead, we fit a GAM with year fitted to a smoothing term. You can specify the type of polynomial you want fit, but since this is more of a data exploration, I will leave that argument blank.
## Fit GAM
m2 <- gam(ndvi~s(year), data=gam.data, method="REML")
summary(m2)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## ndvi ~ s(year)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.22387 0.03134 70.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(year) 6.642 7.767 16.59 2.97e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.811 Deviance explained = 85.3%
## -REML = 2.5671 Scale est. = 0.030446 n = 31
## check diagnostics
par(mfrow = c(2,2))
gam.check(m2)
##
## Method: REML Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [-1.169553e-11,-1.179057e-12]
## (score 2.567067 & scale 0.03044605).
## Hessian positive definite, eigenvalue range [1.398249,15.10734].
## Model rank = 10 / 10
##
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
##
## k' edf k-index p-value
## s(year) 9.00 6.64 1.47 0.98
## see final plot
ggplot(gam.data, aes(x=year, y=ndvi)) + geom_point() + geom_smooth(method = "gam", formula = y ~s(x))
Here we can see from the diagnostic plots that the residuals more closely resemble the fitted values and that we have a relatively normal distribution in the residuals. Using ggplot
has a feature that allows for the plotting on GAMs in addition to the others we described, such as glm and linear models.
GAMs can also fitted random effects, and they fit a form of interaction that is similar to running two nested GAMs separated by the respective factors. Careful attention must be paid to the fitting of the smoothing terms and splines. It is not okay to trust the defaults.
## Change back plot pattern
par(mfrow = c(1,1))