Univariate analysis

2.0 What are univariate analyses?

A univariate analysis is a statistical test that compares a response variable to a series of predictors. There are many different versions including linear regression, ANOVAs, ANCOVAs, and GLMs. We are going to briefly touch upon the math and then go into the differences between these models.

First, let’s touch upon the math behind linear regressions. The form for a linear regression is as follows:

E [Y] = β X + ε

This equation states that the expected value of a dependent variable (E[Y]) is determined by the value of a predictor variable (X) multiplied by a parameter coefficient (β) plus a residual (ε). In other words, your response is determined by your predictor multiplied by some value plus or minus some difference.

Ex. Let’s say that you are a jogger that runs every day and you want to determine the speed that you run at. You record the duration and distance that you run every day. Thus, the predictor is the duration of your run per day, the response is the distance that was run, and the parameter coefficient is the speed that you run at. The residuals in this model are represented by unexplained variation from this relationship, such as you having to stop to tie your shoe that would increase the length but not the distance.

library(LearnCommAnalysis)


duration <- c(30,60,45,44,36,58,57,22,25)
distance <- c(2.5,5.1,3.9,3.9,3,4.9,4.8,1.7,2.1)

plot(duration,distance)

To determine this parameter coefficient we run a linear model using the ~ tilda as the regressor. The formula is represented as such:

m1 <- lm(distance~duration) ## linear model with response ~ predictor
summary(m1) ## see output of model

## 
## Call:
## lm(formula = distance ~ duration)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.10485 -0.05361 -0.02854  0.03275  0.17091 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.119385   0.096981  -1.231    0.258    
## duration     0.087465   0.002201  39.745 1.66e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09038 on 7 degrees of freedom
## Multiple R-squared:  0.9956, Adjusted R-squared:  0.995 
## F-statistic:  1580 on 1 and 7 DF,  p-value: 1.663e-09

Here the parameter coefficient is under Estimate as 0.087 kilometres per minute or 5.22 kph. We also notice that the p value is extremely small and the R squared close to 1. These inform us that the model fits accurately. We can plot this relationship using the abline function. The distance between the line and each point is the residuals ε

plot(duration,distance)
abline(m1, col="red") ## adds line from model

A Generalized Linear Model (GLM) is extremely similar to that of the LM in this regard. However, where the GLM differs is that it has an extra parameter called a “link function” (g^-1). The link function determines how the mean of the response is related to the predictor. Thus, the GLM is represented as:

E [Y] = (β X + ε) * g^-1

Why use a GLM? There are three reasons why GLMs may be more appropriate:

The response variable Yi is not normally distributed or the range is restricted (e.g. binomial, Poisson)
The variance of Y depends on the mean.
GLMs can fit to particular distributions without transformations. The benefits of this include i) the homogeneity of variance does NOT need to be satisfied and ii) the response variable does not need to be changed.

This last assumption has some associated complexities that could be read more on here. However, as a rule, log transforming a response variable and then fitting it to a normal distribution is not the same as conducting a GLM with a log link.

E(log(Y)) ≠ log(E(Y))

2.1 Working example of a GLM

To best understand how a GLM is executed, it is a good idea to go through an example. As said above, at the most basic, a GLM compares a response to one or more predictors. Let us consider a population of people that have a disease. Three drugs are being considered as potentially curing this disease. Eighty people are randomly divided into four groups, three for each drug treatment and one as a control. We would like to see if a drug was effective.

This is an example of a logistic regression that is a member of the GLM family fit to a binomial distribution. The response variable here is not normal because the outcome of an individual can only be two values: 1 = cured or 0 = not cured.

##load data
head(drug.trial)

##   response predictor
## 1        1   control
## 2        0   control
## 3        0   control
## 4        0   control
## 5        1   control
## 6        0   control

## drug.trial <- read.csv("https://raw.githubusercontent.com/afilazzola/YorkU.BioStats.2019/master/data/drug.trial.csv")

The formula has the same syntax as the LM shown above. The response is followed by the tilda and the predictors. Here, we also specify where the data is coming from. In this case it is coming from the drug.trial data frame we created. Lastly, the main nuance of using the GLM function is the specification of a family. Since our data is binary and follows a binomial distribution, that is what we refer to it as.

m2<- glm(response~predictor, data=drug.trial, family="binomial") ##glm for drugs potential to curse

summary(m2) ##output of model

## 
## Call:
## glm(formula = response ~ predictor, family = "binomial", data = drug.trial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6651  -0.8446  -0.7585   1.0935   1.6651  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     -1.0986     0.5164  -2.127  0.03338 * 
## predictordrug1   0.2513     0.7105   0.354  0.72354   
## predictordrug2   1.2993     0.6846   1.898  0.05772 . 
## predictordrug3   2.1972     0.7303   3.009  0.00262 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 110.453  on 79  degrees of freedom
## Residual deviance:  96.947  on 76  degrees of freedom
## AIC: 104.95
## 
## Number of Fisher Scoring iterations: 4

The output of the model reveals lots of information. The most important things to look at are the coefficients and the residual deviance. The GLM takes one of the predictor factors and treats it as a “dummy” variable. The result is that all the other factors are compared against that first value. These P-values therefore represent a comparison of each drug to the control which in itself is interesting, but not exactly what we are looking for. It is also important to examine the residual deviance and the residual degrees of freedom. As a general rule of thumb, the deviance should not be more than double the degrees of freedom, otherwise your data is overdispersed. You can read more here about dummy variables and overdisperson.

The summary results does not quite answer our question. We are interested if the drugs have an effect and if so, which drug is the best? To do this we need to conduct a significance test on the predictor to see if the effect is significantly different than zero. For binomial distributions this can be done with a chisq test.

anova(m2, test="Chisq") ## chisq test on predictor from model

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: response
## 
## Terms added sequentially (first to last)
## 
## 
##           Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
## NULL                         79    110.453            
## predictor  3   13.506        76     96.947 0.003661 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here we see that degrees freedom from the model (i.e. n-1), the residual deviance and DF again from the model output. The p-value here identifies that the effect is significant. R for some reason refers to the Chi-sq value from this test as “deviance”. Judging by the high Chi-sq value, we can believe that the effect of this predictor is significant. We now must determine which drug treatment is most effective.

2.2 Pairwise comparisons and packages

To see which drug is most effective, we need to conduct pairwise comparisons. There are many packages that are available in R to do this. One of the easier ones to work with is lsmeans.

library(emmeans)

## Warning: package 'emmeans' was built under R version 3.5.3

The syntax for conducting a pairwise comparison is model, type of comparison (in this case pairwise), and lastly adjustment type. Tukey is a common technique that is also used for ANOVAs.

emmeans(m2, pairwise~predictor, adjust="tukey") ##conduct pairwise comparisons on predictor

## $emmeans
##  predictor emmean    SE  df asymp.LCL asymp.UCL
##  control   -1.099 0.516 Inf   -2.1107   -0.0865
##  drug1     -0.847 0.488 Inf   -1.8037    0.1091
##  drug2      0.201 0.449 Inf   -0.6803    1.0816
##  drug3      1.099 0.516 Inf    0.0865    2.1107
## 
## Results are given on the logit (not the response) scale. 
## Confidence level used: 0.95 
## 
## $contrasts
##  contrast        estimate    SE  df z.ratio p.value
##  control - drug1   -0.251 0.710 Inf -0.354  0.9848 
##  control - drug2   -1.299 0.685 Inf -1.898  0.2289 
##  control - drug3   -2.197 0.730 Inf -3.009  0.0140 
##  drug1 - drug2     -1.048 0.663 Inf -1.580  0.3903 
##  drug1 - drug3     -1.946 0.710 Inf -2.739  0.0313 
##  drug2 - drug3     -0.898 0.685 Inf -1.312  0.5555 
## 
## Results are given on the log odds ratio (not the response) scale. 
## P value adjustment: tukey method for comparing a family of 4 estimates

The output of this function provides a lot of information including the model statistics for each of the groups. However, the most relative content now to our question is the p.value to determine which groups were different from one another and the estimate to see in what direction. Here, we can see that the only group that is significantly different is drug3. This drug was more effective than a control and drug1, although it had a similar cure rate as drug2. This is a typical workflow for a GLM that begins by testing the effect of the predictor and is followed by a pairwise comparison to determine where the significance can be attributed to.

2.3 Different types of GLMs

There are many different types of GLMs that are possible in addition to the binomial family. Some of them involve changing the family in the glm function to the most appropriate data distribution. Other commons families include:

family	type of data
binomial	binary (y = 0,1) or proportion (0<y<1).
poisson	discrete count (y = 0, 1, 2, 3 … i).
gamma	continuous data that is right skewed, positive, and is on the log scale.
quasibinomial	binary or proportion data that is overdispersed
quasipoisson	discrete count data that is overdispersed.

Quasibinomial and quasipoisson compensate for overdispersed data with an extra parameter ϕ. Theta here is a dispersion parameter that attempts to describe additional variance that cannot be explained by the distribution alone. As a general rule of thumb follow this workflow.

2.4 Try it yourself

Here is an example of bird surveys in Southern Ontario in forests, urban parks, and city centers. Determine if there is a difference between the habitats and which was the greatest.

head(bird.data)

##   species.rich habitat
## 1            5  forest
## 2            7  forest
## 3            1  forest
## 4            7  forest
## 5            3  forest
## 6            5  forest

2.5 Generalized Linear Mixed Model

A Generalized Linear Mixed Model (GLMM) is a type of GLM that includes a random effect. Random effects are variables that are included in the model to account for common variance among your replicates. Let’s consider a regular linear regression, where you have different levels of animal grazing. Here, the number of cattle per hectare is a fixed effect and is represented by the x-axis (predictor variable). The response variable is amount of surveyed biomass.

## Make up some data
head(ranch)

##   cattle    biomass year
## 1     20 0.09915519 2014
## 2     18 0.32291192 2015
## 3     19 0.35357170 2016
## 4     18 0.40782605 2017
## 5     17 0.67523442 2014
## 6     19 0.78686544 2015

plot(ranch$cattle, ranch$biomass, ylab="Biomass (kg/m)", xlab="Cattle (per ha)", pch=19)

We can see the relationship is linear, but these samples are not independent from each other, which violates an assumption of linear regression. These samples were collected over four different years 2014 - 2017. If you were interested in examining year effects on grazing you could include year as a fixed effect and have an interaction term grazing * year to satisfy that assumption. However, what if the assumption is that the differences among years effect everything equally. That is to say, different years can have different biomass values at the sites because of grow conditions, but cattle at their respective densities, will always eat a consistent amount. In this case, you fit year as a random effect because you are not interested in year effects, but are still accounting the associated variation. Another way to consider it, is a random effect is something that should effect the y-intercept of your linear model and a fixed effect is something that would effect the slope. Although there are random effects that consider slope as well (see here), this will cover most instances.

NB One thing important to note is that random vs. fixed effects are determined by your research question. If your question was “What are the effects of inter-annual variability on grazing?” you must treat year as a fixed effect.

To conduct a mixed model requires the package lme4 and the random factor is fitted using the operator (1 | randomeff)

library(lme4)

m1 <- lmer(biomass ~ cattle + (1|year), data=ranch)
summary(m1)

## Linear mixed model fit by REML ['lmerMod']
## Formula: biomass ~ cattle + (1 | year)
##    Data: ranch
## 
## REML criterion at convergence: 65.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.2432 -0.8638 -0.2438  0.5126  2.1241 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  year     (Intercept) 0.000    0.000   
##  Residual             1.373    1.172   
## Number of obs: 20, groups:  year, 4
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  6.57422    0.82442   7.974
## cattle      -0.35766    0.06036  -5.925
## 
## Correlation of Fixed Effects:
##        (Intr)
## cattle -0.948
## convergence code: 0
## boundary (singular) fit: see ?isSingular

But wait?! There are no p.values? There is a fairly long, but humourous explanation by Douglas Bates who wrote the lme4 package as to why there are no reported p.values. The short version is that the degrees freedom can only be approximated and thus p.values cannot be calculated. There are a few solutions, the easiest being comparing the mixed model to one with just an intercept. This effectively determines if the fixed effect is signficant but gets unweidely the more predictor variables that are in your model.

m1 <- lmer(biomass ~ cattle + (1|year), data=ranch)

## boundary (singular) fit: see ?isSingular

m0 <- lmer(biomass ~ 1 + (1|year), data=ranch)

## boundary (singular) fit: see ?isSingular

anova(m1,m0, test="Chisq")

## refitting model(s) with ML (instead of REML)

## Data: ranch
## Models:
## m0: biomass ~ 1 + (1 | year)
## m1: biomass ~ cattle + (1 | year)
##    Df    AIC    BIC  logLik deviance Chisq Chi Df Pr(>Chisq)    
## m0  3 88.636 91.623 -41.318   82.636                            
## m1  4 68.996 72.979 -30.498   60.996 21.64      1  3.289e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The solution recommended by Bates is to use a Markov Chain Monte Carlo sample, but I find this computer intensive and some people struggle understanding the output. Instead, I take the same approach as SAS (i.e. Satterthwaite’s Method) to calculate p.values in mixed models that can be replicated using the lmerTest package

library(lmerTest)

m1 <- lmer(biomass ~ cattle + (1|year), data=ranch)
anova(m1)

## Type III Analysis of Variance Table with Satterthwaite's method
##        Sum Sq Mean Sq NumDF DenDF F value    Pr(>F)    
## cattle  48.22   48.22     1    18   35.11 1.314e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.6 Generalized Additive Models

GAMs borrow all the flexibility of GLMs with the exception of being able to account for non-linearity. A person with the above knowledge could easily fit a basic GAM once they have learned the syntax in R. However, there is some degree of complexity in determining the smoothing term that is associated with the fit. Dr. Gavin Simpson has many resources and tutorials that do a great job explaining everything that is GAM. In a sense, a GAM will do almost anything you ever want because it is a GLM with no boundaries. There are two general caveats though before you consider the GAM route:

Fitting polynomials beyond quadratic rarely make sense ecologically (but remember that rarely is not the same as never and climate data often requires more complex polynomials)
Currently, there is preference for GLMs over GAMs particularly when one includes interaction terms.

The main package for GAMs are mgcv. We can explore fitting a simple GAM model with the previous ranch data. What changes in a GAM is the presence of a smoothing term.

library(mgcv)

head(gam.data)

##   year ndvi
## 1 1980 1.53
## 2 1981 1.57
## 3 1982 1.53
## 4 1983 1.85
## 5 1984 1.84
## 6 1985 1.85

## Plot the data
ggplot(gam.data, aes(x=year, y=ndvi)) + geom_point()

## Run a linear model
m1 <- lm(ndvi~year, data=gam.data)
summary(m1)

## 
## Call:
## lm(formula = ndvi ~ year, data = gam.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5634 -0.2160 -0.0344  0.2661  0.8303 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -29.277823  15.253649  -1.919   0.0648 .
## year          0.015790   0.007646   2.065   0.0479 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3808 on 29 degrees of freedom
## Multiple R-squared:  0.1282, Adjusted R-squared:  0.09815 
## F-statistic: 4.265 on 1 and 29 DF,  p-value: 0.04794

## add linear model to plot
ggplot(gam.data, aes(x=year, y=ndvi)) + geom_point()+  geom_smooth(method = "lm", formula = y ~x)

## check the fit
par(mfrow = c(2,2))
plot(m1)

When we plot changes in NDVI over three decades, we notice that there appears to be a non-linear trend, potentially following some form of cycle or determine by a driver that shifted twice significantly. If we were to fit a typical linear model, the response is significant. However, when we look at the diagnostic plots, there are some strongly suspicious findings. Particularly, with the residuals vs the fitted values.

Instead, we fit a GAM with year fitted to a smoothing term. You can specify the type of polynomial you want fit, but since this is more of a data exploration, I will leave that argument blank.

## Fit GAM
m2 <- gam(ndvi~s(year), data=gam.data, method="REML")
summary(m2)

## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## ndvi ~ s(year)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.22387    0.03134   70.96   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##           edf Ref.df     F  p-value    
## s(year) 6.642  7.767 16.59 2.97e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.811   Deviance explained = 85.3%
## -REML = 2.5671  Scale est. = 0.030446  n = 31

## check diagnostics
par(mfrow = c(2,2))
gam.check(m2)

## 
## Method: REML   Optimizer: outer newton
## full convergence after 6 iterations.
## Gradient range [-1.169553e-11,-1.179057e-12]
## (score 2.567067 & scale 0.03044605).
## Hessian positive definite, eigenvalue range [1.398249,15.10734].
## Model rank =  10 / 10 
## 
## Basis dimension (k) checking results. Low p-value (k-index<1) may
## indicate that k is too low, especially if edf is close to k'.
## 
##           k'  edf k-index p-value
## s(year) 9.00 6.64    1.47    0.98

## see final plot
ggplot(gam.data, aes(x=year, y=ndvi)) + geom_point() +  geom_smooth(method = "gam", formula = y ~s(x))

Here we can see from the diagnostic plots that the residuals more closely resemble the fitted values and that we have a relatively normal distribution in the residuals. Using ggplot has a feature that allows for the plotting on GAMs in addition to the others we described, such as glm and linear models.

GAMs can also fitted random effects, and they fit a form of interaction that is similar to running two nested GAMs separated by the respective factors. Careful attention must be paid to the fitting of the smoothing terms and splines. It is not okay to trust the defaults.

## Change back plot pattern
par(mfrow = c(1,1))