For Loops

To vectorize or not to vectorize

N.B. Start with a new R project

Basics

The concept of a for loop is something pervsaive across many programming languages. Packages and vectorization are generally optimized versions for loops that are executed in a more efficient programming language such as C++. However, R still allows utilization of for loops.

When I was first getting started trying to iterate through many tasks, I found for loops relatively inutitive compared to some other methods such as vectorization. I believe the greater understandability has to do with the structure. A for loop, similar to a function, is a task completed normally but then with one variable that changes, which allows the iterations. Let’s take a look at an example of what that might look like getting the correlation between variables among some groups.

We are going to use the penguin dataset by Allison Horst.

library(palmerpenguins)
penguins <- penguins[!is.na(penguins$body_mass_g),]
penguins <- data.frame(penguins) ## drop NAs and convert to  data.frame
str(penguins)
## 'data.frame':    342 obs. of  8 variables:
##  $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ bill_length_mm   : num  39.1 39.5 40.3 36.7 39.3 38.9 39.2 34.1 42 37.8 ...
##  $ bill_depth_mm    : num  18.7 17.4 18 19.3 20.6 17.8 19.6 18.1 20.2 17.1 ...
##  $ flipper_length_mm: int  181 186 195 193 190 181 195 193 190 186 ...
##  $ body_mass_g      : int  3750 3800 3250 3450 3650 3625 4675 3475 4250 3300 ...
##  $ sex              : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 NA NA NA ...
##  $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
### separate linear models for each species
adelie <- subset(penguins, species=="Adelie")
corAde <- cor(adelie$bill_length_mm, adelie$bill_depth_mm, use="complete.obs")
chinstrap <- subset(penguins, species=="Chinstrap")
corchin <- cor(chinstrap$bill_length_mm, chinstrap$bill_depth_mm, use="complete.obs")
gentoo <- subset(penguins, species=="Gentoo")
corGen <- cor(gentoo$bill_length_mm, gentoo$bill_depth_mm, use="complete.obs")
penguinDF <- data.frame(species=c("Adelie","Chinstrap","Gentoo"), 
                        corValue =c(corAde,corchin,corGen) )

The above approach has a few issues with it. Besides being labour intensive, the copy-paste of multiple lines runs of the risk of an error appearing. The person analyzing this data might miss an argument, such as the use arguement, or might misspell one of the target species they are analyzing. Without an error message produced by R, there is no way to determine that an error was made along the way. Most importantly, this strategy does not scale easily. For three species, it is annoying, for 1000 species it is impractical.

We can iterate through these same operations using a for loop with less code and less potential of errors. First let’s take a look at the basics of what the for loop syntax is. At the beginning of the loop you specify the character to iterate through and what values you want them to be assigned. Here list the numbers 1 through 5 by printing the iterating value of i.

for(i in 1:5){
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

We can extend this logic to our previous example of subsetting and correlations. Copying down the same syntax from the correlation example, we create a generic dataframe that will be written over each iteration. The important difference here is that as the loop works it will continue to write over objects. We will need to specify a value that comes out so that we get a value for each iteration.

## list species
speciesNames <- c("Adelie","Chinstrap","Gentoo")

## Empty vector to fill
outCor <- c()

for(i in 1:3){
tempPenguin <- subset(penguins, species==speciesNames[i]) ## subset species by that iteration
corVal <- cor(tempPenguin$bill_length_mm, tempPenguin$bill_depth_mm, use="complete.obs") ##conduct correlation
outCor <- c(outCor,corVal) ## join empty vector with new cor value
}
outCor
## [1] 0.3914917 0.6535362 0.6433839

While it is common to use a number to iterate through, you can also use characters or factors like we have in this example. There are some instances where this would be preferable.

## Empty vector to fill
outCor <- c()

for(i in c("Adelie","Chinstrap","Gentoo")){
tempPenguin <- subset(penguins, species== i ) ## subset species by that iteration
corVal <- cor(tempPenguin$bill_length_mm, tempPenguin$bill_depth_mm, use="complete.obs") ##conduct correlation
outCor <- c(outCor,corVal) ## join empty vector with new cor value
}
outCor
## [1] 0.3914917 0.6535362 0.6433839

Let’s take a look at another example but with linear models where we extract the co-efficients from those models.

## Empty data.frame to fill
coefout <- data.frame()

for(i in c("Adelie","Chinstrap","Gentoo")){
model <- lm(flipper_length_mm ~ body_mass_g, data= penguins, species == i )
dfTemp <- data.frame(species= i, 
                      coefficient = coef(model)[2],
                      intercept = coef(model)[2])
coefout <- rbind(coefout,dfTemp)
}
coefout
##                species coefficient   intercept
## body_mass_g     Adelie 0.006676867 0.006676867
## body_mass_g1 Chinstrap 0.011905064 0.011905064
## body_mass_g2    Gentoo 0.009039136 0.009039136

It is also possible to nest for loops within one another. Nested for loops are more common and more efficient in other programming languages. However, they are still possible in R and have their utility. We can repeat the above example using another subset.

## Empty data.frame to fill
coefout <- data.frame()

for(i in c("Adelie","Chinstrap","Gentoo")){
  for(j in 2007:2009){
    
model <- lm(flipper_length_mm ~ body_mass_g, data= penguins, species == i & year == j )
dfTemp <- data.frame(species= i, 
                     year = j,
                      coefficient = coef(model)[2],
                      intercept = coef(model)[2])
coefout <- rbind(coefout,dfTemp)
  }
}
coefout
##                species year coefficient   intercept
## body_mass_g     Adelie 2007 0.006421678 0.006421678
## body_mass_g1    Adelie 2008 0.007188878 0.007188878
## body_mass_g2    Adelie 2009 0.006616450 0.006616450
## body_mass_g3 Chinstrap 2007 0.009309690 0.009309690
## body_mass_g4 Chinstrap 2008 0.010534351 0.010534351
## body_mass_g5 Chinstrap 2009 0.015660848 0.015660848
## body_mass_g6    Gentoo 2007 0.006379473 0.006379473
## body_mass_g7    Gentoo 2008 0.010169139 0.010169139
## body_mass_g8    Gentoo 2009 0.011224290 0.011224290

Here the sky is the limit. Really, anything you can do once, can be set to run infinitely using for loops. The vectors supplied to i or j can be hundreds, thousands, or longer. You can have multiple nestings within one another. Our examples here use dataframes and vectors, but all this could be repeated with rasters, plots, graphics, or any object loaded into R. There are many, many options. This is where I believe the for loop has its greatest strength. Using this formula, anything that can be loaded into R can be replicated this way.

Formula in designing for loops

There is a formula that I try and follow. The formula is as follows:

  1. Create a object to receive exports from the loop.
  2. Specify a vector to change (either a number to iterate through or specific elements)
  3. Run the operation (remember most objects in this section will be constantly written over each iteration)
  4. Export needed elements to external object

Let’s try that formula with some plots

par(mfrow=c(3,3))
for(i in 1:3){
  for(j in 1:3){
  tempdf <- rnorm(1000, mean= i, sd = j )
  hist(tempdf, main=paste0("Mean = ",i," & SD = ",j))
  }
}

Next, we are going to look at troubleshooting some problems in for loops. The most common error that will arise is that one iteration will break. This can arise for a number of reasons, but typically involves an error message in a function that prevents completion of the loop, such as the example below.

par(mfrow=c(1,3))
for(i in 2006:2008){
  m1 <- lm(bill_length_mm ~ body_mass_g, data=penguins, year==i)
  hist(residuals(m1))
}

Troubleshooting loops

Troubleshooting is the next logical step. However, there can often be times you want the entire loop to complete because either a) the errors can be skipped, or b) all errors need to be identified to properly troubleshoot. We will explore next for a and TryCatch for b. There is also the break function for those looking to exit the loop specifically at the error to try and resolve it.

Flowcharts curtosy of www.datamentor.io

## Loop without conditional next
valOut <- c() ## empty vector
for (i in 0:5){
  tempVal <- log(i) 
  valOut <- c(valOut,tempVal)
}
valOut ## output
## [1]      -Inf 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
## Adding in next statement
valOut <- c()
for (i in 0:5){
  tempVal <- log(i) 
  if(is.infinite(tempVal)) { ## test expression of the loop
  next ## skip remainder of loop if test is true
  }
  valOut <- c(valOut,tempVal)
}
valOut
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379

While next is good for pushing through on conditions, it is less helpful when it comes to errors. That is where tryCatch comes in. tryCatch will allow the loop to continue through and has the helpful option of outputting an error message. Let’s apply it to the above example with the linear model.

par(mfrow=c(1,3))
for(i in 2006:2008){
  tryCatch({ ## wrap the entire loop in a Try catch
  m1 <- lm(bill_length_mm ~ body_mass_g, data=penguins, year==i)
  hist(residuals(m1))
  }, error = function(e) print(i)) ## specify what occurs with an error messagge
}
## [1] 2006

Here we specify which iteration of the loop causes an error to occupy. Importantly, the iterations that are without error are still produced. These are all effective at trying to resolve bugs and troubleshoot errors in complex loops or massive iterations. There is also no limit to the applicatino of tryCatch. Multiple tryCatch conditions can be implemement at different stages of particularly long loops. This allows for the identification of errors at different points.

While we have showed the application of tryCatch for loops, its has an equally important role in functions. Using the same syntax and error identification, one can use tryCatch to develop functions that provide useful error messages.

Practice using for loops

Let’s practice writing for loops!

Ex1. For our first prompt, your task is to fix this faulty for loop:

## Create output dataframe
outDF <- data.frame()

for(i in c(4,6,8)){
  m1 <- lm(mpg ~ hp, data=mtcars, cyl==i) ## linear model through mtcars dataset for each cylinder type
  results <- data.frame(anova(m1)) ## extract model co-efficients
  results[,"cyl"] <- i
  outDF <- rbind(results)
}
outDF
##           Df    Sum.Sq  Mean.Sq F.value    Pr..F. cyl
## hp         1  6.854271 6.854271 1.04985 0.3257538   8
## Residuals 12 78.345729 6.528811      NA        NA   8

Answer Ex.1

## Create output dataframe
outDF <- data.frame()

for(i in c(4,6,8)){
  m1 <- lm(mpg ~ hp, data=mtcars, cyl==i) ## linear model through mtcars dataset for each cylinder type
  results <- data.frame(anova(m1)) ## extract model co-efficients
  results[,"cyl"] <- i
  outDF <- rbind(outDF, results) ## NEEDED TO ADD BIND TO OUTPUT DF
}
outDF
##            Df      Sum.Sq    Mean.Sq    F.value     Pr..F. cyl
## hp          1  55.7389699 55.7389699 3.39764764 0.09839858   4
## Residuals   9 147.6464846 16.4051650         NA         NA   4
## hp1         1   0.2046882  0.2046882 0.08205609 0.78602021   6
## Residuals1  5  12.4724547  2.4944909         NA         NA   6
## hp2         1   6.8542713  6.8542713 1.04984990 0.32575378   8
## Residuals2 12  78.3457287  6.5288107         NA         NA   8


Ex2. Using the mtcars dataset, write a for loop that conducts ANOVAs testing for differences in miles per gallon (mpg) for cylinder type, gear, and am. *hint: the aov function does not inherently accept character vectors.

## Create output dataframe
anovaOut <- data.frame()

for(i in c("cyl","gear","am")){
 i 
}

Answer Ex.2

for(i in c("cyl","gear","am")){
 predictor <- as.factor( mtcars[,i])
 m1 <- aov(mpg ~ predictor, data=mtcars)
 print(summary(m1))
}
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## predictor    2  824.8   412.4    39.7 4.98e-09 ***
## Residuals   29  301.3    10.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## predictor    2  483.2  241.62    10.9 0.000295 ***
## Residuals   29  642.8   22.17                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## predictor    1  405.2   405.2   16.86 0.000285 ***
## Residuals   30  720.9    24.0                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Ex3. Lastly, check for normality among the residuals of the loop we created above using the Shapiro-Wilks test shapiro.test:

for(i in 2007:2009){
  m1 <- lm(bill_length_mm ~ body_mass_g, data=penguins, year==i)
  
}

Answer Ex.3

for(i in 2007:2009){
  m1 <- lm(bill_length_mm ~ body_mass_g, data=penguins, year==i)
  shap1 <- shapiro.test(residuals(m1))
  print(shap1)
}
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m1)
## W = 0.95606, p-value = 0.001211
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m1)
## W = 0.93658, p-value = 4e-05
## 
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(m1)
## W = 0.9564, p-value = 0.0006984


Next Module

Home Vectorization