N.B. Start with a new R project
The concept of a for
loop is something pervsaive across many programming languages. Packages and vectorization are generally optimized versions for
loops that are executed in a more efficient programming language such as C++. However, R still allows utilization of for
loops.
When I was first getting started trying to iterate through many tasks, I found for
loops relatively inutitive compared to some other methods such as vectorization. I believe the greater understandability has to do with the structure. A for
loop, similar to a function, is a task completed normally but then with one variable that changes, which allows the iterations. Let’s take a look at an example of what that might look like getting the correlation between variables among some groups.
We are going to use the penguin dataset by Allison Horst.
library(palmerpenguins)
penguins <- penguins[!is.na(penguins$body_mass_g),]
penguins <- data.frame(penguins) ## drop NAs and convert to data.frame
str(penguins)
## 'data.frame': 342 obs. of 8 variables:
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num 39.1 39.5 40.3 36.7 39.3 38.9 39.2 34.1 42 37.8 ...
## $ bill_depth_mm : num 18.7 17.4 18 19.3 20.6 17.8 19.6 18.1 20.2 17.1 ...
## $ flipper_length_mm: int 181 186 195 193 190 181 195 193 190 186 ...
## $ body_mass_g : int 3750 3800 3250 3450 3650 3625 4675 3475 4250 3300 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 1 2 NA NA NA ...
## $ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
### separate linear models for each species
adelie <- subset(penguins, species=="Adelie")
corAde <- cor(adelie$bill_length_mm, adelie$bill_depth_mm, use="complete.obs")
chinstrap <- subset(penguins, species=="Chinstrap")
corchin <- cor(chinstrap$bill_length_mm, chinstrap$bill_depth_mm, use="complete.obs")
gentoo <- subset(penguins, species=="Gentoo")
corGen <- cor(gentoo$bill_length_mm, gentoo$bill_depth_mm, use="complete.obs")
penguinDF <- data.frame(species=c("Adelie","Chinstrap","Gentoo"),
corValue =c(corAde,corchin,corGen) )
The above approach has a few issues with it. Besides being labour intensive, the copy-paste of multiple lines runs of the risk of an error appearing. The person analyzing this data might miss an argument, such as the use
arguement, or might misspell one of the target species they are analyzing. Without an error message produced by R, there is no way to determine that an error was made along the way. Most importantly, this strategy does not scale easily. For three species, it is annoying, for 1000 species it is impractical.
We can iterate through these same operations using a for
loop with less code and less potential of errors. First let’s take a look at the basics of what the for
loop syntax is. At the beginning of the loop you specify the character to iterate through and what values you want them to be assigned. Here list the numbers 1 through 5 by printing the iterating value of i.
for(i in 1:5){
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
We can extend this logic to our previous example of subsetting and correlations. Copying down the same syntax from the correlation example, we create a generic dataframe that will be written over each iteration. The important difference here is that as the loop works it will continue to write over objects. We will need to specify a value that comes out so that we get a value for each iteration.
## list species
speciesNames <- c("Adelie","Chinstrap","Gentoo")
## Empty vector to fill
outCor <- c()
for(i in 1:3){
tempPenguin <- subset(penguins, species==speciesNames[i]) ## subset species by that iteration
corVal <- cor(tempPenguin$bill_length_mm, tempPenguin$bill_depth_mm, use="complete.obs") ##conduct correlation
outCor <- c(outCor,corVal) ## join empty vector with new cor value
}
outCor
## [1] 0.3914917 0.6535362 0.6433839
While it is common to use a number to iterate through, you can also use characters or factors like we have in this example. There are some instances where this would be preferable.
## Empty vector to fill
outCor <- c()
for(i in c("Adelie","Chinstrap","Gentoo")){
tempPenguin <- subset(penguins, species== i ) ## subset species by that iteration
corVal <- cor(tempPenguin$bill_length_mm, tempPenguin$bill_depth_mm, use="complete.obs") ##conduct correlation
outCor <- c(outCor,corVal) ## join empty vector with new cor value
}
outCor
## [1] 0.3914917 0.6535362 0.6433839
Let’s take a look at another example but with linear models where we extract the co-efficients from those models.
## Empty data.frame to fill
coefout <- data.frame()
for(i in c("Adelie","Chinstrap","Gentoo")){
model <- lm(flipper_length_mm ~ body_mass_g, data= penguins, species == i )
dfTemp <- data.frame(species= i,
coefficient = coef(model)[2],
intercept = coef(model)[2])
coefout <- rbind(coefout,dfTemp)
}
coefout
## species coefficient intercept
## body_mass_g Adelie 0.006676867 0.006676867
## body_mass_g1 Chinstrap 0.011905064 0.011905064
## body_mass_g2 Gentoo 0.009039136 0.009039136
It is also possible to nest for
loops within one another. Nested for loops are more common and more efficient in other programming languages. However, they are still possible in R and have their utility. We can repeat the above example using another subset.
## Empty data.frame to fill
coefout <- data.frame()
for(i in c("Adelie","Chinstrap","Gentoo")){
for(j in 2007:2009){
model <- lm(flipper_length_mm ~ body_mass_g, data= penguins, species == i & year == j )
dfTemp <- data.frame(species= i,
year = j,
coefficient = coef(model)[2],
intercept = coef(model)[2])
coefout <- rbind(coefout,dfTemp)
}
}
coefout
## species year coefficient intercept
## body_mass_g Adelie 2007 0.006421678 0.006421678
## body_mass_g1 Adelie 2008 0.007188878 0.007188878
## body_mass_g2 Adelie 2009 0.006616450 0.006616450
## body_mass_g3 Chinstrap 2007 0.009309690 0.009309690
## body_mass_g4 Chinstrap 2008 0.010534351 0.010534351
## body_mass_g5 Chinstrap 2009 0.015660848 0.015660848
## body_mass_g6 Gentoo 2007 0.006379473 0.006379473
## body_mass_g7 Gentoo 2008 0.010169139 0.010169139
## body_mass_g8 Gentoo 2009 0.011224290 0.011224290
Here the sky is the limit. Really, anything you can do once, can be set to run infinitely using for loops. The vectors supplied to i
or j
can be hundreds, thousands, or longer. You can have multiple nestings within one another. Our examples here use dataframes and vectors, but all this could be repeated with rasters, plots, graphics, or any object loaded into R. There are many, many options. This is where I believe the for
loop has its greatest strength. Using this formula, anything that can be loaded into R can be replicated this way.
for loops
There is a formula that I try and follow. The formula is as follows:
Let’s try that formula with some plots
par(mfrow=c(3,3))
for(i in 1:3){
for(j in 1:3){
tempdf <- rnorm(1000, mean= i, sd = j )
hist(tempdf, main=paste0("Mean = ",i," & SD = ",j))
}
}
Next, we are going to look at troubleshooting some problems in for
loops. The most common error that will arise is that one iteration will break. This can arise for a number of reasons, but typically involves an error message in a function that prevents completion of the loop, such as the example below.
par(mfrow=c(1,3))
for(i in 2006:2008){
m1 <- lm(bill_length_mm ~ body_mass_g, data=penguins, year==i)
hist(residuals(m1))
}
Troubleshooting is the next logical step. However, there can often be times you want the entire loop to complete because either a) the errors can be skipped, or b) all errors need to be identified to properly troubleshoot. We will explore next
for a and TryCatch
for b. There is also the break
function for those looking to exit the loop specifically at the error to try and resolve it.
Flowcharts curtosy of www.datamentor.io
## Loop without conditional next
valOut <- c() ## empty vector
for (i in 0:5){
tempVal <- log(i)
valOut <- c(valOut,tempVal)
}
valOut ## output
## [1] -Inf 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
## Adding in next statement
valOut <- c()
for (i in 0:5){
tempVal <- log(i)
if(is.infinite(tempVal)) { ## test expression of the loop
next ## skip remainder of loop if test is true
}
valOut <- c(valOut,tempVal)
}
valOut
## [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379
While next
is good for pushing through on conditions, it is less helpful when it comes to errors. That is where tryCatch
comes in. tryCatch
will allow the loop to continue through and has the helpful option of outputting an error message. Let’s apply it to the above example with the linear model.
par(mfrow=c(1,3))
for(i in 2006:2008){
tryCatch({ ## wrap the entire loop in a Try catch
m1 <- lm(bill_length_mm ~ body_mass_g, data=penguins, year==i)
hist(residuals(m1))
}, error = function(e) print(i)) ## specify what occurs with an error messagge
}
## [1] 2006
Here we specify which iteration of the loop causes an error to occupy. Importantly, the iterations that are without error are still produced. These are all effective at trying to resolve bugs and troubleshoot errors in complex loops or massive iterations. There is also no limit to the applicatino of tryCatch
. Multiple tryCatch
conditions can be implemement at different stages of particularly long loops. This allows for the identification of errors at different points.
While we have showed the application of tryCatch
for loops, its has an equally important role in functions. Using the same syntax and error identification, one can use tryCatch
to develop functions that provide useful error messages.
for
loopsLet’s practice writing for
loops!
Ex1. For our first prompt, your task is to fix this faulty for loop:
## Create output dataframe
outDF <- data.frame()
for(i in c(4,6,8)){
m1 <- lm(mpg ~ hp, data=mtcars, cyl==i) ## linear model through mtcars dataset for each cylinder type
results <- data.frame(anova(m1)) ## extract model co-efficients
results[,"cyl"] <- i
outDF <- rbind(results)
}
outDF
## Df Sum.Sq Mean.Sq F.value Pr..F. cyl
## hp 1 6.854271 6.854271 1.04985 0.3257538 8
## Residuals 12 78.345729 6.528811 NA NA 8
Answer Ex.1
## Create output dataframe
outDF <- data.frame()
for(i in c(4,6,8)){
m1 <- lm(mpg ~ hp, data=mtcars, cyl==i) ## linear model through mtcars dataset for each cylinder type
results <- data.frame(anova(m1)) ## extract model co-efficients
results[,"cyl"] <- i
outDF <- rbind(outDF, results) ## NEEDED TO ADD BIND TO OUTPUT DF
}
outDF
## Df Sum.Sq Mean.Sq F.value Pr..F. cyl
## hp 1 55.7389699 55.7389699 3.39764764 0.09839858 4
## Residuals 9 147.6464846 16.4051650 NA NA 4
## hp1 1 0.2046882 0.2046882 0.08205609 0.78602021 6
## Residuals1 5 12.4724547 2.4944909 NA NA 6
## hp2 1 6.8542713 6.8542713 1.04984990 0.32575378 8
## Residuals2 12 78.3457287 6.5288107 NA NA 8
Ex2. Using the mtcars
dataset, write a for loop that conducts ANOVAs testing for differences in miles per gallon (mpg) for cylinder type, gear, and am. *hint: the aov
function does not inherently accept character vectors.
## Create output dataframe
anovaOut <- data.frame()
for(i in c("cyl","gear","am")){
i
}
Answer Ex.2
for(i in c("cyl","gear","am")){
predictor <- as.factor( mtcars[,i])
m1 <- aov(mpg ~ predictor, data=mtcars)
print(summary(m1))
}
## Df Sum Sq Mean Sq F value Pr(>F)
## predictor 2 824.8 412.4 39.7 4.98e-09 ***
## Residuals 29 301.3 10.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## predictor 2 483.2 241.62 10.9 0.000295 ***
## Residuals 29 642.8 22.17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## predictor 1 405.2 405.2 16.86 0.000285 ***
## Residuals 30 720.9 24.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Ex3. Lastly, check for normality among the residuals of the loop we created above using the Shapiro-Wilks test shapiro.test
:
for(i in 2007:2009){
m1 <- lm(bill_length_mm ~ body_mass_g, data=penguins, year==i)
}
Answer Ex.3
for(i in 2007:2009){
m1 <- lm(bill_length_mm ~ body_mass_g, data=penguins, year==i)
shap1 <- shapiro.test(residuals(m1))
print(shap1)
}
##
## Shapiro-Wilk normality test
##
## data: residuals(m1)
## W = 0.95606, p-value = 0.001211
##
##
## Shapiro-Wilk normality test
##
## data: residuals(m1)
## W = 0.93658, p-value = 4e-05
##
##
## Shapiro-Wilk normality test
##
## data: residuals(m1)
## W = 0.9564, p-value = 0.0006984