Must go faster

Below are two examples of scripts that currently are inefficient. We ask that you discuss amongst yourselves ways of improving these codes to be faster. We can employ the Microbenchmark package to assess if you successfully improved the speed.

How to use microbenchmark

Microbenchmark runs the same functions repeatedly and then calculates run times based on min, median, and max from 100 evaluations. Running the same operations repeatedly helps account for variation in individual run times. You can add multiple expressions to same call to compare them side by side. For our purposes here, we are simply trying to beat our best time.

library(microbenchmark)

## Warning: package 'microbenchmark' was built under R version 3.6.3

microbenchmark(

{
## For loops
sums <- c()
for(i in 1:ncol(mtcars)){
  t <- sum(mtcars[,i])
  sums <- c(sums,t)
}

},

{
## Apply family
apply(mtcars, 2, sum)
  
}, unit="ms")

## Unit: milliseconds
##                                                                                                                expr
##  {     sums <- c()     for (i in 1:ncol(mtcars)) {         t <- sum(mtcars[, i])         sums <- c(sums, t)     } }
##                                                                                       {     apply(mtcars, 2, sum) }
##     min      lq     mean  median      uq    max neval cld
##  1.8881 1.95465 2.172234 2.02575 2.28515 3.8610   100   b
##  0.1024 0.11085 0.133713 0.12285 0.13335 0.8181   100  a

Exercise 1 - Guessing a word for hangman

Create a foreach loop that will guess the written word in parallel compared to the current script

word <- c("Batman")


guess <- c()
for(i in 1:nchar(word)){
  
  for(j in 1:52){
    if(substr(word,i,i) == c(letters,LETTERS)[j]) {
  break
  }
  }
  guess <- c(guess,c(letters,LETTERS)[j]) ## output guess letter
  guess <- paste(guess, collapse="")
}
print(paste0(guess, " Number of guesses = ", (i*52)+j))

## [1] "Batman Number of guesses = 326"

Exercise 2 - Tune a Random Forest model

Find the optimal tuning for a Random Forest using foreach. Random forest has parameters that can be tuned to minimize error and increase prediction. Most notably, these include the mtry parameter which determines the number of predictors to test at each split and the ntrees parameter which determines the number of trees to grow. The caret package has a nice wrapper that allows for model tuning between one of these parameters, but it takes a bit of extra work to test both simultatenous and to select a specific metric to evaluate. We can accomplish the same thing using loops but since random forest typically runs slow, a for loop will take a long time. Write a parallel loop to execute the below code faster.

Here is a good resource on nesting foreach loops. To transform multiple nested loops into a single parallelized operation, you’ll need to make use of the %:% operator.

library(randomForest)
set.seed(11)

source("Parallelization//climSim.r")

head(climate)
head(population)

trees <- c(500,1000,1500,2000)
mtrys <- 1:ncol(climate)

outDF <- data.frame()
for(i in trees){
  for(j in mtrys){
  
  rf1 <- randomForest(Pop1 ~ MAT + MAP + TMN + TMX, importance=T, ntree=i, data=population) 
  RSQ <- rf1$rsq[length(rf1$rsq)] ## extract psuedo rsquared values
  MSE <- rf1$mse[length(rf1$mse)] ## extracted mean square errors
  tempDF <- data.frame(ntrees = i, mtry = j, rsq=RSQ, mse=MSE) ## create dataframe of parameters and outputs
  outDF <- rbind(outDF, tempDF)
  }
}
outDF