Below are two examples of scripts that currently are inefficient. We ask that you discuss amongst yourselves ways of improving these codes to be faster. We can employ the Microbenchmark package to assess if you successfully improved the speed.
Microbenchmark runs the same functions repeatedly and then calculates run times based on min, median, and max from 100 evaluations. Running the same operations repeatedly helps account for variation in individual run times. You can add multiple expressions to same call to compare them side by side. For our purposes here, we are simply trying to beat our best time.
library(microbenchmark)
## Warning: package 'microbenchmark' was built under R version 3.6.3
microbenchmark(
{
## For loops
sums <- c()
for(i in 1:ncol(mtcars)){
t <- sum(mtcars[,i])
sums <- c(sums,t)
}
},
{
## Apply family
apply(mtcars, 2, sum)
}, unit="ms")
## Unit: milliseconds
## expr
## { sums <- c() for (i in 1:ncol(mtcars)) { t <- sum(mtcars[, i]) sums <- c(sums, t) } }
## { apply(mtcars, 2, sum) }
## min lq mean median uq max neval cld
## 1.8881 1.95465 2.172234 2.02575 2.28515 3.8610 100 b
## 0.1024 0.11085 0.133713 0.12285 0.13335 0.8181 100 a
Create a foreach loop that will guess the written word in parallel compared to the current script
word <- c("Batman")
guess <- c()
for(i in 1:nchar(word)){
for(j in 1:52){
if(substr(word,i,i) == c(letters,LETTERS)[j]) {
break
}
}
guess <- c(guess,c(letters,LETTERS)[j]) ## output guess letter
guess <- paste(guess, collapse="")
}
print(paste0(guess, " Number of guesses = ", (i*52)+j))
## [1] "Batman Number of guesses = 326"
Find the optimal tuning for a Random Forest using foreach. Random forest has parameters that can be tuned to minimize error and increase prediction. Most notably, these include the mtry
parameter which determines the number of predictors to test at each split and the ntrees
parameter which determines the number of trees to grow. The caret
package has a nice wrapper that allows for model tuning between one of these parameters, but it takes a bit of extra work to test both simultatenous and to select a specific metric to evaluate. We can accomplish the same thing using loops but since random forest typically runs slow, a for loop will take a long time. Write a parallel loop to execute the below code faster.
Here is a good resource on nesting foreach
loops. To transform multiple nested loops into a single parallelized operation, you’ll need to make use of the %:%
operator.
library(randomForest)
set.seed(11)
source("Parallelization//climSim.r")
head(climate)
head(population)
trees <- c(500,1000,1500,2000)
mtrys <- 1:ncol(climate)
outDF <- data.frame()
for(i in trees){
for(j in mtrys){
rf1 <- randomForest(Pop1 ~ MAT + MAP + TMN + TMX, importance=T, ntree=i, data=population)
RSQ <- rf1$rsq[length(rf1$rsq)] ## extract psuedo rsquared values
MSE <- rf1$mse[length(rf1$mse)] ## extracted mean square errors
tempDF <- data.frame(ntrees = i, mtry = j, rsq=RSQ, mse=MSE) ## create dataframe of parameters and outputs
outDF <- rbind(outDF, tempDF)
}
}
outDF