Vectorization

Install and load packages

library(microbenchmark)
library(here) # Look where your folder begins.
library(magrittr)
library(dplyr)
library(tidyr)
library(broom)

Quick aside on the genius of the here package:

Whether they’re collaborating with colleagues, uploading to data repositories after submitting a paper, or uploading scripts to Github for anyone to use, people who write code constantly share their code. A great time-saving strategy for streamlining code-sharing is to use here, a package intended to facilitate file referencing by getting rid of absolute filepaths (which makes your code fragile and difficult to integrate into other people’s systems) and instead using relative filepaths (which makes your code easier to transfer to other people’s systems). See Jenny Bryan’s overview of the package for more information and convincing arguments for adopting here.

Where does `here` begin?

For this workshop, you’ve created a new R project in, for example, a folder where you keep all your R code. The folder that contains this R project is where here begins (aka your top-level directory).

here::here() # this is where your folder begins.

## [1] "D:/RStudio/FastR"

How do I change the root folder?

If you want to change your root folder, you should close your R session, move the current R project folder to the location where you want it to start, and then open a new R session, type here::here() again, and then the output should show you your new root location.

What is vectorization?

Vectorization means executing a function across a vector of elements. What this means in the context of R and your functions is that you are giving more flexibility to your computer to run the operation. To understand what that means, we need to first understand how R works as a programming lanuage. Big thanks to Dr. Noam Ross for an excellent explanation

R is a high-level interpreted language, which is one of the reasons it’s so appealing for scientists without a computer science degree. The extra leg work of assinging floating numbers, strings, pointers in the memory, etc, all are handled behind the scenes. A simple task of renaming an object from 25.2 to “foo” can occur within two lines of code.

x <- 25.2
x <- "foo"

This same operation in a different language like C would require something closer to 10 lines of code. Additionally, C is a compiled language where the entire program is organized and optimized to be run in binary machine code, rather than the line-by-line execution of R. As a result, C has the potential for greater efficiencies that are unavailable to R. All is not lost though because many R functions are written in C, C++, or Fortran. For example, lme4 a popular package for mixed-effect models is written mostly in C++. The apply family is one of those examples and by using it, the bulk of the operations are being executed in C. By trusting your computer to execute the operations in the most efficient way possible, operations tend to run faster.

To vectorize or not to vectorize?

Some say that you should always vectorize in R whenever you possible can. There certainly is some merit to using vectorization as in almost all cases, vectorized code will perform equally or better than unvectorized code. However, there is a learning curve and a bit of extra thought that is necessary to transform code into the vectorized format. Depending on how much time is invested, that can result in a net-zero or negative effect on your productivity. To help you make the decision whether to invest time in optimizing your code through vectorization, it makes sense to explain what vectorization means as well as the pros and cons.

CONS

As far as I know, there is only one con: Vectorization is not intuitive. It requires a bit of abstract thinking to implement. Maybe if you did it all time, vectorization will come naturally to you, but new and complex tasks requires some time where you are blankly staring at your script thinking what to do. Time that could be spent throwing a for loop infront of your code and having it already runing. What complicates this further is that easier instances where vectorization would be particularly applicable, base R and new packages have already written wrappers that take away the advantage.

PROS

Let’s look at an example.

You need the sum for a dataset of 100 columns and 10 rows. Three ways you can execute this operation are 1. using a for loop, 2. using the apply family, 3. using the colSums function in base R.

The speed of vectorization

### create dataset
dataset <- rnorm(1000, 10, 2) %>% matrix(. , nrow=10, ncol=100) %>% data.frame()

### Loop
microbenchmark({
  
sums <- c()
for(i in 1:100){
  t <- sum(dataset[,i])
  sums <- c(sums,t)
}

}, unit="ms")

### Apply
microbenchmark({

apply(dataset, 2, sum)
  
}, unit="ms")


### ColSums
microbenchmark({

colSums(dataset)
  
}, unit="ms")

In this example, we can find that the vectorized apply family executes the function on average 5.5 times faster. The colsums wrapper is tied with the apply family, mostly because it uses the same underlying algorithms. Thus, you could probably last a long time in R just using wrapper functions and for loops. But, this is a workhop on making this go faster and getting a 5.5x boost sounds pretty appealing when your functions take hours to run. Let’s take a look at why vectorization works

Other benefits of vectorization besides speed

There are two other reasons one might perfer to vectorize besides the speed gains. The first is that the code is often shorter. In the previous example, the for loop required five lines of code that included specifying an empty vector to fill, the for loop specification, the operation to be conducted iteratively, and finally appending the output. By contrast, the apply family executes everything within a single line which looks more elegant.

## For loop
sums <- c() ## specify empty vector
for(i in 1:100){
  t <- sum(dataset[,i]) ## conduct iterated operation
  sums <- c(sums,t) ## append
}


### Apply
apply(dataset, 2, sum) ## dataframe, by columns, function

The second benefit is that for loops keep outputs and objects in the environment. This can be bad practice by filling up your memory or potentially causing a conflict down the road. For example, after the above for loop is executed i = 100. That means a future operation where i is called, there is the risk its value will be 100 (such as another loop).

The `Apply` Family

There are multiple versions of apply (e.g. apply, sapply, vapply) including parallel versions (e.g. parLapply). Today we will look at apply and lapply. We will also look at executing these operations in tidyverse.

Apply works across dataframes to execute a function. It will either run across rows (MARGIN=1 default) or columns (MARGIN=2). The function can be inherent in R, from a package or user specified. Let’s look at some examples.

## Sum across rows
apply(dataset, 1, sum)

## Sum across columns
apply(dataset, 2, sum)

## Mean across columns
apply(dataset, 2, mean)

## User specifed functions
se <- function(x) {sd(x)/sqrt(length(x))} ## standard error
apply(dataset, 2, se) 

zScore <- function(x) { (x - mean(x) / sd(x))}
apply(dataset, 2, zScore)

The lapply family works very similar to the for loop but outputs data as a list. These can usually simplified using a do.call function. While this operation is very inefficient, even compared to for loops, there are instances where this is considerably faster, such as reading in 100 CSVs and combining into a single dataframe.

## using lapply for sum across columns
sumOut <- lapply(1:100, function(i){
    sum(dataset[,i]) ## conduct iterated operation
})
do.call(c, sumOut)

Tidyverse

Vectorization gets a lot of power through combination with tidyverse. This allows for rapid computations of subsets of the data that would considerably more lines of code and slower using for loops.

What is the tidyverse?

The tidyverse “is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.” The tidyverse contains some extremely useful packages such as

ggplot2, which lets you make fantastic graphics;
tidyr, which helps you “tidy” (clean) up your data; and
dplyr, which lets you manipulate your data, like you would with pliers (hence d-plyr).

Save this built-in dataset as a dataframe: The “The Effect of Vitamin C on Tooth Growth in Guinea Pigs” dataset contains data from an experiment wherein 60 guinea pigs were given vitamin C either as pure vitamin C (VC) or orange juice (OJ) (column “supp”) in doses of 0.5, 1, or 2 mg/day (column “dose”). The researchers recorded the growth of their teeth (column “len”).

teeth <- ToothGrowth

Today we will work through examples of these tidyverse functions:

mutate()
group_by()
summarize()
filter()
select()

Then we’ll introduce some functions from the `broom` package that are especially useful for working with linear models such as

tidy()
augment()
glance()

Prequel: `magrittr`’s pipes

A fantastic tool that is worth getting to know is the magrittr package’s collection of operators called pipes. As stated on their webpage,

"The magrittr package offers a set of operators which make your code more readable by:

structuring sequences of data operations left-to-right (as opposed to from the inside and out),

avoiding nested function calls,

minimizing the need for local variables and function definitions, and

making it easy to add steps anywhere in the sequence of operations."

We’ll frequently set up our code like this:

new_object <- old_object %>% function()

If you’d rather overwrite your original object, you can use either of these setups:

old_object <- old_object %>% function()

old_object %<>% function() (notice that the pipe has two arrows inside it now)

`mutate()`

The mutate() function lets us create new columns based on existing columns. In this example, we’ll create a new column that takes the “len” data- guinea pig tooth length measured in mm- and divide it by 100 to create a similar column with cm as the unit of length.

The . after mutate( refers to how we’re performing the mutate() function on the object already named after the assignment arrow (the teeth object).

teeth <- teeth %>%
  dplyr::mutate(.,
                len_cm = len / 100)

head(teeth)

##    len supp dose len_cm
## 1  4.2   VC  0.5  0.042
## 2 11.5   VC  0.5  0.115
## 3  7.3   VC  0.5  0.073
## 4  5.8   VC  0.5  0.058
## 5  6.4   VC  0.5  0.064
## 6 10.0   VC  0.5  0.100

We can also mutate multiple columns at once. Let’s add a new column that calculates the mean tooth length for all guinea pigs:

teeth <- teeth %>%
  dplyr::mutate(.,
                len_cm = len / 100,
                mean_len = mean(len))

head(teeth)

##    len supp dose len_cm mean_len
## 1  4.2   VC  0.5  0.042 18.81333
## 2 11.5   VC  0.5  0.115 18.81333
## 3  7.3   VC  0.5  0.073 18.81333
## 4  5.8   VC  0.5  0.058 18.81333
## 5  6.4   VC  0.5  0.064 18.81333
## 6 10.0   VC  0.5  0.100 18.81333

`group_by()`

So now we know the mean tooth length for all guinea pigs, but that’s not very helpful. It’d be more informative to know the mean tooth length for specific groups of guinea pigs- based on the doses of vitamin C adminstered and the method of administration. First, let’s group the subjects by the method of administration (supp) to find the mean tooth length per group (VC or OJ):

teeth_2 <- teeth %>%
  dplyr::group_by(supp) %>%
  dplyr::mutate(.,
         mean_len = mean(len))

head(as.data.frame(teeth_2), 60)

##     len supp dose len_cm mean_len
## 1   4.2   VC  0.5  0.042 16.96333
## 2  11.5   VC  0.5  0.115 16.96333
## 3   7.3   VC  0.5  0.073 16.96333
## 4   5.8   VC  0.5  0.058 16.96333
## 5   6.4   VC  0.5  0.064 16.96333
## 6  10.0   VC  0.5  0.100 16.96333
## 7  11.2   VC  0.5  0.112 16.96333
## 8  11.2   VC  0.5  0.112 16.96333
## 9   5.2   VC  0.5  0.052 16.96333
## 10  7.0   VC  0.5  0.070 16.96333
## 11 16.5   VC  1.0  0.165 16.96333
## 12 16.5   VC  1.0  0.165 16.96333
## 13 15.2   VC  1.0  0.152 16.96333
## 14 17.3   VC  1.0  0.173 16.96333
## 15 22.5   VC  1.0  0.225 16.96333
## 16 17.3   VC  1.0  0.173 16.96333
## 17 13.6   VC  1.0  0.136 16.96333
## 18 14.5   VC  1.0  0.145 16.96333
## 19 18.8   VC  1.0  0.188 16.96333
## 20 15.5   VC  1.0  0.155 16.96333
## 21 23.6   VC  2.0  0.236 16.96333
## 22 18.5   VC  2.0  0.185 16.96333
## 23 33.9   VC  2.0  0.339 16.96333
## 24 25.5   VC  2.0  0.255 16.96333
## 25 26.4   VC  2.0  0.264 16.96333
## 26 32.5   VC  2.0  0.325 16.96333
## 27 26.7   VC  2.0  0.267 16.96333
## 28 21.5   VC  2.0  0.215 16.96333
## 29 23.3   VC  2.0  0.233 16.96333
## 30 29.5   VC  2.0  0.295 16.96333
## 31 15.2   OJ  0.5  0.152 20.66333
## 32 21.5   OJ  0.5  0.215 20.66333
## 33 17.6   OJ  0.5  0.176 20.66333
## 34  9.7   OJ  0.5  0.097 20.66333
## 35 14.5   OJ  0.5  0.145 20.66333
## 36 10.0   OJ  0.5  0.100 20.66333
## 37  8.2   OJ  0.5  0.082 20.66333
## 38  9.4   OJ  0.5  0.094 20.66333
## 39 16.5   OJ  0.5  0.165 20.66333
## 40  9.7   OJ  0.5  0.097 20.66333
## 41 19.7   OJ  1.0  0.197 20.66333
## 42 23.3   OJ  1.0  0.233 20.66333
## 43 23.6   OJ  1.0  0.236 20.66333
## 44 26.4   OJ  1.0  0.264 20.66333
## 45 20.0   OJ  1.0  0.200 20.66333
## 46 25.2   OJ  1.0  0.252 20.66333
## 47 25.8   OJ  1.0  0.258 20.66333
## 48 21.2   OJ  1.0  0.212 20.66333
## 49 14.5   OJ  1.0  0.145 20.66333
## 50 27.3   OJ  1.0  0.273 20.66333
## 51 25.5   OJ  2.0  0.255 20.66333
## 52 26.4   OJ  2.0  0.264 20.66333
## 53 22.4   OJ  2.0  0.224 20.66333
## 54 24.5   OJ  2.0  0.245 20.66333
## 55 24.8   OJ  2.0  0.248 20.66333
## 56 30.9   OJ  2.0  0.309 20.66333
## 57 26.4   OJ  2.0  0.264 20.66333
## 58 27.3   OJ  2.0  0.273 20.66333
## 59 29.4   OJ  2.0  0.294 20.66333
## 60 23.0   OJ  2.0  0.230 20.66333

Now let’s incorporate dosage (dose) as another grouping so that the subjects are grouped by both supp and dose:

teeth_2 <- teeth %>%
  dplyr::group_by(supp, dose) %>%
  dplyr::mutate(.,
         mean_len = mean(len))

head(as.data.frame(teeth_2), 60)

##     len supp dose len_cm mean_len
## 1   4.2   VC  0.5  0.042     7.98
## 2  11.5   VC  0.5  0.115     7.98
## 3   7.3   VC  0.5  0.073     7.98
## 4   5.8   VC  0.5  0.058     7.98
## 5   6.4   VC  0.5  0.064     7.98
## 6  10.0   VC  0.5  0.100     7.98
## 7  11.2   VC  0.5  0.112     7.98
## 8  11.2   VC  0.5  0.112     7.98
## 9   5.2   VC  0.5  0.052     7.98
## 10  7.0   VC  0.5  0.070     7.98
## 11 16.5   VC  1.0  0.165    16.77
## 12 16.5   VC  1.0  0.165    16.77
## 13 15.2   VC  1.0  0.152    16.77
## 14 17.3   VC  1.0  0.173    16.77
## 15 22.5   VC  1.0  0.225    16.77
## 16 17.3   VC  1.0  0.173    16.77
## 17 13.6   VC  1.0  0.136    16.77
## 18 14.5   VC  1.0  0.145    16.77
## 19 18.8   VC  1.0  0.188    16.77
## 20 15.5   VC  1.0  0.155    16.77
## 21 23.6   VC  2.0  0.236    26.14
## 22 18.5   VC  2.0  0.185    26.14
## 23 33.9   VC  2.0  0.339    26.14
## 24 25.5   VC  2.0  0.255    26.14
## 25 26.4   VC  2.0  0.264    26.14
## 26 32.5   VC  2.0  0.325    26.14
## 27 26.7   VC  2.0  0.267    26.14
## 28 21.5   VC  2.0  0.215    26.14
## 29 23.3   VC  2.0  0.233    26.14
## 30 29.5   VC  2.0  0.295    26.14
## 31 15.2   OJ  0.5  0.152    13.23
## 32 21.5   OJ  0.5  0.215    13.23
## 33 17.6   OJ  0.5  0.176    13.23
## 34  9.7   OJ  0.5  0.097    13.23
## 35 14.5   OJ  0.5  0.145    13.23
## 36 10.0   OJ  0.5  0.100    13.23
## 37  8.2   OJ  0.5  0.082    13.23
## 38  9.4   OJ  0.5  0.094    13.23
## 39 16.5   OJ  0.5  0.165    13.23
## 40  9.7   OJ  0.5  0.097    13.23
## 41 19.7   OJ  1.0  0.197    22.70
## 42 23.3   OJ  1.0  0.233    22.70
## 43 23.6   OJ  1.0  0.236    22.70
## 44 26.4   OJ  1.0  0.264    22.70
## 45 20.0   OJ  1.0  0.200    22.70
## 46 25.2   OJ  1.0  0.252    22.70
## 47 25.8   OJ  1.0  0.258    22.70
## 48 21.2   OJ  1.0  0.212    22.70
## 49 14.5   OJ  1.0  0.145    22.70
## 50 27.3   OJ  1.0  0.273    22.70
## 51 25.5   OJ  2.0  0.255    26.06
## 52 26.4   OJ  2.0  0.264    26.06
## 53 22.4   OJ  2.0  0.224    26.06
## 54 24.5   OJ  2.0  0.245    26.06
## 55 24.8   OJ  2.0  0.248    26.06
## 56 30.9   OJ  2.0  0.309    26.06
## 57 26.4   OJ  2.0  0.264    26.06
## 58 27.3   OJ  2.0  0.273    26.06
## 59 29.4   OJ  2.0  0.294    26.06
## 60 23.0   OJ  2.0  0.230    26.06

That’s more like it!

However, there’s a better way to organize this data…

`summarize()`

If you don’t need to keep the individual subjects’ tooth length data, then it makes sense to use summarize() to condense our summarized data into fewer rows. Our output file will have one row for each combination of supp x dose.

teeth_3 <- teeth %>%
  dplyr::group_by(supp, dose) %>%
  dplyr::summarize(mean_len = mean(len))

## `summarise()` has grouped output by 'supp'. You can override using the `.groups` argument.

head(teeth_3)

## # A tibble: 6 x 3
## # Groups:   supp [2]
##   supp   dose mean_len
##   <fct> <dbl>    <dbl>
## 1 OJ      0.5    13.2 
## 2 OJ      1      22.7 
## 3 OJ      2      26.1 
## 4 VC      0.5     7.98
## 5 VC      1      16.8 
## 6 VC      2      26.1

You can even summarize multiple times, such as if you wanted to first get mean values per supp x dose group and then average those values to get mean length per supp group.

teeth_4 <- teeth %>%
  dplyr::group_by(supp, dose) %>%
  dplyr::summarize(mean_len = mean(len)) %>%
  dplyr::group_by(supp) %>%
  dplyr::summarize(mean_len = mean(mean_len))

## `summarise()` has grouped output by 'supp'. You can override using the `.groups` argument.

head(teeth_4)

## # A tibble: 2 x 2
##   supp  mean_len
##   <fct>    <dbl>
## 1 OJ        20.7
## 2 VC        17.0

What if you just want to summarize all the rows of your dataset? Then you can use summarize() in conjunction with the across() “helper” function.

Let’s calculate the mean values for dose and len_cm. To do so, you’d set up your code like this. Be sure you’re not selecting any non-numeric columns, like supp.

teeth_5 <- teeth %>%
  dplyr::summarize(across(dose:len_cm, mean))

teeth_5

##       dose    len_cm
## 1 1.166667 0.1881333

You can even summarize multiple variables in multiple ways, like in this example, where we’re calculating both mean, maximum, and minimum values for dose and length. In the set of parentheses determining which calculations to perform (mean, max, min), be sure to put the suffix you want to append to the column name first (e.g. mean, maximum, minimum), then the = sign, then the R-recognized calculation name (mean, max, min).

teeth_6 <- teeth %>%
  dplyr::summarize(across(c(len, dose),
                          c(mean = mean,
                            maximum = max,
                            minimum = min))) %T>%
  print()

##   len_mean len_maximum len_minimum dose_mean dose_maximum dose_minimum
## 1 18.81333        33.9         4.2  1.166667            2          0.5

You’ll also notice that I added a special pipe- %T>%- plus the print() function at the end of the command. This tells R to perform the functions after the special pipe but to not associate those functions with the object being created (teeth_6). This pipe is very useful when creating plots, for instance.

Let’s imagine that you only want to focus on the guinea pigs given orange juice. How would you get rid of the other rows?

`filter()`

The filter() function lets you filter out the rows you don’t want. Here we’ve added a line of code to keep only the rows where the guinea pig was given vitamin C as orange juice (OJ), not as pure vitamin C.

teeth_7 <- teeth %>%
  dplyr::filter(supp == "OJ")

head(teeth_7)

##    len supp dose len_cm mean_len
## 1 15.2   OJ  0.5  0.152 18.81333
## 2 21.5   OJ  0.5  0.215 18.81333
## 3 17.6   OJ  0.5  0.176 18.81333
## 4  9.7   OJ  0.5  0.097 18.81333
## 5 14.5   OJ  0.5  0.145 18.81333
## 6 10.0   OJ  0.5  0.100 18.81333

`select()`

Similarly, the select() function lets you filter out the columns you don’t want. Our new dataframe only includes the first three columns: from len to dose. Both of these commands accomplishes the same goal, whether you name the columns by their names or index position.

teeth_8 <- teeth %>%
  dplyr::select(len:dose)

head(teeth_8)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

teeth_8 <- teeth %>%
  dplyr::select(1:3)

head(teeth_8)

##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Helpful Functions for Linear Models

Let’s load a new built-in dataset: warpbreaks. This dataset contains data regarding the number of breaks in yarn accumulated while weaving based on the yarn’s wool type and amount of tension.

yarn <- warpbreaks

We’re going to set up some linear models- one for each type of wool- and then use three functions from the broom package to summarize important information about the models which will be generated in a format that makes it simple to compare and plot models.

As stated in the broom overview page, here is the role of each function:

`tidy()`:

“Summarizes information about model components”

`augment()`

“Reports information about the entire model”

`glance()`

“Adds informations about observations to a dataset”

The long way

mod1 <- lm(breaks ~ tension,
           yarn %>% filter(wool == "A"))

mod2 <- lm(breaks ~ tension,
           yarn %>% filter(wool == "B"))

`tidy()`

broom::tidy(mod1)

## # A tibble: 3 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     44.6      4.34     10.3  2.91e-10
## 2 tensionM       -20.6      6.13     -3.35 2.66e- 3
## 3 tensionH       -20        6.13     -3.26 3.32e- 3

# broom::tidy(mod2)

`augment()`

broom::augment(mod1)

## # A tibble: 27 x 8
##    breaks tension .fitted .resid  .hat .sigma .cooksd .std.resid
##     <dbl> <fct>     <dbl>  <dbl> <dbl>  <dbl>   <dbl>      <dbl>
##  1     26 L          44.6 -18.6  0.111   12.6 0.0953      -1.51 
##  2     30 L          44.6 -14.6  0.111   12.9 0.0586      -1.19 
##  3     54 L          44.6   9.44 0.111   13.1 0.0247       0.770
##  4     25 L          44.6 -19.6  0.111   12.6 0.106       -1.59 
##  5     70 L          44.6  25.4  0.111   12.0 0.179        2.07 
##  6     52 L          44.6   7.44 0.111   13.2 0.0153       0.607
##  7     51 L          44.6   6.44 0.111   13.2 0.0115       0.525
##  8     26 L          44.6 -18.6  0.111   12.6 0.0953      -1.51 
##  9     67 L          44.6  22.4  0.111   12.3 0.139        1.83 
## 10     18 M          24    -6    0.111   13.2 0.00996     -0.489
## # ... with 17 more rows

# broom::augment(mod2)

`glance()`

broom::glance(mod1)

## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.378         0.326  13.0      7.29 0.00336     2  -106.  220.  225.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# broom::glance(mod2)

The fastR way to conduct linear models:

Notice that we are using the group_by function perform the same command on each wool type but in the same chunk of code.

`tidy()`

# tidy
mod_all.tidy <- yarn %>%
  group_by(wool) %>%
  do(fit = broom::tidy(lm(breaks ~ tension,
              data = .))) %>%
  unnest(fit) %T>%
  print()

## # A tibble: 6 x 6
##   wool  term        estimate std.error statistic  p.value
##   <fct> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 A     (Intercept)   44.6        4.34    10.3   2.91e-10
## 2 A     tensionM     -20.6        6.13    -3.35  2.66e- 3
## 3 A     tensionH     -20          6.13    -3.26  3.32e- 3
## 4 B     (Intercept)   28.2        2.79    10.1   3.91e-10
## 5 B     tensionM       0.556      3.94     0.141 8.89e- 1
## 6 B     tensionH      -9.44       3.94    -2.39  2.48e- 2

`augment()`

mod_all.augment <- yarn %>%
  group_by(wool) %>%
  do(fit = broom::augment(lm(breaks ~ tension,
              data = .))) %>%
  unnest(fit) %T>%
  print()

## # A tibble: 54 x 9
##    wool  breaks tension .fitted .resid  .hat .sigma .cooksd .std.resid
##    <fct>  <dbl> <fct>     <dbl>  <dbl> <dbl>  <dbl>   <dbl>      <dbl>
##  1 A         26 L          44.6 -18.6  0.111   12.6 0.0953      -1.51 
##  2 A         30 L          44.6 -14.6  0.111   12.9 0.0586      -1.19 
##  3 A         54 L          44.6   9.44 0.111   13.1 0.0247       0.770
##  4 A         25 L          44.6 -19.6  0.111   12.6 0.106       -1.59 
##  5 A         70 L          44.6  25.4  0.111   12.0 0.179        2.07 
##  6 A         52 L          44.6   7.44 0.111   13.2 0.0153       0.607
##  7 A         51 L          44.6   6.44 0.111   13.2 0.0115       0.525
##  8 A         26 L          44.6 -18.6  0.111   12.6 0.0953      -1.51 
##  9 A         67 L          44.6  22.4  0.111   12.3 0.139        1.83 
## 10 A         18 M          24    -6    0.111   13.2 0.00996     -0.489
## # ... with 44 more rows

`glance()`

mod_all.glance <- yarn %>%
  group_by(wool) %>%
  do(fit = broom::glance(lm(breaks ~ tension,
              data = .))) %>%
  unnest(fit) %T>%
  print()

## # A tibble: 2 x 13
##   wool  r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##   <fct>     <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1 A         0.378         0.326 13.0       7.29 0.00336     2 -106.   220.  225.
## 2 B         0.253         0.190  8.37      4.06 0.0303      2  -94.1  196.  201.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Practice

Let’s practice writing vectorized code!

Ex.1 First, use the apply function to find the median value across columns of the “dataset” data frame.

Answer Ex.1

apply(mtcars, 1, median)

##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##               4.000               4.000               4.000               3.215 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##               3.440               3.460               4.000               4.000 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##               4.000               4.000               4.000               4.070 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##               3.730               3.780               5.250               5.424 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##               5.345               4.000               4.000               4.000 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##               3.700               3.520               3.435               4.000 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##               3.845               4.000               4.430               4.000 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##               5.000               6.000               8.000               4.000

Ex.2 Next, find the average number of breaks from yarn with wool type A and tensions L & M only. You should end up with one dataframe with 2 rows and 2 columns (“tension” and “mean_breaks”) showing mean A/L breaks and A/M breaks. Hint: Take it one function at a time (you’ll use 3) and check your code as you go to see that you’re on the right track.

Answer Ex.2

yarn %>%
  filter(wool == "A" & tension != "H") %>%
  group_by(tension) %>%
  summarise(mean_breaks = mean(breaks)) %T>%
  print()

## # A tibble: 2 x 2
##   tension mean_breaks
##   <fct>         <dbl>
## 1 L              44.6
## 2 M              24

## # A tibble: 2 x 2
##   tension mean_breaks
##   <fct>         <dbl>
## 1 L              44.6
## 2 M              24

Ex.3 Lastly, choose only the rows with supp = VC from the teeth data frame and then create 2 new columns: one with the product of len x dose (called “len_x_dose”) and the other with the ranking for the len values (called “rank”) in descending order (the largest value should rank at 1). Hint: here’s how to write the ranking function.

Answer Ex.3

teeth %>%
  filter(supp == "VC") %>%
  mutate(len_x_dose = len*dose,
         rank = row_number(desc(len)))

##     len supp dose len_cm mean_len len_x_dose rank
## 1   4.2   VC  0.5  0.042 18.81333       2.10   30
## 2  11.5   VC  0.5  0.115 18.81333       5.75   21
## 3   7.3   VC  0.5  0.073 18.81333       3.65   25
## 4   5.8   VC  0.5  0.058 18.81333       2.90   28
## 5   6.4   VC  0.5  0.064 18.81333       3.20   27
## 6  10.0   VC  0.5  0.100 18.81333       5.00   24
## 7  11.2   VC  0.5  0.112 18.81333       5.60   22
## 8  11.2   VC  0.5  0.112 18.81333       5.60   23
## 9   5.2   VC  0.5  0.052 18.81333       2.60   29
## 10  7.0   VC  0.5  0.070 18.81333       3.50   26
## 11 16.5   VC  1.0  0.165 18.81333      16.50   15
## 12 16.5   VC  1.0  0.165 18.81333      16.50   16
## 13 15.2   VC  1.0  0.152 18.81333      15.20   18
## 14 17.3   VC  1.0  0.173 18.81333      17.30   13
## 15 22.5   VC  1.0  0.225 18.81333      22.50    9
## 16 17.3   VC  1.0  0.173 18.81333      17.30   14
## 17 13.6   VC  1.0  0.136 18.81333      13.60   20
## 18 14.5   VC  1.0  0.145 18.81333      14.50   19
## 19 18.8   VC  1.0  0.188 18.81333      18.80   11
## 20 15.5   VC  1.0  0.155 18.81333      15.50   17
## 21 23.6   VC  2.0  0.236 18.81333      47.20    7
## 22 18.5   VC  2.0  0.185 18.81333      37.00   12
## 23 33.9   VC  2.0  0.339 18.81333      67.80    1
## 24 25.5   VC  2.0  0.255 18.81333      51.00    6
## 25 26.4   VC  2.0  0.264 18.81333      52.80    5
## 26 32.5   VC  2.0  0.325 18.81333      65.00    2
## 27 26.7   VC  2.0  0.267 18.81333      53.40    4
## 28 21.5   VC  2.0  0.215 18.81333      43.00   10
## 29 23.3   VC  2.0  0.233 18.81333      46.60    8
## 30 29.5   VC  2.0  0.295 18.81333      59.00    3

Next Module

Home Functions

Vectorization

Install and load packages

Where does here begin?

How do I change the root folder?

What is vectorization?

To vectorize or not to vectorize?

CONS

PROS

The speed of vectorization

Other benefits of vectorization besides speed

The Apply Family

Tidyverse

What is the tidyverse?

Today we will work through examples of these tidyverse functions:

Then we’ll introduce some functions from the broom package that are especially useful for working with linear models such as

Prequel: magrittr’s pipes

We’ll frequently set up our code like this:

If you’d rather overwrite your original object, you can use either of these setups:

mutate()

group_by()

summarize()

filter()

select()

Helpful Functions for Linear Models

tidy():

augment()

glance()

The long way

tidy()

augment()

glance()

The fastR way to conduct linear models:

tidy()

augment()

glance()

Practice

Next Module

Where does `here` begin?

The `Apply` Family

Then we’ll introduce some functions from the `broom` package that are especially useful for working with linear models such as

Prequel: `magrittr`’s pipes

`mutate()`

`group_by()`

`summarize()`

`filter()`

`select()`

`tidy()`:

`augment()`

`glance()`

`tidy()`

`augment()`

`glance()`

`tidy()`

`augment()`

`glance()`