library(microbenchmark)
library(here) # Look where your folder begins.
library(magrittr)
library(dplyr)
library(tidyr)
library(broom)
Quick aside on the genius of the here
package:
Whether they’re collaborating with colleagues, uploading to data repositories after submitting a paper, or uploading scripts to Github for anyone to use, people who write code constantly share their code. A great time-saving strategy for streamlining code-sharing is to use here
, a package intended to facilitate file referencing by getting rid of absolute filepaths (which makes your code fragile and difficult to integrate into other people’s systems) and instead using relative filepaths (which makes your code easier to transfer to other people’s systems). See Jenny Bryan’s overview of the package for more information and convincing arguments for adopting here
.
here
begin?For this workshop, you’ve created a new R project in, for example, a folder where you keep all your R code. The folder that contains this R project is where here
begins (aka your top-level directory).
here::here() # this is where your folder begins.
## [1] "D:/RStudio/FastR"
If you want to change your root folder, you should close your R session, move the current R project folder to the location where you want it to start, and then open a new R session, type here::here()
again, and then the output should show you your new root location.
Vectorization means executing a function across a vector of elements. What this means in the context of R and your functions is that you are giving more flexibility to your computer to run the operation. To understand what that means, we need to first understand how R works as a programming lanuage. Big thanks to Dr. Noam Ross for an excellent explanation
R is a high-level interpreted language, which is one of the reasons it’s so appealing for scientists without a computer science degree. The extra leg work of assinging floating numbers, strings, pointers in the memory, etc, all are handled behind the scenes. A simple task of renaming an object from 25.2 to “foo” can occur within two lines of code.
x <- 25.2
x <- "foo"
This same operation in a different language like C would require something closer to 10 lines of code. Additionally, C is a compiled language where the entire program is organized and optimized to be run in binary machine code, rather than the line-by-line execution of R. As a result, C has the potential for greater efficiencies that are unavailable to R. All is not lost though because many R functions are written in C, C++, or Fortran. For example, lme4
a popular package for mixed-effect models is written mostly in C++. The apply
family is one of those examples and by using it, the bulk of the operations are being executed in C. By trusting your computer to execute the operations in the most efficient way possible, operations tend to run faster.
Some say that you should always vectorize in R whenever you possible can. There certainly is some merit to using vectorization as in almost all cases, vectorized code will perform equally or better than unvectorized code. However, there is a learning curve and a bit of extra thought that is necessary to transform code into the vectorized format. Depending on how much time is invested, that can result in a net-zero or negative effect on your productivity. To help you make the decision whether to invest time in optimizing your code through vectorization, it makes sense to explain what vectorization means as well as the pros and cons.
As far as I know, there is only one con: Vectorization is not intuitive. It requires a bit of abstract thinking to implement. Maybe if you did it all time, vectorization will come naturally to you, but new and complex tasks requires some time where you are blankly staring at your script thinking what to do. Time that could be spent throwing a for
loop infront of your code and having it already runing. What complicates this further is that easier instances where vectorization would be particularly applicable, base R and new packages have already written wrappers that take away the advantage.
Let’s look at an example.
You need the sum for a dataset of 100 columns and 10 rows. Three ways you can execute this operation are 1. using a for loop, 2. using the apply family, 3. using the colSums
function in base R.
### create dataset
dataset <- rnorm(1000, 10, 2) %>% matrix(. , nrow=10, ncol=100) %>% data.frame()
### Loop
microbenchmark({
sums <- c()
for(i in 1:100){
t <- sum(dataset[,i])
sums <- c(sums,t)
}
}, unit="ms")
### Apply
microbenchmark({
apply(dataset, 2, sum)
}, unit="ms")
### ColSums
microbenchmark({
colSums(dataset)
}, unit="ms")
In this example, we can find that the vectorized apply family executes the function on average 5.5 times faster. The colsums wrapper is tied with the apply family, mostly because it uses the same underlying algorithms. Thus, you could probably last a long time in R just using wrapper functions and for loops. But, this is a workhop on making this go faster and getting a 5.5x boost sounds pretty appealing when your functions take hours to run. Let’s take a look at why vectorization works
There are two other reasons one might perfer to vectorize besides the speed gains. The first is that the code is often shorter. In the previous example, the for loop required five lines of code that included specifying an empty vector to fill, the for
loop specification, the operation to be conducted iteratively, and finally appending the output. By contrast, the apply family executes everything within a single line which looks more elegant.
## For loop
sums <- c() ## specify empty vector
for(i in 1:100){
t <- sum(dataset[,i]) ## conduct iterated operation
sums <- c(sums,t) ## append
}
### Apply
apply(dataset, 2, sum) ## dataframe, by columns, function
The second benefit is that for
loops keep outputs and objects in the environment. This can be bad practice by filling up your memory or potentially causing a conflict down the road. For example, after the above for
loop is executed i = 100
. That means a future operation where i
is called, there is the risk its value will be 100 (such as another loop).
Apply
FamilyThere are multiple versions of apply (e.g. apply, sapply, vapply) including parallel versions (e.g. parLapply). Today we will look at apply
and lapply
. We will also look at executing these operations in tidyverse
.
Apply works across dataframes to execute a function. It will either run across rows (MARGIN=1 default) or columns (MARGIN=2). The function can be inherent in R, from a package or user specified. Let’s look at some examples.
## Sum across rows
apply(dataset, 1, sum)
## Sum across columns
apply(dataset, 2, sum)
## Mean across columns
apply(dataset, 2, mean)
## User specifed functions
se <- function(x) {sd(x)/sqrt(length(x))} ## standard error
apply(dataset, 2, se)
zScore <- function(x) { (x - mean(x) / sd(x))}
apply(dataset, 2, zScore)
The lapply family works very similar to the for
loop but outputs data as a list. These can usually simplified using a do.call
function. While this operation is very inefficient, even compared to for
loops, there are instances where this is considerably faster, such as reading in 100 CSVs and combining into a single dataframe.
## using lapply for sum across columns
sumOut <- lapply(1:100, function(i){
sum(dataset[,i]) ## conduct iterated operation
})
do.call(c, sumOut)
Vectorization gets a lot of power through combination with tidyverse
. This allows for rapid computations of subsets of the data that would considerably more lines of code and slower using for
loops.
The tidyverse “is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.” The tidyverse contains some extremely useful packages such as
ggplot2
, which lets you make fantastic graphics;tidyr
, which helps you “tidy” (clean) up your data; anddplyr
, which lets you manipulate your data, like you would with pliers (hence d-plyr). Save this built-in dataset as a dataframe: The “The Effect of Vitamin C on Tooth Growth in Guinea Pigs” dataset contains data from an experiment wherein 60 guinea pigs were given vitamin C either as pure vitamin C (VC) or orange juice (OJ) (column “supp”) in doses of 0.5, 1, or 2 mg/day (column “dose”). The researchers recorded the growth of their teeth (column “len”).
teeth <- ToothGrowth
mutate()
group_by()
summarize()
filter()
select()
broom
package that are especially useful for working with linear models such astidy()
augment()
glance()
magrittr
’s pipesA fantastic tool that is worth getting to know is the magrittr
package’s collection of operators called pipes. As stated on their webpage,
"The
magrittr
package offers a set of operators which make your code more readable by:
- structuring sequences of data operations left-to-right (as opposed to from the inside and out),
- avoiding nested function calls,
- minimizing the need for local variables and function definitions, and
- making it easy to add steps anywhere in the sequence of operations."
new_object <- old_object %>% function()
old_object <- old_object %>% function()
old_object %<>% function()
(notice that the pipe has two arrows inside it now)
mutate()
The mutate()
function lets us create new columns based on existing columns. In this example, we’ll create a new column that takes the “len” data- guinea pig tooth length measured in mm- and divide it by 100 to create a similar column with cm as the unit of length.
The .
after mutate( refers to how we’re performing the mutate()
function on the object already named after the assignment arrow (the teeth object).
teeth <- teeth %>%
dplyr::mutate(.,
len_cm = len / 100)
head(teeth)
## len supp dose len_cm
## 1 4.2 VC 0.5 0.042
## 2 11.5 VC 0.5 0.115
## 3 7.3 VC 0.5 0.073
## 4 5.8 VC 0.5 0.058
## 5 6.4 VC 0.5 0.064
## 6 10.0 VC 0.5 0.100
We can also mutate multiple columns at once. Let’s add a new column that calculates the mean tooth length for all guinea pigs:
teeth <- teeth %>%
dplyr::mutate(.,
len_cm = len / 100,
mean_len = mean(len))
head(teeth)
## len supp dose len_cm mean_len
## 1 4.2 VC 0.5 0.042 18.81333
## 2 11.5 VC 0.5 0.115 18.81333
## 3 7.3 VC 0.5 0.073 18.81333
## 4 5.8 VC 0.5 0.058 18.81333
## 5 6.4 VC 0.5 0.064 18.81333
## 6 10.0 VC 0.5 0.100 18.81333
group_by()
So now we know the mean tooth length for all guinea pigs, but that’s not very helpful. It’d be more informative to know the mean tooth length for specific groups of guinea pigs- based on the doses of vitamin C adminstered and the method of administration. First, let’s group the subjects by the method of administration (supp) to find the mean tooth length per group (VC or OJ):
teeth_2 <- teeth %>%
dplyr::group_by(supp) %>%
dplyr::mutate(.,
mean_len = mean(len))
head(as.data.frame(teeth_2), 60)
## len supp dose len_cm mean_len
## 1 4.2 VC 0.5 0.042 16.96333
## 2 11.5 VC 0.5 0.115 16.96333
## 3 7.3 VC 0.5 0.073 16.96333
## 4 5.8 VC 0.5 0.058 16.96333
## 5 6.4 VC 0.5 0.064 16.96333
## 6 10.0 VC 0.5 0.100 16.96333
## 7 11.2 VC 0.5 0.112 16.96333
## 8 11.2 VC 0.5 0.112 16.96333
## 9 5.2 VC 0.5 0.052 16.96333
## 10 7.0 VC 0.5 0.070 16.96333
## 11 16.5 VC 1.0 0.165 16.96333
## 12 16.5 VC 1.0 0.165 16.96333
## 13 15.2 VC 1.0 0.152 16.96333
## 14 17.3 VC 1.0 0.173 16.96333
## 15 22.5 VC 1.0 0.225 16.96333
## 16 17.3 VC 1.0 0.173 16.96333
## 17 13.6 VC 1.0 0.136 16.96333
## 18 14.5 VC 1.0 0.145 16.96333
## 19 18.8 VC 1.0 0.188 16.96333
## 20 15.5 VC 1.0 0.155 16.96333
## 21 23.6 VC 2.0 0.236 16.96333
## 22 18.5 VC 2.0 0.185 16.96333
## 23 33.9 VC 2.0 0.339 16.96333
## 24 25.5 VC 2.0 0.255 16.96333
## 25 26.4 VC 2.0 0.264 16.96333
## 26 32.5 VC 2.0 0.325 16.96333
## 27 26.7 VC 2.0 0.267 16.96333
## 28 21.5 VC 2.0 0.215 16.96333
## 29 23.3 VC 2.0 0.233 16.96333
## 30 29.5 VC 2.0 0.295 16.96333
## 31 15.2 OJ 0.5 0.152 20.66333
## 32 21.5 OJ 0.5 0.215 20.66333
## 33 17.6 OJ 0.5 0.176 20.66333
## 34 9.7 OJ 0.5 0.097 20.66333
## 35 14.5 OJ 0.5 0.145 20.66333
## 36 10.0 OJ 0.5 0.100 20.66333
## 37 8.2 OJ 0.5 0.082 20.66333
## 38 9.4 OJ 0.5 0.094 20.66333
## 39 16.5 OJ 0.5 0.165 20.66333
## 40 9.7 OJ 0.5 0.097 20.66333
## 41 19.7 OJ 1.0 0.197 20.66333
## 42 23.3 OJ 1.0 0.233 20.66333
## 43 23.6 OJ 1.0 0.236 20.66333
## 44 26.4 OJ 1.0 0.264 20.66333
## 45 20.0 OJ 1.0 0.200 20.66333
## 46 25.2 OJ 1.0 0.252 20.66333
## 47 25.8 OJ 1.0 0.258 20.66333
## 48 21.2 OJ 1.0 0.212 20.66333
## 49 14.5 OJ 1.0 0.145 20.66333
## 50 27.3 OJ 1.0 0.273 20.66333
## 51 25.5 OJ 2.0 0.255 20.66333
## 52 26.4 OJ 2.0 0.264 20.66333
## 53 22.4 OJ 2.0 0.224 20.66333
## 54 24.5 OJ 2.0 0.245 20.66333
## 55 24.8 OJ 2.0 0.248 20.66333
## 56 30.9 OJ 2.0 0.309 20.66333
## 57 26.4 OJ 2.0 0.264 20.66333
## 58 27.3 OJ 2.0 0.273 20.66333
## 59 29.4 OJ 2.0 0.294 20.66333
## 60 23.0 OJ 2.0 0.230 20.66333
Now let’s incorporate dosage (dose) as another grouping so that the subjects are grouped by both supp and dose:
teeth_2 <- teeth %>%
dplyr::group_by(supp, dose) %>%
dplyr::mutate(.,
mean_len = mean(len))
head(as.data.frame(teeth_2), 60)
## len supp dose len_cm mean_len
## 1 4.2 VC 0.5 0.042 7.98
## 2 11.5 VC 0.5 0.115 7.98
## 3 7.3 VC 0.5 0.073 7.98
## 4 5.8 VC 0.5 0.058 7.98
## 5 6.4 VC 0.5 0.064 7.98
## 6 10.0 VC 0.5 0.100 7.98
## 7 11.2 VC 0.5 0.112 7.98
## 8 11.2 VC 0.5 0.112 7.98
## 9 5.2 VC 0.5 0.052 7.98
## 10 7.0 VC 0.5 0.070 7.98
## 11 16.5 VC 1.0 0.165 16.77
## 12 16.5 VC 1.0 0.165 16.77
## 13 15.2 VC 1.0 0.152 16.77
## 14 17.3 VC 1.0 0.173 16.77
## 15 22.5 VC 1.0 0.225 16.77
## 16 17.3 VC 1.0 0.173 16.77
## 17 13.6 VC 1.0 0.136 16.77
## 18 14.5 VC 1.0 0.145 16.77
## 19 18.8 VC 1.0 0.188 16.77
## 20 15.5 VC 1.0 0.155 16.77
## 21 23.6 VC 2.0 0.236 26.14
## 22 18.5 VC 2.0 0.185 26.14
## 23 33.9 VC 2.0 0.339 26.14
## 24 25.5 VC 2.0 0.255 26.14
## 25 26.4 VC 2.0 0.264 26.14
## 26 32.5 VC 2.0 0.325 26.14
## 27 26.7 VC 2.0 0.267 26.14
## 28 21.5 VC 2.0 0.215 26.14
## 29 23.3 VC 2.0 0.233 26.14
## 30 29.5 VC 2.0 0.295 26.14
## 31 15.2 OJ 0.5 0.152 13.23
## 32 21.5 OJ 0.5 0.215 13.23
## 33 17.6 OJ 0.5 0.176 13.23
## 34 9.7 OJ 0.5 0.097 13.23
## 35 14.5 OJ 0.5 0.145 13.23
## 36 10.0 OJ 0.5 0.100 13.23
## 37 8.2 OJ 0.5 0.082 13.23
## 38 9.4 OJ 0.5 0.094 13.23
## 39 16.5 OJ 0.5 0.165 13.23
## 40 9.7 OJ 0.5 0.097 13.23
## 41 19.7 OJ 1.0 0.197 22.70
## 42 23.3 OJ 1.0 0.233 22.70
## 43 23.6 OJ 1.0 0.236 22.70
## 44 26.4 OJ 1.0 0.264 22.70
## 45 20.0 OJ 1.0 0.200 22.70
## 46 25.2 OJ 1.0 0.252 22.70
## 47 25.8 OJ 1.0 0.258 22.70
## 48 21.2 OJ 1.0 0.212 22.70
## 49 14.5 OJ 1.0 0.145 22.70
## 50 27.3 OJ 1.0 0.273 22.70
## 51 25.5 OJ 2.0 0.255 26.06
## 52 26.4 OJ 2.0 0.264 26.06
## 53 22.4 OJ 2.0 0.224 26.06
## 54 24.5 OJ 2.0 0.245 26.06
## 55 24.8 OJ 2.0 0.248 26.06
## 56 30.9 OJ 2.0 0.309 26.06
## 57 26.4 OJ 2.0 0.264 26.06
## 58 27.3 OJ 2.0 0.273 26.06
## 59 29.4 OJ 2.0 0.294 26.06
## 60 23.0 OJ 2.0 0.230 26.06
That’s more like it!
However, there’s a better way to organize this data…
summarize()
If you don’t need to keep the individual subjects’ tooth length data, then it makes sense to use summarize()
to condense our summarized data into fewer rows. Our output file will have one row for each combination of supp x dose.
teeth_3 <- teeth %>%
dplyr::group_by(supp, dose) %>%
dplyr::summarize(mean_len = mean(len))
## `summarise()` has grouped output by 'supp'. You can override using the `.groups` argument.
head(teeth_3)
## # A tibble: 6 x 3
## # Groups: supp [2]
## supp dose mean_len
## <fct> <dbl> <dbl>
## 1 OJ 0.5 13.2
## 2 OJ 1 22.7
## 3 OJ 2 26.1
## 4 VC 0.5 7.98
## 5 VC 1 16.8
## 6 VC 2 26.1
You can even summarize multiple times, such as if you wanted to first get mean values per supp x dose group and then average those values to get mean length per supp group.
teeth_4 <- teeth %>%
dplyr::group_by(supp, dose) %>%
dplyr::summarize(mean_len = mean(len)) %>%
dplyr::group_by(supp) %>%
dplyr::summarize(mean_len = mean(mean_len))
## `summarise()` has grouped output by 'supp'. You can override using the `.groups` argument.
head(teeth_4)
## # A tibble: 2 x 2
## supp mean_len
## <fct> <dbl>
## 1 OJ 20.7
## 2 VC 17.0
What if you just want to summarize all the rows of your dataset? Then you can use summarize()
in conjunction with the across()
“helper” function.
Let’s calculate the mean values for dose and len_cm. To do so, you’d set up your code like this. Be sure you’re not selecting any non-numeric columns, like supp.
teeth_5 <- teeth %>%
dplyr::summarize(across(dose:len_cm, mean))
teeth_5
## dose len_cm
## 1 1.166667 0.1881333
You can even summarize multiple variables in multiple ways, like in this example, where we’re calculating both mean, maximum, and minimum values for dose and length. In the set of parentheses determining which calculations to perform (mean, max, min), be sure to put the suffix you want to append to the column name first (e.g. mean, maximum, minimum), then the = sign, then the R-recognized calculation name (mean, max, min).
teeth_6 <- teeth %>%
dplyr::summarize(across(c(len, dose),
c(mean = mean,
maximum = max,
minimum = min))) %T>%
print()
## len_mean len_maximum len_minimum dose_mean dose_maximum dose_minimum
## 1 18.81333 33.9 4.2 1.166667 2 0.5
You’ll also notice that I added a special pipe- %T>%- plus the print()
function at the end of the command. This tells R to perform the functions after the special pipe but to not associate those functions with the object being created (teeth_6). This pipe is very useful when creating plots, for instance.
Let’s imagine that you only want to focus on the guinea pigs given orange juice. How would you get rid of the other rows?
filter()
The filter()
function lets you filter out the rows you don’t want. Here we’ve added a line of code to keep only the rows where the guinea pig was given vitamin C as orange juice (OJ), not as pure vitamin C.
teeth_7 <- teeth %>%
dplyr::filter(supp == "OJ")
head(teeth_7)
## len supp dose len_cm mean_len
## 1 15.2 OJ 0.5 0.152 18.81333
## 2 21.5 OJ 0.5 0.215 18.81333
## 3 17.6 OJ 0.5 0.176 18.81333
## 4 9.7 OJ 0.5 0.097 18.81333
## 5 14.5 OJ 0.5 0.145 18.81333
## 6 10.0 OJ 0.5 0.100 18.81333
select()
Similarly, the select()
function lets you filter out the columns you don’t want. Our new dataframe only includes the first three columns: from len to dose. Both of these commands accomplishes the same goal, whether you name the columns by their names or index position.
teeth_8 <- teeth %>%
dplyr::select(len:dose)
head(teeth_8)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
teeth_8 <- teeth %>%
dplyr::select(1:3)
head(teeth_8)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
Let’s load a new built-in dataset: warpbreaks. This dataset contains data regarding the number of breaks in yarn accumulated while weaving based on the yarn’s wool type and amount of tension.
yarn <- warpbreaks
We’re going to set up some linear models- one for each type of wool- and then use three functions from the broom
package to summarize important information about the models which will be generated in a format that makes it simple to compare and plot models.
As stated in the broom
overview page, here is the role of each function:
tidy()
:“Summarizes information about model components”
augment()
“Reports information about the entire model”
glance()
“Adds informations about observations to a dataset”
mod1 <- lm(breaks ~ tension,
yarn %>% filter(wool == "A"))
mod2 <- lm(breaks ~ tension,
yarn %>% filter(wool == "B"))
tidy()
broom::tidy(mod1)
## # A tibble: 3 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 44.6 4.34 10.3 2.91e-10
## 2 tensionM -20.6 6.13 -3.35 2.66e- 3
## 3 tensionH -20 6.13 -3.26 3.32e- 3
# broom::tidy(mod2)
augment()
broom::augment(mod1)
## # A tibble: 27 x 8
## breaks tension .fitted .resid .hat .sigma .cooksd .std.resid
## <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 26 L 44.6 -18.6 0.111 12.6 0.0953 -1.51
## 2 30 L 44.6 -14.6 0.111 12.9 0.0586 -1.19
## 3 54 L 44.6 9.44 0.111 13.1 0.0247 0.770
## 4 25 L 44.6 -19.6 0.111 12.6 0.106 -1.59
## 5 70 L 44.6 25.4 0.111 12.0 0.179 2.07
## 6 52 L 44.6 7.44 0.111 13.2 0.0153 0.607
## 7 51 L 44.6 6.44 0.111 13.2 0.0115 0.525
## 8 26 L 44.6 -18.6 0.111 12.6 0.0953 -1.51
## 9 67 L 44.6 22.4 0.111 12.3 0.139 1.83
## 10 18 M 24 -6 0.111 13.2 0.00996 -0.489
## # ... with 17 more rows
# broom::augment(mod2)
glance()
broom::glance(mod1)
## # A tibble: 1 x 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.378 0.326 13.0 7.29 0.00336 2 -106. 220. 225.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# broom::glance(mod2)
Notice that we are using the group_by
function perform the same command on each wool type but in the same chunk of code.
tidy()
# tidy
mod_all.tidy <- yarn %>%
group_by(wool) %>%
do(fit = broom::tidy(lm(breaks ~ tension,
data = .))) %>%
unnest(fit) %T>%
print()
## # A tibble: 6 x 6
## wool term estimate std.error statistic p.value
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 A (Intercept) 44.6 4.34 10.3 2.91e-10
## 2 A tensionM -20.6 6.13 -3.35 2.66e- 3
## 3 A tensionH -20 6.13 -3.26 3.32e- 3
## 4 B (Intercept) 28.2 2.79 10.1 3.91e-10
## 5 B tensionM 0.556 3.94 0.141 8.89e- 1
## 6 B tensionH -9.44 3.94 -2.39 2.48e- 2
augment()
mod_all.augment <- yarn %>%
group_by(wool) %>%
do(fit = broom::augment(lm(breaks ~ tension,
data = .))) %>%
unnest(fit) %T>%
print()
## # A tibble: 54 x 9
## wool breaks tension .fitted .resid .hat .sigma .cooksd .std.resid
## <fct> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A 26 L 44.6 -18.6 0.111 12.6 0.0953 -1.51
## 2 A 30 L 44.6 -14.6 0.111 12.9 0.0586 -1.19
## 3 A 54 L 44.6 9.44 0.111 13.1 0.0247 0.770
## 4 A 25 L 44.6 -19.6 0.111 12.6 0.106 -1.59
## 5 A 70 L 44.6 25.4 0.111 12.0 0.179 2.07
## 6 A 52 L 44.6 7.44 0.111 13.2 0.0153 0.607
## 7 A 51 L 44.6 6.44 0.111 13.2 0.0115 0.525
## 8 A 26 L 44.6 -18.6 0.111 12.6 0.0953 -1.51
## 9 A 67 L 44.6 22.4 0.111 12.3 0.139 1.83
## 10 A 18 M 24 -6 0.111 13.2 0.00996 -0.489
## # ... with 44 more rows
glance()
mod_all.glance <- yarn %>%
group_by(wool) %>%
do(fit = broom::glance(lm(breaks ~ tension,
data = .))) %>%
unnest(fit) %T>%
print()
## # A tibble: 2 x 13
## wool r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 A 0.378 0.326 13.0 7.29 0.00336 2 -106. 220. 225.
## 2 B 0.253 0.190 8.37 4.06 0.0303 2 -94.1 196. 201.
## # ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Let’s practice writing vectorized code!
Ex.1 First, use the apply
function to find the median value across columns of the “dataset” data frame.
Answer Ex.1
apply(mtcars, 1, median)
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 4.000 4.000 4.000 3.215
## Hornet Sportabout Valiant Duster 360 Merc 240D
## 3.440 3.460 4.000 4.000
## Merc 230 Merc 280 Merc 280C Merc 450SE
## 4.000 4.000 4.000 4.070
## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental
## 3.730 3.780 5.250 5.424
## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla
## 5.345 4.000 4.000 4.000
## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28
## 3.700 3.520 3.435 4.000
## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa
## 3.845 4.000 4.430 4.000
## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E
## 5.000 6.000 8.000 4.000
Ex.2 Next, find the average number of breaks from yarn with wool type A and tensions L & M only. You should end up with one dataframe with 2 rows and 2 columns (“tension” and “mean_breaks”) showing mean A/L breaks and A/M breaks. Hint: Take it one function at a time (you’ll use 3) and check your code as you go to see that you’re on the right track.
Answer Ex.2
yarn %>%
filter(wool == "A" & tension != "H") %>%
group_by(tension) %>%
summarise(mean_breaks = mean(breaks)) %T>%
print()
## # A tibble: 2 x 2
## tension mean_breaks
## <fct> <dbl>
## 1 L 44.6
## 2 M 24
## # A tibble: 2 x 2
## tension mean_breaks
## <fct> <dbl>
## 1 L 44.6
## 2 M 24
Ex.3 Lastly, choose only the rows with supp = VC from the teeth data frame and then create 2 new columns: one with the product of len x dose (called “len_x_dose”) and the other with the ranking for the len values (called “rank”) in descending order (the largest value should rank at 1). Hint: here’s how to write the ranking function.
Answer Ex.3
teeth %>%
filter(supp == "VC") %>%
mutate(len_x_dose = len*dose,
rank = row_number(desc(len)))
## len supp dose len_cm mean_len len_x_dose rank
## 1 4.2 VC 0.5 0.042 18.81333 2.10 30
## 2 11.5 VC 0.5 0.115 18.81333 5.75 21
## 3 7.3 VC 0.5 0.073 18.81333 3.65 25
## 4 5.8 VC 0.5 0.058 18.81333 2.90 28
## 5 6.4 VC 0.5 0.064 18.81333 3.20 27
## 6 10.0 VC 0.5 0.100 18.81333 5.00 24
## 7 11.2 VC 0.5 0.112 18.81333 5.60 22
## 8 11.2 VC 0.5 0.112 18.81333 5.60 23
## 9 5.2 VC 0.5 0.052 18.81333 2.60 29
## 10 7.0 VC 0.5 0.070 18.81333 3.50 26
## 11 16.5 VC 1.0 0.165 18.81333 16.50 15
## 12 16.5 VC 1.0 0.165 18.81333 16.50 16
## 13 15.2 VC 1.0 0.152 18.81333 15.20 18
## 14 17.3 VC 1.0 0.173 18.81333 17.30 13
## 15 22.5 VC 1.0 0.225 18.81333 22.50 9
## 16 17.3 VC 1.0 0.173 18.81333 17.30 14
## 17 13.6 VC 1.0 0.136 18.81333 13.60 20
## 18 14.5 VC 1.0 0.145 18.81333 14.50 19
## 19 18.8 VC 1.0 0.188 18.81333 18.80 11
## 20 15.5 VC 1.0 0.155 18.81333 15.50 17
## 21 23.6 VC 2.0 0.236 18.81333 47.20 7
## 22 18.5 VC 2.0 0.185 18.81333 37.00 12
## 23 33.9 VC 2.0 0.339 18.81333 67.80 1
## 24 25.5 VC 2.0 0.255 18.81333 51.00 6
## 25 26.4 VC 2.0 0.264 18.81333 52.80 5
## 26 32.5 VC 2.0 0.325 18.81333 65.00 2
## 27 26.7 VC 2.0 0.267 18.81333 53.40 4
## 28 21.5 VC 2.0 0.215 18.81333 43.00 10
## 29 23.3 VC 2.0 0.233 18.81333 46.60 8
## 30 29.5 VC 2.0 0.295 18.81333 59.00 3