Introduction to R functions

Introduction

One of the more powerful aspects of R relative to other statistical softwares such as SPSS is the ability to create functions. Functions are extremely powerful because they allow for the automation of repetitive tasks and reduce bloated code from the more traditional “copy-paste” format. This becomes extremely powerful when combined with the apply family or loops. However, that will be covered later in this presentation

Why use functions?

  • save complicated operations or mathematical calculations
  • reduce repetitive tasks
  • standardize formats (e.g. plots, maps, etc)
  • combine with apply family for automation of tasks

Function Syntax

The format of function is fairly simple:

function.name <- function(arguments){ arguments + operation }

Here arguments can be any lists of text or characters that will later be used within the formula. Within the function, the operate will have two general components. Firstly, the portion of the function that will remain unchanged each time it is used. The second are the arguments that will be changed each time.

Ex.1

add.3 <- function(x){
x+3
}

add.3(x=5)
## [1] 8

For these functions there are two items that will remain unchanged. The first is the calculation. In the above example, the addition operation will not change and remains the same. These types of functions are useful when you have a particular index that you are trying to calculate, but the inputs will keep changing.

Ex.2

Log response ratio is a calculation when comparing two groups based on a treatment vs control format.

LRR <- function(treatment,control){
  log(treatment/control)
}


LRR(treatment=30, control=20)
## [1] 0.4054651

Here we see that the log response ratio of our two individuals. Specifying the treatments is not entirely necessary. Depending on how you set up your arguments, R will assume your inputs will be in the order that they are presented. The function can also handle any type of data format such as number, vector, shapefile, dataframe, etc, just as long as what is specified within the function can handle it.

Ex.3

treat1 = c(30,40,50,35,45,55)
control1= c(20,25,30,23,25,20)

LRR(treat1, control1)
## [1] 0.4054651 0.4700036 0.5108256 0.4198538 0.5877867 1.0116009

To drive home the point a bit more, you can even embed functions within functions. Something that actually becomes more useful that you may believe.

Ex.4

## Create function to calculate average
average <- function(x){
  sum(x)/length(x) 
}

## assign LRR to object
experiment.results <- LRR(treat1, control1)

## calculate the average of results from LRR
average(experiment.results)
## [1] 0.5675893

The second major component of a function that remains consistent between its usage are particular settings. This is especially true when you are trying to run some form of analyses that have specific assumptions, weightings, or syntax.

For example, imagining you would like to run a series of Generalized linear models with specific settings, such as poisson family. The function would look as such:

glm.pois <- function(predictor,response){
m1 <- glm(y~x, family=poisson) ## glm formula with poisson family
anova(m1, test="Chisq") ## test for difference
}

y <- rpois(10,5) ## create random poisson distribution
x <- rep(c("group1","group2"),5) ## create random grouping variable

glm.pois(x,y)
## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: y
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                     9     7.2103         
## x     1        0         8     7.2103        1

Functions for plots

Creating plots in R is difficult. The aesthetic component is getting better and packages such as ggplot2 help format plots to be more appealing. However, defaults are not ideal and often not publication quality. It can be helpful to create a function that matches your ideal plot archetype and then edit the inputs only.

Lets use the cars dataset to conduct and plot a correlation

cars
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17
## 11    11   28
## 12    12   14
## 13    12   20
## 14    12   24
## 15    12   28
## 16    13   26
## 17    13   34
## 18    13   34
## 19    13   46
## 20    14   26
## 21    14   36
## 22    14   60
## 23    14   80
## 24    15   20
## 25    15   26
## 26    15   54
## 27    16   32
## 28    16   40
## 29    17   32
## 30    17   40
## 31    17   50
## 32    18   42
## 33    18   56
## 34    18   76
## 35    18   84
## 36    19   36
## 37    19   46
## 38    19   68
## 39    20   32
## 40    20   48
## 41    20   52
## 42    20   56
## 43    20   64
## 44    22   66
## 45    23   54
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85

We can use default plot to see the output. On our first look we can see the output is less than ideal. To resolve this I will load an R script using the source function. This way I can pull all my prefered defaults.

Ex.5

plot(cars)

plot.better <- function(x,y,...){
  plot(x,y, pch=21, cex=2, cex.axis=1.3, cex.lab=1.5, bg="Grey70",...)
}

plot.better(cars[,"speed"],cars[,"dist"], xlab="speed", ylab="distance")

Here, I have added the ... syntax. This allows for the specification of additional arugments that can be present within your function that are likely to change each time you use the function. This is helpful because you can combine arguments that will remain static, arguments that will always change, and arguments that will change only some of the time.

Functions for summaries

The reason I find functions so helpful is that they can be embeded into other functions or automation tools in R. The apply family is a perfect example of this and what we will be covering shortly. However, I want to give a quick example using another method. The function aggregate in R is useful for acquiring summary statistics by a certain variable. The syntax is similar to a linear model and can be present as response ~ predictor. The third argument is the function you would like to apply to this. Typically, I use mean to get averages per treatment, but there are a whole host of other metrics that could be used as well.

Let us first start with a dataset with two drug treatments and a control on 60 individuals.

Ex.6

response <- c(4.31068872,4.02239617,5.63827473,7.72906176,7.35096786,4.2539145,7.9450572,3.5814285,6.430872,4.64378698,3.55665898,4.99139648,4.036698,5.46499184,3.85910911,3.5291655,5.90729568,5.674248,7.23597678,4.39489848,7.63737,9.6482263,7.28738049,7.4192216,10.59532723,9.2013317,9.4829496,7.1599644,6.940219,7.30966038,7.10753942,9.28011484,8.97616,9.08440512,7.2778324,7.8634062,8.15142375,8.9925216,8.69784123,5.5644162,10.92730496,10.1363948,9.41244444,10.263204,11.38765203,13.2568931,14.4508848,7.5576591,8.530619,12.19368536,6.74443276,13.08433484,10.727451,13.27716208,9.49188068,9.8798724,10.98806094,9.79678656,10.81542819,7.57395576)

treatment <- c(rep("control",20),rep("drugA",20),rep("drugB",20))

data <- data.frame(treatment, response)
data
##    treatment  response
## 1    control  4.310689
## 2    control  4.022396
## 3    control  5.638275
## 4    control  7.729062
## 5    control  7.350968
## 6    control  4.253914
## 7    control  7.945057
## 8    control  3.581428
## 9    control  6.430872
## 10   control  4.643787
## 11   control  3.556659
## 12   control  4.991396
## 13   control  4.036698
## 14   control  5.464992
## 15   control  3.859109
## 16   control  3.529165
## 17   control  5.907296
## 18   control  5.674248
## 19   control  7.235977
## 20   control  4.394898
## 21     drugA  7.637370
## 22     drugA  9.648226
## 23     drugA  7.287380
## 24     drugA  7.419222
## 25     drugA 10.595327
## 26     drugA  9.201332
## 27     drugA  9.482950
## 28     drugA  7.159964
## 29     drugA  6.940219
## 30     drugA  7.309660
## 31     drugA  7.107539
## 32     drugA  9.280115
## 33     drugA  8.976160
## 34     drugA  9.084405
## 35     drugA  7.277832
## 36     drugA  7.863406
## 37     drugA  8.151424
## 38     drugA  8.992522
## 39     drugA  8.697841
## 40     drugA  5.564416
## 41     drugB 10.927305
## 42     drugB 10.136395
## 43     drugB  9.412444
## 44     drugB 10.263204
## 45     drugB 11.387652
## 46     drugB 13.256893
## 47     drugB 14.450885
## 48     drugB  7.557659
## 49     drugB  8.530619
## 50     drugB 12.193685
## 51     drugB  6.744433
## 52     drugB 13.084335
## 53     drugB 10.727451
## 54     drugB 13.277162
## 55     drugB  9.491881
## 56     drugB  9.879872
## 57     drugB 10.988061
## 58     drugB  9.796787
## 59     drugB 10.815428
## 60     drugB  7.573956
## calculate means for each treatment

aggregate(response~treatment, data=data, FUN=mean)
##   treatment  response
## 1   control  5.227844
## 2     drugA  8.183866
## 3     drugB 10.524805

Here we calculated the mean for each treatment by specifying which function will opperate on the treatment group. This does not be restricted to functions in R base.It could also be expanded to include items such as linear models, complex index calculations, or anything else that is applied at the group level. Let’s try it with a user defined function

## function for standard error
se <- function(x) sqrt(var(x)/length(x))

aggregate(response~treatment, data=data, FUN=se)
##   treatment  response
## 1   control 0.3265749
## 2     drugA 0.2725698
## 3     drugB 0.4596505
## calculate confidence intervals

conf95 <- function(x)   {(sqrt(var(x,na.rm=T)/length(na.omit(x))))*1.96}

aggregate(response~treatment, data=data, FUN=conf95)
##   treatment  response
## 1   control 0.6400869
## 2     drugA 0.5342369
## 3     drugB 0.9009150

There are similar utilizations in tidyverse within dplyr such as:

data %>% group_by(site) %>% summarize(temp.avg=mean(temperature),temp.se=conf95(temperature)