1.0 Introduction

(modified from Julie Lowndes workshop)

To start off, it is a good idea to mention what is R and why anyone should bother to use it. R is a programming language that is typically used for data mangement and statistical analysis. Larger than that, R is a place where you can manage your data, analyze it, collaborate, and keep a record of what you do. R has a number of advantages over other statistical software.

  1. R is free!
  2. R is a combination of many different software including:
  • data manipulation
  • statistical analysis
  • graphics
  • database
  • automation
  • computer learning
  1. R is open source allowing for people to develop new tools to be used in the R environment.

1.1 R basics, workspace and working directory

Launch R studio

Start by opening a new project under File -> New Project -> New Directory -> Empty Directory -> Name the project. Once your project has been opened open a R script under File -> New File -> R script.

Now you are ready to get started in R. Currently, R Studio should have four distinct panels. In the top left is your R script that will act as a notepad for everything you write. On the top right is the environment that stores objects that you load. We will discuss more about it later. On the bottom right is your working directory where your data and scripts are saved. Lastly, the bottom left is the R console that is where your R script is tested.

Let’s start with some basic arithmetic. In the script section type the following. You can then run the code by selecting the line of code and clicking run in the top right corner or cmd/ctrl+R shortcut.

3 * 4
## [1] 12

R uses the <- symbol for variable assignment

x <- 3
x
## [1] 3

Here on the left is the object, in this case x and on the right a value to be assigned. We can therefore assign numbers to objects and conduction functions on them

x <- 3
y <- 4
x*y
## [1] 12

Often we give objects names that are easier to understand. An object can be any combination of characters as long as it starts off with a letter and does not have any spaces.

product.of.x.y <- x*y
product.of.x.y
## [1] 12

The computer is extremely literal with its inputs. Often mistakes such as typos and case-sensitivity will result in errors. Keeping simple names helps reduce that as a source of error.

Another commonality of R and programming is logical operators. These are as follows

  • == means ‘is equal to’
  • != means ‘is not equal to’
  • < means ` is less than’
  • > means ` is greater than’
  • <= means ` is less than or equal to’
  • >= means ` is greater than or equal to’
product.of.x.y == 12
## [1] TRUE
product.of.x.y < 5
## [1] FALSE

One operator that is not common in programming but used for statistics is the tilda ~. This operator represents “regressed” and will be discussed in the statistics part of this workshop.

1.2 Functions in R

R has a near endless variety of functions that are available. Some are already loaded into what is referred to as base R, however, there are more that are coming out every day that are available through additional packages that can be installed. Here we are only going to cover those available in base R. First it is important to understand how functions work.

A function is in its most basic form is something that takes a list of arguments and does something with it. A simple example of this is the sequence function seq(). Sequence function has three basic arguments. The starting number, the ending number, and the number to increase by increments.

seq(from=1 , to=10, by=1)
##  [1]  1  2  3  4  5  6  7  8  9 10
seq.2.5 <- seq(from=0, to=10, by=2.5)
seq.2.5
## [1]  0.0  2.5  5.0  7.5 10.0

Functions have an order to their arguments and some arguments that have default values. For instance, if we do not specify the from and to arguments, R assumes them to be the first and second values.

seq(1,10,1)
##  [1]  1  2  3  4  5  6  7  8  9 10

R also treats the default by argument as 1 when no value is supplied.

seq(1,10)
##  [1]  1  2  3  4  5  6  7  8  9 10

R has your back and tries to run the function with the least amount of information you have provided. Becareful though!! Sometimes your function will run, but the defaults are incorrect. It is always best to go over the structure of your function. To do this we call on the help using a question mark ?seq.

The help page is broken down into sections: - Description: An extended description of what the function does. - Usage: The arguments of the function and their default values. - Arguments: An explanation of the data each argument is expecting. - Details: Any important details to be aware of. - Value: The data the function returns. - See Also: Any related functions you might find useful. - Examples: Some examples for how to use the function.

1.3 R environment and types of data

You may have noticed that in the top right panel values have been populated. Currently there should be four objects present. These represent objects you have loaded and that are saved in R. If you type any of these below such as seq.2.5. the respective values will be returned.

You can also see a list of these objects using a function

objects() ## list objects in R
## [1] "product.of.x.y" "seq.2.5"        "x"              "y"

If you want to remove a particular object that could be interfering with something else you can as well

rm(product.of.x.y) ## remove object

You may have also noticed that there are different types of data stored as objects. Lets take a look at some we have loaded already. The class function tells us what type data is stored.

class(x)
## [1] "numeric"

Here x is classified as a numeric vector, meaning a one dimensional series of numbers. R has the capacity to store everything from numbers to images to website links all as objects.

There is a basic data set that is loaded into R at all times for sample analyses called cars. We can see it here

cars
##    speed dist
## 1      4    2
## 2      4   10
## 3      7    4
## 4      7   22
## 5      8   16
## 6      9   10
## 7     10   18
## 8     10   26
## 9     10   34
## 10    11   17
## 11    11   28
## 12    12   14
## 13    12   20
## 14    12   24
## 15    12   28
## 16    13   26
## 17    13   34
## 18    13   34
## 19    13   46
## 20    14   26
## 21    14   36
## 22    14   60
## 23    14   80
## 24    15   20
## 25    15   26
## 26    15   54
## 27    16   32
## 28    16   40
## 29    17   32
## 30    17   40
## 31    17   50
## 32    18   42
## 33    18   56
## 34    18   76
## 35    18   84
## 36    19   36
## 37    19   46
## 38    19   68
## 39    20   32
## 40    20   48
## 41    20   52
## 42    20   56
## 43    20   64
## 44    22   66
## 45    23   54
## 46    24   70
## 47    24   92
## 48    24   93
## 49    24  120
## 50    25   85

The data is stored as a dataframe which is a multiple of vectors that are organized into columns with a heading. Let’s take a look at what class it belongs to and top 5 rows.

class(cars) ## what is the class of cars
## [1] "data.frame"
head(cars,5) ## list top five rows
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16

Dataframes are organized by a series of rows and columns. Using the square brackets [] we can reference certain portions of the dataframe. If we want a certain number of rows, the numbers that go before the comma correspond to the respective number of rows.

cars[1,] ## first row
##   speed dist
## 1     4    2
cars[1:5,] ## first five rows
##   speed dist
## 1     4    2
## 2     4   10
## 3     7    4
## 4     7   22
## 5     8   16

Alternatively, if we want a particular column we can either use the column number or name that it corresponds to. This is placed after the comma

cars[,2] ## second column
##  [1]   2  10   4  22  16  10  18  26  34  17  28  14  20  24  28  26  34
## [18]  34  46  26  36  60  80  20  26  54  32  40  32  40  50  42  56  76
## [35]  84  36  46  68  32  48  52  56  64  66  54  70  92  93 120  85
cars[,"dist"] ## second column using title
##  [1]   2  10   4  22  16  10  18  26  34  17  28  14  20  24  28  26  34
## [18]  34  46  26  36  60  80  20  26  54  32  40  32  40  50  42  56  76
## [35]  84  36  46  68  32  48  52  56  64  66  54  70  92  93 120  85

If we want a particular number we can combine these two values. This allows us to extract particular numbers or change them in the dataset

cars[2,"dist"] ## second row, second column
## [1] 10
cars[2,"dist"] <- 9 ## change that value to 9

cars[2,"dist"] ## check that the value was changed
## [1] 9

Other popular data classes are lists and matrices. A list is a series of objects that have been grouped together into another object. A matrices is almost identical to a data frame except it uses numbers rather than names for the columns and rows.

1.4 Data visualization

Lets do a simple plot of the cars data set. We want to compare speed by distance from the dataframe. However, right now each of those vectors is stored in the dataframe. We can extract them by the square brackets.

speed <- cars[,"speed"]
dist <- cars[,"dist"]

We can do a simple x by y plot and plot a line of best fit

plot(speed, dist)
z <- line(cars)
abline(coef(z), col = "red")

Histogram is another function that is great for checking the distributions of a data set. We can generate a random normal distribution using rnorm then check its shape using hist.

pop1 <- rnorm(100) ## generate 100 random numbers from a normal population

hist(pop1) ## plot randomly generated numbers

hist(dist) ## plot distribution of distances