Chapter 1 R basics
1.1 Setup
1.1.1 Downloading and installing R and RStudio
When we want to describe or analyze our data, our computer will do the number crunching and making the pretty graphs, but we have to tell it what to do.
R is a programming language that is particularly good for analyzing and visualizing quantitative data, and it is one of the most-used programming languages in data science today. RStudio is a user-friendly software application that provides a collection of several components that we will use a lot. It is possible to use R without RStudio (you can talk to your computer using R directly via a command line), but RStudio makes things a lot easier!
You will need to download both R and RStudio separately. To do this, please follow the instructions for downloading and installing RStudio in Appendix A in Grolemund (2014).
If you have used R before, I recommend that you re-download both of these so you are working with the latest version.
1.1.2 Using RStudio
RStudio allow you to see several different sub-windows at once. You’ll see different tabs in each sub-window that show different things. You can customize which of these you see (and where to put them), but for this course, the most important ones will be:
- Console: This will be on the bottom left (or the whole left side) of the screen by default when you open R. The console is where all the action happens: you type in commands and the computer does your bidding!
- Display window: On the upper left, you can view files like datasets or your R scripts (if you don’t have any files open, this will be empty and the whole left side will be taken up by the console)
- Environment: This is usually on the top right. This will show you the list of all objects saved in your workspace, including datasets and variables.
- Plots: This is usually on the bottom right. When you create graphs, they will show up here.
1.2 Using the console
This is a basic introduction to what you can do with the console. We’ll be discussing this more systematically later. Try out the following things by entering them into the command line after >.
1.2.2 Objects
You’ll be working with different types of data in R: numbers, lists, and dataframes (like spreadsheets, a grid of columns and rows). In order to save and work with this data, we use objects (these are also called variables). An object is just a name you can use to store data.
When you have data (a number, list, or spreadsheet) you want to work with, you can assign it a name to save it as an object. This gives you a way to refer to the data. Objects are assigned with the symbol <-
; when you create an object, it will be saved as part of your Environment (you can see this in the top right panel). This will be very useful when we do more complicated operations.
Object names:
- must begin with a letter
- can only contain letters, numbers, underscore, or periods. No spaces allowed!
The first line of the following code assigns the number 3 to the object ‘cats.’ After assigning this variable, you can see the result if you just type ‘cats’ in the console. You can then do other operations with this. You can change the value of the object, and subsequent operations will refer to the updated object.
Run the lines of code below in the console. Try to figure out what answer you will get before you run the code.
In these examples, each object just consists of a single number, but as we will see in the future, objects can refer to non-numeric characters, lists, and entire data sets.
1.3 Packages
After you have downloaded R, your computer will have all the information it needs to interpret commands written in the R language, and the initial installation also includes some functions and datasets. However, one of the nice things about R is that people can create “packages” that are bundles of code, functions, or datasets. This allows us to do different types of things that we wouldn’t be able to do with the base package. To use a package, you have to do the following:
- Install the package: this will download the package and you will have it on your computer. You only have to do this once (unless you get a new computer!).
- Load the package: this will load the package into your workspace. You have to do this every time you open R (because every time you close R, it cleans up everything in your workspace)
A package we will be using frequently in this course is tidyverse.
Install (you only have to do this once, and you must be connected to the internet):
Load (you have to do this every time you open R)
1.4 R scripts
As you have seen above, commands need to be entered into the console to run them. However, if you type your commands directly into the console as you have done above, this means that you have to start from scratch if you want to reproduce what you’ve done another time. This creates a lot of extra work!
To make things easier for yourself, and to make sure your work is documented, you can create an R script. An R script is simply a fancy name for a plain text file that contains a list of all of the commands you want to run, in order. The filename has the extension .R. This is just so your computer knows to open it in RStudio; it’s really just a plain text file. An R script will allow you to save the code you write, so you can have it for future reference, and to have a reproducible record of what you have done.
You can open the R script and view it in RStudio, then simply run the commands that you have saved in the console. I highly recommend always saving your commands in an R script!
You can create your own empty R script or open an existing R script.
- Create a new R script in RStudio by going to
File > New File > R script
. - Open an existing R script by double-clicking on it (it should open in RStudio by default) or by going to
File > Open File
. As an example, download and then try to open the R script corresponding to the first lab in this course: lab01.R.
Once you have your script open in RStudio, you can run the commands in the console in one of two ways. Note that you can run multiple lines/commands at once, which can save a lot of time! * Copy-paste the relevant line of code * Place your cursor on the line of code (or select it) and then use the keyboard shortcut CMD/CTRL+Enter, which will run it in the console.
R scripts should contain ONLY two things:
- R commands. These should be commands that can be run in the console
- If you want to include notes that are not commands, you can include them as “comments” by putting the hashtag symbol
#
at the beginning of the line. This is helpful so you can use it to write notes for humans (yourself or others!). R will ignore any line that starts with this symbol, so if you try to run it in the console, it won’t do anything (if you try to run notes in the console without a hashtag, R will interpret it as a command and give you a red error message).
Most of your assignments in this course will be R scripts. Usually, I will create a template script, and you will modify it to include your answers. You will then save your version of the script for me to evaluate.
The scripts that you turn in for this course should include only the commands and comments that are necessary to answer the questions. I should be able to run your script from start to finish and get the same results you do. You can check by selecting the whole script and clicking Run (or CTRL/CMD+Enter), and make sure there are no errors, and that you are getting the answers you want. This is also a good idea to do even when you are writing the script only for yourself.
1.5 Data structures
1.5.1 Vectors
There are different types of data structures in R. The most fundamental is probably the vector. You can think of a vector as a list of values. You can create a vector from any list of elements using the c() function (I’m not actually sure what the ‘c’ stands for: combine? concatenate? conjoin? cats?). (data below comes from Ethnologue).
language.families <- c("Afro-Asiatic","Austronesian","Indo-European","Niger-Congo","Sino-Tibetan","Trans-New Guinea","Other")
number.languages <- c(456, 1225, 447, 1536, 456, 476, 2641)
number.speakers.million <- c(596, 326, 3300, 600, 1400, 4, 1070)
Note that when you look at these objects by calling them in the console, they look different.
## [1] "Afro-Asiatic" "Austronesian" "Indo-European" "Niger-Congo" "Sino-Tibetan" "Trans-New Guinea"
## [7] "Other"
## [1] 456 1225 447 1536 456 476 2641
1.5.1.1 Data classes
R distinguishes between different ‘classes’ or types of data. Some classes of data are:
- numeric (numbers)
- factor or character: these are both also known as “strings,” and are interpreted as non-numerical characters.
- R reads in non-numerical data as character by default, but we will generally be working with factors, so we will often need to explicitly tell R to interpret a column as a factor.
- Don’t worry for now about the difference between these two types.
- logical (True or False)
You can see the class of data with the function class()
.
This will become important later, but for now here are some things to remember:
- R will always make an assumption about what class of data it’s working with.
- All items in a vector must be of the same class. If you try to create a vector with multiple classes, R will coerce everything into a single class.
- R can only do math with numbers, not characters.
- Strings (factor/character) must always be in double quotes.
Look at the examples below. What do you expect the resulting class to be? Examine each vector (by typing the object name in the console), as well as its class.
A note on NA: NA stands for ‘not applicable’ and is treated as a null value, or an empty cell in the table. NA is written without quotes, and is special because it can coexist with numbers. “NA” in quotes will be treated as a character.
1.5.2 Data frames
The data frame is probably the most useful structure for data analysis. You can think of it like a spreadsheet: a table of rows and columns. It’s possible to create your own data frames in R as shown below. Note that each column is a vector.
df = data.frame(
family = factor(c("Afro-Asiatic","Austronesian","Indo-European","Niger-Congo","Sino-Tibetan","Trans-New Guinea","Other")),
num.langs = c(456, 1225, 447, 1536, 456, 476, 2641),
num.speakers = c(596, 326, 3300, 600, 1400, 4, 1070)
)
df
## family num.langs num.speakers
## 1 Afro-Asiatic 456 596
## 2 Austronesian 1225 326
## 3 Indo-European 447 3300
## 4 Niger-Congo 1536 600
## 5 Sino-Tibetan 456 1400
## 6 Trans-New Guinea 476 4
## 7 Other 2641 1070
1.5.3 Reading in an existing dataset
Usually you will not be creating data frames in R. Instead, you will be reading in data that has been created in a different program. This data will usually be in the form of a text file, with columns separated either by commas or tabs (commas in the example below).
To read in a dataset, you need to tell R where to find it; you do this by providing the path, or the location of the file. Paths can be confusing, all the more so because they are different on Mac vs. PC! If you’re having trouble locating the path to a file on your computer, here is a helpful resource.
We will be looking at a dataset from Nettle (1999) that includes information about the geography, population, and number of languages spoken in different countries. This dataset allows us to look at the relationship between geographical factors and linguistic diversity. Note that this command is written on two lines, but is actually just a single command: the %>%
or pipe function tells R that the command is not finished (more on this later). The second line is telling R to interpret these as factors, not characters.
1.5.4 Examining data
Once we’ve read in the dataset and saved it as an object, we’re ready to look at it! R makes it easy to do this in many different ways.
You can view the whole dataset with the command View(). Note that this will open up a new tab in the top left window in RStudio, so if you’re working from a script, you’ll have to navigate back to it when you’re done looking at the data.
You can summarize the columns of a dataset; this is a very useful feature. Note that for columns that are numeric, the summary provides information about the range and average for the column; for columns that are strings, it does not (which makes sense, because R can only do math on numbers!).
You can also look at the top and bottom lines of a dataset with the functions head() and tail().
You can access specific rows or columns of data by providing the [row, column] number. You can also access columns with the $
sign followed by the name of the column.
nettle[2,] #gives the second row of the object 'nettle'
nettle[,5] #gives the fifth column
nettle$Langs #gives the column named Langs
Looking at the data in these ways allows us to get a sense of the data. For example, now we can see that the nettle
dataset consists of 6 columns with countries and their corresponding population, geographical area, MGS (which stands for ‘mean growing season,’ or the average number of months crops can be grown in the country), (number of) languages, and continent. From the summary()
function, we can also see summary statistics, or the range and mean/median, for any numeric columns. For example, we can already see that the number of languages in the countries in this dataset ranges from 1 to 862, with the mean being 89.73 languages.
You can see the numbers of rows and columns in a dataset as follows:
## [1] 74
## [1] 6
You can see the levels of a factor with as follows:
## [1] "Africa" "Asia" "North America" "Oceania" "South America"
1.6 Manipulating dataframes using tidyverse
Note: for additional reading and practice with the concepts in this section, I highly recommend reading and doing the practice exercises in the freely available R for Data Science, Chapter 2, Sections 4.1-4.4.
Looking at the data is fun, but usually we want to extract certain information from the dataset instead of looking at the whole thing. We’ve seen how to extract specific rows and columns based on their indices above, but more often we’ll want to sort or filter data by certain criteria, just like you might do in Excel.
In this class, we’ll use the package tidyverse
in this class for data manipulation and making graphs. You will need to make sure this package is installed and loaded for pretty much everything you do.
1.6.1 Select specific columns
You can use the select()
function to select certain columns of a dataframe. select()
takes two arguments: the first is the name of the dataframe, and the second is a vector of the names of the columns you would like to select. Alternatively, you can exclude certain columns by putting -
in front of the column name. See what happens for the following. Note that in the examples below, no new objects are being created or saved - you’re just viewing the result of what you have asked R to do.
1.6.2 Filter
To create a subset of a dataframe, including only rows that meet certain specifications, you can use the filter()
function. The following operators will be useful:
- is equal to:
==
- is not equal to:
!=
(the exclamation point typically means NOT) - less than:
<
- greater than:
>
- less than or equal to:
<=
- greater than or equal to:
>=
You can also combine operators using the following:
- and:
&
- or:
|
Let’s say we wanted to look at only the countries that have a mean growing season (MGS) of less than 2 months, or countries that are reported to have between 100 and 200 languages, or only the countries in South America:
1.6.3 Pipe notation
In this course, we will use what is called pipe notation. The pipe symbol is %>%
or |>
1. A pipe is placed at the end of the line and tells R: “whatever object was calculated before, carry that forward to the next operation.” You use the symbol .
to represent the previously calculated object. We can use this to do multiple operations without creating a bunch of new objects. Note that the two following sequences of commands do the same thing.
At first glance, it might look like the piping method is actually more complicated. However, having many different objects leaves more room for error, not to mention a messier workspace. Let’s say we want to get a summary of a specific subset of the nettle
data: one that includes only countries with greater than 200 languages, and just focuses on the population and languages of those countries. One way to do this is to do each step at a time, creating a new object for each.
But alternatively, you can start from the original, use pipes for each step, and see the same output, without creating any new objects.
## Country Population Langs
## Australia:1 Min. :3.580 Min. :209.0
## Brazil :1 1st Qu.:4.240 1st Qu.:234.0
## Cameroon :1 Median :4.940 Median :275.0
## India :1 Mean :4.761 Mean :397.2
## Indonesia:1 3rd Qu.:5.190 3rd Qu.:427.0
## Mexico :1 Max. :5.930 Max. :862.0
## (Other) :3
The output you see here will give the relevant information, so you can see the range and mean/median for the population and number of languages spoken in countries with over 200 languages spoken.
1.6.4 Summarizing values
We often want to calculate the average value for a variable across different categories of another variable (for example, the average number of languages spoken across countries in each continent separately). One way to do this is to create separate subsets the data for each continent, and then find the average number of languages for each, as shown above. However, the summarize()
function allows us to find the average for all continents with a single (piped) command.
First, the group_by(X)
command tells R to subset the data into different groups, based on the value of column X. This doesn’t actually do anything visible to the data; it just creates some invisible structure. Then, the summarize()
function allows you to create a new column that will perform an operation over each of these subsets of data.
In the example below, the group_by(Continent)
command tells R to (invisibly) group the data by Continent. Then, the summarize()
command tells it that it is going to perform an operation on each level of the factor Continent. It will create a new column (called avgLangs), with the value of each row corresponding to the mean of all countries in that continent. Note that the other columns (like Population) no longer exist, because we have not asked R to do anything with that column.
## # A tibble: 5 × 2
## Continent avgLangs
## <fct> <dbl>
## 1 Africa 67.9
## 2 Asia 124.
## 3 North America 55.2
## 4 Oceania 318.
## 5 South America 50.1
If we wanted to also get the average population by continent, we can simply add that to the summarize()
command.
## # A tibble: 5 × 3
## Continent avgLangs avgPop
## <fct> <dbl> <dbl>
## 1 Africa 67.9 3.97
## 2 Asia 124. 4.39
## 3 North America 55.2 3.91
## 4 Oceania 318. 3.39
## 5 South America 50.1 3.75
1.7 Functions
Functions are shortcuts for operations. They are of the form function_name(x, y, z)
where function_name
is the name of the function and the value(s) in parentheses are called arguments. You’ve already used functions: for example, the built-in mean(x)
function takes the mean value of its argument. You can also build your own custom functions to help do tasks more efficiently.
1.7.1 Example: generating random numbers
The function of the form rbinom(n, 1, 0.5) will generate a random series of 1s and 0s. You can think of it as a coin flip, with 1 as heads and 0 as tails (the name of the function is an abbreviation for ‘random binomial,’ since it is randomly pulling from 0 and 1, which is the ‘binomial distribution’). The function has 3 arguments in parentheses. The first number (n) is how many coin flips there will be (don’t worry about the other 2 for now; they will always be the same).
Note that in the output in the console, you’ll see numbers in [brackets] at the beginning of each line. These are NOT part of the series, they are just showing you the ‘index’ or the number of the item in the list (so [75] indicates that this is the 75th item in the list). Note that in the examples below, showing random draws of 5 and 100 numbers, respectively, the output is also shown, as indicated by the ##
at the beginning.
## [1] 0 0 0 0 1
## [1] 1 0 1 0 0 0 1 0 0 1 1 1 1 1 0 0 1 1 1 0 0 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 0 0 0 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 0 1
## [57] 1 1 1 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 1 0 0 1 1
We can also combine functions. For example, if we want to find the average number of times that heads or tails came up in our n coin flips, we can do it as follows. Note that we’re assigning each series to an object, then finding the mean of that object.
1.8 Troubleshooting
It is very common to run into errors, and these can take a long time to sort out, often to find that the error is based on a simple typo! This is very frustrating, but it’s a part of using R (or any programming language)! The only way to get better at it is to practice debugging code - and it will get better with experience, though it never goes away completely, as it happens with experienced coders as well.
There are many online forums that have discussion and answers to common questions, including probably the largest. StackOverflow.
In addition, this page provides an overview of some common errors for R specifically.