Chapter 1: The Basics of the Language
“Success is neither magical nor mysterious. Success is the natural consequence of consistently applying the basic fundamentals.” — Jim Rohn
R is an open source statistical software made by statisticians. This means it generally speaks the language of statistics. This is very helpful when it comes running analyses but can be confusing when starting to understand the code.
The best way to begin to learn it is by jumping right into it. This chapter will provide the background and foundation to start using
R right away. This background revolves around data and functions—two types of objects.
R is built on a few different types of virtual objects. An object, just like in the physical world, is something you can do things with. In the real world, we have objects that we use regularly. For example, we have chairs. Chairs are great for some things (sitting on, sleeping on, enjoying the beach) and horrible at others (playing basketball, flying, floating in the ocean). Similarly, in
R each type of object is useful for certain things. The data types that we will discuss below are certain types of objects.
Because this is so analogous to the real world, it becomes quite natural to work with. You can have many objects in the computer’s memory, which allows flexbility in analyzing many different things simply within a single
R session.3 The main object types that you’ll work with are presented in the following table.
|Vector||A single column of data (‘a variable’)|
|Matrix||An n x p table of numeric data|
|Data Frame||An n x p table of data (numeric and/or categorical)|
|Function||Takes input and produces output—the workhorse of R|
|Operator||A special type of function (e.g.
Each of these objects will be introduced in this chapter, highlighting their definition and use. For your work, the first objects you work with will be data in various forms. Below, we explain the different data types and how they can combine into what is known as a
Early Advice: Don’t get overwhelmed. It may feel like there is a lot to learn, but taking things one at a time will work surprisingly quickly. I’ve designed this book to discuss what you need to know from the beginning. Other topics that are not discussed are things you can learn later and do not need to be of your immediate concern.
To begin understanding data in
R, you must know about vectors. Vectors are, in essence, a single column of data—a variable. In
R there are three main vector data types (variable types) that you’ll work with in research:
The first, numeric, is just that: numbers. In
R, you can make a numeric variable with the code below:
x <- c(10.1, 2.1, 4.6, 2.3, 8.9)
c() is a function4 that stands for “concatenate” which basically glues the values inside the paratheses together into one. We use
<- to put it into
x. So in this case,
x (which we could have named anything) is saving those values so we can work with them5. If we ran this code, the
x object would be in the working memory of
R and will stay there unless we remove it or until the end of the
R session (i.e., we close
A factor variable is a categorical variable (i.e., only a limited number of options exist). For example, race/ethnicity is a factor variable.
race <- c(1, 3, 2, 1, 1, 2, 1, 3, 4, 2)
The code above actually produces a numeric vector (since it was only provided numbers). We can quickly tell
R that it is indeed supposed to be a factor.
race <- factor(race, labels = c("white", "black", "hispanic", "asian"))
factor() function tells
R that the first thing—
race—is actually a factor. The additional argument
R what each of the values means. If we print out
race we see that
R has replaced the numeric values with the labels.
##  white hispanic black white white black white ##  hispanic asian black ## Levels: white black hispanic asian
Finally, and maybe less relevantly, there are character variables. These are words (known as strings). In research this is often where subjects give open responses to a question.
ch <- c("I think this is great.", "I would suggest you learn R.", "You seem quite smart.")
When we combine multiple variables into one, we create a data.frame. A data frame is like a spreadsheet table, like the ones you have probably seen in Microsoft’s Excel and IBM’s SPSS. Here’s a simple example:
df <- data.frame("A"=c(1,2,1,4,3), "B"=c(1.4,2.1,4.6,2.0,8.2), "C"=c(0,0,1,1,1)) df
## A B C ## 1 1 1.4 0 ## 2 2 2.1 0 ## 3 1 4.6 1 ## 4 4 2.0 1 ## 5 3 8.2 1
We can do quite a bit with the
data.frame that we called
df6. Once again, we could have called this data frame anything, although I recommend short names. If “A” and “C” are factors we can tell
df$A <- factor(df$A, labels = c("level1", "level2", "level3", "level4")) df$C <- factor(df$C, labels = c("Male", "Female"))
In the above code, the
$ reaches into
df to grab a variable (or column). The following code does the exact same thing:
df[["A"]] <- factor(df[["A"]], labels = c("level1", "level2", "level3", "level4")) df[["C"]] <- factor(df[["C"]], labels = c("Male", "Female"))
and so is the following:
df[, "A"] <- factor(df[, "A"], labels = c("level1", "level2", "level3", "level4")) df[, "C"] <- factor(df[, "C"], labels = c("Male", "Female"))
df[["A"]] grabs the
A variable just like
df$A. The last example shows that we can grab both columns and rows. In
df[, "C"] we have a spot just a head of the comma. It works like this:
df[rows, columns]. So we could specifically grab certain rows and certain columns.
df[1:3, "A"] df[1:3, 1]
Both lines of the above code grabs rows 1 thorugh 3 and column “A”.
Finally, we can combine the
c() function to grab different rows and columns. To grab rows 1 and 5 and columns “B” and “C” you can do the following:
df[c(1,5), c("B", "C")]
We may also want to get more information about the data frame before we do any subsetting. There are a few nice functions to get information that can help us know what we should do next with our data.
## Get the variable names names(df)
##  "A" "B" "C"
## Know what type of variable it is class(df$A)
##  "factor"
## Get quick summary statistics for each variable summary(df)
## A B C ## level1:2 Min. :1.40 Male :2 ## level2:1 1st Qu.:2.00 Female:3 ## level3:1 Median :2.10 ## level4:1 Mean :3.66 ## 3rd Qu.:4.60 ## Max. :8.20
## Get the first 10 columns of your data head(df, n=10)
## A B C ## 1 level1 1.4 Male ## 2 level2 2.1 Male ## 3 level1 4.6 Female ## 4 level4 2.0 Female ## 5 level3 8.2 Female
We admit that the last one is a bit pointless since our data frame is only a few lines long. However, these functions can give you quick information about your data with hardly any effort on your part.
Earlier we mentioned that
c() was a “function.” Functions are how we do things with our data. There are probably hundreds of thousands of functions at your reach. In fact, you can create your own! We’ll discuss that more in later chapters.
For now, know that each named function has a name (the function name of
c() is “c”), arguments, and output of some sort. Arguments are the information that you provide the function between the parenthases (e.g. we gave
c() a bunch of numbers; we gave
factor() two arguments—the variable and the labels for the variable’s levels). Output from a function varies to a massive degree but, in general, the output is what you are using the function for (e.g., for
c() we wanted to create a vector—a variable—of data).
At any point, by typing:
we get information in the “Help” window. It provides information on how to use the function, including arguments, output, and examples.
After a quick note about operators, you will be shown several functions for both importing and saving data. Note that each have a name, arguments, and output of each.
A special type of function is called an operator. These take two inputs—a left hand side and a right hand side—and output some value. A very common operator is
<-, known as the assignment operator. It takes what is on the right hand side and assigns it to the left hand side. We saw this earlier with vectors and data frames. Other operators exist, a few of which we will introduce in the following chapter. But again, an operator is just a special function.
Most of the time you’ll want to import data into
R rather than manually entering it line by line, variable by variable.
There are some built in ways to import many delimited7 data types (e.g. comma delimited–also called a CSV, tab delimited, space delimited). Other packages8 have been developed to help with this as well.
R Data File
The first, if it is an
R data file in the form
.RData simply use:
Note that you don’t assign this to a name such as
df. Instead, it loads whatever
R objects were saved to it.
Most delimited files are saved as
.dat. As long as you know the delimiter, this process is easy.
df <- read.table("file.csv", sep = ",", header=TRUE) ## for csv df <- read.table("file.txt", sep = "\t", header=TRUE) ## for tab delimited df <- read.table("file.txt", sep = " ", header=TRUE) ## for space delimited
sep tells the function what kind of delimiter the data has and
R if the first row contains the variable names (you can change it to
FALSE if the first row isn’t).
Note that at the end of the lines you see that I left a comment using
#. I used two for stylistic purposes but only one is necessary. Anything after a
# is not read by the computer; it’s just for us humans.
Heads up! Note that unless you are using the
loadfunction, you need to assign what is being read in to a name. In the examples, all were called
df. In real life, you won’t run a bunch of different
readfunctions to the same name because only the last one run would be saved (the others would be written over). However, if you have multiple data files to import you can assign them to different names and later merge them. Merging, also called joining, is something we’ll discuss in the next chapter.
Other Data Formats
Data from other statistical software such as SAS, SPSS, or Stata are also easy to get into
R. We will use two powerful packages:
To install, simply run:
This only needs to be run once on a computer. Then, to use it in a single
R session (i.e. from when you open
R to when you close it) run:
Using these packages, I will show you simple ways to bring your data in from other formats.
library(haven) df <- read_dta("file.dta") ## for Stata data df <- read_spss("file.sav") ## for SPSS data df <- read_sas("file.sas7bdat") ## for this type of SAS file
library(foreign) df <- read.xport("file.xpt") ## for export SAS files
Finally, there are many ways to save data. Most of the
read... functions have a corresponding
write.table(df, file="file.csv", sep = ",") ## to create a CSV data file
R automatically saves missing data as
NA since that is what it is in
R. But often when we write a CSV file, we might want it as blank or some other value. If that’s the case, we can add another argument
na = " " after the sep argument.
Again, if you ever have questions about the specific arguments that a certain function has, you can simply run
?functionname. So, if you were curious about the different arguments in
write.table simply run:
?write.table. In the pane with the files, plots, packages, etc. a document will show up to give you more informaton.
R is designed to be flexible and do just about anything with data that you’ll need to do as a researcher. With this chapter under your belt, you can now read basic
R code, import and save your data. The next chapter will introduce the “tidyverse” of methods that can help you join, reshape, summarize, group, and much more.
Rsession is any time you open
Rdo work and then close
R. Unless you are saving your workspace (which, in general you shouldn’t do), it starts the slate clean–no objects are in memory and no packages are loaded. This is why we use scripts. Also, it makes your workflow extra transparent and replicable.↩
Ris all about functions. Functions tell
Rwhat to do with the data. You’ll see many more examples throughout the book.↩
This is a great feature of
R. It is called “object oriented” which basically means
Rcreates objects to work with. I discuss this more in 1.2. Also, the fact that I said
x“saves” the information is not entirely true but is useful to think this way.↩
I used this name since
dfis common in online helps and other resources.↩
The delimiter is what separates the pieces of data.↩
A package is an extension to
Rthat gives you more functions–abilities–to work with data. Anyone can write a package, although to get it on the Comprehensive R Archive Network (CRAN) it needs to be vetted to a large degree. In fact, after some practice, you could write a package to help you more easily do your work.↩