Chapter 1: The Basics of the Language

“Success is neither magical nor mysterious. Success is the natural consequence of consistently applying the basic fundamentals.” — Jim Rohn

R is an open source statistical software made by statisticians. This means it generally speaks the language of statistics. This is very helpful when it comes running analyses but can be confusing when starting to understand the code.

The best way to begin to learn it is by jumping right into it. This chapter will provide the background and foundation to start using R right away. This background revolves around data and functions—two types of objects.

Objects

R is built on a few different types of virtual objects. An object, just like in the physical world, is something you can do things with. In the real world, we have objects that we use regularly. For example, we have chairs. Chairs are great for some things (sitting on, sleeping on, enjoying the beach) and horrible at others (playing basketball, flying, floating in the ocean). Similarly, in R each type of object is useful for certain things. The data types that we will discuss below are certain types of objects.

Chair

Chair

Because this is so analogous to the real world, it becomes quite natural to work with. You can have many objects in the computer’s memory, which allows flexbility in analyzing many different things simply within a single R session.3 The main object types that you’ll work with are presented in the following table.

Object Description
Vector A single column of data (‘a variable’)
Matrix An n x p table of numeric data
Data Frame An n x p table of data (numeric and/or categorical)
Function Takes input and produces output—the workhorse of R
Operator A special type of function (e.g. <-

Each of these objects will be introduced in this chapter, highlighting their definition and use. For your work, the first objects you work with will be data in various forms. Below, we explain the different data types and how they can combine into what is known as a data.frame.

Early Advice: Don’t get overwhelmed. It may feel like there is a lot to learn, but taking things one at a time will work surprisingly quickly. I’ve designed this book to discuss what you need to know from the beginning. Other topics that are not discussed are things you can learn later and do not need to be of your immediate concern.

Data Types

To begin understanding data in R, you must know about vectors. Vectors are, in essence, a single column of data—a variable. In R there are three main vector data types (variable types) that you’ll work with in research:

  • numeric
  • factor
  • character

The first, numeric, is just that: numbers. In R, you can make a numeric variable with the code below:

x <- c(10.1, 2.1, 4.6, 2.3, 8.9)

The c() is a function4 that stands for “concatenate” which basically glues the values inside the paratheses together into one. We use <- to put it into x. So in this case, x (which we could have named anything) is saving those values so we can work with them5. If we ran this code, the x object would be in the working memory of R and will stay there unless we remove it or until the end of the R session (i.e., we close R).

A factor variable is a categorical variable (i.e., only a limited number of options exist). For example, race/ethnicity is a factor variable.

race <- c(1, 3, 2, 1, 1, 2, 1, 3, 4, 2)

The code above actually produces a numeric vector (since it was only provided numbers). We can quickly tell R that it is indeed supposed to be a factor.

race <- factor(race, 
               labels = c("white", "black", "hispanic", "asian"))

The factor() function tells R that the first thing—race—is actually a factor. The additional argument labels tells R what each of the values means. If we print out race we see that R has replaced the numeric values with the labels.

race
##  [1] white    hispanic black    white    white    black    white   
##  [8] hispanic asian    black   
## Levels: white black hispanic asian

Finally, and maybe less relevantly, there are character variables. These are words (known as strings). In research this is often where subjects give open responses to a question.

ch <- c("I think this is great.", 
        "I would suggest you learn R.", 
        "You seem quite smart.")

When we combine multiple variables into one, we create a data.frame. A data frame is like a spreadsheet table, like the ones you have probably seen in Microsoft’s Excel and IBM’s SPSS. Here’s a simple example:

df <- data.frame("A"=c(1,2,1,4,3),
                 "B"=c(1.4,2.1,4.6,2.0,8.2),
                 "C"=c(0,0,1,1,1))
df
##   A   B C
## 1 1 1.4 0
## 2 2 2.1 0
## 3 1 4.6 1
## 4 4 2.0 1
## 5 3 8.2 1

We can do quite a bit with the data.frame that we called df6. Once again, we could have called this data frame anything, although I recommend short names. If “A” and “C” are factors we can tell R by:

df$A <- factor(df$A, labels = c("level1", "level2", "level3", "level4"))
df$C <- factor(df$C, labels = c("Male", "Female"))

In the above code, the $ reaches into df to grab a variable (or column). The following code does the exact same thing:

df[["A"]] <- factor(df[["A"]], labels = c("level1", "level2", "level3", "level4"))
df[["C"]] <- factor(df[["C"]], labels = c("Male", "Female"))

and so is the following:

df[, "A"] <- factor(df[, "A"], labels = c("level1", "level2", "level3", "level4"))
df[, "C"] <- factor(df[, "C"], labels = c("Male", "Female"))

df[["A"]] grabs the A variable just like df$A. The last example shows that we can grab both columns and rows. In df[, "C"] we have a spot just a head of the comma. It works like this: df[rows, columns]. So we could specifically grab certain rows and certain columns.

df[1:3, "A"]
df[1:3, 1]

Both lines of the above code grabs rows 1 thorugh 3 and column “A”.

Finally, we can combine the c() function to grab different rows and columns. To grab rows 1 and 5 and columns “B” and “C” you can do the following:

df[c(1,5), c("B", "C")]

We may also want to get more information about the data frame before we do any subsetting. There are a few nice functions to get information that can help us know what we should do next with our data.

## Get the variable names
names(df)
## [1] "A" "B" "C"
## Know what type of variable it is
class(df$A)
## [1] "factor"
## Get quick summary statistics for each variable
summary(df)
##       A           B             C    
##  level1:2   Min.   :1.40   Male  :2  
##  level2:1   1st Qu.:2.00   Female:3  
##  level3:1   Median :2.10             
##  level4:1   Mean   :3.66             
##             3rd Qu.:4.60             
##             Max.   :8.20
## Get the first 10 columns of your data
head(df, n=10)
##        A   B      C
## 1 level1 1.4   Male
## 2 level2 2.1   Male
## 3 level1 4.6 Female
## 4 level4 2.0 Female
## 5 level3 8.2 Female

We admit that the last one is a bit pointless since our data frame is only a few lines long. However, these functions can give you quick information about your data with hardly any effort on your part.

Functions

Earlier we mentioned that c() was a “function.” Functions are how we do things with our data. There are probably hundreds of thousands of functions at your reach. In fact, you can create your own! We’ll discuss that more in later chapters.

For now, know that each named function has a name (the function name of c() is “c”), arguments, and output of some sort. Arguments are the information that you provide the function between the parenthases (e.g. we gave c() a bunch of numbers; we gave factor() two arguments—the variable and the labels for the variable’s levels). Output from a function varies to a massive degree but, in general, the output is what you are using the function for (e.g., for c() we wanted to create a vector—a variable—of data).

At any point, by typing:

?functionname

we get information in the “Help” window. It provides information on how to use the function, including arguments, output, and examples.

After a quick note about operators, you will be shown several functions for both importing and saving data. Note that each have a name, arguments, and output of each.

Operators

A special type of function is called an operator. These take two inputs—a left hand side and a right hand side—and output some value. A very common operator is <-, known as the assignment operator. It takes what is on the right hand side and assigns it to the left hand side. We saw this earlier with vectors and data frames. Other operators exist, a few of which we will introduce in the following chapter. But again, an operator is just a special function.

Importing Data

Most of the time you’ll want to import data into R rather than manually entering it line by line, variable by variable.

There are some built in ways to import many delimited7 data types (e.g. comma delimited–also called a CSV, tab delimited, space delimited). Other packages8 have been developed to help with this as well.

R Data File

The first, if it is an R data file in the form .rda or .RData simply use:

load("file.rda")

Note that you don’t assign this to a name such as df. Instead, it loads whatever R objects were saved to it.

Delimited Files

Most delimited files are saved as .csv, .txt, or .dat. As long as you know the delimiter, this process is easy.

df <- read.table("file.csv", sep = ",", header=TRUE)   ## for csv
df <- read.table("file.txt", sep = "\t", header=TRUE)  ## for tab delimited
df <- read.table("file.txt", sep = " ", header=TRUE)   ## for space delimited

The argument sep tells the function what kind of delimiter the data has and header tells R if the first row contains the variable names (you can change it to FALSE if the first row isn’t).

Note that at the end of the lines you see that I left a comment using #. I used two for stylistic purposes but only one is necessary. Anything after a # is not read by the computer; it’s just for us humans.

Heads up! Note that unless you are using the load function, you need to assign what is being read in to a name. In the examples, all were called df. In real life, you won’t run a bunch of different read functions to the same name because only the last one run would be saved (the others would be written over). However, if you have multiple data files to import you can assign them to different names and later merge them. Merging, also called joining, is something we’ll discuss in the next chapter.

Other Data Formats

Data from other statistical software such as SAS, SPSS, or Stata are also easy to get into R. We will use two powerful packages:

  1. haven
  2. foreign

To install, simply run:

install.packages("packagename")

This only needs to be run once on a computer. Then, to use it in a single R session (i.e. from when you open R to when you close it) run:

library(packagename)

Using these packages, I will show you simple ways to bring your data in from other formats.

library(haven)
df <- read_dta("file.dta")       ## for Stata data
df <- read_spss("file.sav")      ## for SPSS data
df <- read_sas("file.sas7bdat")  ## for this type of SAS file
library(foreign)
df <- read.xport("file.xpt")     ## for export SAS files

If you have another type of data file to import, online helps found on sites like www.stackoverflow.com and www.r-bloggers.com often have the solution.

Saving Data

Finally, there are many ways to save data. Most of the read... functions have a corresponding write... function.

write.table(df, file="file.csv", sep = ",")  ## to create a CSV data file

R automatically saves missing data as NA since that is what it is in R. But often when we write a CSV file, we might want it as blank or some other value. If that’s the case, we can add another argument na = " " after the sep argument.

Again, if you ever have questions about the specific arguments that a certain function has, you can simply run ?functionname. So, if you were curious about the different arguments in write.table simply run: ?write.table. In the pane with the files, plots, packages, etc. a document will show up to give you more informaton.

Conclusions

R is designed to be flexible and do just about anything with data that you’ll need to do as a researcher. With this chapter under your belt, you can now read basic R code, import and save your data. The next chapter will introduce the “tidyverse” of methods that can help you join, reshape, summarize, group, and much more.


  1. An R session is any time you open R do work and then close R. Unless you are saving your workspace (which, in general you shouldn’t do), it starts the slate clean–no objects are in memory and no packages are loaded. This is why we use scripts. Also, it makes your workflow extra transparent and replicable.

  2. R is all about functions. Functions tell R what to do with the data. You’ll see many more examples throughout the book.

  3. This is a great feature of R. It is called “object oriented” which basically means R creates objects to work with. I discuss this more in 1.2. Also, the fact that I said x “saves” the information is not entirely true but is useful to think this way.

  4. I used this name since df is common in online helps and other resources.

  5. The delimiter is what separates the pieces of data.

  6. A package is an extension to R that gives you more functions–abilities–to work with data. Anyone can write a package, although to get it on the Comprehensive R Archive Network (CRAN) it needs to be vetted to a large degree. In fact, after some practice, you could write a package to help you more easily do your work.