Chapter 8: Advanced Data Manipulation

“Every new thing creates two new questions and two new opportunities.” — Jeff Bezos

There’s so much more we can do with data in R than what we’ve presented. Two main topics we need to clarify here are:

  1. How do you reshape your data from wide to long form or vice versa?
  2. How do can we simplify tasks that we need done many times?

We will introduce both ideas to you in this chapter. To discuss the first, we will introduce two functions to help you reshape your data: gather() and spread(). For the second, we need to talk about loops. Looping, for our purposes, refers to the ability to repeat something across many variables or data sets. There’s many ways of doing this but some are better than others. For looping, we’ll talk about:

  1. vectorized functions,
  2. for loops, and
  3. the apply family of functions.

Reshaping Your Data

We introduced you to wide form and long form of your data in Chapter 2. In reality, data can take on nearly infinite forms but for most data in health, behavior, and social science, these two forms are sufficient to know. But the question is, how to change the form of your data?

In the tidyverse functions known as gather and spread can help with this in a simple way. To show you, we will use the fake data we started with in Chapter 2.

##    ID Var_Time1 Var_Time2
## 1   1 -0.205149    0.5350
## 2   2  0.497889    0.4928
## 3   3  0.530410    0.7172
## 4   4  0.026185    0.3038
## 5   5 -0.485566    0.6118
## 6   6 -0.003066    0.5458
## 7   7  0.646010    0.6828
## 8   8 -0.088697    0.9747
## 9   9  1.480590    0.9269
## 10 10  0.717556    0.4123

Notice that this data frame is in wide format (each ID is one row and there are multiple times or measurements per person). To change this to wide format, we’ll use gather(). The first argument is the data.frame, followed by two variable names (names that we go into the new long form), and then the numbers of the columns that are the measures (in this case, Var_Time1 and Var_Time2).

library(tidyverse)
long_form <- gather(d1, "measures", "values", 2:3)
long_form
##    ID  measures    values
## 1   1 Var_Time1 -0.205149
## 2   2 Var_Time1  0.497889
## 3   3 Var_Time1  0.530410
## 4   4 Var_Time1  0.026185
## 5   5 Var_Time1 -0.485566
## 6   6 Var_Time1 -0.003066
## 7   7 Var_Time1  0.646010
## 8   8 Var_Time1 -0.088697
## 9   9 Var_Time1  1.480590
## 10 10 Var_Time1  0.717556
## 11  1 Var_Time2  0.534965
## 12  2 Var_Time2  0.492754
## 13  3 Var_Time2  0.717174
## 14  4 Var_Time2  0.303823
## 15  5 Var_Time2  0.611775
## 16  6 Var_Time2  0.545812
## 17  7 Var_Time2  0.682821
## 18  8 Var_Time2  0.974716
## 19  9 Var_Time2  0.926901
## 20 10 Var_Time2  0.412281

As you can see, it took the variable names and put that in our first variable that we called “measures”. The actual values of the variables are now in the variable we called “values”. Finally, notice that each ID now has two rows (one for each measure).

To go in the opposite direction (long to wide) we can use the spread() function. All we do is provide the long formed data frame, the measured variable (measures) and the variable with the values (values).

wide_form <- spread(long_form, measures, values)
wide_form
##    ID Var_Time1 Var_Time2
## 1   1 -0.205149    0.5350
## 2   2  0.497889    0.4928
## 3   3  0.530410    0.7172
## 4   4  0.026185    0.3038
## 5   5 -0.485566    0.6118
## 6   6 -0.003066    0.5458
## 7   7  0.646010    0.6828
## 8   8 -0.088697    0.9747
## 9   9  1.480590    0.9269
## 10 10  0.717556    0.4123

And we are back to the wide form.

These steps can be followed for situations where there are many measures per person, many people per cluster, etc. In most cases, this is the way multilevel data analysis occurs (as we discussed in Chapter 6) and is a nice way to get our data ready for plotting.

The reshape() function

It is also possible to do multiple measures at once when moving from wide to long or long to wide. The following figure shows the basics of the reshape() function that was made for an introductory R class (which can also be found here).

Note a few important features:

  1. reshape() is used for moving from wide to long and long to wide. Here we just tell it the direction.
  2. To indicate multiple columns, use a list of vectors (e.g., list(c("x1", "x2"), c("z1", "z2")))).
  3. reshape() can be used before or after subsetting, allowing for only the necessary variables to be included in the reshaping process.

Repeating Actions (Looping)

To fully go into looping, understanding how to write your own functions is needed.

Your Own Functions

Let’s create a function that estimates the mean (although it is completely unnecessary since there is already a perfectly good mean() function).

mean2 <- function(x){
  n <- length(x)
  m <- (1/n) * sum(x)
  return(m)
}

We create a function using the function() function.21 Within the function() we put an x. This is the argument that the function will ask for. Here, it is a numeric vector that we want to take the mean of. We then provide the meat of the function between the {}. Here, we did a simple mean calculation using the length(x) which gives us the number of observations, and sum() which sums the numbers in x.

Let’s give it a try:

v1 <- c(1,3,2,4,2,1,2,1,1,1)   ## vector to try
mean2(v1)                      ## our function
## [1] 1.8
mean(v1)                       ## the base R function
## [1] 1.8

Looks good! These functions that you create can do whatever you need them to (within the bounds that R can do). I recommend by starting outside of a function that then put it into a function. For example, we would start with:

n <- length(v1)
m <- (1/n) * sum(v1)
m
## [1] 1.8

and once things look good, we would put it into a function like we had before with mean2. It is an easy way to develop a good function and test it while developing it.

By creating your own function, you can simplify your workflow and can use them in loops, the apply functions and the purrr package.

For practice, we will write one more function. Let’s make a function that takes a vector and gives us the N, the mean, and the standard deviation.

important_statistics <- function(x, na.rm=FALSE){
  N  <- length(x)
  M  <- mean(x, na.rm=na.rm)
  SD <- sd(x, na.rm=na.rm)
  
  final <- c(N, M, SD)
  return(final)
}

One of the first things you should note is that we included a second argument in the function seen as na.rm=FALSE (you can have as many arguments as you want within reason). This argument has a default that we provide as FALSE as it is in most functions that use the na.rm argument. We take what is provided in the na.rm and give that to both the mean() and sd() functions. Finally, you should notice that we took several pieces of information and combined them into the final object and returned that.

Let’s try it out with the vector we created earlier.

important_statistics(v1)
## [1] 10.000  1.800  1.033

Looks good but we may want to change a few aesthetics. In the following code, we adjust it so we have each one labeled.

important_statistics2 <- function(x, na.rm=FALSE){
  N  <- length(x)
  M  <- mean(x, na.rm=na.rm)
  SD <- sd(x, na.rm=na.rm)
  
  final <- data.frame(N, "Mean"=M, "SD"=SD)
  return(final)
}
important_statistics2(v1)
##    N Mean    SD
## 1 10  1.8 1.033

We will come back to this function and use it in some loops and see what else we can do with it.

Vectorized

By construction, R is the fastest when we use the vectorized form of doing things. For example, when we want to add two variables together, we can use the + operator. Like most functions in R, it is vectorized and so it is fast. Below we create a new vector using the rnorm() function that produces normally distributed random variables. First argument in the function is the length of the vector, followed by the mean and SD.

v2 <- rnorm(10, mean=5, sd=2)
add1 <- v1 + v2
round(add1, 3)
##  [1] 9.894 2.955 4.401 9.110 5.335 4.097 4.721 6.187 6.877 9.562

We will compare the speed of this to other ways of adding two variables together and see it is the simplest and quickest.

For Loops

For loops have a bad reputation in the R world. This is because, in general, they are slow. It is among the slowest of ways to iterate (i.e., repeat) functions. We start here to show you, in essence, what the apply family of functions are doing, often, in a faster way.

At times, it is easiest to develop a for loop and then take it and use it within the apply or purrr functions. It can help you think through the pieces that need to be done in order to get your desired result.

For demonstration, we are using the for loop to add two variables together. The code between the ()’s tells R information about how many loops it should do. Here, we are looping through 1:10 since there are ten observations in each vector. We could also specify this as 1:length(v1). When using for loops, we need to keep in mind that we need to initialize a variable in order to use it within the loop. That’s precisely what we do with the add2, making it a numberic vector with 10 observations.

add2 <- vector("numeric", 10)   ## Initialize
for (i in 1:10){
  add2[i] <- v1[i] + v2[i]
}
round(add2, 3)
##  [1] 9.894 2.955 4.401 9.110 5.335 4.097 4.721 6.187 6.877 9.562

Same results! But, we’ll see later that the speed is much than the vectorized function.

The apply family

The apply family of functions that we’ll introduce are:

  1. apply()
  2. lapply()
  3. sapply()
  4. tapply()

Each essentially do a loop over the data you provide using a function (either one you created or another). The different versions are extremely similar with some minor differences. For apply() you tell it if you want to iterative over the columns or rows; lapply() assumes you want to iterate over the columns and outputs a list (hence the l); sapply() is similar to lapply() but outputs vectors and data frames. tapply() has the most differences because it can iterative over columns by a grouping variable. We’ll show apply(), lapply() and tapply() below.

For example, we can add two variables together here. We provide it the data.frame that has the variables we want to add together.

df <- data.frame(v1, v2)
add3 <- apply(df, 1, sum)
round(add3, 3)
##  [1] 9.894 2.955 4.401 9.110 5.335 4.097 4.721 6.187 6.877 9.562

The function apply() has three main arguments: a) the data.frame or list of data, b) 1 meaning to apply the function for each row or 2 to the columns, and c) the function to use.

We can also use one of our own functions such as important_statistics2() within the apply family.

lapply(df, important_statistics2)
## $v1
##    N Mean    SD
## 1 10  1.8 1.033
## 
## $v2
##    N  Mean    SD
## 1 10 4.514 2.791

This gives us a list of two elements, one for each variable, with the statistics that our function provides. With a little adjustment, we can make this into a data.frame using the do.call() function with "rbind".

do.call("rbind", lapply(df, important_statistics2))
##     N  Mean    SD
## v1 10 1.800 1.033
## v2 10 4.514 2.791

tapply() allows us to get information by a grouping factor. We are going to add a factor variable to the data frame we are using df and then get the mean of the variables by group.

group1 <- factor(sample(c(0,1), 10, replace=TRUE))
tapply(df$v1, group1, mean)
##     0     1 
## 2.000 1.333

We now have the means by each group. This, however, is probably replaced by the 3 step summary that we learned earlier in dplyr using group_by() and summarize().

These functions are useful in many situations, especially where there are no vectorized functions. You can always get an idea of whether to use a for loop or an apply function by giving it a try on a small subset of data to see if one is better and/or faster.

Speed Comparison

We can test to see how fast functions are with the microbenchmark package. Since it wants functions, we will create a function that uses the for looop.

forloop <- function(var1, var2){
  add2 <- vector("numeric", length(var1))
  for (i in 1:10){
    add2[i] <- var1[i] + var2[i]
  }
  return(add2)
}

Below, we can see that the vectorized version is nearly 50 times faster than the for loop and 300 times faster than the apply. Although the for loop was faster here, sometimes it can be slower than the apply functions–it just depends on the situation. But, the vectorized functions will almost always be much faster than anything else. It’s important to note that the + is also a function that can be used as we do below, highlighting the fact that anything that does something to an object in R is a function.

library(microbenchmark)
microbenchmark(forloop(v1, v2),
               apply(df, 1, sum),
               `+`(v1, v2))
## Unit: nanoseconds
##               expr   min    lq    mean  median      uq    max neval cld
##    forloop(v1, v2)  9368 10692 15305.9 11501.0 12819.0  73858   100  b 
##  apply(df, 1, sum) 74742 76953 87987.5 78567.5 94955.0 214625   100   c
##            v1 + v2   172   249   447.8   387.5   525.5   5345   100 a

Of course, as it says the units are in nanoseconds. Whether a function takes 200 or 200,000 nanoseconds probably won’t change your life. However, if the function is being used repeatedly or on on large data sets, this can make a difference.

Using “Anonymous Functions” in Apply

Last thing to know here is that you don’t need to create a named function everytime you want to use apply. We can use what is called “Anonymous” functions. Below, we use one to get at the N and mean of the data.

lapply(df, function(x) rbind(length(x), mean(x, na.rm=TRUE)))
## $v1
##      [,1]
## [1,] 10.0
## [2,]  1.8
## 
## $v2
##        [,1]
## [1,] 10.000
## [2,]  4.514

So we don’t name the function but we design it like we would a named function, just minus the return(). We take x (which is a column of df) and do length() and mean() and bind them by rows. The first argument in the anonymous function will be the column or variable of the data you provide.

Here’s another example:

lapply(df, function(y) y * 2 / sd(y))
## $v1
##  [1] 1.936 5.809 3.873 7.746 3.873 1.936 3.873 1.936 1.936 1.936
## 
## $v2
##  [1]  6.37327 -0.03256  1.72029  3.66153  2.38987  2.21910  1.95023
##  [8]  3.71732  4.21144  6.13587

We take y (again, the column of df), times it by two and divide by the standard deviation of y. Note that this is gibberish and is not some special formula, but again, we can see how flexible it is.

The last two examples also show something important regarding the output:

  1. The output will be at the level of the anonymous function. The first had two numbers per variable because the function produced two summary statistics for each variable. The second we multiplied y by 2 (so it is still at the individual observation level) and then divide by the SD. This keeps it at the observation level so we get ten values for every variable.
  2. We can name the argument anything we want (as long as it is one word). We used x in the first and y in the second but as long as it is the same within the function, it doesn’t matter what you use.

Conclusions

These are useful tools to use in your own data manipulation beyond that what we discussed with dplyr. It takes time to get used to making your own functions so be patient with yourself as you learn how to get R to do exactly what you want in a condensed, replicable format.

With these new tricks up your sleeve, we can move on to more advanced plotting using ggplot2.


  1. That seemed like excessive use of the word function… It is important though. So, get used to it!