Introduction to The Tidyverse

class: center, middle, inverse, title-slide

# Introduction to The Tidyverse
## 🤓
### Tyson S. Barrett

---

# Introduction

## The Newest and Brightest

![](Figures/Rstats_logo.png)

---

# The Newest and Brightest
### Tidyverse

.pull-left[
* In order to manipulate your data in the cleanest, most up-to-date manner, we are going to be using the "tidyverse" group of methods. 
* The tidyverse is a group of packages that provide a simple syntax that can do many basic (and complex) data manipulating. 
* The group of packages can be downloaded in the packages panel
]

.pull-right[

After downloading it, simply use:

```r
library(tidyverse)
```

```
── Attaching packages ──────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
```

```
✓ ggplot2 3.3.2     ✓ purrr   0.3.4
✓ tibble  3.0.1     ✓ dplyr   1.0.0
✓ tidyr   1.1.0     ✓ stringr 1.4.0
✓ readr   1.3.1     ✓ forcats 0.5.0
```

```
Warning: package 'ggplot2' was built under R version 3.6.2
```

```
Warning: package 'tibble' was built under R version 3.6.2
```

```
Warning: package 'tidyr' was built under R version 3.6.2
```

```
Warning: package 'purrr' was built under R version 3.6.2
```

```
Warning: package 'dplyr' was built under R version 3.6.2
```

```
── Conflicts ─────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
```
to unload all the functions!

]

---

# Tidyverse

Note that when we loaded `tidyverse` it loaded 6 packages and told you of "conflicts". These conflicts are where two or more loaded packages have the same function in them. The last loaded package is the one that `R` will use by default. For example, if we loaded two packages--`awesome` and `amazing`--and both had the function--`make_really_great` and we loaded `awesome` and then `amazing` as so:

```r
library(awesome)
library(amazing)
```

`R` will automatically use the function from `amazing`.

---

# Conflicts

We can still access the `awesome` version of the function (because even though the name is the same, they won't necessarily do the same things for you). We can do this by:

```r
awesome::make_really_great(arg)
```

That's a bit of an aside, but know that you can always get at a function even if it is "masked" from your current session.

---

# Tidy Methods

.pull-left[
## The Tidy Data Way

I'm introducing this to you for a couple reasons.

1. It simplifies the code and makes the code more readable. As Mr. Wickham says, **there are always at least two collaborators on any project: you and future you.**
2. It is the cutting edge. The most influential individuals in the `R` world, including the makers and maintainers of `RStudio`, use these methods and syntax.

The majority of what you'll need to do with data as a researcher will be covered by these functions. 
]

.pull-right[
## Methods for Tidying
There are several methods that help tidy up your data:

1. Piping
2. Selecting and Filtering
3. Grouping and Summarizing
4. Reshaping
5. Joining (merging)

To help illustrate each aspect, we are going to use real data from the National Health and Nutrition Examiniation Survey (NHANES). I've provided this data [on my website](https://tysonstanley.github.io/assets/Data/NHANES.zip). I've cleaned it up somewhat already.
]

---

# Example: NHANES

## Import

```r
library(rio)
dem_df <- import("Data/NHANES_demographics_11.xpt")
med_df <- import("Data/NHANES_MedHeath_11.xpt")
men_df <- import("Data/NHANES_MentHealth_11.xpt")
act_df <- import("Data/NHANES_PhysActivity_11.xpt")
```

## Example: NHANES

Now we have four separate, but related, data sets in memory:

1. `dem_df` containing demographic information
2. `med_df` containing medical health information
3. `men_df` containing mental health information
4. `act_df` containing activity level information

---

# Example: NHANES

Since all of them have all-cap variable names, we are going to quickly change this with a little trick:

```r
names(dem_df) <- tolower(names(dem_df))
names(med_df) <- tolower(names(med_df))
names(men_df) <- tolower(names(men_df))
names(act_df) <- tolower(names(act_df))
```
This takes the names of the data frame (on the right hand side), changes them to lower case and then reassigns them to the names of the data frame.[^names]

Note that these are not particularly helpful names, but they are the names provided in the original data source. If you have questions about the data, visit [here](http://wwwn.cdc.gov/Nchs/Nhanes/Search/Nhanes11_12.aspx).

---

# Example: NHANES
### We will now go through each aspect of the tidy way of working with data using these four data sets.

---

#  Piping

.pull-left[
![](Figures/pipe_logo.png)
]

.pull-right[

```r
me %>% 
  wake_up("8:00am") %>% 
  exercise(30, units = "minutes") %>% 
  shower(15, units = "minutes") %>% 
  eat_breakfast("toast") %>% 
  go_to_work("basement")
```
]

---

# Example: NHANES
### Piping

`%>%` is the pipe "operator". It takes what is on the left hand side and puts it in the right hand side's function.

```r
dem_df %>% summary()
```

So the above code takes the data frame `df` and puts it into the `summary` function. This does the same thing as `summary(dem_df)`.

In this simple case, it doesn't really make the code more readable, but in more complex situations it can really help.

---

# Select and Filter

.pull-left[
![](Figures/dplyr_logo.png)
]

.pull-right[
In situations where you want or need to subset your data, two main forms exist:

1. **Selecting Variables**
2. **Filtering Rows**
]

---

# Example: NHANES
### Select and Filter

**Selecting Variables**

```r
df[, c("var1", "var2")]
df %>%
  select(var1, var2)
```

Here both do the same thing. The first, using `[`, is the "base R" way of selecting variables. The second, using the pipe, is the tidyverse way. Both work great so the choice is yours.

---

# Example: NHANES
### Select and Filter

**Filtering Rows**

```r
df[df$var1 == 1, ]
df %>%
  filter(var1 == 1)
```

Again, both do the same thing. The first, using `[`, is the "base R" way of filtering rows so that you only keep the ones where "var1" in `df` is equal to `1`. Again, the second is the tidyverse way. Whichever you like you should use.

---

# Mutating Variables

Anytime you see `mutate()` it means you are adding a new variable or modifying an existing one.

```r
data %>% 
  mutate(new_var = do_something_function(old_var))
```

```r
data <- data %>% 
  mutate(new_var = do_something_function(old_var))
```

---
# Example: NHANES
### Mutating

```r
## Our Grouping Variable as a factor
dem_df <- dem_df %>%
  mutate(citizen = factor(dmdcitzn))
```

---

# Grouping and Summarizing

A major aspect of analysis is comparing groups. Lucky for us, this is very simple in `R`. I call it the three step summary:

1. Data
2. Group by
3. Summarize

---