Data Visualization

class: center, middle, inverse, title-slide

# Data Visualization
## Cohen Chapter 2 .small[EDUC/PSY 6600]

---

# Always plot your data first!

.center[.Huge["Always." - Severus Snape]]

--
### Why?

.large[
- .dcoral[Outliers] and impossible values

- Determine correct .nicegreen[statistical approach]

- .bluer[Assumptions] and diagnostics

- Discover new .nicegreen[relationships]
]

---
background-image: url(figures/fig_misleading_graph1.png)
background-position: 90% 60%
background-size: 500px

# The Visualization Paradox

.pull-left[.large[
- Often the .dcoral[most informative] aspect of analysis
- .nicegreen[Communicates] the "data story" the best
- Most abused area of quantitative science
- Figures can be *very* .bluer[misleading]
]]

.footnote[[Misleading Graphs](https://venngage.com/blog/misleading-graphs/)]

---
background-image: url(figures/fig_misleading_graph1_fixed.png)
background-position: 50% 80%
background-size: 700px

# Much better

---
# Keys to Good Viz's

.pull-left[.huge[
- .dcoral[Graphical method] should match .dcoral[level of measurement]
- Label all axes and include figure caption
- .nicegreen[Simplicity and clarity]
- .bluer[Avoid of ‘chartjunk’]
]]

.pull-right[.huge[
- Unless there are 3 or more variables, avoid 3D figures (and even then, avoid it)
- Black & white, grayscale/pattern fine for most simple figures
]]

---
# Data Visualizations

.huge[Takes practice -- try a bunch of stuff]

### Resources

.large[.large[
- [Edward Tufte's books](https://www.edwardtufte.com/tufte/books_vdqi)
- ["R for Data Science"](http://r4ds.had.co.nz/) by Grolemund and Wickham
- ["Data Visualization for Social Science"](http://socviz.co/) by Healy
]]

---
# Frequency Distributions

.pull-left[
.huge[.bluer[Counting] the number of occurrences of unique events]

- Categorical or continuous
- just like with `tableF()` and `table1()`
]

.pull-right[
.large[
Can see .dcoral[central tendency] (continuous data) or .dcoral[most common value] (categorical data)
]

.large[Can see .nicegreen[range and extremes]]
]

```
                    
                    ──────────────────────────────────────────────────────
                     x       Freq CumFreq Percent CumPerc Valid  CumValid
                     1       265  265     26.50%  26.50%  27.32% 27.32%  
                     2       222  487     22.20%  48.70%  22.89% 50.21%  
                     3       242  729     24.20%  72.90%  24.95% 75.15%  
                     4       241  970     24.10%  97.00%  24.85% 100.00% 
                     Missing 30   1000    3.00%   100.00%                
                    ──────────────────────────────────────────────────────
```

---
# Frequencies and Viz's Together ❤️

.pull-left[
### Bar Graph

![](u01_Ch2_DataViz_files/figure-html/unnamed-chunk-2-1.png)
]

.pull-right[
### Histogram

![](u01_Ch2_DataViz_files/figure-html/unnamed-chunk-3-1.png)
]

???
What differences do you notice between them?

##### Bar
- Do NOT touch each other
- Begin and terminate at real limits
- Centered on the value
- Height = frequency

##### Histogram
- Touch each other
- Begin and terminate at real limits
- Centered on interval midpoint
- Height = frequency
- Interval size or ‘bin’ determines shape
    - Too narrow or too wide problematic
- Useful for checking distributional assumptions

---
# What does DISTRIBUTION mean?

### The way that the data points are scattered

.pull-left[.large[.large[
**For .dcoral[Continuous]**

- General shape
- Exceptions (outliers)
- Modes (peaks)
- Center & spread (chap 3)
- Histogram
]]]

.pull-right[.large[.large[
**For .nicegreen[Categorical]**

- Counts of each
- Percent or Rate (adjusts for an ‘out of’ to compare)
- Bar chart
- Pie chart - avoid!
]]]

---
class: inverse, center, middle

# Let's Apply This To the Inho Dataset

---
background-image: url(figures/fig_inho_data_desc.png)
background-position: 50% 80%
background-size: 800px

# Reminder

---
# Read in the Data

```r
library(tidyverse)   # the easy button
library(rio)      # read in Excel files
library(furniture)   # nice tables

data_raw <- rio::import("Ihno_dataset.xls") %>% 
 dplyr::rename_all(tolower) # converts all variable names to lower case
```

--
### And Clean It

```r
data_clean <- data_raw %>% 
 dplyr::mutate(majorF = factor(major,
 levels= c(1, 2, 3, 4, 5),
 labels = c("Psychology", "Premed",
 "Biology", "Sociology",
 "Economics"))) %>%
 dplyr::mutate(coffeeF = factor(coffee,
 levels = c(0, 1),
 labels = c("Not a regular coffee drinker",
 "Regularly drinks coffee")))
```

---
# Frequency Distrubutions

.pull-left[

```r
data_clean %>%                 
  furniture::tableF(majorF)
```

```
## 
## ─────────────────────────────────────────
##  majorF     Freq CumFreq Percent CumPerc
##  Psychology 29   29      29.00%  29.00% 
##  Premed     25   54      25.00%  54.00% 
##  Biology    21   75      21.00%  75.00% 
##  Sociology  15   90      15.00%  90.00% 
##  Economics  10   100     10.00%  100.00%
## ─────────────────────────────────────────
```
]

.pull-right[

```r
data_clean %>% 
  furniture::tableF(phobia)
```

```
## 
## ─────────────────────────────────────
##  phobia Freq CumFreq Percent CumPerc
##  0      12   12      12.00%  12.00% 
##  1      15   27      15.00%  27.00% 
##  2      12   39      12.00%  39.00% 
##  3      16   55      16.00%  55.00% 
##  4      21   76      21.00%  76.00% 
##  5      11   87      11.00%  87.00% 
##  6      1    88      1.00%   88.00% 
##  7      4    92      4.00%   92.00% 
##  8      4    96      4.00%   96.00% 
##  9      1    97      1.00%   97.00% 
##  10     3    100     3.00%   100.00%
## ─────────────────────────────────────
```
]

---
# Frequency Viz's

.Huge[For viz's, we will use `ggplot2`]

.huge[This provides the most powerful, beautiful framework for data visualizations]

.large[.large[
- It is built on making .dcoral[layers]
- Each plot has a .nicegreen["geom"] function
    - e.g. `geom_bar()` for bar charts, `geom_histogram()` for histograms, etc.
]]

---
# Bar Charts

.pull-left[

```r
data_clean %>% 
  ggplot() +
  aes(majorF)
```

![](u01_Ch2_DataViz_files/figure-html/unnamed-chunk-9-1.png)
]

.pull-right[

```r
data_clean %>% 
  ggplot() +
  aes(majorF) +
  geom_bar()
```

![](u01_Ch2_DataViz_files/figure-html/unnamed-chunk-10-1.png)

]

---
# Bar Charts

```r
data_clean %>% 
  ggplot() +
  aes(coffee) +
  geom_bar()
```

![](u01_Ch2_DataViz_files/figure-html/unnamed-chunk-11-1.png)

---
# Histograms

```r
data_clean %>% 
  ggplot() +
  aes(phobia) +
  geom_histogram()
```

![](u01_Ch2_DataViz_files/figure-html/unnamed-chunk-12-1.png)

---
# Histograms (change number of bins)

```r
data_clean %>% 
  ggplot() +
  aes(phobia) +
  geom_histogram(bins = 8)
```

![](u01_Ch2_DataViz_files/figure-html/unnamed-chunk-13-1.png)

---
# Histograms (change bins to size 5)

```r
data_clean %>% 
  ggplot() +
  aes(phobia) +
  geom_histogram(binwidth = 5)
```

![](u01_Ch2_DataViz_files/figure-html/unnamed-chunk-14-1.png)

---
# Histograms

```r
data_clean %>% 
  ggplot() +
  aes(mathquiz) +
  geom_histogram(binwidth = 4)
```

![](u01_Ch2_DataViz_files/figure-html/unnamed-chunk-15-1.png)

---
# Histograms -by- a Factor (columns)

```r
data_clean %>% 
  ggplot() +
  aes(mathquiz) +
  geom_histogram(binwidth = 4) +
  facet_grid(. ~ coffeeF)
```

![](u01_Ch2_DataViz_files/figure-html/unnamed-chunk-16-1.png)

---
# Histograms -by- a Factor (rows)

```r
data_clean %>% 
  ggplot() +
  aes(mathquiz) +
  geom_histogram(binwidth = 4) +
  facet_grid(coffeeF ~ .)
```

![](u01_Ch2_DataViz_files/figure-html/unnamed-chunk-17-1.png)

---

# Deciles (break into 10% chunks)

```r
data_clean %>% 
  dplyr::pull(statquiz) %>% 
  quantile(probs = c(.10, .20, .30, .40, .50, .60, .70, .80, .90))
```

```
## 10% 20% 30% 40% 50% 60% 70% 80% 90% 
## 4.0 6.0 6.0 7.0 7.0 8.0 8.0 8.0 8.1
```

---

# Deciles - with missing values

```r
data_clean %>% 
  dplyr::pull(mathquiz) %>% 
  quantile(probs = c(.10, .20, .30, .40, .50, .60, .70, .80, .90))
```

`Error in quantile.default(., probs = c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, : missing values and NaN's not allowed if 'na.rm' is FALSE`

---
# Deciles - `na.rm = TRUE`

```r
data_clean %>% 
  dplyr::pull(mathquiz) %>% 
  quantile(probs = c(.10, .20, .30, .40, .50, .60, .70, .80, .90),
           na.rm =TRUE)
```

```
##  10%  20%  30%  40%  50%  60%  70%  80%  90% 
## 15.0 21.0 25.2 28.0 30.0 32.0 33.8 37.2 41.0
```

---

# Quartiles (break into 4 chunks)

```r
data_clean %>% 
  dplyr::pull(statquiz) %>% 
  quantile(probs = c(0, .25, .50, .75, 1))
```

```
##   0%  25%  50%  75% 100% 
##    1    6    7    8   10
```

---

# Percentiles

```r
data_clean %>% 
  dplyr::pull(statquiz) %>% 
  quantile(probs = c(.01, .05, .173, .90))
```

```
##    1%    5% 17.3%   90% 
##  2.98  3.00  5.00  8.10
```

---
class: inverse, center, middle

# Questions?

---
class: inverse, center, middle

# Next Topic

### Center and Spread