Logistic Regression

class: center, middle, inverse, title-slide

# Logistic Regression
## 🤗
### Tyson S. Barrett

---

background-image: url(phdcomics_stressvacations.png)
background-size: contain
background-repeat: no-repeat

---
background-image: url(phd_10minthesis.gif)
background-size: contain
background-repeat: no-repeat

---
class: inverse

# What is linear regression?
# What is logistic regression?
# How do we interpret it?
# How do we use it? (in R)

---
# Linear Regression

---
# Linear Regression

.pull-left[
.center[
# 🤗

.coral[
### Interpretable estimates
### Simple with continuous outcomes
### Handles most types of predictors
]]]

.pull-right[
.center[
# 😭

.coral[
### What if your outcome is **not** continuous?
]
.large[.large[Examples]]]
]

---
# .coral[Generalized] Linear Models

.large[.large[These **generalize** the regression framework to more data situations.]]

<br>
.large[.large[
To do so:

1. Can use a different **distribution** 📊

2. Uses a **link** function ⛓
]]

---
# Common Combinations

.large[.large[
You don't need to know everything about distributions and links to understand how to use them
]]

<div id="htmlwidget-667fe88c15af8698569f" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-667fe88c15af8698569f">{"x":{"filter":"none","data":[["Normal","Binomial","Poisson","Gamma"],["Identity","Logit, Probit","Log","Inverse, Log"],["Continuous","Binary","Count","High Counts"],["Original Scale","Log-Odds (Logit)","Logged Units","Inverse of Original"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>Distribution<\/th>\n      <th>Link<\/th>\n      <th>Outcome<\/th>\n      <th>Estimates<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

.large[We will only discuss the .nicegreen[**Binomial - Logit**] combination]

<br>

.large[.large[So, how does this make regression into .coral[logistic regression?]]]

---
# Great Question!

### Quick Review of Regression

.large[.large[
$$
Y = \beta_0 + \beta X + e_i
$$
]]

---
count: false

# Great Question!

### Quick Review of Regression

.large[.large[
$$
Y = \beta_0 + \beta X + e_i
$$
]]

.pull-left[
.large[
- best fitting line
- both categorical and continuous predictors
- fast computation
]]

.pull-right[
.large[
- estimates in `$Y$`'s units
- decent predictive accuracy
- commonly used
]]

--
.center[.large[.coral[To use logistic regression, we make a change to how this looks.]]]

---
class: inverse, middle, center

# Logistic Regression

---
# Logisics of Logistic Regression

.pull-left[
.large[.large[Regression + a **logit** link: ]]

$$
logit(Y) = \beta_0 + \beta X + e_i
$$

where:

$$
logit(Y) = log(\frac{Prob(Y = 1)}{1 - Prob(Y = 1)})
$$

]

.pull-right[
### Why is this cool?? 🥇

.large[
- The .coral[predicted values] stay between Probability = 0 and Probability = 1.
- .nicegreen[Variances] are good
- Its .bluer[usage] is very similar to conventional regression.
]]

<br>

.center[.large[However: Output is not as interpretable as regular regression... Why?]]

---
class: inverse, center, middle

# Interpreting Logistic Regression

---
# Log-Odds, Probabilities, Odds Ratios, Oh My!

.large[.large[
- Log-Odds 😐

- Odds Ratios 🙂
  
  .small[Odds Ratios are much better --> exponentiate the estimate]
]]

<div id="htmlwidget-108e0224b087a9435555" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-108e0224b087a9435555">{"x":{"filter":"none","data":[["< 1","Between 1 and 2","> 1"],["The odds are (1 - R) % lower of Y for a one unit increase in X","The odds are (R - 1) % higher of Y for a one unit increase in X","The odds are R times higher of Y for a one unit increase in X"]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th>Ratio<\/th>\n      <th>Interpretation<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"dom":"t","order":[],"autoWidth":false,"orderClasses":false}},"evals":[],"jsHooks":[]}</script>

---
# Log-Odds, Probabilities, Odds Ratios, Oh My!

.large[.large[
- Log-Odds 😐

- Odds Ratios 🙂
  
  .small[Odds Ratios are much better --> exponentiate the estimate]
  
- Probabilities 😃
  
  .small[Probabilities easiest to interpret --> next slide]
]]

---
# Probabilities in Logistic Regression

.large[.large[
Often, this is of most interest in our study

- Odds ratios are also very important in most studies
]]

## How do we get to them from our model?

.large[.large[
1. Obtain the .nicegreen[Predicted Probabilities], and/or

2. Compute the .coral[Average Marginal Effects]
]]

---
# The Probabilities

.large[Let's get mathy!]

$$
logit(Y) = \beta_0 + \sum_j^p \beta_j X_j + e_i
$$

$$
log(\frac{Prob(Y = 1)}{1 - Prob(Y = 1)}) = \beta_0 + \sum_j^p \beta_j X_j + e_i
$$

$$
\frac{Prob(Y = 1)}{1 - Prob(Y = 1)} = e^{\beta_0 + \sum_j^p \beta_j X_j + e_i}
$$

$$
Prob(Y = 1) = \frac{e^{\beta_0 + \sum_j^p \beta_j X_j + e_i}}{1 + e^{\beta_0 + \sum_j^p \beta_j X_j + e_i}}
$$

---
# So...

.pull-left[
.large[.large[
The Probability then:

1. Depends on all the estimates and values

2. Is not necessarily linear
]]]

.pull-right[
<img src="Logistic_Workshop_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" />
]

---
# Average Marginal Effects

.large[.large[
Each individual in the model has a .coral[**marginal effect**] for each variable
]

- To get a representative estimate for the sample, let's take the average of the marginal effects
- Gives us the **Average Marginal Effect** for that variable
]

.large[.large[
This gives us an estimate in the outcome's original units (e.g., probabilities or risk)
]

- .nicegreen[A one unit increase in X is associated with a AME increase/decrease in the probability (or risk) of Y]
]

---
# Reporting

.large[.large[
- Report using multiple avenues (both odds ratios and probabilities)

- Uncertainty (confidence intervals, standard errors)
]]

---
# Interpretation Summary

.pull-left[
.large[.large[
### Outcomes

1. Odds Ratios

2. Probabilities

- Predicted Probabilities
  - Average Marginal Effects
]]]

.pull-right[
.large[.large[
### Reporting

- Report using multiple avenues (odds ratios and probabilities)

- Uncertainty (confidence intervals, standard errors)
]]]

---
# Cool! So... how do we use it?

.large[
.large[If you are not an `R` user you can ignore the syntax .nicegreen[but pay attention to the logic of it]]

.large[We'll use a fake data set about two popular TV shows--The Office and Parks and Recreation.]
]

<br>

.coral[.large[Note: We'll be ignoring some assumptions (like the fact the data are nested).]]

---
# Dataset

<div id="htmlwidget-25d04ba035d31fab617e" style="width:100%;height:auto;" class="datatables html-widget"></div>
<script type="application/json" data-for="htmlwidget-25d04ba035d31fab617e">{"x":{"filter":"none","data":[["1","2","3","4","5","6","7","8","9","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28","29","30","31","32","33"],["Michael","Pam","Jim","Dwight","Stanley","Phyllis","Creed","Meredith","Oscar","Angela","Kevin","Kelley","Ryan","Toby","Andy","Jan","April","Andy","Leslie","Ron","Tom","Donna","Ben","Chris","Gary (Larry, Jerry)","Jean Ralphio","Mona Lisa","Ann","Kyle","Shauna Molwaytweep","Ethel","Councilman Howser","Tammy II"],[2,3,3,5,4,4,1,3,5,4,2,3,2,4,3,4,1,1,5,3,2,2,5,4,3,1,1,5,3,4,2,5,5],[3,8,8,6,7,8,2,5,7,5,6,5,2,1,5,6,6,2,8,8,5,7,8,6,5,1,1,8,5,6,5,6,5],[8,7,8,8,4,4,4,4,7,7,2,5,5,6,7,6,4,2,7,7,5,6,5,8,3,2,1,8,2,5,2,6,3],[0,1,1,0,1,1,0,0,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,0,0,0,1,0,0,null,0],[0,1,0,0,0,1,0,1,0,1,0,1,0,0,0,1,1,0,1,0,0,1,0,0,0,0,1,1,1,0,0,1,0],["White","White","White","White","Black","White","White","White","Mexican American","White","White","Indian","White","White","White","White","Mexican American","White","White","White","Indian","Black","White","White","White","White","White","White","White","White","White","Black","White"],[55,35,70,70,70,70,45,40,50,50,45,40,40,60,60,80,25,15,45,55,35,70,65,70,40,10,10,40,35,45,40,60,40],[0,2,2,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,null,0,0,0,0],[1,0,0,0,0,0,1,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0],[1,1,1,1,1,1,1,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,1,1,1,1,1,1],[1,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0]],"container":"<table class=\"display\">\n  <thead>\n    <tr>\n      <th> <\/th>\n      <th>nam<\/th>\n      <th>prod1<\/th>\n      <th>ment1<\/th>\n      <th>phys<\/th>\n      <th>marr<\/th>\n      <th>gend<\/th>\n      <th>race<\/th>\n      <th>inco<\/th>\n      <th>chil<\/th>\n      <th>subs<\/th>\n      <th>alco<\/th>\n      <th>spor<\/th>\n    <\/tr>\n  <\/thead>\n<\/table>","options":{"pageLength":5,"columnDefs":[{"className":"dt-right","targets":[2,3,4,5,6,8,9,10,11,12]},{"orderable":false,"targets":0}],"order":[],"autoWidth":false,"orderClasses":false,"lengthMenu":[5,10,25,50,100]}},"evals":[],"jsHooks":[]}</script>

---
# Start with .coral[Cross-Tabulations]

.large[.large[
Check for small cells, understand missingness
]]

```
                         
                         ─────────────────────────────────────────────
                                                     Sport 
                                              No          Yes        
                                              n = 25      n = 7      
                          ------------------- ----------- -----------
                          Income              46.0 (17.3) 52.9 (21.0)
                          Productivity        3.2 (1.4)   2.9 (1.1)  
                          Physical-Health     4.7 (2.0)   6.3 (2.4)  
                          Married: Yes        5 (20%)     4 (57.1%)  
                          Race                                       
                             White            19 (76%)    7 (100%)   
                             Black            2 (8%)      0 (0%)     
                             Mexican American 2 (8%)      0 (0%)     
                             Indian           2 (8%)      0 (0%)     
                         ─────────────────────────────────────────────
```

---
# How do we use it?

.large[

```r
fit1 <- glm(spor ~ inco, data = df, family = binomial(link = "logit"))
summary(fit1)
```
]

---

```
## 
## Call:
## glm(formula = spor ~ inco, family = binomial(link = "logit"), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9122  -0.6921  -0.6283  -0.4655   2.0882  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -2.38299    1.40059  -1.701   0.0889 .
## inco         0.02152    0.02583   0.833   0.4049  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 34.106  on 32  degrees of freedom
## Residual deviance: 33.373  on 31  degrees of freedom
## AIC: 37.373
## 
## Number of Fisher Scoring iterations: 4
```

---
# Add Covariates

.large[

```r
fit2 <- glm(spor ~ inco + prod1 + phys, data = df, family = binomial(link = "logit"))
summary(fit2)
```
]

---

```
## 
## Call:
## glm(formula = spor ~ inco + prod1 + phys, family = binomial(link = "logit"), 
##     data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2201  -0.6434  -0.3933  -0.2191   2.4183  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -2.87936    1.74390  -1.651   0.0987 .
## inco         0.01171    0.03716   0.315   0.7527  
## prod1       -0.84335    0.48253  -1.748   0.0805 .
## phys         0.64104    0.36532   1.755   0.0793 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 34.106  on 32  degrees of freedom
## Residual deviance: 27.356  on 29  degrees of freedom
## AIC: 35.356
## 
## Number of Fisher Scoring iterations: 5
```

---
# Model Comparisons

.pull-left[
.large[.large[
Productivity and Physical Health were *borderline significant*

Is the second model .bluer[better] than the first?

- Use the .coral[likelihood ratio test] to investigate]]]

.pull-right[

```r
anova(fit1, fit2, test = "LRT")
```

```
## Analysis of Deviance Table
## 
## Model 1: spor ~ inco
## Model 2: spor ~ inco + prod1 + phys
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1        31     33.373                       
## 2        29     27.356  2   6.0163  0.04938 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```
]

.footnote[Make sure models had same number of observations]

---
class: inverse, middle, center

# Modeling Considerations

---
# Diagnostics

.large[.large[
- Model fit
- Multi-collinearity
- Prediction Accuracy
- Residual Deviance
]]

.footnote[see [Diagnostics in Logistic Regression](https://courses.washington.edu/b515/l14.pdf)]

---
# Assumptions

.large[.large[
1. Independence

2. No omitted influences

3. Right Distribution

4. Accurate Measurement
]]

---
# Other Considerations

.pull-left[
### .nicegreen[Missing Values]

- By default, uses listwise deletion (like regression)

### .dcoral[R-Squared?]

- There are approximations, but nothing quite like `$R^2$`
]

.pull-right[
### .coral[Perfect Prediction/Separation]

- If any variable (or combination of variables) perfectly predicts the outcome, the model won't run

### .bluer[Sample Size]

- Need larger sample size than regular regression
- The proportion of yes/no is also important
]

---
class: inverse, middle, center

# Questions?