class: center, middle, inverse, title-slide # Logistic Regression ## 🤗 ### Tyson S. Barrett --- background-image: url(phdcomics_stressvacations.png) background-size: contain background-repeat: no-repeat --- background-image: url(phd_10minthesis.gif) background-size: contain background-repeat: no-repeat --- class: inverse # What is linear regression? # What is logistic regression? # How do we interpret it? # How do we use it? (in R) --- # Linear Regression <img src="Logistic_Workshop_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- # Linear Regression .pull-left[ .center[ # 🤗 .coral[ ### Interpretable estimates ### Simple with continuous outcomes ### Handles most types of predictors ]]] -- .pull-right[ .center[ # 😭 .coral[ ### What if your outcome is **not** continuous? ] .large[.large[Examples]]] ] --- # .coral[Generalized] Linear Models .large[.large[These **generalize** the regression framework to more data situations.]] -- <br> .large[.large[ To do so: 1. Can use a different **distribution** 📊 2. Uses a **link** function ⛓ ]] --- # Common Combinations .large[.large[ You don't need to know everything about distributions and links to understand how to use them ]] --
.large[We will only discuss the .nicegreen[**Binomial - Logit**] combination] <br> .large[.large[So, how does this make regression into .coral[logistic regression?]]] --- # Great Question! ### Quick Review of Regression .large[.large[ $$ Y = \beta_0 + \beta X + e_i $$ ]] <img src="Logistic_Workshop_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- count: false # Great Question! ### Quick Review of Regression .large[.large[ $$ Y = \beta_0 + \beta X + e_i $$ ]] .pull-left[ .large[ - best fitting line - both categorical and continuous predictors - fast computation ]] -- .pull-right[ .large[ - estimates in `\(Y\)`'s units - decent predictive accuracy - commonly used ]] -- .center[.large[.coral[To use logistic regression, we make a change to how this looks.]]] --- class: inverse, middle, center # Logistic Regression --- # Logisics of Logistic Regression .pull-left[ .large[.large[Regression + a **logit** link: ]] $$ logit(Y) = \beta_0 + \beta X + e_i $$ where: $$ logit(Y) = log(\frac{Prob(Y = 1)}{1 - Prob(Y = 1)}) $$ ] -- .pull-right[ ### Why is this cool?? 🥇 .large[ - The .coral[predicted values] stay between Probability = 0 and Probability = 1. - .nicegreen[Variances] are good - Its .bluer[usage] is very similar to conventional regression. ]] -- <br> .center[.large[However: Output is not as interpretable as regular regression... Why?]] --- class: inverse, center, middle # Interpreting Logistic Regression --- # Log-Odds, Probabilities, Odds Ratios, Oh My! .large[.large[ - Log-Odds 😐 - Odds Ratios 🙂 .small[Odds Ratios are much better --> exponentiate the estimate] ]] --
--- # Log-Odds, Probabilities, Odds Ratios, Oh My! .large[.large[ - Log-Odds 😐 - Odds Ratios 🙂 .small[Odds Ratios are much better --> exponentiate the estimate] - Probabilities 😃 .small[Probabilities easiest to interpret --> next slide] ]] --- # Probabilities in Logistic Regression .large[.large[ Often, this is of most interest in our study - Odds ratios are also very important in most studies ]] -- ## How do we get to them from our model? .large[.large[ 1. Obtain the .nicegreen[Predicted Probabilities], and/or 2. Compute the .coral[Average Marginal Effects] ]] --- # The Probabilities .large[Let's get mathy!] $$ logit(Y) = \beta_0 + \sum_j^p \beta_j X_j + e_i $$ $$ log(\frac{Prob(Y = 1)}{1 - Prob(Y = 1)}) = \beta_0 + \sum_j^p \beta_j X_j + e_i $$ $$ \frac{Prob(Y = 1)}{1 - Prob(Y = 1)} = e^{\beta_0 + \sum_j^p \beta_j X_j + e_i} $$ $$ Prob(Y = 1) = \frac{e^{\beta_0 + \sum_j^p \beta_j X_j + e_i}}{1 + e^{\beta_0 + \sum_j^p \beta_j X_j + e_i}} $$ --- # So... .pull-left[ .large[.large[ The Probability then: 1. Depends on all the estimates and values 2. Is not necessarily linear ]]] .pull-right[ <img src="Logistic_Workshop_files/figure-html/unnamed-chunk-5-1.png" style="display: block; margin: auto;" /> ] --- # Average Marginal Effects .large[.large[ Each individual in the model has a .coral[**marginal effect**] for each variable ] - To get a representative estimate for the sample, let's take the average of the marginal effects - Gives us the **Average Marginal Effect** for that variable ] -- .large[.large[ This gives us an estimate in the outcome's original units (e.g., probabilities or risk) ] - .nicegreen[A one unit increase in X is associated with a AME increase/decrease in the probability (or risk) of Y] ] --- # Reporting .large[.large[ - Report using multiple avenues (both odds ratios and probabilities) - Uncertainty (confidence intervals, standard errors) ]] --- # Interpretation Summary .pull-left[ .large[.large[ ### Outcomes 1. Odds Ratios 2. Probabilities - Predicted Probabilities - Average Marginal Effects ]]] -- .pull-right[ .large[.large[ ### Reporting - Report using multiple avenues (odds ratios and probabilities) - Uncertainty (confidence intervals, standard errors) ]]] --- # Cool! So... how do we use it? .large[ .large[If you are not an `R` user you can ignore the syntax .nicegreen[but pay attention to the logic of it]] .large[We'll use a fake data set about two popular TV shows--The Office and Parks and Recreation.] ] <br> .coral[.large[Note: We'll be ignoring some assumptions (like the fact the data are nested).]] --- # Dataset
--- # Start with .coral[Cross-Tabulations] .large[.large[ Check for small cells, understand missingness ]] ``` ───────────────────────────────────────────── Sport No Yes n = 25 n = 7 ------------------- ----------- ----------- Income 46.0 (17.3) 52.9 (21.0) Productivity 3.2 (1.4) 2.9 (1.1) Physical-Health 4.7 (2.0) 6.3 (2.4) Married: Yes 5 (20%) 4 (57.1%) Race White 19 (76%) 7 (100%) Black 2 (8%) 0 (0%) Mexican American 2 (8%) 0 (0%) Indian 2 (8%) 0 (0%) ───────────────────────────────────────────── ``` --- # How do we use it? .large[ ```r fit1 <- glm(spor ~ inco, data = df, family = binomial(link = "logit")) summary(fit1) ``` ] --- ``` ## ## Call: ## glm(formula = spor ~ inco, family = binomial(link = "logit"), ## data = df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -0.9122 -0.6921 -0.6283 -0.4655 2.0882 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -2.38299 1.40059 -1.701 0.0889 . ## inco 0.02152 0.02583 0.833 0.4049 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 34.106 on 32 degrees of freedom ## Residual deviance: 33.373 on 31 degrees of freedom ## AIC: 37.373 ## ## Number of Fisher Scoring iterations: 4 ``` --- # Add Covariates .large[ ```r fit2 <- glm(spor ~ inco + prod1 + phys, data = df, family = binomial(link = "logit")) summary(fit2) ``` ] --- ``` ## ## Call: ## glm(formula = spor ~ inco + prod1 + phys, family = binomial(link = "logit"), ## data = df) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.2201 -0.6434 -0.3933 -0.2191 2.4183 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -2.87936 1.74390 -1.651 0.0987 . ## inco 0.01171 0.03716 0.315 0.7527 ## prod1 -0.84335 0.48253 -1.748 0.0805 . ## phys 0.64104 0.36532 1.755 0.0793 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 34.106 on 32 degrees of freedom ## Residual deviance: 27.356 on 29 degrees of freedom ## AIC: 35.356 ## ## Number of Fisher Scoring iterations: 5 ``` --- # Model Comparisons .pull-left[ .large[.large[ Productivity and Physical Health were *borderline significant* Is the second model .bluer[better] than the first? - Use the .coral[likelihood ratio test] to investigate]]] .pull-right[ ```r anova(fit1, fit2, test = "LRT") ``` ``` ## Analysis of Deviance Table ## ## Model 1: spor ~ inco ## Model 2: spor ~ inco + prod1 + phys ## Resid. Df Resid. Dev Df Deviance Pr(>Chi) ## 1 31 33.373 ## 2 29 27.356 2 6.0163 0.04938 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ] .footnote[Make sure models had same number of observations] --- class: inverse, middle, center # Modeling Considerations --- # Diagnostics .large[.large[ - Model fit - Multi-collinearity - Prediction Accuracy - Residual Deviance ]] .footnote[see [Diagnostics in Logistic Regression](https://courses.washington.edu/b515/l14.pdf)] --- # Assumptions .large[.large[ 1. Independence 2. No omitted influences 3. Right Distribution 4. Accurate Measurement ]] --- # Other Considerations .pull-left[ ### .nicegreen[Missing Values] - By default, uses listwise deletion (like regression) ### .dcoral[R-Squared?] - There are approximations, but nothing quite like `\(R^2\)` ] -- .pull-right[ ### .coral[Perfect Prediction/Separation] - If any variable (or combination of variables) perfectly predicts the outcome, the model won't run ### .bluer[Sample Size] - Need larger sample size than regular regression - The proportion of yes/no is also important ] --- class: inverse, middle, center # Questions?