Chapter 7: Other Modeling Techniques

“Simplicity is the ultimate sophistication.” — Leonardo da Vinci

In this chapter we cover, however briefly, modeling techniques that are especially useful to make complex relationships easier to interpret. We will focus on mediation and moderation modeling, methods relating to structural equation modeling (SEM), and methods applicable to our field from machine learning. Although these machine learning may appear very different than mediation and SEM, they each have advantages that can help in different situations. For example, SEM is useful when we know there is a high degree of measurement error or our data has multiple indicators for each construct. On the other hand, regularized regression and random forests–two popular forms of machine learning–are great to explore patterns and relationships there are hundreds or thousands of variables that may predict an outcome.

Mediation modeling, although often used within SEM, can help us understand pathways of effect from one variable to another. It is especially useful with moderating variables (i.e., variables that interact with another).

So we’ll start with discussing mediation, then we’ll move on to SEM, followed by machine learning techniques.

Mediation Modeling

Mediation modeling can be done via several packages. For now, we recommend using either lavaan (stands for “latent variable analysis”)21 or MarginalMediation (written by the author of this book). Although both are technically still “beta” versions, they both perform very well especially for more simple models. It makes mediation modeling straightforward.

Below, we model the following mediation model: \[ depression = \beta_0 + \beta_1 asthma + \epsilon_1 \]

\[ time_{Sedentary} = \lambda_0 + \lambda_1 asthma + \lambda_2 depression + \epsilon_2 \]

In essence, we believe that asthma increases depression which in turn increases the amount of time spent being sedentary. To run this with MarginalMediation, we will use two distinct regression models (see Chapter 4) and combine them with mma(). The object pathbc is the model with sedentary behavior as the outcome and patha is the path leading to the mediator.

library(MarginalMediation)

df$sed_hr = df$sed/60  ## in hours instead of minutes

pathbc <- glm(sed_hr ~ dep + asthma, data = df)
patha  <- glm(dep ~ asthma, data = df)

mma(pathbc, patha,
    ind_effects = c("asthmaAsthma-dep"))
## 
## calculating a paths... b and c paths... Done.
                                                                                 
## ┌───────────────────────────────┐
## │  Marginal Mediation Analysis  │
## └───────────────────────────────┘
## A marginal mediation model with:
##    1 mediators
##    1 indirect effects
##    1 direct effects
##    500 bootstrapped samples
##    95% confidence interval
##    n = 4632 
## 
## Formulas:
##    ◌ sed_hr ~ dep + asthma
##    ◌ dep ~ asthma 
## 
## Regression Models: 
## 
##      sed_hr ~ 
##                          Est      SE Est/SE P-Value
##         (Intercept)  5.88575 0.06262 93.986 0.00000
##         dep          0.04367 0.01112  3.927 0.00009
##         asthmaAsthma 0.41230 0.13908  2.965 0.00305
## 
##      dep ~ 
##                        Est      SE Est/SE P-Value
##         (Intercept)  2.948 0.07059 41.764       0
##         asthmaAsthma 1.459 0.18250  7.993       0
## 
## Unstandardized Mediated Effects: 
## 
##    Indirect Effects: 
## 
##      sed_hr ~ 
##                             Indirect   Lower  Upper
##         asthmaAsthma => dep   0.0637 0.03181 0.1076
## 
##    Direct Effects: 
## 
##      sed_hr ~ 
##                      Direct  Lower  Upper
##         asthmaAsthma 0.4123 0.1597 0.6644
## 
## 
## Standardized Mediated Effects: 
## 
##    Indirect Effects: 
## 
##      sed_hr ~ 
##                             Indirect   Lower   Upper
##         asthmaAsthma => dep    0.019 0.00949 0.03211
## 
##    Direct Effects: 
## 
##      sed_hr ~ 
##                      Direct   Lower  Upper
##         asthmaAsthma  0.123 0.04764 0.1982

This gives us the estimates of the individual regression models and the estimates of the indirect and direct effects.

To do the exact same model with lavaan, we can do the following:

library(lavaan)

## Our model
model1 <- '
  dep ~ asthma
  sed_hr ~ dep + asthma
'
## sem function to run the model
fit <- sem(model1, data = df)
summary(fit)
## lavaan 0.6-3 ended normally after 30 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          5
## 
##                                                   Used       Total
##   Number of observations                          4614        4632
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   dep ~                                               
##     asthma            1.478    0.183    8.084    0.000
##   sed_hr ~                                            
##     dep               0.044    0.011    3.929    0.000
##     asthma            0.412    0.139    2.965    0.003
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .dep              19.597    0.408   48.031    0.000
##    .sed_hr           11.171    0.233   48.031    0.000

From the output we see asthma does predict depression and depression does predict time being sedentary. There is also a direct effect of asthma on sedentary behavior even after controlling for depression. We can further specify the model to have it give us the indirect effect and direct effects tested.

## Our model
model2 <- '
  dep ~ a*asthma
  sed_hr ~ b*dep + c*asthma
 
  indirect := a*b
  total := c + a*b
'
## sem function to run the model
fit2 <- sem(model2, data = df)
summary(fit2)
## lavaan 0.6-3 ended normally after 30 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          5
## 
##                                                   Used       Total
##   Number of observations                          4614        4632
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   dep ~                                               
##     asthma     (a)    1.478    0.183    8.084    0.000
##   sed_hr ~                                            
##     dep        (b)    0.044    0.011    3.929    0.000
##     asthma     (c)    0.412    0.139    2.965    0.003
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .dep              19.597    0.408   48.031    0.000
##    .sed_hr           11.171    0.233   48.031    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     indirect          0.065    0.018    3.534    0.000
##     total             0.477    0.138    3.448    0.001

We defined a few things in the model. First, we gave the coefficients labels of a, b, and c. Doing so allows us to define the indirect and total effects. Here we see the indirect effect, although small, is significant at \(p < .001\). The total effect is larger (not surprising) and is also significant.

Also note that we can make the regression equations have other covariates as well if we needed to (i.e. control for age or gender) just as we do in regular regression.

## Our model
model2.1 <- '
  dep ~ asthma + ridageyr
  sed_hr ~ dep + asthma + ridageyr
'
## sem function to run the model
fit2.1 <- sem(model2.1, data = df)
summary(fit2.1)
## lavaan 0.6-3 ended normally after 33 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                          7
## 
##                                                   Used       Total
##   Number of observations                          4614        4632
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                       0.000
##   Degrees of freedom                                 0
##   Minimum Function Value               0.0000000000000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   dep ~                                               
##     asthma            1.462    0.183    7.980    0.000
##     ridageyr         -0.005    0.004   -1.330    0.183
##   sed_hr ~                                            
##     dep               0.044    0.011    3.927    0.000
##     asthma            0.412    0.139    2.956    0.003
##     ridageyr         -0.000    0.003   -0.063    0.950
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .dep              19.590    0.408   48.031    0.000
##    .sed_hr           11.171    0.233   48.031    0.000

Although we don’t show it here, we can also do moderation (“interactions”) as part of the mediation model (especially using the MarginalMediation package).

Structural Equation Modeling

Instead of summing our depression variable, we can use SEM to run the mediation model from above but use the latent variable of depression instead.

## Our model
model3 <- '
  dep1 =~ dpq010 + dpq020 + dpq030 + dpq040 + dpq050 + dpq060 + dpq070 + dpq080 + dpq090
  dep1 ~ a*asthma
  sed_hr ~ b*dep1 + c*asthma

  indirect := a*b
  total := c + a*b
'
## sem function to run the model
fit3 <- sem(model3, data = df)
summary(fit3)
## lavaan 0.6-3 ended normally after 47 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         22
## 
##                                                   Used       Total
##   Number of observations                          4614        4632
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                    1065.848
##   Degrees of freedom                                43
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   dep1 =~                                             
##     dpq010            1.000                           
##     dpq020            1.096    0.024   45.136    0.000
##     dpq030            1.133    0.031   36.908    0.000
##     dpq040            1.149    0.030   38.066    0.000
##     dpq050            0.933    0.025   36.773    0.000
##     dpq060            0.929    0.022   42.107    0.000
##     dpq070            0.871    0.022   39.760    0.000
##     dpq080            0.686    0.019   36.325    0.000
##     dpq090            0.308    0.011   28.544    0.000
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   dep1 ~                                              
##     asthma     (a)    0.173    0.023    7.656    0.000
##   sed_hr ~                                            
##     dep1       (b)    0.342    0.105    3.275    0.001
##     asthma     (c)    0.418    0.139    2.998    0.003
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .dpq010            0.306    0.007   42.008    0.000
##    .dpq020            0.212    0.006   37.549    0.000
##    .dpq030            0.559    0.013   43.807    0.000
##    .dpq040            0.514    0.012   43.302    0.000
##    .dpq050            0.384    0.009   43.862    0.000
##    .dpq060            0.221    0.005   40.808    0.000
##    .dpq070            0.249    0.006   42.420    0.000
##    .dpq080            0.217    0.005   44.038    0.000
##    .dpq090            0.090    0.002   46.106    0.000
##    .sed_hr           11.179    0.233   48.012    0.000
##    .dep1              0.256    0.010   24.657    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     indirect          0.059    0.020    3.019    0.003
##     total             0.477    0.138    3.448    0.001

We defined dep1 as a latent variable using =~. Although the model does not fit the data well–“P-value (Chi-square) = 0.000”–it is informative for demonstration. We would likely need to find out how the measurement model (dep1 =~ dpq010 + dpq020 + dpq030 +) actually fits before throwing it into a mediation model. We can do that via:

model4 <- '
  dep1 =~ dpq010 + dpq020 + dpq030 + dpq040 + dpq050 + dpq060 + dpq070 + dpq080 + dpq090
'
fit4 <- cfa(model4, data=df)
summary(fit4)
## lavaan 0.6-3 ended normally after 29 iterations
## 
##   Optimization method                           NLMINB
##   Number of free parameters                         18
## 
##   Number of observations                          4632
## 
##   Estimator                                         ML
##   Model Fit Test Statistic                     985.831
##   Degrees of freedom                                27
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   dep1 =~                                             
##     dpq010            1.000                           
##     dpq020            1.097    0.024   45.383    0.000
##     dpq030            1.128    0.031   36.962    0.000
##     dpq040            1.145    0.030   38.136    0.000
##     dpq050            0.927    0.025   36.630    0.000
##     dpq060            0.930    0.022   42.294    0.000
##     dpq070            0.870    0.022   39.941    0.000
##     dpq080            0.681    0.019   36.350    0.000
##     dpq090            0.307    0.011   28.609    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .dpq010            0.306    0.007   42.051    0.000
##    .dpq020            0.211    0.006   37.470    0.000
##    .dpq030            0.560    0.013   43.909    0.000
##    .dpq040            0.515    0.012   43.400    0.000
##    .dpq050            0.390    0.009   44.041    0.000
##    .dpq060            0.221    0.005   40.835    0.000
##    .dpq070            0.249    0.006   42.461    0.000
##    .dpq080            0.216    0.005   44.149    0.000
##    .dpq090            0.090    0.002   46.195    0.000
##     dep1              0.261    0.011   24.765    0.000

As we can see, there is a lack of fit in the measurement model. It is possible that these depression questions could be measuring more than one factor. We could explore this using exploratory factor analysis. We don’t demonstrate that here, but know that it is possible to do in R with a few other packages.

Machine Learning Techniques

We are briefly going to introduce some machine learning techniques that may be of interest to researchers. We will quickly introduce and demonstrate:

  1. Ridge, Lasso and Elastic Net
  2. Random Forests

In order to use these methods, we can use the fantastic caret package. It allows us to do nearly any type of machine learning technique. It is a type of package that takes many other packages and gives us a simple syntax across all the methods.

Ridge, Lasso and Elastic Net

Lasso and elastic net can do variable selection in addition to estimation. Ridge is great at handling correlated predictors. Each of them are better than conventional methods at prediction and each of them can handle large numbers of predictors. To learn more see “Introduction to Statistical Learning” by Daniela Witten, Gareth James, Robert Tibshirani, and Trevor Hastie. A free PDF is available on their website.

To use the package, it wants the data in a very specific form. First, we need to remove any missingness. We use na.omit() to do this. We take all the predictors (without the outcome) and put it in a data matrix object. We only include a few for the demonstration but you can include many predictors. We name ours X. Y is our outcome.

df2 <- df %>%
  dplyr::select(riagendr, ridageyr, ridreth3, race, famsize, dep, asthma, sed_hr) %>%
  na.omit

Then we use the train() function to fit the different models. This function, by default, uses cross-validation22, which we don’t discuss here, but it an important topic to become familiar with. Below we fit a model that is either a ridge, a lasso, or an elastic net model depending on the alpha level. This is done using the method = "glmnet" argument. We specify the model by the formula sed_hr ~ . which means we want sed_hr to be the outcome and all the rest of the variables to be predictors.

library(caret)

## Use 10-fold cross validation
fitControl <- trainControl(method = "cv",
                           number = 10)

## Run the model
fit <- train(sed_hr ~ ., 
             method = "glmnet",
             data = df2,
             trControl = fitControl)
fit
## glmnet 
## 
## 4437 samples
##    7 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 3993, 3993, 3992, 3994, 3994, 3993, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda     RMSE   Rsquared  MAE  
##   0.10   0.0008293  3.282  0.04191   2.689
##   0.10   0.0082933  3.283  0.04165   2.690
##   0.10   0.0829328  3.284  0.04099   2.693
##   0.55   0.0008293  3.282  0.04186   2.690
##   0.55   0.0082933  3.284  0.04108   2.692
##   0.55   0.0829328  3.286  0.04026   2.700
##   1.00   0.0008293  3.282  0.04185   2.690
##   1.00   0.0082933  3.284  0.04099   2.693
##   1.00   0.0829328  3.291  0.03962   2.706
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.1 and lambda
##  = 0.0008293.

With this model, we can assess the most important predictors of sedentary behavior. We can do that with:

varImp(fit)
## glmnet variable importance
## 
##                   Overall
## raceOther          100.00
## raceBlack           83.52
## raceWhite           80.22
## raceOtherHispanic   28.12
## asthmaAsthma        10.87
## ridreth3             5.47
## riagendr             4.81
## famsize              4.54
## dep                  1.37
## ridageyr             0.00

This shows us that, of these variables, race was most important, followed by asthma. Importantly, though, this model did not predict the outcome very well so these are likely not very important predictors overall.

Random Forests

Random forests is another machine learning method that can do fantastic prediction. It is built in a very different way than the methods we have discussed up to this point. It is not built on a linear modeling scheme; rather, it is built on classification and regression trees (CART). Again, “Introduction to Statistical Learning” is a great resource to learn more.

Conveniently, we can use the randomForest package. (We can also use the caret package here.) We specify the model by the formula sed_hr ~ . just like before, which means we want sed_hr to be the outcome and all the rest of the variables to be predictors.

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
fit_rf <- randomForest(sed_hr ~ ., data = df2)
fit_rf
## 
## Call:
##  randomForest(formula = sed_hr ~ ., data = df2) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 10.83
##                     % Var explained: 3.61

We can find out which variables were important in the model via:

par(mfrow=c(1,1))  ## back to one plot per page
varImpPlot(fit_rf)

We can see that age (ridageyr) is the most important variable, depression (dep) follows, with the family size (famsize) the third most important in the random forests model.

Apply It

This link contains a folder complete with an Rstudio project file, an RMarkdown file, and a few data files. Download it and unzip it to do the following steps.

Step 1

Open the Chapter4.Rproj file. This will open up RStudio for you.

Step 2

Once RStudio has started, in the panel on the lower-right, there is a Files tab. Click on that to see the project folder. You should see the data files and the Chapter4.Rmd file. Click on the Chapter4.Rmd file to open it. In this file, import the data and run each type of statistical analysis presented in this chapter (there are others that are presented in Chapters 4, 5, and 6 that you may have done already).

Once that code is in the file, click the knit button. This will create an HTML file with the code and output knitted together into one nice document. This can be read into any browser and can be used to show your work in a clean document.

Conclusions

Although we only discussed these methods briefly, that does not mean they are less important. On the contrary, they are essential upper level statistical methods. This brief introduction hopefully helped you know what R is capable of across a wide range of methods.

The next chapter begins our “advanced” topics, starting with “Advanced Data Manipulation”.


  1. The lavaan package has some great vignettes at http://lavaan.ugent.be/ to help with the other types of models it can handle.

  2. Cross-validation is a common way to reduce over-fitting and make sure your model is generalizable. Generally, you split your data into training and testing sets. It is very common in machine learning and is beginning to be practiced in academic fields as well. We recommend using it as often as you can, especially with these methods but also to make sure your other models are accurate on new data as well.