Chapter 7: Other Modeling Techniques

“Simplicity is the ultimate sophistication.” — Leonardo da Vinci

In this chapter we cover, however briefly, modeling techniques that are especially useful to make complex relationships easier to interpret. We will focus on mediation and moderation modeling, methods relating to structural equation modeling (SEM), and methods applicable to our field from machine learning. Although these machine learning may appear very different than mediation and SEM, they each have advantages that can help in different situations. For example, SEM is useful when we know there is a high degree of measurement error or our data has multiple indicators for each construct. On the other hand, regularized regression and random forests–two popular forms of machine learning–are great to explore patterns and relationships there are hundreds or thousands of variables that may predict an outcome.

Mediation modeling, although often used within SEM, can help us understand pathways of effect from one variable to another. It is especially useful with moderating variables (i.e., variables that interact with another).

So we’ll start with discussing mediation, then we’ll move on to SEM, followed by machine learning.

Mediation Modeling

Mediation modeling can be done via several packages. For now, we recommend using lavaan (stands for “latent variable analysis”)19. Although it is technically still a “beta” version, it performs very well especially for more simple models. It makes mediation modeling straightforward.

Below, we model the following mediation model: \[ depression = \beta_0 + \beta_1 asthma + \epsilon_1 \]

\[ time_{Sedentary} = \lambda_0 + \lambda_1 asthma + \lambda_2 depression + \epsilon_2 \]

In essence, we believe that asthma increases depression which in turn increases the amount of time spent being sedentary.

library(lavaan)
df$sed_hr = df$sed/60  ## in hours instead of minutes

## Our model
model1 <- '
  dep ~ asthma
  sed_hr ~ dep + asthma
'
## sem function to run the model
fit <- sem(model1, data = df)
summary(fit)
## lavaan (0.5-23.1097) converged normally after  30 iterations
## 
##                                                   Used       Total
##   Number of observations                          4614        4632
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   dep ~                                               
##     asthma            1.478    0.183    8.084    0.000
##   sed_hr ~                                            
##     dep               0.044    0.011    3.929    0.000
##     asthma            0.412    0.139    2.965    0.003
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .dep              19.597    0.408   48.031    0.000
##    .sed_hr           11.171    0.233   48.031    0.000

From the output we see asthma does predict depression and depression does predict time being sedentary. There is also a direct effect of asthma on sedentary behavior even after controlling for depression. We can further specify the model to have it give us the indirect effect and direct effects tested.

## Our model
model2 <- '
  dep ~ a*asthma
  sed_hr ~ b*dep + c*asthma
 
  indirect := a*b
  total := c + a*b
'
## sem function to run the model
fit2 <- sem(model2, data = df)
summary(fit2)
## lavaan (0.5-23.1097) converged normally after  30 iterations
## 
##                                                   Used       Total
##   Number of observations                          4614        4632
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   dep ~                                               
##     asthma     (a)    1.478    0.183    8.084    0.000
##   sed_hr ~                                            
##     dep        (b)    0.044    0.011    3.929    0.000
##     asthma     (c)    0.412    0.139    2.965    0.003
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .dep              19.597    0.408   48.031    0.000
##    .sed_hr           11.171    0.233   48.031    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     indirect          0.065    0.018    3.534    0.000
##     total             0.477    0.138    3.448    0.001

We defined a few things in the model. First, we gave the coefficients labels of a, b, and c. Doing so allows us to define the indirect and total effects. Here we see the indirect effect, although small, is significant at \(p < .001\). The total effect is larger (not surprising) and is also significant.

Also note that we can make the regression equations have other covariates as well if we needed to (i.e. control for age or gender) just as we do in regular regression.

## Our model
model2.1 <- '
  dep ~ asthma + ridageyr
  sed_hr ~ dep + asthma + ridageyr
'
## sem function to run the model
fit2.1 <- sem(model2.1, data = df)
summary(fit2.1)
## lavaan (0.5-23.1097) converged normally after  33 iterations
## 
##                                                   Used       Total
##   Number of observations                          4614        4632
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic                0.000
##   Degrees of freedom                                 0
##   Minimum Function Value               0.0000000000000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   dep ~                                               
##     asthma            1.462    0.183    7.980    0.000
##     ridageyr         -0.005    0.004   -1.330    0.183
##   sed_hr ~                                            
##     dep               0.044    0.011    3.927    0.000
##     asthma            0.412    0.139    2.956    0.003
##     ridageyr         -0.000    0.003   -0.063    0.950
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .dep              19.590    0.408   48.031    0.000
##    .sed_hr           11.171    0.233   48.031    0.000

Although we don’t show it here, we can also do moderation (“interactions”) as part of the mediation model.

Structural Equation Modeling

Instead of summing our depression variable, we can use SEM to run the mediation model from above but use the latent variable of depression instead.

## Our model
model3 <- '
  dep1 =~ dpq010 + dpq020 + dpq030 + dpq040 + dpq050 + dpq060 + dpq070 + dpq080 + dpq090
  dep1 ~ a*asthma
  sed_hr ~ b*dep1 + c*asthma

  indirect := a*b
  total := c + a*b
'
## sem function to run the model
fit3 <- sem(model3, data = df)
summary(fit3)
## lavaan (0.5-23.1097) converged normally after  47 iterations
## 
##                                                   Used       Total
##   Number of observations                          4614        4632
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic             1065.848
##   Degrees of freedom                                43
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   dep1 =~                                             
##     dpq010            1.000                           
##     dpq020            1.096    0.024   45.136    0.000
##     dpq030            1.133    0.031   36.908    0.000
##     dpq040            1.149    0.030   38.066    0.000
##     dpq050            0.933    0.025   36.773    0.000
##     dpq060            0.929    0.022   42.107    0.000
##     dpq070            0.871    0.022   39.760    0.000
##     dpq080            0.686    0.019   36.325    0.000
##     dpq090            0.308    0.011   28.544    0.000
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   dep1 ~                                              
##     asthma     (a)    0.173    0.023    7.656    0.000
##   sed_hr ~                                            
##     dep1       (b)    0.342    0.105    3.275    0.001
##     asthma     (c)    0.418    0.139    2.998    0.003
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .dpq010            0.306    0.007   42.008    0.000
##    .dpq020            0.212    0.006   37.549    0.000
##    .dpq030            0.559    0.013   43.807    0.000
##    .dpq040            0.514    0.012   43.302    0.000
##    .dpq050            0.384    0.009   43.862    0.000
##    .dpq060            0.221    0.005   40.808    0.000
##    .dpq070            0.249    0.006   42.420    0.000
##    .dpq080            0.217    0.005   44.038    0.000
##    .dpq090            0.090    0.002   46.106    0.000
##    .sed_hr           11.179    0.233   48.012    0.000
##    .dep1              0.256    0.010   24.657    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     indirect          0.059    0.020    3.019    0.003
##     total             0.477    0.138    3.448    0.001

We defined dep1 as a latent variable using =~. Although the model does not fit the data well–“P-value (Chi-square) = 0.000”–it is informative for demonstration. We would likely need to find out how the measurement model (dep1 =~ dpq010 + dpq020 + dpq030 +) actually fits before throwing it into a mediation model. We can do that via:

model4 <- '
  dep1 =~ dpq010 + dpq020 + dpq030 + dpq040 + dpq050 + dpq060 + dpq070 + dpq080 + dpq090
'
fit4 <- cfa(model4, data=df)
summary(fit4)
## lavaan (0.5-23.1097) converged normally after  29 iterations
## 
##   Number of observations                          4632
## 
##   Estimator                                         ML
##   Minimum Function Test Statistic              985.831
##   Degrees of freedom                                27
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Standard Errors                             Standard
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   dep1 =~                                             
##     dpq010            1.000                           
##     dpq020            1.097    0.024   45.383    0.000
##     dpq030            1.128    0.031   36.962    0.000
##     dpq040            1.145    0.030   38.136    0.000
##     dpq050            0.927    0.025   36.630    0.000
##     dpq060            0.930    0.022   42.294    0.000
##     dpq070            0.870    0.022   39.941    0.000
##     dpq080            0.681    0.019   36.350    0.000
##     dpq090            0.307    0.011   28.609    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .dpq010            0.306    0.007   42.051    0.000
##    .dpq020            0.211    0.006   37.470    0.000
##    .dpq030            0.560    0.013   43.909    0.000
##    .dpq040            0.515    0.012   43.400    0.000
##    .dpq050            0.390    0.009   44.041    0.000
##    .dpq060            0.221    0.005   40.835    0.000
##    .dpq070            0.249    0.006   42.461    0.000
##    .dpq080            0.216    0.005   44.149    0.000
##    .dpq090            0.090    0.002   46.195    0.000
##     dep1              0.261    0.011   24.765    0.000

As we can see, there is a lack of fit in the measurement model. It is possible that these depression questions could be measuring more than one factor. We could explore this using exploratory factor analysis. We don’t demonstrate that here, but know that it is possible to do in R with a few other packages.

Machine Learning Techniques

We are briefly going to introduce some machine learning techniques that may be of interest to researchers. We will quickly introduce and demonstrate:

  1. Ridge, Lasso and Elastic Net
  2. Random Forests

Ridge, Lasso and Elastic Net

In order to do either ridge, lasso, or elastic net regression, we can use the fantastic glmnet package. Using the cv.glmnet() function we can run the ridge (\(alpha = 0\)), lasso (\(alpha = 1\) which is default), and elastic net (\(0 \leq alpha \leq 1\)). It turns out that elastic net is the combination of the ridge and lasso methods and the closer alpha is to 1 the more it acts like lasso and the closer it is to 0 the more it acts like ridge.

Lasso and elastic net can do variable selection in addition to estimation. Ridge is great at handling correlated predictors. Each of them are better than conventional methods at prediction and each of them can handle large numbers of predictors. To learn more see “Introduction to Statistical Learning” by Daniela Witten, Gareth James, Robert Tibshirani, and Trevor Hastie. A free PDF is available on their website.

To use the package, it wants the data in a very specific form. First, we need to remove any missingness. We use na.omit() to do this. We take all the predictors (without the outcome) and put it in a data matrix object. We only include a few for the demonstration but you can include many predictors. We name ours X. Y is our outcome.

df2 <- df %>%
  dplyr::select(riagendr, ridageyr, ridreth3, race, famsize, dep, asthma, sed_hr) %>%
  na.omit
X <- df2 %>%
  dplyr::select(-sed_hr) %>%
  data.matrix
Y <- df2$sed_hr

Then we use the cv.glmnet() function to fit the different models. The “cv” refers to cross-validation20, which we don’t discuss here, but it an important topic to become familiar with. Below we fit a ridge, a lasso, and an elastic net model. The elastic net model uses more of the lasso penalty because the alpha is closer to 1 than 0.

library(glmnet)

fit_ridge <- cv.glmnet(X, Y, alpha = 0)
fit_lasso <- cv.glmnet(X, Y, alpha = 1)
fit_enet  <- cv.glmnet(X, Y, alpha = .8)

The plots below show where appropriate lambda values are based on the mean squared error of the cross-validated prediction. The vertical dashed lines show a reasonable range of lambda values that can be used.

plot(fit_ridge)

plot(fit_lasso)

plot(fit_enet)

We can get the coefficients at a reasonable lambda. Specifically, we use the “1-SE rule” (near the right hand side vertical dashed lines in the above plots) by s = "lambda.1se". You can directly tell it what lambda value you’d like but this is a simple rule of thumb.

coef(fit_ridge, s = "lambda.1se")
## 8 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept)  5.7780966
## riagendr     0.0287649
## ridageyr    -0.0003536
## ridreth3     0.0364093
## race         0.0588428
## famsize     -0.0258347
## dep          0.0067781
## asthma       0.0662119
coef(fit_lasso, s = "lambda.1se")
## 8 x 1 sparse Matrix of class "dgCMatrix"
##                  1
## (Intercept) 5.6183
## riagendr    .     
## ridageyr    .     
## ridreth3    .     
## race        0.1622
## famsize     .     
## dep         .     
## asthma      .
coef(fit_enet, s = "lambda.1se")
## 8 x 1 sparse Matrix of class "dgCMatrix"
##                  1
## (Intercept) 5.7175
## riagendr    .     
## ridageyr    .     
## ridreth3    .     
## race        0.1322
## famsize     .     
## dep         .     
## asthma      .

Although we briefly introduce these regression methods, they are indeed very important. We highly recommend learning more about them.

Random Forests

Random forests is another machine learning method that can do fantastic prediction. It is built in a very different way than the methods we have discussed up to this point. It is not built on a linear modeling scheme; rather, it is built on classification and regression trees (CART). Again, “Introduction to Statistical Learning” is a great resource to learn more.

Conveniently, we can use the randomForest package. We specify the model by the formula sed_hr ~ . which means we want sed_hr to be the outcome and all the rest of the variables to be predictors.

library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
fit_rf <- randomForest(sed_hr ~ ., data = df2)
fit_rf
## 
## Call:
##  randomForest(formula = sed_hr ~ ., data = df2) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 10.81
##                     % Var explained: 3.77

We can find out which variables were important in the model via:

par(mfrow=c(1,1))  ## back to one plot per page
varImpPlot(fit_rf)

We can see that age (ridageyr) is the most important variable, depression (dep) follows, with the family size (famsize) the third most important in the random forests model.

Conclusions

Although we only discussed these methods briefly, that does not mean they are less important. On the contrary, they are essential upper level statistical methods. This brief introduction hopefully helped you know what R is capable of across a wide range of methods.

The next chapter begins our “advanced” topics, starting with “Advanced Data Manipulation”.


  1. The lavaan package has some great vignettes at http://lavaan.ugent.be/ to help with the other types of models it can handle.

  2. Cross-validation is a common way to reduce over-fitting and make sure your model is generalizable. Generally, you split your data into training and testing sets. It is very common in machine learning and is beginning to be practiced in academic fields as well. We recommend using it as often as you can, especially with these methods but also to make sure your other models are accurate on new data as well.