# Chapter 7: Other Modeling Techniques

“Simplicity is the ultimate sophistication.” — Leonardo da Vinci

In this chapter we cover, however briefly, modeling techniques that are especially useful to make complex relationships easier to interpret. We will focus on mediation and moderation modeling, methods relating to structural equation modeling (SEM), and methods applicable to our field from machine learning. Although these machine learning may appear very different than mediation and SEM, they each have advantages that can help in different situations. For example, SEM is useful when we know there is a high degree of measurement error or our data has multiple indicators for each construct. On the other hand, regularized regression and random forests–two popular forms of machine learning–are great to explore patterns and relationships there are hundreds or thousands of variables that may predict an outcome.

Mediation modeling, although often used within SEM, can help us understand pathways of effect from one variable to another. It is especially useful with moderating variables (i.e., variables that interact with another).

So we’ll start with discussing mediation, then we’ll move on to SEM, followed by machine learning.

## Mediation Modeling

Mediation modeling can be done via several packages. For now, we recommend using `lavaan`

(stands for “latent variable analysis”)^{19}. Although it is technically still a “beta” version, it performs very well especially for more simple models. It makes mediation modeling straightforward.

Below, we model the following mediation model: \[ depression = \beta_0 + \beta_1 asthma + \epsilon_1 \]

\[ time_{Sedentary} = \lambda_0 + \lambda_1 asthma + \lambda_2 depression + \epsilon_2 \]

In essence, we believe that asthma increases depression which in turn increases the amount of time spent being sedentary.

```
library(lavaan)
df$sed_hr = df$sed/60 ## in hours instead of minutes
## Our model
model1 <- '
dep ~ asthma
sed_hr ~ dep + asthma
'
## sem function to run the model
fit <- sem(model1, data = df)
summary(fit)
```

```
## lavaan (0.5-23.1097) converged normally after 30 iterations
##
## Used Total
## Number of observations 4614 4632
##
## Estimator ML
## Minimum Function Test Statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Information Expected
## Standard Errors Standard
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## dep ~
## asthma 1.478 0.183 8.084 0.000
## sed_hr ~
## dep 0.044 0.011 3.929 0.000
## asthma 0.412 0.139 2.965 0.003
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .dep 19.597 0.408 48.031 0.000
## .sed_hr 11.171 0.233 48.031 0.000
```

From the output we see asthma does predict depression and depression does predict time being sedentary. There is also a direct effect of asthma on sedentary behavior even after controlling for depression. We can further specify the model to have it give us the indirect effect and direct effects tested.

```
## Our model
model2 <- '
dep ~ a*asthma
sed_hr ~ b*dep + c*asthma
indirect := a*b
total := c + a*b
'
## sem function to run the model
fit2 <- sem(model2, data = df)
summary(fit2)
```

```
## lavaan (0.5-23.1097) converged normally after 30 iterations
##
## Used Total
## Number of observations 4614 4632
##
## Estimator ML
## Minimum Function Test Statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Information Expected
## Standard Errors Standard
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## dep ~
## asthma (a) 1.478 0.183 8.084 0.000
## sed_hr ~
## dep (b) 0.044 0.011 3.929 0.000
## asthma (c) 0.412 0.139 2.965 0.003
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .dep 19.597 0.408 48.031 0.000
## .sed_hr 11.171 0.233 48.031 0.000
##
## Defined Parameters:
## Estimate Std.Err z-value P(>|z|)
## indirect 0.065 0.018 3.534 0.000
## total 0.477 0.138 3.448 0.001
```

We defined a few things in the model. First, we gave the coefficients labels of `a`

, `b`

, and `c`

. Doing so allows us to define the `indirect`

and `total`

effects. Here we see the indirect effect, although small, is significant at \(p < .001\). The total effect is larger (not surprising) and is also significant.

Also note that we can make the regression equations have other covariates as well if we needed to (i.e. control for age or gender) just as we do in regular regression.

```
## Our model
model2.1 <- '
dep ~ asthma + ridageyr
sed_hr ~ dep + asthma + ridageyr
'
## sem function to run the model
fit2.1 <- sem(model2.1, data = df)
summary(fit2.1)
```

```
## lavaan (0.5-23.1097) converged normally after 33 iterations
##
## Used Total
## Number of observations 4614 4632
##
## Estimator ML
## Minimum Function Test Statistic 0.000
## Degrees of freedom 0
## Minimum Function Value 0.0000000000000
##
## Parameter Estimates:
##
## Information Expected
## Standard Errors Standard
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## dep ~
## asthma 1.462 0.183 7.980 0.000
## ridageyr -0.005 0.004 -1.330 0.183
## sed_hr ~
## dep 0.044 0.011 3.927 0.000
## asthma 0.412 0.139 2.956 0.003
## ridageyr -0.000 0.003 -0.063 0.950
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .dep 19.590 0.408 48.031 0.000
## .sed_hr 11.171 0.233 48.031 0.000
```

Although we don’t show it here, we can also do moderation (“interactions”) as part of the mediation model.

## Structural Equation Modeling

Instead of summing our depression variable, we can use SEM to run the mediation model from above but use the latent variable of depression instead.

```
## Our model
model3 <- '
dep1 =~ dpq010 + dpq020 + dpq030 + dpq040 + dpq050 + dpq060 + dpq070 + dpq080 + dpq090
dep1 ~ a*asthma
sed_hr ~ b*dep1 + c*asthma
indirect := a*b
total := c + a*b
'
## sem function to run the model
fit3 <- sem(model3, data = df)
summary(fit3)
```

```
## lavaan (0.5-23.1097) converged normally after 47 iterations
##
## Used Total
## Number of observations 4614 4632
##
## Estimator ML
## Minimum Function Test Statistic 1065.848
## Degrees of freedom 43
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Information Expected
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## dep1 =~
## dpq010 1.000
## dpq020 1.096 0.024 45.136 0.000
## dpq030 1.133 0.031 36.908 0.000
## dpq040 1.149 0.030 38.066 0.000
## dpq050 0.933 0.025 36.773 0.000
## dpq060 0.929 0.022 42.107 0.000
## dpq070 0.871 0.022 39.760 0.000
## dpq080 0.686 0.019 36.325 0.000
## dpq090 0.308 0.011 28.544 0.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## dep1 ~
## asthma (a) 0.173 0.023 7.656 0.000
## sed_hr ~
## dep1 (b) 0.342 0.105 3.275 0.001
## asthma (c) 0.418 0.139 2.998 0.003
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .dpq010 0.306 0.007 42.008 0.000
## .dpq020 0.212 0.006 37.549 0.000
## .dpq030 0.559 0.013 43.807 0.000
## .dpq040 0.514 0.012 43.302 0.000
## .dpq050 0.384 0.009 43.862 0.000
## .dpq060 0.221 0.005 40.808 0.000
## .dpq070 0.249 0.006 42.420 0.000
## .dpq080 0.217 0.005 44.038 0.000
## .dpq090 0.090 0.002 46.106 0.000
## .sed_hr 11.179 0.233 48.012 0.000
## .dep1 0.256 0.010 24.657 0.000
##
## Defined Parameters:
## Estimate Std.Err z-value P(>|z|)
## indirect 0.059 0.020 3.019 0.003
## total 0.477 0.138 3.448 0.001
```

We defined `dep1`

as a latent variable using `=~`

. Although the model does not fit the data well–“`P-value (Chi-square) = 0.000`

”–it is informative for demonstration. We would likely need to find out how the measurement model (`dep1 =~ dpq010 + dpq020 + dpq030 +`

) actually fits before throwing it into a mediation model. We can do that via:

```
model4 <- '
dep1 =~ dpq010 + dpq020 + dpq030 + dpq040 + dpq050 + dpq060 + dpq070 + dpq080 + dpq090
'
fit4 <- cfa(model4, data=df)
summary(fit4)
```

```
## lavaan (0.5-23.1097) converged normally after 29 iterations
##
## Number of observations 4632
##
## Estimator ML
## Minimum Function Test Statistic 985.831
## Degrees of freedom 27
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Information Expected
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## dep1 =~
## dpq010 1.000
## dpq020 1.097 0.024 45.383 0.000
## dpq030 1.128 0.031 36.962 0.000
## dpq040 1.145 0.030 38.136 0.000
## dpq050 0.927 0.025 36.630 0.000
## dpq060 0.930 0.022 42.294 0.000
## dpq070 0.870 0.022 39.941 0.000
## dpq080 0.681 0.019 36.350 0.000
## dpq090 0.307 0.011 28.609 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .dpq010 0.306 0.007 42.051 0.000
## .dpq020 0.211 0.006 37.470 0.000
## .dpq030 0.560 0.013 43.909 0.000
## .dpq040 0.515 0.012 43.400 0.000
## .dpq050 0.390 0.009 44.041 0.000
## .dpq060 0.221 0.005 40.835 0.000
## .dpq070 0.249 0.006 42.461 0.000
## .dpq080 0.216 0.005 44.149 0.000
## .dpq090 0.090 0.002 46.195 0.000
## dep1 0.261 0.011 24.765 0.000
```

As we can see, there is a lack of fit in the measurement model. It is possible that these depression questions could be measuring more than one factor. We could explore this using exploratory factor analysis. We don’t demonstrate that here, but know that it is possible to do in `R`

with a few other packages.

## Machine Learning Techniques

We are briefly going to introduce some machine learning techniques that may be of interest to researchers. We will quickly introduce and demonstrate:

- Ridge, Lasso and Elastic Net
- Random Forests

### Ridge, Lasso and Elastic Net

In order to do either ridge, lasso, or elastic net regression, we can use the fantastic `glmnet`

package. Using the `cv.glmnet()`

function we can run the ridge (\(alpha = 0\)), lasso (\(alpha = 1\) which is default), and elastic net (\(0 \leq alpha \leq 1\)). It turns out that elastic net is the combination of the ridge and lasso methods and the closer `alpha`

is to 1 the more it acts like lasso and the closer it is to 0 the more it acts like ridge.

Lasso and elastic net can do variable selection in addition to estimation. Ridge is great at handling correlated predictors. Each of them are better than conventional methods at prediction and each of them can handle large numbers of predictors. To learn more see “Introduction to Statistical Learning” by Daniela Witten, Gareth James, Robert Tibshirani, and Trevor Hastie. A free PDF is available on their website.

To use the package, it wants the data in a very specific form. First, we need to remove any missingness. We use `na.omit()`

to do this. We take all the predictors (without the outcome) and put it in a data matrix object. We only include a few for the demonstration but you can include *many* predictors. We name ours `X`

. `Y`

is our outcome.

```
df2 <- df %>%
dplyr::select(riagendr, ridageyr, ridreth3, race, famsize, dep, asthma, sed_hr) %>%
na.omit
X <- df2 %>%
dplyr::select(-sed_hr) %>%
data.matrix
Y <- df2$sed_hr
```

Then we use the `cv.glmnet()`

function to fit the different models. The “cv” refers to cross-validation^{20}, which we don’t discuss here, but it an important topic to become familiar with. Below we fit a ridge, a lasso, and an elastic net model. The elastic net model uses more of the lasso penalty because the `alpha`

is closer to 1 than 0.

```
library(glmnet)
fit_ridge <- cv.glmnet(X, Y, alpha = 0)
fit_lasso <- cv.glmnet(X, Y, alpha = 1)
fit_enet <- cv.glmnet(X, Y, alpha = .8)
```

The plots below show where appropriate `lambda`

values are based on the mean squared error of the cross-validated prediction. The vertical dashed lines show a reasonable range of lambda values that can be used.

`plot(fit_ridge)`

`plot(fit_lasso)`

`plot(fit_enet)`

We can get the coefficients at a reasonable `lambda`

. Specifically, we use the “1-SE rule” (near the right hand side vertical dashed lines in the above plots) by `s = "lambda.1se"`

. You can directly tell it what `lambda`

value you’d like but this is a simple rule of thumb.

`coef(fit_ridge, s = "lambda.1se")`

```
## 8 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 5.7780966
## riagendr 0.0287649
## ridageyr -0.0003536
## ridreth3 0.0364093
## race 0.0588428
## famsize -0.0258347
## dep 0.0067781
## asthma 0.0662119
```

`coef(fit_lasso, s = "lambda.1se")`

```
## 8 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 5.6183
## riagendr .
## ridageyr .
## ridreth3 .
## race 0.1622
## famsize .
## dep .
## asthma .
```

`coef(fit_enet, s = "lambda.1se")`

```
## 8 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 5.7175
## riagendr .
## ridageyr .
## ridreth3 .
## race 0.1322
## famsize .
## dep .
## asthma .
```

Although we briefly introduce these regression methods, they are indeed very important. We highly recommend learning more about them.

### Random Forests

Random forests is another machine learning method that can do fantastic prediction. It is built in a very different way than the methods we have discussed up to this point. It is not built on a linear modeling scheme; rather, it is built on classification and regression trees (CART). Again, “Introduction to Statistical Learning” is a great resource to learn more.

Conveniently, we can use the `randomForest`

package. We specify the model by the formula `sed_hr ~ .`

which means we want `sed_hr`

to be the outcome and all the rest of the variables to be predictors.

`library(randomForest)`

`## randomForest 4.6-12`

`## Type rfNews() to see new features/changes/bug fixes.`

```
##
## Attaching package: 'randomForest'
```

```
## The following object is masked from 'package:dplyr':
##
## combine
```

```
## The following object is masked from 'package:ggplot2':
##
## margin
```

```
fit_rf <- randomForest(sed_hr ~ ., data = df2)
fit_rf
```

```
##
## Call:
## randomForest(formula = sed_hr ~ ., data = df2)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 10.81
## % Var explained: 3.77
```

We can find out which variables were important in the model via:

```
par(mfrow=c(1,1)) ## back to one plot per page
varImpPlot(fit_rf)
```

We can see that age (`ridageyr`

) is the most important variable, depression (`dep`

) follows, with the family size (`famsize`

) the third most important in the random forests model.

## Conclusions

Although we only discussed these methods briefly, that does not mean they are less important. On the contrary, they are essential upper level statistical methods. This brief introduction hopefully helped you know what `R`

is capable of across a wide range of methods.

The next chapter begins our “advanced” topics, starting with “Advanced Data Manipulation”.

The

`lavaan`

package has some great vignettes at http://lavaan.ugent.be/ to help with the other types of models it can handle.↩Cross-validation is a common way to reduce over-fitting and make sure your model is generalizable. Generally, you split your data into training and testing sets. It is very common in machine learning and is beginning to be practiced in academic fields as well. We recommend using it as often as you can, especially with these methods but also to make sure your other models are accurate on new data as well.↩