## 3. Empirical Example

We now give an example of implementing the

**regsem** package to further discuss the details. Here, we use an integrated dataset consisting of five publicly available datasets: National Comorbidity Survey—Baseline (NCS) [

25]; NCS—Replication [

26]; National Survey of American Life (NSAL) [

26]; National Survey of American Life—Adolescent Supplement [

27]; and the National Latino and Asian American Study [

26]. Each survey was designed to study the prevalence and correlates of psychological disorders among individuals living in the United States. Combined, 25,159 respondents were surveyed; however, response rates on individual items within and across datasets vary. For demonstration purposes, we randomly selected a sample of 1000 to mimic a limited sample size situation when regularization is desired. For this example, we selected 18 items (11 assessed within four datasets; 7 assessed within five datasets) that assessed the presence of symptom-level information (i.e., felt depressed most days, has a limited appetite most days, was so restless that others noticed) as it occurred during an individual’s most severe depressive episode.

Given that this analysis is a part of a larger study, our motivation was to combine depression items from across multiple scales, assessed across the five datasets, to identify an optimal factor structure to be used in additional analyses. Our eventual goal was an adequately fitting CFA model, as we desired some degree of clarity regarding the interpretation of each factor. However, given that we did not have an a priori model, we started with EFA to derive the number of factors.

We store the data to object dat.sub2. The path diagram of the constructed model is displayed in

Figure 1.

Before fitting the model in

**regsem**, we need to organize our data. There are two major points here. The first is to transfer the data type of the endogenous variables of the model to continuous. This is because the

**regsem** package works with the maximum likelihood discrepancy function, which assumes that the endogenous variables are continuous and normally distributed. With this, even though there are available options in

**lavaan** that accommodate categorical variables, those options are not currently supported by

**regsem**. For more discussion on categorical variables, see

Section 4.5.

The second point is with respect to the scale of the variables. In regularized regression, it is suggested to standardize the variables before fitting the model. This is because the effect of the regularization is not orthogonal to the scales of the variables. For penalty types such as ridge and lasso, larger coefficients are penalized more, which will make the regularization biased and tend to penalize features with smaller scales. For SEM, the covariance matrix becomes the correlation matrix after standardizing all of the variables. This will not create a problem for maximum likelihood without the penalty, since the results based on covariance and correlation are essentially equivalent except for the scale. However, this invariance property of maximum likelihood no longer holds after the penalty is added; that is, results based on covariance and correlation are not equivalent for regularized SEM. Huang et al. [

4] thus suggest fixing the scaling loadings deliberately, such that all of the latent variables have variances of around one. Jin et al. [

17] suggest working with the correlation matrix, then transferring back to the covariance scale in their study of penalized likelihood EFA. We recommend standardizing only the variables which have at least one path to be penalized before fitting the regularized SEM model to eliminate the effect of the scale on variable/path selection. A second step of fitting the restricted model after the selection (“relaxed lasso”) [

28] could then be done on the original scale. We suggest that researchers transform the variable scale carefully based on the research question before the analysis.

To evaluate the factor structure, we first used parallel analysis to identify an appropriate number of factors, of which five seemed the most appropriate. From this, we followed the procedure of Scharf and Nestler [

19] and specified an EFA model, with each factor loading penalized. This initial model is not identified; however, as the penalties increase, the degrees of freedom will become positive as additional factor loadings are set to zero. In

Table 1, we show the R code for constructing the model and organizing the example data.

For the model fitted in

**lavaan**, any wrapper function can be used.

**lavaan** by default sets

`estimator = “ML”`. Researchers should not change this default, as the other options are not currently supported by

**regsem**. When there is missing data,

**regsem** currently supports listwise deletion and full information maximum likelihood (FIML). Since

**lavaan** and

**regsem** both use listwise deletion by default, we demonstrate only the use of listwise deletion for our example here. The FIML option and other options related to missing data will be further discussed in the Discussion section. Here, our model fitted by

**lavaan** is not identified. This will not create a problem for

**regsem**, since

`regsem()` uses the

**lavaan** object only to extract the sample covariance matrix and other aspects of the data. One can also specify

`do.fit=FALSE` in this step. The model will become identified through path/variable selection. For an example of using

**regsem** to identify EFA models, see [

23,

24].

After running a model in

**lavaan**, we can then add penalties to the uncertain structural coefficients in

**regsem**. There are multiple ways to specify this in the pars_pen argument. By default,

**regsem** penalizes all regression parameters (

`pars_pen = “regression”`). One can also specify all loadings (

`pars_pen = “loadings”`), or both (

`pars_pen = c(“regressions”, “loadings”)`). Since regularized SEM is semi-confirmatory, researchers may want to leave the theory-based part of the model unpenalized. Though those unpenalized parameters are estimated along with penalized ones, they should not be included in the pars_pen argument. If parameter labels are used in the

**lavaan** model specification, those labels of the parameters to be penalized can be directly passed to the

`pars_pen` argument. Otherwise, one can find the corresponding parameter numbers by looking at the output of the

`extractMatrices()` function. An example of the R code and output is shown in

Table 2.

The package

**regsem** is built upon RAM notation [

23,

24]. The

`extractMatrices()` function extracts and returns the RAM matrices of the SEM model estimated in

**lavaan**. The filter matrix F (

`$F`) indicates which variables are observed (as opposed to latent), the asymmetric matrix A (

`$A_est`) stores the estimated direct path coefficients, and the symmetric matrix S (

`$S_est`) stores the estimated variances and covariances. Those matrices are then used to derive the implied covariance matrix of the model. The

`$A` matrix and

`$S` matrix in the

`extractMatrices()` output store the corresponding parameter number of each estimated parameter. We can refer to those matrices and then pass the desired parameter numbers to the pars_pen argument. For more detail on RAM notation and its application to

**regsem**, see Jacobucci et al. [

3]. In our example, we would like to penalize all paths with loadings smaller than 0.5 from the rotated EFA model. We deviate from the Scharf and Nestler [

19] procedure in this regard, as in our experience, allowing large factor loadings to go unpenalized assists in achieving a converged solution as fewer constraints are being placed on the model. The corresponding parameters penalized are summarized in

Table 3.

We can then set the penalty type (e.g.,

`type = “lasso”`), the number of penalty values we want to test (e.g.,

`n.lambda = 20`), and how much the penalty should increase (e.g.,

`jump = 0.05`). The latter two arguments may vary for different models and data, as the impact of the penalty depends on the scale of

${F}_{MLE}$. We suggest including a wider range of penalties initially, as

**regsem** will terminate when testing higher penalty values if all parameters have been set to zero. One can also determine the penalty range by looking at the parameter trajectories, which is the trajectory of the value of each penalized parameter at different penalty levels. One can further visualize the chosen “optimal” penalty level by specifying

`show.minimum` argument equal to the desired criteria for optimality (detailed later). The parameter trajectories corresponding to our example are shown in

Figure 2. At penalty 0.02, the model becomes identified, and the optimal penalty is 0.16 as shown in this plot.

For

`cv_regsem()` to examine the penalties and select the best model, a fit index needs to be specified as well. As the penalty increases, some parameters would be set to 0, making

${F}_{ML}$ larger (worse). However, the degrees of freedom would increase as well. Further, we can see in

Table 4 a large change in the parameter estimates from lambda = 0 to lambda = 0.02. This is caused by an unpenalized model being unidentified, as there are more parameters than cells in the covariance matrix. By adding even a small penalty, the dimensionality of the model is reduced, thus resulting in more stable parameter estimates. This can further be seen in

Figure 2.

The

**regsem** package, by default, uses the Bayesian Information Criteria (BIC) [

9] to select a final model, which takes both the likelihood and degrees of freedom into account, and thus still could improve (decrease) as penalty increases. The final model is selected to be the one that corresponds to the lowest BIC value of all models that converged. To compare the selection performance of other information criteria, see Jacobucci et al. [

3] and Lüdtke et al. [

29]. The output of our example model is shown in

Table 5.

The output of

`cv_regsem()` contains the parameter estimates of the models fitted at each penalty level in

`$parameters`, their corresponding model fit information in

`$fits`, and the parameter estimates of the best-fitted model according to the chosen fit index metric (here, BIC) in

`$final_pars`. Note that this best-fitted model is only considering models that converged (

`“conv” = 0 `in

`$fits` means converged, whereas

`“conv” = 1` means non-convergence). In our example, the model did not converge with several penalty levels (e.g., when

`lambda = 0.20`). One can choose to explore each penalty value further by testing multiple starting values using the

`multi_optim()` function, or this can be done automatically in

`cv_regsem()` by setting

`multi.iter = TRUE`. The demonstration R code for the case lambda = 0.20 and the corresponding output is shown in

Table 5. The BIC value of the model at this penalty level is 22,120.89, which is larger than the BIC value of 21,787.29 at

`lambda = 0.16`. Thus, the optimal model is still the one with penalty level equal to 0.16.

In

`$final_pars` of the

`cv_regsem()` output, we can see that several parameters (for example, the ones corresponding to path “f1 -> DEP1” and path “f1 -> DEP4”, etc.) now have parameter estimates of zero. Those are the paths to be removed from the current model. We recommend the use of the two-step “relaxed” lasso [

25]; that is, to refit the model with those paths removed. For this example, 51 paths are removed based on the selection result. The path diagram of the reduced model is displayed in

Figure 3. Ideally, one should evaluate this simplified model on a new set of data. Here, we demonstrate this on another random sample of 1000 from our dataset. The fitted model has a CFI of 0.946, a TLI of 0.920, and an RMSEA of 0.095 (90% CI (0.089, 0.101)).

Given that we now have a final CFA model derived from the use of lasso penalties, each factor with only a handful of loadings, we can interpret the meaning of the resultant factors. For example, factor one predominantly includes depression items related to change in appetite and weight; factor two items pertain to symptoms of low energy; factor three relates to symptoms that assess energy levels more broadly, rather than just low energy; factor four items map on to behavioral changes associated with depression, such as appetite, weight, sleep, and talkativeness; and factor five most represents emotional distress, such as crying, feeling worthless, and suicidal thinking. However, it is important to note that there are significant item cross-loadings, suggesting multiple factors have significant conceptual overlap, as demonstrated by the fact that four of the assessed symptoms load onto four of the five factors. Thus, we are not advocating for this to be a solution for researchers moving forward, but rather a conceptual demonstration.