Evaluating Cluster-Level Factor Models with lavaan and M plus

: Background : Researchers frequently use the responses of individuals in clusters to measure cluster-level constructs. Examples are the use of student evaluations to measure teaching quality, or the use of employee ratings of organizational climate. In earlier research, Stapleton and Johnson (2019) provided advice for measuring cluster-level constructs based on a simulation study with inadvertently confounded design factors. We extended their simulation study using both M plus and lavaan to reveal how their conclusions were dependent on their study conditions. Methods: We generated data sets from the so-called conﬁgural model and the simultaneous shared-and-conﬁgural model, both with and without nonzero residual variances at the cluster level. We ﬁtted models to these data sets using different maximum likelihood estimation algorithms. Results: Stapleton and Johnson’s results were highly contingent on their confounded design factors. Convergence rates could be very different across algorithms, depending on whether between-level residual variances were zero in the population or in the ﬁtted model. We discovered a worrying convergence issue with the default settings in M plus , resulting in seemingly converged solutions that are actually not. Rejection rates of the normal-theory test statistic were as expected, while rejection rates of the scaled test statistic were seriously inﬂated in several conditions. Conclusions: The defaults in M plus carry speciﬁc risks that are easily checked but not well advertised. Our results also shine a different light on earlier advice on the use of measurement models for shared factors.


Introduction
To measure constructs at the cluster level-termed shared constructs [1,2]-researchers frequently use the responses of individuals in clusters. For example, students' evaluations may be used to measure the teaching quality of instructors, patient reports may be used to evaluate social skills of therapists, and residents' ratings may be used to evaluate neighborhood safety.
When multiple items are used to measure such cluster-level constructs, multilevel confirmatory factor analysis (CFA) models are useful. These models allow for the evaluation of the factor structure at the cluster level (modeling the (co)variances among item means across clusters), and at the individual level (modeling the (co)variances across individuals within clusters).
If the cluster-level construct, for example teacher quality, would be perfectly measured using the responses of students, all students evaluating the same teacher would agree, and provide exactly the same item scores. In that case, there will not be any systematic variance in the item scores within clusters (but there will still be variance due to sampling error).
In practice, individuals within a cluster do not all provide the same responses to the items, leading to systematic variance (and covariance) to be explained at the individual level. The question then arises how the variance within clusters should be modeled. Stapleton et al. [1] proposed a model for cluster-level constructs with a saturated model at the individual level. This work was updated by proposing and evaluating several Psych 2021, 3 135 two-level factor models in Stapleton and Johnson [2]. In this article, we will provide a simulation study to evaluate under what scenarios the proposed models are able to provide sensible results, partly by replicating the study by Stapleton and Johnson. A second aim is to investigate the types and frequency of estimation problems in the software packages Mplus [3] and lavaan [4], which have different default settings that could have important consequences for convergence problems and the quality of obtained results.

Different Types of Two-Level Models
Stapleton and Johnson [2] evaluated three different models for cluster-level constructs: the 'configural model', the 'unconstrained model', and the 'simultaneous shared-andconfigural model'. We will introduce these three models in the next section.

Configural Model
Configural models are factor models in which the same factor structure is applied to the within and the between level, and the factor loadings are constrained to be equal across levels. The configural model decomposes the common factor(s) into a within-cluster and a between-cluster part, meaning that the between-and within-components can be interpreted as stemming from the same latent variable. For example, one could have measured collaborative playing skills in children in different classrooms using several items, hypothesizing that the collaborative playing skills systematically vary within as well as across classrooms (for a teaching-quality example see [5]). The configural model would allow one to interpret the differences in the within-level common factor representing within-classroom differences in collaborative play, and the differences in the between-level common factor representing between-classroom differences in collaborative play.
The cross-level invariance of factor loadings is necessary for a meaningful interpretation of factors at the two levels [1,[6][7][8][9][10][11][12]. The left panel of Figure 1 shows a population configural model in which each indicator's factor loading is equal across levels.
the individual level. This work was updated by proposing and evaluating several tw level factor models in Stapleton and Johnson [2]. In this article, we will provide a simu tion study to evaluate under what scenarios the proposed models are able to provide se sible results, partly by replicating the study by Stapleton and Johnson. A second aim is investigate the types and frequency of estimation problems in the software packag Mplus [3] and lavaan [4], which have different default settings that could have importa consequences for convergence problems and the quality of obtained results.

Different Types of Two-Level Models
Stapleton and Johnson [2] evaluated three different models for cluster-level co structs: the 'configural model', the 'unconstrained model', and the 'simultaneous share and-configural model'. We will introduce these three models in the next section.

Configural Model
Configural models are factor models in which the same factor structure is applied the within and the between level, and the factor loadings are constrained to be equ across levels. The configural model decomposes the common factor(s) into a within-clu ter and a between-cluster part, meaning that the between-and within-components can interpreted as stemming from the same latent variable. For example, one could have mea ured collaborative playing skills in children in different classrooms using several item hypothesizing that the collaborative playing skills systematically vary within as well across classrooms (for a teaching-quality example see [5]). The configural model wou allow one to interpret the differences in the within-level common factor representi within-classroom differences in collaborative play, and the differences in the betwee level common factor representing between-classroom differences in collaborative play.
The cross-level invariance of factor loadings is necessary for a meaningful interpr tation of factors at the two levels [1,[6][7][8][9][10][11][12]. The left panel of Figure 1 shows a populati configural model in which each indicator's factor loading is equal across levels.

Unconstrained Model
Unconstrained two-level factor models do not have cross-level invariance of fact loadings, implying that different constructs may be measured in each cluster, or that d ferent structures altogether dictate covariation at each level. As a result, the common fa tors at the two levels do not reflect the within-and between component of the same late

Unconstrained Model
Unconstrained two-level factor models do not have cross-level invariance of factor loadings, implying that different constructs may be measured in each cluster, or that different structures altogether dictate covariation at each level. As a result, the common factors at the two levels do not reflect the within-and between component of the same latent variable. Instead, the theoretical meaning of the latent variables is different at the two levels (as well as across clusters). An unconstrained model resembles the left panel of Figure 1, but without equality across levels of each indicator's loading.
In practice, it can be quite difficult to describe exactly how the interpretation of the factors varies across levels. For example, if the factor loading for a specific indicator is higher at the between level than at the within level, then the factor at the between level will represent more of the content of that specific indicator (and the other way around). If the pattern of higher and smaller factor loadings varies across items, it will become harder to appropriately label the common factor at each level. The unconstrained two-level model is therefore not a theoretically useful model. The model was still included in our (and Stapleton and Johnson's) study because it is regularly applied in practice, and because we argue that an unconstrained two-level model is capable of fitting data generated from a simultaneous shared-and-configural model.

Simultaneous Shared-and-Configural Model
The simultaneous shared-and-configural model was introduced by Stapleton et al. [1] and comprises a configural model with an added common factor at the between level (see the right panel of Figure 1). In this model, the configural factor represents a 'nuisance' factor, i.e., a common factor that is accidentally measured using the individual responses, and that also systematically varies over clusters. In the examples provided by Stapleton and Johnson [2], this nuisance factor could represent an individual's tendency to always agree with statements (acquiescence), which could also have a cluster component (e.g., some clusters exhibit more acquiescence than others). The additional between-level factor, which is uncorrelated with the configural factor, then represents the objective shared construct that was the focus of the research. This shared factor does not differ within clusters, and therefore has no within-level component.
Stapleton and Johnson [2] constrained the intraclass correlation (ICC) of the configural factor to a specific value, for example by constraining the variance of the between-level factor to (0.05/0.95) times the within-level factor, leading to an ICC of 0.05 for the configural factor. The authors argued that such a constraint was needed to identify the model. They seemed to have overlooked that the cross-level constraint on the factor loadings already identifies the factor variance of the configural factor on the between level. The shared factor can be identified by either fixing its variance to 1 or by fixing one of the factor loadings to 1. The additional constraint is therefore not necessary to identify the model.

Estimation of Two-Level Models
Different algorithms have been proposed to obtain maximum likelihood (ML) estimates, and their availability and defaults vary across software. In this section we discuss four algorithms available in Mplus, two of which are also available in lavaan. We also discuss the estimation of between-level residual variances, which was confounded with model type in Stapleton and Johnson's [2] simulation study.

Maximum Likelihood Estimation Algorithms
Both lavaan [4] and Mplus [3] use normal-theory ML estimation by default for continuous variables, using the observed information matrix to derive SEs. Although also available in lavaan, only Mplus defaults to a χ 2 statistic and SEs that are robust to nonnormality. The robust χ 2 statistic provided by lavaan and Mplus is asymptotically equivalent to Yuan and Bentler's T * 2 statistic [13]. This robust test statistic (with SEs) is requested in lavaan with the argument estimator = "MLR" (or equivalently, test = "yuan.bentler.mplus" and se Convergence of the algorithm is determined by tracking criteria at each iteration-namely, the log-likelihood function and its first derivative. After any of these algorithm's convergence criteria (i.e., the rules for stopping the optimizer from iterating further) have been met, it must be verified that the optimizer in fact converged on a maximum.
In Mplus, convergence for any optimization algorithm for ML estimation can be controlled with ANALYSIS options, such as the ODLL derivative using CONVERGENCE (for QN) or MCONVERGENCE (for EM or EMA), as well as LOGCRITERION (absolute change in log-likelihood from previous iteration) and RLOGCRITERION (relative change in log-likelihood from previous iteration). The maximum number of iterations is set by ITERATIONS for QN and MITERATIONS for EM(A). The default optimizer used by lavaan is nlminb from the stats package, whose options can be set by passing a named list to the control = argument (see the ?lavOptions help page). lavaan's defaults are list (iter.max = 10,000, abs.tol = .Machine$double.eps × 10, rel.tol = 1 × 10 −10 ). When using EM, lavaan has its own parallel dedicated arguments, shown in the last example on the tutorial page: https://lavaan.ugent.be/tutorial/multilevel.html, accessed on 31 May 2021.
Verifying that the optimizer converged on a maximum involves checking the first and second derivatives of the log-likelihood function with respect to the estimated parameters, respectively called the "gradient" and "Hessian." If ML estimates were obtained, each element of the gradient vector should be effectively zero (the "first-order condition"). Upon finding a nonzero gradient element, lavaan warns that the optimizer did not find a ML solution and "estimates below are most likely unreliable." Likewise, Mplus output will contain the message: "The model estimation did not terminate normally due to a non-zero derivative of the observed-data loglikelihood," referring to the first-order condition. Any nonzero element of the gradient indicates the corresponding parameter estimate is not a ML estimate. But the reverse is not necessarily true: if the gradient consists only of zeros, it could be a minimum or a saddle point rather than a maximum. In order to verify the solution is a maximum, the Hessian should be negative definite. Because the Hessian is intensive to compute, this "second-order condition" is rarely checked to simply verify convergence. However, multiplying the Hessian by −1 yields the information matrix, the inverse of which is the asymptotic covariance matrix of the estimated parameters (the diagonal of which contains the squared SEs). Thus, if the information matrix is not positive definite (and so cannot be inverted), a warning is issued that SEs cannot be calculated.
Because EMA is the default algorithm in Mplus, we describe one more computational detail about the acceleration aspect (which seems to be shared by the FS algorithm, but we did not focus on that). When the log-likelihood does not change fast enough between iterations, Mplus attempts to accelerate optimization by switching from EM to QN. This could backfire if the QN step overshoots its target, instead decreasing the log-likelihood at the next iteration. In such circumstances, Mplus will then restart the EM algorithm as though the user selected EM instead of the default EMA, because EM alone (without switching to QN) will always increase the log-likelihood between iterations. However, we discovered that apparent convergence with EM after EMA failed does not necessarily converge when explicitly setting ALGORITHM = EM. Causes and potential consequences are provided in the Results section.

Between-Level Residual Variances in Two-Level Models
When strong factorial invariance across clusters holds, then in a two-level model, not only will factor loadings be equal across levels, residual variances at the between level will be zero [11,[14][15][16]. For configural constructs (i.e., with cross-level invariance of factor loadings), any residual variance at the between level (θ B ) can therefore be interpreted as differences in intercepts across clusters (measurement bias, also called cluster bias by Jak, Oort and Dolan [14]. Nonzero residual variance at the between level (θ B > 0) means that the cluster-level differences in the indicators are not all explained by cluster-level differences in the common factor. In other words, variables other than what was intended to be measured cause differences in the indicator scores across clusters. In practice, it may not be realistic to expect exactly zero cluster bias for all indicators, similar to how exact invariance of intercepts generally does not hold [17]. That is, cluster invariance may hold only approximately, implying small θ B instead of zero θ B . Moreover, some indicators may be subject to cluster bias while other indicators are not (representing partial invariance [18]).
Sample estimates of θ B vary around the population values, so when they are (nearly) zero, estimates can frequently take negative values simply due to sampling error. Thus, if strong factorial invariance across clusters holds even approximately (θ B ∼ = 0), then estimating θ B under non-negative constraints may lead to trouble with convergence in samples that would have contained at least one negative variance under unconstrained estimation. By default, lavaan does not restrict θ B estimates to be positive when using QN, while Mplus does. The EM algorithm requires θ B > 0 in both packages, because the EM algorithm requires the between-level model-implied covariance matrix to be positive definite. The minimum-variance requirement (set with the ANALYSIS option VARIANCE) must be between 0 and 1, so negative values for θ B are not permitted in Mplus. This requirement may therefore result in nonconvergence for populations with θ B ∼ = 0.
In applications of (shared-and-)configural-construct models, it can be valuable to assess cluster bias to establish whether Level-2 residual variances should be constrained to zero, which could avoid negative-variance estimates. However, nonconvergence under minimum-variance-constrained estimation when θ B ∼ = 0 would prevent the ability to compare that model to one with strong factorial invariance across clusters (θ B = 0). In our simulation study, we explicitly crossed these design factors: zero vs. nonzero θ B in the population model and fixed vs. estimated θ B in the fitted model. We focus on conditions where population θ B is exactly zero, representing exact invariance, but we will show some results based on conditions with θ B = 0.01 and θ B = 0.0001 as well. Chen, Bollen, Paxton, Curran and Kirb [19] describe the possible causes of, consequences of, and possible strategies to handle inadmissible solutions in more detail. Negative variance estimates could result from either model misspecification or sampling error. So before fixing negative variance estimates to zero, one should first test the null hypothesis that the parameter is an admissible solution. For example, if the 95% confidence interval for a residual variance includes positive values, then one cannot reject the null hypothesis (using α = 0.05) that the true population value is indeed positive; in this case, if the model fits well and there are no other signs of misspecification, one could conclude that the true parameter is simply close enough to zero that sampling error occasionally yields a negative estimate. For more discussion about negative variance parameters and estimates see [20][21][22].

Overview of the Study
In this article, we will provide an extensive simulation study to evaluate two-level models for measuring cluster-level constructs. We will replicate the analyses of Stapleton and Johnson [2] with slight alterations. The differences in the models to be evaluated are that we will evaluate a simultaneous shared-and-configural model with unconstrained ICC instead of unnecessarily fixing the ICC of the configural factor to various specific values. In addition, we will generate data with zero or nonzero between-level residual variances, and we fit models with freely estimated or with fixed (to zero) between-level residual variances. We justify our design choices and provide our expectations following the description of our design factors in Section 2.

Materials and Methods
We generated all data using R [15] (version 4.0.4). We used both lavaan [4] (Version 0.6-7 and Mplus [3] (Version 8.5) to fit all models to those data. R syntax to generate and analyze the Monte Carlo results are available from the Open Science Framework: https://osf.io/sdwam/ (accessed on 31 May 2021).

Data Generating Models
Stapleton and Johnson [1] generated data from two population models, being a configural model with θ B = 0, and a shared model with θ B > 0. As this confounding of factor structure and presence of θ B is not apparent from the article, their results are easily misinterpreted. We generated data from four different population models. These are models with or without the existence of an additional between-level construct, and with or without cluster bias (indicated by θ B > 0).
Where possible, we use the same population values for parameters as reported in Stapleton and Johnson [2]. Figure 1 shows the population models with (left panel) and without (right panel) the additional between-level factor. All population factor variances were 1. For the configural factor, all factor loadings at the within and between levels were 0.70. The factor loadings for the shared factor all were 0.40. The θ B values were either fixed to zero or chosen to standardize the between-level factor loadings (e.g., 1 − (0.40 2 + 0.70 2 ) = 0.35 in the model with the shared factor). Stapleton and Johnson only generated data from the shared-and-configural model with θ B = 0 and from the configural-only model with θ B > 0, so these factors were confounded in their study design.
These parameter values led to item intra class correlations (ICCs) of 0.50 in all conditions with θ B > 0. In conditions with θ B = 0, item ICCs were 0.32 for the configural model conditions, and item ICCs were 0.39 for the shared model conditions. The ICC of the configural factor was 0.50 in all conditions.

Sample Size Conditions
We generated data for 50, 100, and 200 clusters with a fixed size of 20 individuals per cluster. These are the same sample size conditions as Stapleton and Johnson [2], but without the 300-cluster condition.

Fitted Models
We fitted three different models (unconstrained, configural, shared-and-configural) to each simulated dataset, both with freely estimated θ B and fixed θ B = 0. The unconstrained model is a two-level CFA model with one factor at each level and with freely estimated factor loadings at both levels. The factor variances of the unconstrained model are fixed at one at both levels. This model has df = 10 in the condition with freely estimated θ B , and df = 15 in the condition with fixed θ B = 0.
The configural model adds cross-level invariance constraints to the unconstrained model. With factor loadings constrained to be equal across levels, the factor variance at the between level can be freely estimated. The configural model therefore has df = 14 in the condition with freely estimated θ B , and df = 19 in the condition with fixed θ B = 0.
The shared-and-configural model adds to the configural model an orthogonal betweenlevel common factor. In this model, the factor variance of the shared factor is fixed at 1, and the factor loadings of the shared factor are freely estimated. For the configural part of the model, the factor loadings are again constrained to be equal across levels, the variance of the within-level configural factor is fixed at one, and its variance at the between level is freely estimated. The shared-and-configural model has df = 9 in the conditions with freely estimated θ B , and df = 14 in the condition with fixed θ B = 0.

Estimation Options
In all sample-size conditions, we fit all models to data from all populations using ML estimation with the QN algorithm and an observed information matrix in lavaan and Mplus. In a follow-up study holding the number of clusters constant at 100, we also compared QN to EMA (the default algorithm in Mplus) and EM in Mplus. A third study used MLR (the default in Mplus) to compare the rejection rates reported by Stapleton and Johnson [2].

Number of Conditions and Replications
The primary study design consisted of 2 (configural or shared-and-configural population model) × 2 (θ B = 0 or θ B > 0 in the population) × 3 (sample size) = 12 data conditions. We generated 1000 datasets per condition. For all 12,000 datasets, 3 (unconstrained, configural, or shared) × 2 (θ B = 0 or θ B > 0) = 6 models were fitted with ML estimation using the QN algorithm in both software packages. In the first follow-up study, the subset of 4000 datasets with 100 clusters were analyzed using ML estimation for the same six models in both software packages, but additionally using EMA and EM in Mplus. The second follow-up study used MLR, again with only QN in lavaan but QN and EMA in Mplus.

Expectations Regarding Convergence and Rejection Rates
Multilevel structural equation modeling is very susceptible to estimation problems. Especially with small numbers of clusters, and/or little variance at the between level, non-converged and inadmissible solutions are frequently observed [19][20][21][22]. Overall, we expected more convergence problems in conditions with fewer clusters [2,8]. Table 2 depicts the four data-generating models (irrespective of sample size) in the rows, and the six fitted models in the columns. The grey cells represent the six conditions that were evaluated by Stapleton and Johnson [2]. We expected that when the correct model was fitted (cells labeled with 'T'), models would converge for nearly 100% of samples (with lower convergence in smaller samples), and rejection rates would be close to the nominal α level. When θ B > 0 was not taken into account (i.e., by fixing θ B = 0 in the analysis), we expected high rejection rates (cells labeled 'R' in Table 2). In conditions with an overparameterized model (cells labeled 'O' in Table 2), such as when θ B = 0 but was freely estimated, we expected more convergence problems, but nominal rejection rates for the converged cases.
In the remaining cells (labeled 'S' in Table 2), the shared factor was omitted from the model. Because the ratio of between-level population cross-loadings was proportional across indicators (i.e., they were all 0.70 for the configural factor and all 0.40 for the shared factor), the two between-level factors were perfectly confounded in Stapleton and Johnson's [2] population (which we replicated). Shared-factor loadings were only identified because the configural loadings were constrained to equality across levels. As a result, the shared-factor variance would have been absorbed by the single factor at the between level (inflating its variance), so the unconstrained and configural models should have fit perfectly to data generated from the shared-and-configural model. Therefore, we expected that rejection rates in these conditions would still be nominal, in contrast to the 100% rejection rates reported by Stapleton and Johnson [2] (p. 319), which they attributed to omitting the shared factor rather than to fixing θ B = 0 when data were generated with θ B > 0. Table 2. Overview of the four data generation conditions (rows), and the six fitted models (columns).

Fitted Model
Uncon Conf Shared  Table 1) also reported that when data were generated from a configural-model population, fitting the unconstrained model yielded around 50% rejection rates, which should not be the case because it is an overparameterized model, as our Table 2 shows. They also reported 0% convergence when fitting the shared-and-configural model to data generated from a configural-model population, which they attributed merely to "no between-cluster covariance to be modeled above that explained by the configural" factor [2] (p. 319). Because overparameterization did not prevent convergence in their unconstrained model, we expected their 0% convergence could be at least partly related to the unnecessary constraint on the configural construct's ICC in the shared-and-configural models. Since Stapleton and Johnson used the Mplus defaults (MLR + EMA), we will only compare the rejection rates we find with their findings in a follow up study, and not in our primary study in which we apply ML with QN.

Convergence Rates in the Primary Study
Nonconvergence in lavaan only occurred when fitting the shared-and-configural models to datasets. Regardless of θ B in the data-generating model, convergence was consistently >95% when estimating θ B . When θ B = 0 in the population, oddly, convergence problems only occurred in lavaan when appropriately fixing θ B = 0, but convergence was still approximately 80% and improved in larger samples, whereas convergence in the same conditions was notably lower in Mplus and varied erratically across sample sizes.
Regardless of the fitted model (unconstrained, configural, or shared), Mplus additionally had convergence problems (nearly 0% convergence) when estimating θ B despite θ B = 0 in the population. In contrast, lavaan had 100% convergence in the same conditions (except the conditions described in the previous paragraph). In order to evaluate the effects of generating data with exact or approximate cluster invariance, we also evaluated the conditions with 100 clusters while fixing θ B to 0.0001 (the minimum value for a variance  Table A1 shows that the convergence problems when estimating θ B freely persisted in conditions with approximate instead of exact cluster invariance, with 0-0.5% convergence in conditions with generated θ B = 0.0001, and 12.4-43.1% convergence in conditions with generated θ B = 0.01. As expected, nonconvergence was generally exacerbated by fewer clusters (except when Mplus fitted shared models with θ B = 0 in the population and analysis models). For all conditions in Table 3 lavaan either converged more often than Mplus or both packages converged in 100% of samples.

Rejection Rates in the Primary Study
Rejection rates using α = 0.05 are provided in Table 4. In populations with θ B > 0, rejection rates were as expected in both Mplus and lavaan. In fact, the rates are mostly identical in conditions where convergence rates were 100% for both lavaan and Mplus, reinforcing the expectation the two software packages provide the same results when fitting the same model to the same data, using the same estimation routine and calculating the normal-theory χ 2 test statistic. In the other conditions, differences in convergence rates cause small differences between the results obtained with lavaan and Mplus with ML.
Fixing θ B = 0 yielded 100% power to reject the model, and freely estimating θ B yielded rejection rates that did not appreciably differ from the nominal 5% and were closer in larger samples. In populations with θ B = 0, however, rejections rates were nearly 0% across conditions and software (except when they could not be calculated in conditions where Mplus did not converge). Table 5 includes the same convergence and rejection rates for the 100-cluster conditions reported in Tables 3 and 4 (i.e., using the QN algorithm), as well as the EM(A) algorithms in Mplus. Convergence rates of EMA and QN were very similar in populations with θ B > 0. When θ B = 0 in the population, EMA converged in all samples, including the conditions that fitted θ B > 0 (for which QN failed in all samples). Convergence rates for EM were consistently zero for fitted models with θ B = 0, regardless of the population model. This implies some counter-intuitive results with EM. For example, fitting the correct configural model with θ B = 0 leads to 0% convergence, while fitting the overparametarized configural model with θ B > 0 (while θ B = 0 in the population) converged in 90.4% of the replications. In addition, EM convergence rates were particularly low in conditions where the shared model with θ B > 0 was fitted to data generated with θ B = 0. For the converged cases, there were no notable differences in rejection rates across the different algorithms. Nonconvergence Anomaly with Mplus Different Mplus estimation algorithms had similar rejection rates but different convergence rates in Table 5. In order to further investigate this finding, we focus on one condition from Table 5, where the population model is a shared model with θ B = 0, but θ B is freely estimated in the fitted model. In this condition, EM failed to converge in 98.1% of the replications, QN failed to converge in 100% of the replications, yet EMA converged for each sample. We inspected the TECH8 output, which prints the optimization history for each replication, and found that all 998 converged replications using EMA contained the message: "The optimization algorithm has changed to the em algorithm". As noted in Section 1.2.1, this occurs when a QN step (used to accelerate convergence with EM) fails to improve the log-likelihood.

Follow-Up Study Comparing ML Algorithms
We investigated further by fitting the model to the first generated dataset from this condition (i.e., as a single analysis, not using the MONTECARLO feature). The analysis yielded an apparently converged solution, with a scaled χ 2 (9) = 2.817, p = 0.971. But indeed the optimization history showed that the default EMA algorithm failed after 111 iterations, at which point Mplus switched to EM and appeared to converge after 243 iterations. However, when we explicitly selected ALGORITHM = EM in the Mplus input file, we saw that the MCONVERGENCE criterion of the EM algorithm was not fulfilled, despite having apparently converged when EM was used following the failure of EMA to converge. Close inspection of each interation in the optimization history for this data set reveals that Iteration 243 of explicitly requested EM had the same log-likelihood as the final (apparently converging) Iteration 243 of EM following failure of EMA; however, when explicitly requested, the EM continued iterating until the maximum number of m-iterations (500) was reached and eventually failed to converge. When we ran the MONTECARLO analysis on all 1000 datasets in this condition with ALGORITHM = EM, the model converged on a solution in none of the data sets. After requesting help from the Mplus support team about this anomaly, we learned that the employed convergence criteria for EM after failure of EMA differ from the EM defaults. Specifically, for EM after EMA, Mplus does not check the first order condition (i.e., whether the gradient consists of zero's). As a result, any model seems to converge, including those that do not with EM (or ODLL), after which the gradient is checked. Effectively, this means that for the results obtained with EM-after-EMA, it is not verified that the obtained parameter estimates are actual ML estimates. It is important to note that this issue is not detectable from the Mplus output file. First, there is no indication of the change in convergence criteria. The Mplus output will list a minimum derivative value of 0.0001 as one of the convergence criteria, while in reality this criterium is ignored. Users may verify this by setting the mconvergence criterium to a large number (like 1000) in an explicit EM analysis. This will lead to the same results as obtained with an EMA analysis in which Mplus switched to EM. Second, Mplus does not provide any warning about the switch from EMA to EM. Table A2 (first column) shows the number of replications across conditions for which the algoritm changed from EMA to EM.

Follow-Up Study Comparing ML Algorithms with Robust Corrections
The final follow-up study was conducted to reveal under what conditions Stapleton and Johnson's [2] inflated rejection rates could be replicated, given that they used the default estimation options (MLR with EMA) in Mplus. Because MLR simply begins with ML estimation, the convergence rates were the same for MLR as reported for ML in Table 5. Robust corrections are calculated after convergence of the algorithm on ML estimates. Table 6 reports rejection rates of the scaled χ 2 statistic in the 100-cluster conditions using the QN and EMA algorithms in Mplus and QN in lavaan. The results obtained with MLR show that the rejections rates of the default test statistic in Mplus can be seriously inflated under certain conditions that we examined in this simulation study-namely, in populations for which there is strong invariance across clusters (θ B = 0). The same inflation is not apparent when using MLR (with QN) in lavaan. However, the same inflation is apparent when using QN in Mplus, except in conditions where models did not converge, in which case rejection rates cannot be calculated. Because interpreting these differential results involves quite some detail across software packages, we focus first on how our results compare to the results presented by Stapleton and Johnson [2], and then discuss the differences between software packages.

Comparison with Stapleton and Johnson (2019)
The last column of Table 6 shows the rejection rates reported by Stapleton and Johnson [2]. Their inflated rejection rates with 100 clusters were replicated with Mplus using EMA (the default, used in their simulation), but only in the conditions where population θ B = 0 in the configural model. That is, when the correct model was fitted to data generated under the configural model with no residual variance, rejection rates were 11%. Also, when fitting the overparameterized unconstrained model to the same data, rejection rates were as highly inflated as Stapleton and Johnson reported, but only when the model was additionally overparameterized by freely estimating θ B . When appropriately fixing θ B = 0, the rejection rates were not nearly as inflated (i.e., 13% rather than 51%). Furthermore, the same pattern of results just described (for fitting the unconstrained model to configuralmodel data) can be seen not only when fitting the configural or shared models to the same data, but also when fitting any model to data from populations with a shared construct (but again, only when population θ B = 0).
The pattern of Mplus EMA results in Table 6 challenges the conclusions offered by Stapleton and Johnson [2]. First, the highly inflated rejection rates of the unconstrained model (fitted to configural-model data) should not have been attributed to overparameterization. Stapleton and Johnson [2] (p. 319) contended that unnecessarily estimating additional parameters "used unnecessary degrees of freedom, and the χ 2 test criterion was lowered," perhaps forgetting that the χ 2 test statistic itself would also lower in this case (on average, as much as the df decrease). Table 6 shows that the unconstrained model's rejection rates were not substantially larger than α = 5% when appropriately estimating θ B > 0, matching Stapleton and Johnson's results when fitting an unconstrained model to data from a population with a shared construct. Note that this is also consistent with our prediction that even the configural model (which is similar to but more constrained than the unconstrained model) should have rejection rates near the α level. When fitting the unconstrained model to data from a population without a shared construct, Stapleton and Johnson's high rejection rates were instead due to unnecessarily fixing θ B = 0.
Finally, Stapleton and Johnson [2] observed 100% power to reject a configural model when the population includes a shared factor, which we predicted should yield rejection rates close to the α level. However, Table 6 shows that result is only consistent with fixing θ B = 0 when the population θ B > 0. Appropriately estimating θ B yielded 5% rejection rates, supporting our claim that the shared and configural factors would be confounded due to proportionally equivalent loadings across indicators. All in all, Stapleton and Johnson's results are not generalizable because they did not hold constant whether θ B = 0 either in the population or in the fitted model. Note that all of these inflated error rates occurred only when using the Mplus default setting (i.e., a scaled test statistic) because the same patterns were not found in Table 5. We will elaborate more on the inflated error rates found with the scaled test statistic in the discussion.

Comparison of MLR Results with lavaan and Mplus
Looking at the results of lavaan with MLR in Table 6, six conditions stand out where models were seriously misspecified because population θ B > 0 but fitted θ B = 0. In these conditions, lavaan did not provide a test statistic. Closer inspection of these results indicated that the scaled test statistis was actually not defined due to a negative trace involved in calculating the scaling factor (i.e., the trace of UΓ [23]). Users can obtain this quantity from lavaan using the function lavInspect(), with the second argument as "UGamma", as well as the U or Γ matrix separately using "UfromUGamma" or "gamma". Naturally, the same issue occurs for Mplus. However, instead of providing a warning, Mplus reports the unscaled test statistic and indicates 'Undefined' for the scaling correction factor. The MONTECARLO output of Mplus does not contain any information pointing to the scaled test statistic being undefined, so we only discovered this was happening because of our comparison of results reported by lavaan. Given that the models are severely misspecified in these conditions, the uncorrected test statistics are very high and will not lead to wrong conclusions in practice. The second column in Table A1 indicates the number of samples for which lavaan indicated that the test statistic was not available per condition.
In the conditions where θ B > 0 and in the fitted model θ B > 0, the results obtained with lavaan (with QN) and Mplus were identical. For population conditions with θ B = 0 and fitted θ B = 0, the rejection rates obtained with Mplus with QN are somewhat larger than the rejection rates obtained with lavaan. For example, the rejection rate was 0.004 in lavaan and 0.111 in Mplus when fitting the correct configural model with θ B = 0. Since the rejection rates did not differ across lavaan and Mplus with QN or EMA for the uncorrected test statistic (reported in Table 5), the difference in results must be rooted in how the scaled test statistics are calculated. In this condition there were 3 samples for which the scaling correction factor was not defined, implying that the results for Mplus are partly based on unscaled test statistics. This may explain a small part of the difference across packages. Another source of differences may be found in how the packages proceed when the (augmented) observed information matrix is near-singular. Near-singularity of the observed information matrix often happens, and usually this it not a reason for concern. Both lavaan and Mplus do not print out a warning when (equality or inequality) constraints are part of the model, and both programs probably use a different approach to handle these near-singular cases. While lavaan uses a generalized inverse, the solution of Mplus is less clear. In some cases Mplus gives the warning: "An adjustment to the estimatioon of the information matrix has been made". The last column in Table A3 shows the number of replications per condition for which Mplus provided this warning.
In the conditions where population θ B = 0 and in the fitted model θ B > 0, lavaan showed low rejection rates as expected. The 100% nonconvergence of Mplus with QN prevents comparison of results across software packages using the same QN algorithm. Mplus with EMA however resulted in severely inflated Type 1 error rates, ranging from 0.365 to 0.515. In Table A3 one can see that the rejection rates are also inflated in conditions where θ B is freely estimated while population θ B is not exactly zero, but 0.0001 or 0.01, although the inflation is less severe (but still around 12%) in the conditions with θ B = 0.01.

Summary of the Results
For all conditions in the primary study (comparing ML with QN across packages), lavaan either converged more often than Mplus or both packages converged in 100% of samples. Mplus never converged in conditions with population θ B = 0 but fitted θ B > 0. Rejection rates of the normal-theory χ 2 test statistic were identical across packages. Our comparison of ML algorithms in Mplus showed that using EM did not converge in any condition for which fitted θ B = 0. With the default Mplus settings for two-level models (MLR + EMA), Mplus often switches to the EM algorithm. When this switch is made, Mplus ignores one of the main convergence criteria (i.e., whether the algorithm in fact converged on a ML estimate, as revealed by the first derivative), meaning that the obtained results may be based on a non-converged solution. Users are not notified that convergence criteria are ignored, nor are they notified of this switch being made (unless they specifically request and pay attention to the TECH8 output, which seems unlikely to be common practice). In our second follow up study, we found seriously inflated scaled test statistics in Mplus in populations with θ B = 0.
Based on the comparison of our results with those of Stapleton and Johnson [2], we showed that their advice is not generalizable because they did not independently vary whether θ B = 0 either in the population or in the fitted model. In all conditions in which they reported high rejection rates, this was the result of incorrectly fixing θ B = 0, or of inflated scaled test statistics obtained by using the default settings in Mplus.

Recommendation for Practice
Our investigation reveals that the defaults in the most popular multilevel SEM software (Mplus) carry specific risks that are easily checked but not well advertised. When using Mplus with the default settings for two-level models, it is strongly recommended to check Mplus' TECH8 output to verify convergence of the solution. Specifically, when the algorithm switched from EMA to EM, we recommend to verify whether the derivative criterion in the TECH8 output is sufficiently close to zero; one can also run the model again, explicitly requesting ALGORITHM = EM. If running the model with EM and the default convergence criteria does not lead to a converged solution, then one should be suspicious of the output obtained using EMA that switches to EM. Using the default settings in lavaan does not carry the same risks, given that the convergence rates were good, rejection rates appropriate, and one can be sure that there are no hidden changes of convergence checks or test statistics. Therefore, when Mplus users find evidence that their apparently converged model might not have actually converged on a maximum, we recommend fitting their model (when possible) with lavaan, whose defaults do not cause the same rates of convergence problems, nor are lavaan's test statistics (and Type I error rates) inflated in the very conditions when Mplus results are dubious (Table 6; see also the Appendix A).
Researchers should be aware that the scaled χ 2 statistic can be seriously inflated. This is in line with earlier findings [24,25]. Moreover, there exist many computational options for (scaled) test statistics, and more research is needed to evaluate which of those work best in which conditions. Also, the default implementations differ across packages. Savalei and Rosseel [23] provide an overview of computational variations and how to apply them using lavaan.

Future Research
In this study we only focused on the statistical performance (convergence rates and rejection rates of standard and scaled test statistics) of the models proposed and evaluated in Stapleton and Johnson [2]. From a theoretical perspective, we also have some concerns with respect to the simultaneous shared-and-configural model. The goal of the proposed model is to disentangle the 'objective' shared construct from the shared part of what Stapleton and Johnson described as a nuisance factor. We find it hard to imagine a realistic situation in which one would want to measure an objective shared construct using individual responses. If an objective property of a cluster (for example a neighborhood or a school class) should be operationalized for a study, we think one should look at objective variables at the cluster level. For example, the crime rate per neighborhood, the size of the school class, or the gender of the teacher-Stapleton and Johnson similarly proposed measuring strictly Level-2 indicators in addition to the Level-1 indicators, but to be used as additional indicators of the shared factor (p. 325). From our perspective, as soon as one asks individuals in the cluster to rate a cluster-level object, one will most probably be measuring subjective perceptions of the construct (for example, perceived neighborhood safety, perceived teaching quality) for which within-cluster differences would be expected. Future research may therefore focus on the theoretical meaning and practical applications of the models evaluated in the current research.
It was known from earlier research that the scaled χ 2 as implemented in Mplus overrejects models when the sample size is not large enough, especially with large models [25]. In line with earlier findings based on two-level models [14,26], our simulation study shows that over-rejection by this test statistic is exacerbated when between-level residual variances are zero (or small, see Table A2) in the population. Because θ B = 0 is to be expected when strong factorial invariance across clusters holds (i.e., when there is no cluster bias, even approximately), it is reasonable to assume that θ B ∼ = 0, at least for some of the indicators in a substantial part of research settings. In our study, we did not evaluate conditions with partial cluster invariance. Future research may investigate the performance of the scaled χ 2 as implemented in Mplus when population residual variances are zero for some indicators but nonzero for others.
Several Bayesian approaches have been proposed and evaluated for estimating twolevel factor models [21,22,27,28]. Bayesian methods with (weakly) informative priors may be able to avoid some of the estimation problems that occur with ML estimation, specifically when sample sizes are relatively small [28]. Alternatively, factor score regression methods [29,30] could be a solution to avoid certain estimation problems For future research, it would be interesting to evaluate the performance of these alternative methods under the conditions of our simulation study.     Note: Uncon = Unconstrained model, Conf = Configural model. Shared = Simultaneous shared-and-configural model. θ B = between-level residual variances. Reported convergence rates for these conditions were all higher than 0.973, but across θ B conditions, the frequencies of Mplus switching to the EM algorithm (and ignoring one of the convergence criteria) are very similar to the frequencies reported in the first column of Table A2, so convergence status is effectively unknown.