Evaluating the Observed Log-Likelihood Function in Two-Level Structural Equation Modeling with Missing Data: From Formulas to R Code

This paper discusses maximum likelihood estimation for two-level structural equation models when data are missing at random at both levels. Building on existing literature, a computationally efficient expression is derived to evaluate the observed log-likelihood. Unlike previous work, the expression is valid for the special case where the model implied variance–covariance matrix at the between level is singular. Next, the log-likelihood function is translated to R code. A sequence of R scripts is presented, starting from a naive implementation and ending at the final implementation as found in the lavaan package. Along the way, various computational tips and tricks are given.


Introduction
Applied users are increasingly using multilevel structural equation modeling (SEM), partly due to its wide availability in several software packages. The open-source R package lavaan [1] currently only handles two-level continuous data using maximum likelihood (ML) estimation. Until now, data had to be complete: all cases with missing values were removed from the data before the analysis could start. However, in lavaan version 0.6-9, support for missing data is finally available. I was fortunate to have found two excellent sources in the literature [2,3] that discuss the (observed) log-likelihood function for twolevel SEMs. Both sources describe how this log-likelihood function can be re-expressed in a computationally convenient form. From here on, I will refer to these two results as the McDonald solution and the du Toit and du Toit solution, respectively. Interestingly, the two solutions are different but equivalent, which I will clarify in this paper. Both solutions also suffer from the requirement that the model-implied variance-covariance matrix of the variables at the between level (Σ b ) must be nonsingular. In practice, this often leads to numerical problems when Σ b is singular, by design, or nearly singular for a particular model/data combination.
I have several goals for this paper. First, I wish to document the exact formulas that are used by lavaan 0.6-9 to compute the so-called observed log-likelihood used to obtain the ML estimates for two-level SEMs when data are missing. Second, I will suggest a small modification to the existing solutions that will allow Σ b to be singular. Finally, I will show and discuss five R scripts that evaluate the log-likelihood function. An example model and dataset are also provided so that readers can run these scripts on their own machines. Because R can be very slow (compared to compiled programming languages), care is needed to obtain reasonably efficient code. The five scripts document the various stages of the development of "real" R code and hopefully offer some insight into the typical process that is used (at least by me) to translate formulas to production-ready R code.
The paper is organized as follows. In Section 2, I briefly review single-level SEM, with a focus on the log-likelihood function (or an equivalent discrepancy function), which is needed to obtain ML estimates. In Section 3, I discuss the log-likelihood function that is Psych 2021, 3 used for two-level SEM. In both sections, I first discuss the complete data setting, followed by the missing data setting. In Section 4, I discuss the R code listed in the Appendices. The paper ends with some final thoughts housed in Section 5.

SEM for Single-Level Data
In this section, I briefly describe the statistical model used in traditional SEM if the data originate from a single population. The purpose of this section is mostly to introduce notation and to provide some building blocks for the two-level sections. For Equations (1)- (8), I have used the LISREL all-y notation [4] to describe the basic formulas, but I could as well have used the Bentler-Weeks [5] or the RAM [6] notation. The remaining formulas are unaffected by this choice.

Complete Data
Let y be a P-dimensional vector randomly drawn from a population with mean vector µ and variance-covariance matrix Σ. A common working assumption is that the population distribution is normal. In an SEM analysis, a model is postulated that implies a certain structure for µ = E(y) and Σ = Var(y). The model typically consists of two parts: a measurement part and a structural part. The measurement part of the model is defined as where y is a P × 1 vector of observed variables, ν is a P × 1 vector of intercepts, Λ is a P × M matrix of factor loadings relating the latent variables to the observed variables, and is a P × 1 vector of residual errors. The structural part of the model (also called the "latent variable model") is defined as where η is an M × 1 vector of latent variables, α is an M × 1 vector of intercepts, B is an M × M matrix of coefficients for the regressions of the latent variables on each other, and ζ is an M × 1 vector of disturbance terms. The diagonal elements of B must be zero, and (I − B) should be invertible. To simplify the notation, I will use the convention that an observed variable involved in the structural part of the model is upgraded to a latent variable, with the observed variable as its only indicator with no measurement error. To derive the model-implied moments, the structural part is written in its so-called reduced form, where the endogenous variables only appear on the left-hand side of the equation: Similarly, assuming E( ) = 0, and Cov(ζ, ) = 0 and using the notation Var(ζ) = Ψ and Var( ) = Θ, it follows from (4) that µ = ν + Λ(I − B) −1 α (7) The model matrices are then ν, α, Λ, Θ, Ψ, and B. A structural model can be defined by setting some elements of these model matrices to a fixed constant (often zero or unity), while allowing other elements to be free. The T free elements are collected in a T-dimensional parameter vector θ. For a given choice of values of θ, the model-implied moments can be written as µ(θ) and Σ(θ) to stress that they are a function of these free parameters. Given a set of i.i.d. data vectors Y = {y 1 , y 2 , . . . , y N }, ML estimation seeks those values of θ that maximize the multivariate normal log-likelihood given by (9) If the data are complete (i.e., no missing values), this log-likelihood can be rewritten in terms of the sample moments: whereȳ is the sample mean vector, S is the (biased) sample variance-covariance matrix (divided by N), and tr[·] is the trace operator that computes the sum of the diagonal elements of its matrix argument. The value of θ that maximizes the log-likelihood in (10) is called the ML estimator of θ and will be denoted byθ. The resulting model-implied moments will be denoted by µ(θ) and Σ(θ), orμ andΣ for short. The unrestricted (or saturated) model simply assumes thatθ = (ȳ , vech[S] ) , where vech[·] is an operator that stacks the elements of the lower triangular part of a symmetric matrix columnwise in a vector. As a result, for the unrestricted model,Σ = S andμ =ȳ. In this case, the log-likelihood equals becauseȳ − µ(θ) =ȳ −ȳ = 0 and tr Σ(θ) −1 S = tr S −1 S = tr[I P ] = P, with I P an identity matrix of dimension P. Comparing the restricted model to the saturated model can be accomplished by the standard likelihood ratio test (LRT) given by where is commonly called the normal theory-based ML discrepancy function [7]. Because logl(ȳ, S) does not depend on θ, minimizing (14) is equivalent to maximizing (10). In the common case, when the mean structure µ(θ) is unrestricted (hence µ(θ) =ȳ), the discrepancy function simplifies to To minimize the objective function in (14) or (15), an optimization method is needed. Popular choices include Newton-Raphson, Fisher scoring, and various quasi-Newton methods. The lavaan package uses by default a quasi-Newton method implemented in the nlminb() function, although a custom implementation of Fisher scoring is also available.
In any case, it is necessary to evaluate the objective function many, many times. Therefore, it is crucial to be able to evaluate this function in a fast and efficient way. Fortunately, from a computational point of view, both (14) and (15) are easy objective functions to evaluate. The only complication is the inverse and the determinant of Σ(θ), but this matrix has dimension P × P, and P is usually small. The determinant of S only needs to be computed once. Note that the computation time is independent of the sample size because (14) and (15) only rely on summary statistics.

Missing Data
When some values in y are missing, I will assume in this paper that the missing mechanism is either missing completely at random (MCAR) or missing at random (MAR) [8]. In this case, ML can still be used to estimate θ by using all available data. This is sometimes called full information maximum likelihood (FIML) estimation. Because it is no longer possible to summarize the incomplete data by its sample mean and variance-covariance matrix, I revert to the casewise log-likelihood function. For notational purposes, it is convenient to introduce a selection matrix Q i for each observation. A selection matrix can convert a vector that includes both observed and missing components (say, y i ) to a complete vector containing only the observed components (say, y o i ). To construct these selection matrices, I start from an identity matrix and remove the rows that correspond to the missing values. For example, suppose P = 5 and the second and fourth value are missing in y i ; thus, Q i is defined to be Let P i be the number of rows of Q i . P i may be different from observation to observation. For each observation, µ i and Σ i are defined as follows: and With this notation in place, the log-likelihood for a single observation can be written as The total log-likelihood for the entire sample is obtained by summing over the individual log-likelihoods: In practice, instead of maximizing (19), it is convenient to minimize an equivalent expression where the first term of (18) is dropped (because it does not depend on θ) and where (18) is multiplied by a factor minus two. The resulting objective function for the entire sample then becomes Evaluating this objective function is much more costly than evaluating the objective function in (14) for the complete data setting. The determinant and the inverse of Σ i (θ) need to be computed for every observation, and the computation time will increase (linearly) with the sample size. To decrease the computational burden, two techniques are commonly used. The first technique is to combine all observations that have the same missing pattern. For all observations that share the same missing pattern, it is possible to use the same determinant and inverse of Σ(θ), but only for this particular pattern. In addition, consider several observations that feature the same missing pattern; it is possible to summarize them by their summary statistics and to use (14) for this pattern. In many datasets, the number of missing patterns is far less than the number of observations. The second technique relates to the computation of the determinant and the inverse of Σ(θ) for each pattern. Instead, these quantities may be computed for the complete pattern and then "updated" for each pattern exploiting the well-known formulas for the inverse and determinant of a partitioned matrix (see Appendix B.1). This updating approach is much faster than computing the determinant and the inverse from scratch. By combining these two techniques, it becomes possible to dramatically lower the computation time that is needed to evaluate (20).

Implementation in lavaan
Readers can explore the source code of the lavaan package, which can be found on github using the link https://github.com/yrosseel/lavaan/tree/master/R. The file lav_mvnorm.R contains functions that are used when data are complete. The evaluation of the objective function for complete data in (14) is done in a function called lav_mvnorm_loglik_samplestats(). The file lav_mvnorm_missing.R contains various functions that handle missing data (assuming multivariate normality). The function used to evaluate the objective function in (20) in the presence of missing data is called lav_mvnorm_missing_loglik_samplestats(). The code for updating the determinant and the inverse of a symmetric matrix after removing rows and columns from that matrix can be found in the function lav_matrix_symmetric_inverse_update(), in the file lav_matrix.R.

SEM for Two-Level Data
In this section, I examine two-level data. Two-level data arise from a nested sampling scheme, where level-1 units (e.g., students) are nested within level-2 units (e.g., schools). Next, I briefly discuss the complete data setting before delving into the main topic of this paper: two-level data with missing values.

Complete Data
Given J clusters with variables measured at both levels, let y ji be the P-dimensional vector of level-1 variables from unit i in cluster j, where i = 1, 2, . . . , n j and j = 1, 2, . . . , J. Let z j be the K-dimensional vector of level-2 variables for cluster j. Combining z j with the n j observations of cluster j, a P j = K + (n j × P) dimensional vector v j can be defined containing all observed variables for a single cluster j: v j = (z j , y j1 , y j2 , . . . , y jn j ) .
(21) I will use the notation µ j = E(v j ) and V j = Var(v j ). In this paper, I only consider two-level SEMs with random intercepts. For models with random slopes, see [9]. When only random intercepts are involved, the level-1 vector y ji can be split into a within and a between part: where it is assumed that Cov(y jb , y jwi ) = 0 and Cov(z j , y jwi ) = 0 for all i and j. I will use the notation µ y = E(y ji ), Σ b = Var(y jb ), and Σ w = Var(y jwi ) for all i and j. For some level-1 variables in the model, it may be convenient to set the between part in (22) to zero. I refer to these variables as "within-only" variables. The corresponding rows and columns in Σ b are then zero. In this paper, it is always assumed that Σ w is positive definite, but Σ b can be singular. Further, I will use the notation µ z = E(z j ), Σ zz = Var(z j ), and Σ zy = Cov(z j , y jb ) for all j. The mean vector µ j can then be written as where 1 n j is a unity vector of size n j , and ⊗ is the Kronecker-product operator. The variancecovariance matrix V j can be written as where I use Σ yz = Σ zy . Note that 1 n j 1 n j is just a unity matrix of dimension n j × n j . Structural models may now be defined by restricting the elements of µ z , µ y , Σ zz , Σ zy , Σ w , and Σ b to be a function of a parameter vector θ. This can be accomplished by setting up an SEM for each level, resulting in a set of model matrices for the within part and another set of model matrices for the between part-not unlike a two-group SEM analysis. For a given model and a given cluster, µ j and V j can be written as a function of θ. Given data Z = {v 1 , v 2 , . . . , v J } for a random sample of J clusters, the total log-likelihood for the complete sample is given by where Again, maximizing the log-likelihood can be replaced by minimizing the objective function: Note the close similarity between Equations (18)-(20) respectively, but a crucial difference is that P j (and therefore V j ) can become very large within applications. For example, consider a fairly large dataset, with P = 20 and K = 5. For a given cluster, let n j = 100, P j = 5 + 100 × 20 = 2005; this would necessitate finding the determinant and the inverse of a 2005 × 2005 dimensional matrix V j , only for this cluster. From a computational point of view, this is not practical. To overcome this problem, the special structure of V j must be exploited. Ideally, (27) can be rewritten so that the determinant and the inverse of matrices that need to computed are of size P or K. Before showing a solution, I must first introduce some notation. The matrix V j in (24) consists of four blocks and can be written symbolically as For the corresponding blocks of the inverse of V j , I will use the following notation: Further, I define δ j to be the difference between v j and µ j : where the last term δ j is partitioned into a z part and a y part-just like V j . The quadratic form in (27) can be written as: The objective function can then be written as follows: The main task now is to find a computationally convenient expression for the three quadratic terms q yy j , q zy j , and q zz j as well as for the determinant of V j . McDonald and Goldstein [10] accomplished this more than 30 years ago. They obtained the following expressions for the determinant and for the quadratic forms: where the n j × P matrix Y j contains all level-1 observations within cluster j, and where the following notation was used: Although this may seem like a more complicated function, it can be evaluated much faster than the original function in (27). A little known fact (see [11,12]) is that instead of taking the sum over all clusters, it is possible to take the sum over all cluster sizes. The different number of cluster sizes (S) is often much smaller than the number of clusters (J). In the balanced case where all clusters have the same size, S = 1 and no summation is even needed. Finally, further simplification can be achieved by making use of the pooled within-sample covariance matrix defined by This results in the objective function that is effectively used by lavaan: where N s is the number of observations from all the clusters that have the same cluster size n s . Further, δ s , t s , and g s are now based on the observations of all clusters that share the same cluster size. The actual R code that is used by lavaan to evaluate this objective function can be found in the function lav_mvnorm_cluster_loglik_samplestats_2l() in the file lav_mvnorm_cluster.R.

Missing Data
To handle missing values at both levels, it will be convenient to introduce selection matrices Q ji and G j for the first and second levels, respectively. The matrix Q j is defined to be the collection of all Q ji matrices from the same cluster: It is now possible to express µ j and V j as follows: and The V j(yy) matrix is defined as where the direct sum operator is used to create a block diagonal matrix, with each block being equal to Q ji Σ w Q ji . If data are complete in cluster j, then G j and Q ji become identity matrices. In that case, Q j Σ b Q j equals 1 n j 1 n j ⊗ Σ b , n j i=1 Q ji Σ w Q ji equals I n j ⊗ Σ w , G j Σ zy Q j equals 1 n j ⊗ Σ zy , and G j Σ zz G j equals Σ zz . Because an expression for µ j and V j was found, it becomes possible to estimate the model parameters using ML by minimizing the following objective function: where v o j only contains the observed values from v j : The objective function can be written again as in (34). Just like in the complete case, the challenge is to rewrite this objective function so as to avoid the computation of the determinant and the inverse of the formidable matrix V j directly. Below, I briefly describe two solutions, put forth by McDonald [2] and du Toit and du Toit [3], respectively.

The McDonald (1993) Solution
In Appendix B.1, two identities are shown to invert a partitioned matrix. McDonald starts from the second identity, and he develops an expression for V yy j . Using my notation, the expression becomes where Σ zz j is defined by By moving the term to the front and rearranging the remaining terms, the expression can be rewritten as (55) To simplify the notation, I will define and This means Equation (55) can be written in a more compact form: This is a typical example wherein the Woodbury identity (see Appendix B.2) can be exploited. Importantly, McDonald assumes that both Σ b and Σ j(b.z) are positive definite. He then uses the classic Woodbury identity in (A25) to obtain the following expression for V yy j : Note that Λ j is a block diagonal matrix. Therefore, the inverse Λ −1 j can be easily computed by inverting the individual blocks (see Appendix B.4). From this, an expression for the quadratic form q yy j may be produced: where I have used the following notation: For the q zz j term, V j(zz) must first be inverted. Using the second identity of a partitioned matrix once more results in Note that the Woodbury push-through (right) identity (see Appendix B.2) may be applied Plugging this into Equation (62) results in The quadratic form q zz j can then be written as where the following notation has been used: Finally, instead of q zy j , an expression for q yz j must be found; the inverse of V j(yz) can be written as Again, using the identity from Equation (A27) results in The quadratic form q zy j is therefore given by For the determinant of V j , the formula for the determinant of a partitioned matrix in (A24) can be used to write |V j | as The first determinant is simple because the largest dimension of G j Σ zz G j is K × K (or smaller, if missing values are present in z j ). The argument of the second determinant, however, is still too large to be practical. McDonald [2] suggests a recursive formula, where the second determinant is computed in an incremental fashion. However, here I will give an explicit solution, making use of the determinant identity in (A31): To summarize and for future reference, the final expressions for the determinant and the three quadratic forms are as follows: To make the connection with the expressions for the complete data setting, note that when data are complete, Furthermore, when data are complete, both p j and A j contain Σ −1 w . Therefore, it can be verified (using the identity (AB) −1 = B −1 A −1 for nonsingular AB and the classic Woodbury identity) that

The du Toit and du Toit (2008) Solution
To invert the partitioned matrix V j , McDonald [2] made use of the second identity in (A21) to develop the formulas. By contrast, du Toit and du Toit [3] start from the first identity in (A20). Because the rest of the paper will focus on the McDonald solution, the details are given in Appendix A. Below are the final expressions for the determinant and the three quadratic forms: where

An Extension to Allow Σ b to Be Singular
An important limitation of both approaches is that they both assume Σ b and (in the case of McDonald) Σ j(b.z) are nonsingular and therefore are invertible. However, sometimes rows and columns in Σ b are all zero (e.g., if they correspond to within-only variables), or restrictions in the model have been imposed that result in a singular Σ b . For example, consider a one-factor model at the between level. If the residual variances are fixed to zero, then Σ b will be singular. This is a common setting as zero-residual variances are sometimes used in cluster-invariant measurement models [13,14]. In other settings, Σ b may be near singular, potentially leading to numerical instabilities. For all these reasons, it would be useful if the previous solutions could be adapted to handle the case where Σ b is singular. The solution is simply to replace the classic Woodbury identity in (A25) by the more general version in (A26). I will illustrate how this changes the final expressions for the McDonald solution. The logic for the du Toit and du Toit approach is similar.
Consider again the expression for V yy j as it was presented in Equation (58): Applying the classic Woodbury identity results in an expression (59) that contains the inverse of Σ j(b.z) . However, if the more general identity (A26) is used, then (59) may be rewritten as follows: In this expression, Σ j(b.z) does not need to be inverted. On the other hand, if Σ j(b.z) is invertible, then it follows that: The term on the right-hand side is used in the McDonald solution, but if it is replaced by the term on the left-hand side, the following alternative expressions for the quadratic forms emerge: None of these expressions require Σ b or Σ j(b.z) to be nonsingular, making them much more useful in practice. The last task is to find an expression for the determinant of V j that can be used when Σ b or Σ j(b.z) is singular. Here, I will make use of the determinant identity in (A32), which results in the following: Thus, the determinant of V j can be written as which again does not require the inversion of Σ j(b.z) .

Implementation in R Code
Formulas are useful to study, but for some readers, it may be more rewarding to study the implementation of these formulas in source code. Appendix C contains the source code of several short R scripts. The first script (main.R) contains the dataset and the example model and runs the code in the other scripts. Following this are five R scripts that each illustrate a different approach to evaluate the objective function in (51). Starting (objective1.R) with a naive implementation, the first step is to compute the determinant and the inverse of the full V j matrix for each cluster. This script allows for the introduction of various ingredients in a simple way. In objective2.R, I will implement the formulas at the end of Section 3.5. This version is faster (see Table 1) than the naive version-as expected.
If the programmer feels that this implementation is still too slow, it may be tempting to rewrite the code in a compiled language, such as C/C++ or FORTRAN. Without any effort (apart from the programming effort), this will result in a much faster code. I will attempt to improve the R code itself to allow the code to run even faster. In the objective3.R script, I will avoid the explicit construction of selection matrices and will include some minor improvements. This will further decrease the computation time. In the objective4.R script, I will make use of missing patterns within each cluster. Unfortunately, this hardly leads to any improvement because the cluster sizes are rather modest in my example dataset. In the objective5.R script, I will use missing patterns as found in the entire dataset. Although this requires much more housekeeping, the gain in computation time is significant. An overview of the computation time for the five objective functions is presented in Table 1. In the second column, the average computation time (over 100 replications) is shown in milliseconds. In the last column, I take the second objective function as the baseline and show the time differences in percentages. The R scripts are based on the McDonald solution, taking into account the extension for singular Σ b matrices. Equivalent R scripts (not shown here) have been written for the du Toit and du Toit solution. However, the shortcut that is used in objective5.R does not seem to apply for the du Toit and du Toit solution. For this reason, I only report the scripts based on the McDonald solution. In what follows, I will briefly discuss each file, guiding readers through the source code.

The Main.R File
To explore the R code, readers should start with the file main.R, which is listed in Appendix C.1. The script starts with reading in the dataset (Lines 4-9). The (artificial) dataset contains 2500 observations of 20 variables, clustered in J = 200 clusters. The cluster sizes are 5, 10, 15, and 20, with 50 clusters of each size. The y1-y10 and x1-x3 variables are measured at the within level, whereas the z1-z4 and w1-w3 variables are measured at the between level. Roughly 10% of the data are missing at both levels. Then, I specify a two-level model using lavaan syntax (Lines 10-38). In this model, the y1-y6 variables are split into a within and a between part. The y7-y10 and x1-x3 variables are within-only variables. The z1-z4 and w1-w3 variables are between-only variables. Next, I call the sem() function (Lines 44-45), but with the do.fit = FALSE argument. This implies that only the model and the dataset are processed, without starting the optimization. In fact, I will simply keep the starting values and only use the resulting dummy.fit object to extract the model-implied statistics and information about the dataset. There are T = 83 free parameters in this model. The next lines in the script extract some useful information from the dummy.fit object. The lavmodel object (Line 52) contains the internal representation of the model, including all model matrices. Lp (Line 53) contains various information about the clusters: the number of clusters (Line 64), cluster size (Line 65), cluster index (Line 66), and an index of the between-level variables (Line 67). In Line 54, Y1 contains the raw dataset as a 2500 × 20 matrix, and Y2 (Line 55) contains the cluster aggregated data as a 200 × 20 matrix. The lav_model_implied() function (Line 57) is used to compute the model-implied statistics for this model-given the current set of parameters. This returns a list (implied), which contains a mean vector and a variance-covariance matrix per level. These implied statistics are then further broken down (Line 61) into smaller pieces: µ y , µ z , Σ w , Σ b , Σ zz , and Σ yz . These vectors and matrices will always be the input for the objective functions. For each of the five objective functions, I first read in the source code (e.g., Line 71), and then I run the function to compute the value of the objective function (e.g., Lines 72-74). Sometimes, I precompute a few quantities that are needed in the objective function. For example, for objective4.R, a list (MP) is needed that contains information about the missing patterns for every cluster (Lines 91-95). For objective5.R, information is needed about the missing patterns for the complete dataset at both levels (Lines 104-105). Finally, I use the microbenchmark package to benchmark the five different functions (Lines 112-129). The results are shown in Table 1.

The Objective1.R File
The first script for the objective function (objective1.R) is listed in Appendix C.2. This script is based on Section 3.2. Apart from the transpose of Σ yz (Line 9), no preparations are needed. Everything happens within the main for-loop ( Lines 12-81), which runs over all clusters. For each cluster, the contribution to the (log)likelihood is computed and stored in a vector loglik (Line 11). If the function argument loglik. is FALSE (the default), then the constant is omitted and the log-likelihood contribution is multiplied by a factor −2 (minus two) (Lines 74-79). The final value of the objective function is the sum over all cluster contributions (Line 83). For a given cluster (j), the cluster size n j must first be noted (Line 15). Then, any variables at the between level are ascertained (Line 17). In my example, this will always be the case. Then the data for this cluster are collected in an n j × P dimensional matrix Yw.j for the within level (Line 18) and in a K × 1 vector z.j for the between level (Line 19). For the data vector at the between level, the G j selection matrix (Lines 21-26) is created to keep track of the missing values in z.j. If the data are complete, G j will be an identity matrix. The missing values in z.j are removed (Line 27). Similarly, for the within level, a selection matrix Q ji is created (Lines 34-42) for every observation in this cluster. A corresponding W.ji matrix (Line 43) is also created, which corresponds to Q ji Σ w Q ji , because these matrices are needed to construct the block diagonal matrix n j i=1 Q ji Σ w Q ji (using the lav_matrix_bdiag() function, Line 54). All the Q ji matrices for this cluster are then collected in a P j × P dimensional matrix Q j (Line 47). It is now possible to construct the variance-covariance matrix V j for this cluster. First, the four parts (V.zz, V.zy, V.yz, and V.yy (Lines 51-54)) are constructed, and then the full V.j matrix (Lines 56-57) is assembled. The est.j vector contains the model-implied means for both the between and the within level (Lines 58-59). The obs.j vector contains the observed data for both levels (Lines 60-61). Both vectors contain P j elements. The difference between these two vectors is stored in delta.j (Line 71). Perhaps the biggest task in this script (in terms of computation time) is the computation of the determinant and the inverse of the P j × P j dimensional matrix V j (Lines 68-69). Once the inverse has been computed, it is possible to compute the quadratic form δ j V j δ j , storing the (scalar) result in q.j (Line 72). Finally, the cluster contribution to the objective value is stored in loglik, and the next cluster can then be addressed. After repeating this J = 200 times, the final value for the objective function is stored in out and returned to the caller.

The Objective2.R File
The second script for the objective function (objective2.R), which is listed in Appendix C.3, is based on Section 3.3 and the extension in Section 3.5. The first part of the script (Lines 9-49) is identical to the objective1.R script. The only addition is the creation of the A.ji matrices, which are defined as A ji = Q ji (Q ji Σ w Q ji ) −1 Q ji . The sum of these n j matrices is then stored in the matrix A j , which corresponds to Q j Λ −1 j Q j -a matrix that plays a crucial role in this script. The name for Λ j in this script is bdiag_sigma.w.j, indicating that this is a block diagonal matrix, where each block is based on a selection of rows and columns of the Σ w matrix (Line 52). The inverse (Λ −1 j ) is called bdiag_sigma.w.j.inv (Line 53). Next, we compute Σ zz j (Line 57), Σ j(b.z) (Lines 58-60), and IBZA.j, which is defined as I P + Σ j(b.z) A j (Line 64) and is perhaps the most important matrix of this script. The model-implied means (est.j) and observed data (obs.j) are defined just like in the first script (Lines 70-74). However, the difference is split between a z part (delta.z) and a y part (delta.y) (Lines 76-81). Finally, p j (Line 82) and g j (Line 83) are defined. All the ingredients are now in place to compute the quadratic forms q.zz, q.zy, q.yy, and eventually q.j (Lines 85-93). The drop() function is used to make sure the result is a scalar, not a 1 × 1 dimensional matrix. Some care is needed for clusters where all elements in z.j are missing. In that case, G j is empty (zero rows), and only q.y is used to form q.j (Lines 96-101). To compute the (log) determinant, the (log) determinant of Λ j is first computed by taking the sum of the n j (log) determinants for each W.ji block (Line 106). Then the (log) determinant of the IBZA.j matrix (Line 107) is computed, followed by the (log) determinant of G j Σ zz G j (Line 108). The sum of these three log determinants provides the log determinant of the full V j matrix (Line 109). The log-likelihood contribution for this cluster is stored in loglik (Lines 116-121), and this completes the computations for this cluster. After J = 200 repetitions, the final value (out) is obtained and returned to the caller.

The Objective3.R File
The objective2.R script runs faster than the objective1.R script, but there is still much room for improvement. Perhaps the biggest problem in objective2.R is that W.ji must be computed and inverted (see Line 45 in objective2.R) for every observation. In addition, the construction and use of the selection matrices (Q ji , G j ) are not practical and (as I shall demonstrate) unnecessary. In the third script for the objective function (listed in Appendix C.4), the global structure of the objective2.R script is kept intact, but I try to improve the code to make it more efficient.
Recall that W.ji is just Σ w , but the rows and columns that correspond to the missing values for this observation have been removed. In this script, I precompute the (log) determinant and the inverse of the full Σ w matrix (Lines 12-13). If an observation has missing values, I can "update" the inverse (and determinant) (Lines 52-57) and use this information to incrementally construct W.logdet.j (Line 58), A j (Lines 59-60), and p j (Line 63) while the individual observations of this cluster are processed. In addition, I update q.yy.a (Line 64), which is the part of q.yy that only depends on within-level information. When an observation is complete, there is no need to "update" the inverse of Σ w , and I can compute W.logdet.j (Line 67), A j (Line 68), p j (Line 71), and q.yy.a (Line 72) immediately using the precomputed (log) determinant and the inverse of Σ w . In this script, I no longer explicitly use the selection matrices Q ji . Instead, I keep track of the missing-value indices (na.idx) and use these to remove the rows and columns of the matrices when needed. For the between part, I use a similar strategy. I precompute the (log) determinant and the inverse of Σ zz (Lines 23-24) and Σ b.z (Line 26) for the complete case. Then, for a specific cluster, I make a distinction between three scenarios: (1) all between values (of obs.z) are missing (Lines 87-88), (2) at least one value is missing (but not all) (Lines 90-101), or (3) no values are missing (Lines 103-108). In the latter case, the complete data statistics can be used immediately. When some values are missing, the (log) determinant and the inverse of Σ zz (Lines 93-97) must be "updated." The selection matrix G j is no longer used. The key matrix IBZA.j is now constructed in a slightly fancier way (Lines 113-114) in order to avoid the construction of a diagonal matrix (for every cluster). Other improvements are the construction of the quadratic forms (Lines 126-129), where matrix multiplications are avoided and maximally rely on scalar multiplications. The quadratic form for the y part is split into two parts (q.yy.a and q.yy.b), where the former was already computed when the individual observations of this cluster (Line 64 or 72) were processed. In a similar fashion, the z part is split into two parts (q.zz.a and q.zz.b), where the former is based on the raw data (delta.z), and the latter is based on the g j vector for this cluster. Because I have already computed all the needed (log) determinants along the way, I can immediately compute V.j.logdet (Line 143). The remainder of the script is identical to the objective2.R script.

The Objective4.R File
In the final paragraph of the last Appendix (12.B.6) of their book chapter, du Toit and du Toit [3] write: "One could also compute the patterns of missingness within each level-2 unit . . . " In the objective4.R script, listed in Appendix C.5, I followed this suggestion. First, the missing pattern information for each cluster will be precomputed. This is done in the main.R script (Lines 91-95), where we construct a list (MP) with missing pattern information for each cluster. Instead of running over all the observations within a cluster, the missing patterns within a cluster (Line 46) are run over. For each pattern, the (log) determinant and the inverse of Σ w (Lines 56-60) are "updated," and the quantities W.logdet.j, A.j, p j , and q.yy.z are computed (Lines 61-68). Note that freq is used to account for the number of observations that belong to this pattern. Other small improvements in this script are the precentering of Y1w (Lines 17-18) and Z (Lines 23-24), which facilitate the construction of the vector of observed data for each cluster (Line 65 or 75 and Line 95 or 108). The computation of q.zz.a now happens in the between section, although it is nothing more than a cosmetic change. Finally, the NY and NZ scalars keep track of the number of nonmissing values in the cluster (at the within and between level, respectively). They are used to compute P j , which is only needed if loglik. = TRUE (Line 146). Based on Table 1, it would seem that this script is not faster than the previous script (objective3.R)-at least in my example. The reason is that the cluster sizes are rather small in my dataset, and the number of missing patterns is often not much smaller than the number of observations within a cluster. However, even for larger cluster sizes, this script is not much faster than the previous one.

The Objective5.R File
In order to further decrease the computation time, it is necessary to decrease the number of matrix inversions. In the objective4.R script, Σ w must be inverted and the determinant must be computed for every missing pattern in each cluster. Across clusters, the same missing pattern often emerges. In this script, listed in Appendix C.6, I sought to invert Σ w only once for each missing pattern in the entire level-1 dataset. In addition, missing patterns will be used for the between data. The price to pay is that more housekeeping becomes necessary. At the between level, a list SIGMA.B.Z (Line 39) is created, where the Σ j(b.z) matrix is filled in for each missing pattern at the between level (Lines 62-63). At the same time, the J × 1 vector ZPAT2J keeps track of the missing pattern that applies for each cluster. Later, when the IBZA.j matrix must be constructed for a given cluster, the correct Σ j(b.z) matrix can then be "selected" from the list in SIGMA.B.Z (Line 145). Similarly, a J × P dimensional matrix GJ is created, where the g j vectors are filled in for each cluster. The g j vector is computed only once per missing pattern, whereupon this g j fills in all rows of GJ that share the same missing pattern at the between level (Line 64). For this purpose, I make use of j.idx (Line 46), which keeps track of the cluster indices that share the same missing pattern. Some care is needed for the clusters where the observed data vector is completely empty (i.e., all values are missing). They are not part of the missing patterns in Zp and need to be handled separately (Lines 79-85). For the within level, I use a similar strategy, although the situation is more complex. The J × P dimensional matrix PJ and the list ALIST (of length J) are used to store the p j vectors and the A j matrices for all clusters, respectively. However, these containers should be filled in while processing the missing data patterns of the level-1 data. To better understand how this is accomplished, consider missing pattern number 2 in Mp. This missing pattern (with a missing value for the y1 variable only) occurs 123 times in the entire dataset (so freq = 123). The cluster numbers where this pattern is observed are stored in j.idx. Most cluster numbers are unique, but sometimes this pattern occurs multiple times within the same cluster. These frequencies are stored in npatj (Line 100), and the unique cluster numbers are stored in j1.idx. Recall that A j = Q ji Λ j Q ji is just the sum (over all observations within a cluster) of the inverse of Σ w , where (for each observation) the rows and columns (corresponding to the missing values) have been removed before the inverse was taken. Then the result is plugged back into a matrix where these rows and columns are replaced by zeroes. Each time the (updated) inverse of Σ w is computed, the contribution of this pattern can be added to all the A j matrices where j is in j1.idx (Lines 118-122). After all missing patterns in Mp are processed, the completed A j matrices will be available in ALIST, so they can be extracted per cluster when it becomes necessary to compute the IBZA.j matrix (Line 146). In a similar fashion, the PJ matrix is gradually updated when running over the missing patterns. After all patterns have been processed, each row in PJ will contain the p j vector for this cluster-to be used in the computation of q.yy.b (Line 145 or 154). This scheme will bring down the number of times it is necessary to invert (a submatrix of) Σ w to a minimum. The same is true for Σ zz at the between level. Unfortunately, the IBZA.j matrix still needs to be inverted for every cluster and remains the bottleneck (in terms of computation time) of this script. A last small improvement in this script is that I replaced every occurrence of (solve(A) %*% B) by solve (A, B). Finally, some generic function calls (e.g., solve()) were replaced by more specific function calls (e.g., solve.default()) to avoid the (small) overhead that comes with method dispatching, which is part of R's S3 system.
Based on Table 1, this script runs significantly faster than the previous scripts. However, there is no doubt that, given more thought, even more improvements are possible. However, objective5.R represents the current state of affairs at the time of this writing, and this script is used (with some cosmetic changes) in lavaan 0.6-9.

Stochastic Versus Fixed Covariates
In our scripts, I have made no distinction between endogenous ("y") and exogenous ("x") observed variables. In lavaan terminology, this corresponds to fixed.x = FALSE, implying that the "x" variables are treated as stochastic (assuming multivariate normality). This was only done for ease of presentation. By default, lavaan uses fixed.x = TRUE, and the "x" variables are treated as fixed constants. The advantage of this is that distributional assumptions are no longer needed for the "x" variables. However, the price is that when fixed.x = TRUE, "x" variables must not have missing values. As a result, all observations with missing values in any of the level-1 "x" variables will be deleted from the dataset before the analysis. In addition, if missing values are present in any of the level-2 "x" variables, the entire cluster is removed from the dataset before the analysis.

Conclusions
In this paper, I focused on the evaluation of the (observed) log-likelihood function for a two-level SEM in the presence of missing data. This makes it possible to obtain ML estimates of the model parameters using all available data. I have discussed two approaches (the McDonald solution and the du Toit and du Toit solution), which have been published in the literature, and have added a small extension to handle the case where Σ b is singular. Finally, I presented a sequence of R scripts to evaluate this observed log-likelihood, starting from a naive (and slow) version and ending with a more refined (and faster) version. This sequence was not written with this paper in mind, but it truly reflects how the R code has evolved over time (over a period of many months) in order to prepare the code for inclusion in lavaan. Of course, the objective function is just a small part of the optimization machinery. Another essential ingredient is the gradient of the objective function. This gradient can be approximated numerically, but this will be slow, especially when the number of free parameters is rather large. Therefore, I also derived an analytic expression for this gradient. These derivations are not shown in this paper, but the corresponding code can be found in the file lav_mvnorm_cluster_missing.R in the lavaan source code. Another limitation of this paper is that I only considered the random-intercept setting. No random slopes were allowed. I refer interested readers to [9] for a detailed description of how the log-likelihood function can be expressed to handle random slopes in the complete data setting. As noted by Rockwood (see the Appendix of [9]), the adaption to the missing data setting is straightforward. Translating this to efficient R code, however, is not so straightforward. In the near future, the plan is to incorporate the Rockwood approach into lavaan, so that both missing data and random slopes can be handled. It would also be interesting to compare the lavaan approach with Rampart, which is implemented in OpenMx [15]. This algorithm was designed to evaluate many-level multilevel SEMs in a computationally efficient way. Some ideas of Rampart (in particular the orthogonal rotation) could perhaps be integrated in our code. As a final outlook, the plan is also to construct an expectation-maximization (EM) algorithm for the two-level SEM setting with missing data and random slopes. This is clearly doable, as this has been available in a commercial SEM package for almost two decades. Unfortunately, to the best of my knowledge, there is no (published) technical literature on this algorithm. In addition, it is not clear whether this EM algorithm can be adapted to handle the case where Σ b is singular.

Conflicts of Interest:
The author declares no conflict of interest.

Appendix A. Derivation of the du Toit and du Toit Solution
The steps to derive the du Toit and du Toit solution are similar to the steps that were used to derive the McDonald solutioin. Except that the starting point is now the first identity (A20) in Appendix B.1. Before I start with the individual blocks of V j , I will first derive some intermediate results. The first is an expression for V −1 j(yy) , which is different from V yy j . Using the notation Λ j = n j i=1 Q ji Σ w Q ji , V j(yy) can be written as To take the inverse, the (classic) Woodbury identity in (A25) can be used to obtain where the notation has been used. The second result is for V −1 j(yy) Q j as this calls for an application of the Woodbury push-through right identity in (A27): Similarly, a third result is for Q j V −1 j(yy) , for which we need the Woodbury push-through left identity in (A29): It follows that Q j V −1 j(yy) Q j can be written as I can now proceed with finding an expression for V zz j . By using (A20), V zz j = V −1 j(z.y) , where V j(z.y) can be written as: where (A5) has been used to get from the second to the third line. Using the following notation: V zz j can be written as and the quadratic form q zz j becomes Next, I will seek an expression for V yz j . Consider (A20) wherein the result in (A3) was used to get from the first to the second line. The quadratic form q zy j is therefore derived from where I have used The tilde ong j should alert readers that this expression is similar, but not identical to g j , as defined in (67). Finally, for the V yy j term, I will use (A20) once more to find The first line is the inverse of (A1) and can be rewritten as in (A2). The second line equals (minus) the expression for V yz j in (A10). The last line corresponds to the result in (A4), premultiplied by G j Σ zy . Applying these results, (A13) may be rewritten as As a side note, du Toit and du Toit [3] further simplify the last expression by writing the term within the square brackets as −H j , so that V yy j can be written more compactly as The quadratic form q yy j can be written as To derive an expression for the determinant of V j , this time using (A23), results in Using the determinant identity in (A32), results in The only "large" matrix for which the determinant must be computed is Λ j , but that is a block diagonal matrix, so the determinant is easily obtained as in (A35). The final results are summarized in Section 3.4.

Appendix B. Some Useful Formulas of Matrix Algebra
Appendix B.1. Inverse and Determinant of a Partitioned Matrix There are two different but equivalent identities for the inverse of a partitioned matrix. Identity 1 is given by where and Identity 2 is given by where Similarly, the determinant of a partitioned matrix can be computed in two different ways:

. The Woodbury Identity
The Woodbury identity gives an expression for the following inverse: where A (p × p) is square and nonsingular; U (p × r), B (r × s), and V (s × p) can be rectangular (or square). Several variants and special cases exist. An overview can be found in [16]. I only provide two identities. The first is the classic identity that is widely used in the literature. It assumes that B is square and nonsingular so that its inverse B −1 exists: The second identity is less known but is given in [16] for the general case and in [17] for the symmetric case where V = U . This version is more general and allows B to be singular: The following identities give expressions for the setting where the original term is either postmultiplied with U or premultiplied with V. Expressions are known as the Woodbury push-through "right" identities when they are postmultiplied. Here are two versions: The first version requires B to be nonsingular, but the second version does not: Expressions are known as Woodbury push-through "left" identities when they are premultiplied. Again, two versions are given, but only the first version requires B to be nonsingular:

. Determinant Identities
A well-known determinant identity (e.g., [18], p. 420) is related to the Woodbury identity: where A (p × p) and B (r × r) are square and nonsingular, whereas U (p × r) and V (r × p) can be rectangular or square. Because |AB| = |A| · |B|, this identity can also be rewritten as follows: The latter expression has the advantage that it can be applied in settings where B is singular.

Appendix B.4. Block Diagonal Matrices
A block diagonal matrix is similar to a diagonal (square) matrix, but every diagonal element is itself a (square) matrix. All other elements are zero. For example, a block diagonal matrix with K blocks has the following form: The inverse of a block diagonal matrix is again a block diagonal matrix where the diagonal elements are the inverted blocks (assuming their inverses exist): The determinant of a block diagonal matrix is simply the product of the determinants of the individual blocks: