Between-Item Multidimensional IRT: How Far Can the Estimation Methods Go?

: Multidimensional item response models are known to be difﬁcult to estimate, with a variety of estimation and modeling strategies being proposed to handle the difﬁculties. While some previous studies have considered the performance of these estimation methods, they typically include only one or two methods, or a small number of factors. In this paper, we report on a large simulation study of between-item multidimensional IRT estimation methods, considering ﬁve different methods, a variety of sample sizes, and up to eight factors. This study provides a comprehensive picture of the methods’ relative performance, as well as each individual method’s strengths and weaknesses. The study results lead us to make recommendations for applied research, related to which estimation methods should be used under various scenarios.


Introduction
Given the known difficulties with estimating multidimensional item response theory (MIRT) models [1], the goal of this paper was to compare a variety of estimation methods across a large range of sample size and model size conditions in a simulation study. Our objective is to provide recommendations to applied researchers regarding which estimation method is better suited across different types of research contexts.
For dichotomous indicators, unidimensional IRT models express the probability of endorsing an item (e.g., giving a correct response) as a function of a common underlying/unobserved factor and a set of item parameters. There are multiple IRT models for dichotomous items that differ in terms of their item parameters. The two-parameter logistic model (2PL), which is the model of focus in this paper, is one of the most common IRT models for dichotomous responses. The probability of endorsing item j for each subject i can be expressed as In Equation (1), u ij is the item response 0 or 1, and θ i represents the underlying factor score for each subject i. The 2PL model includes a discrimination/slope parameter a j for each item j, indicating that items can have different degrees of relation with θ. The b j parameter represents the location/threshold for each item j, which tells us at what point on the latent variable continuum the item is most discriminating [1,2].
IRT models were initially developed under the assumption of unidimensionality, meaning that a single latent variable underlies the probability of item responses. This assumption has been questioned as inappropriate for complex phenomena, and is perhaps Psych 2021, 3 an unrealistic representation of item responses. In fact, some have argued that every test is multidimensional to some degree [3], necessitating multidimensional IRT (MIRT) models (e.g., [4]).
There are two types of multidimensionality commonly discussed within the MIRT literature. Between-item multidimensionality occurs when each item measures a single construct, but multiple constructs are measured by the set of items. This is in contrast to within-item multidimensionality, which occurs when a single item reflects variation on more than one underlying construct [5]. In the current project, we focused solely on between-item multidimensional models, which is relevant to applications in which there is a theoretical basis for distinct factors. Within-item multidimensional models, while also an important area of research, are not considered here.

MIRT Model
For this project, we focus on the 2PL IRT model for dichotomous items. The 2PL MIRT is an straightforward extension of the unidimensional model (Equation (1)) and can be represented with two common parameterizations: slope-intercept and slope-threshold. Here, we work with the slope-threshold formulation for which P is the predicted probability of answering 1, u ij is the item response (1 for endorsement, and 0 otherwise) for subject i on item j, and θ i is the vector of factor scores for subject i. Additionally, a j and b j can be conceptualized as slope/discrimination and threshold parameters for item j, which matches the form of the unidimensional model. Each slope parameter indicates the magnitude of the relation between item j and the respective factor, while the threshold indicates at what level of the respective latent variable θ each item's slope is the steepest. In the between-item multidimensional model, each item j has one free parameter in the a j vector, while the within-item model involves multiple free parameters in a j that representing the relationships between the item j and each factor that defines item j. For multidimensional models, the formulation from Equation 2 includes redundant parameters that help us see similarities to the unidimensional model, but that are not identified. Instead of estimating multiple b j parameters per item, we simply estimate a scalar parameter that equals a j b j . We further identify the model by fixing the factor means to 0 and the factor variances to 1.
Commonly, these IRT models are estimated with marginal maximum likelihood (MML) estimation based on the expectation maximization (EM) algorithm. More recently, a variation of Markov chain Monte Carlo (MCMC) sampling for Bayesian inference has been used, as well as methods that combine the advantages of MCMC sampling with ML estimation, such as Quasi-Monte Carlo EM integration (QMCEM) and Metropolis-Hastings Robbins-Monro (MHRM).
Within a factor analytic framework, the latent variable models for dichotomous items are often conceptualized as item factor analysis (IFA) models. IFA is an adaptation of the factor analysis model for continuous items. These are usually limited information methods, as they involve the estimation of bivariate item relations such as tetrachoric or polychoric correlations [6]. This is in contrast to IRT estimation methods, which use all the available information in the item responses and not just summary statistics (i.e., the item covariance matrix) [1,7].
Factor analysis parameter estimation typically includes the use of weighted least squares (WLS) in combination with tetrachoric or polychoric correlation matrices to estimate SEM models based on limited information. Estimation methods have been compared for unidimensional models [8], as well as for multidimensional models with few factors and large sample size [9]. In general, previous studies comparing estimation methods have either compared only a handful of estimation methods, or when more estimation methods are compared, they focused on few simulation conditions with large sample sizes [6,[8][9][10][11][12][13][14][15]. To our knowledge, there is no clear information about the comparison between these estimation methods for highly multidimensional models and a large range of sample sizes.
The goal of the present research was to compare different estimation methods with a Monte Carlo simulation. Specifically, we were interested in how the methods perform across different numbers of factor and sample sizes for between-item multidimensional 2PL models. The results can help in providing guidelines for researchers who wish to fit MIRT models.
In the following pages, we first provide an overview of the estimation methods that we compared in the simulation study. We then describe the design of the simulation study and the estimation methods that will be evaluated. Then, we present the results of the simulation study, comparing each estimation method. Lastly, we provide a discussion of the results, focusing on recommendations for applied researchers and future directions for related research.

Estimation Methods
This project focuses on five estimation methods. The first four are frequentist, including three IRT full information methods (EM, QMCEM, and MHRM) and one limited information method (SEM-WLSMV). The last method is Markov chain Monte Carlo (MCMC) sampling for Bayesian inference. A description of the methods is provided below.

Expectation-Maximization (EM)
Marginal maximum likelihood (MML) was developed by [16] as a two-step approach for full information parameter estimation. The model likelihood defines the joint probability of the item response pattern given the subject parameters. MML subjects are random effects, and the respective marginal probability of observing each response pattern is derived by integrating the subject effect out of the joint likelihood. In doing so, the subject and item parameters are separated. MML estimates item parameters with the expectationmaximum (EM) algorithm, and the subject parameters can be estimated afterward from the item parameters.
The EM algorithm is an iterative procedure useful to approximate MML estimates in incomplete data problems. Define y as a vector of observed data, with u as a vector of unobserved data, and ξ as a vector of parameters. Then we have that f (y, u; ξ) is the joint density of the complete data (y, u). The likelihood of interest is marginalized out from the unobserved data as L(ξ; y) = f (y, u; ξ)∂u. At each iteration, the EM algorithm performs an expectation and maximization step. At each tth iteration of the algorithm, the E-step computes the conditional expectation of the complete data log-likelihood, conditional on the observed data and the current parameter value. Then the M-step takes the updated parameters ξ t and maximizes the complete data log-likelihood for all ξ in the parameter space. The M-step is commonly implemented with numerical methods like Newton-Raphson [17].

Quasi-Monte Carlo EM (QMCEM)
With the EM algorithm, the E-step can involve the estimation of a high dimensional integral, which is analytically intractable. As a solution the Monte Carlo EM (MCEM) algorithm was proposed, which estimates the intractable integral with an empirical average based on simulated data [18]. Most commonly, the simulated data are generated from random draws of the distribution commanded by EM. Following the law of large numbers, this estimated integral can be highly accurate by increasing the number of simulations. This method usually requires a large number of iterations [17].
The efficiency of MCEM could be improved with simulation strategies that produce high accuracy while requiring fewer simulations. Random simulations result in inefficient data, as these random simulations fail to explore the sample space properly [19]. These limitations have led to the development of deterministic methods known as Quasi-Monte Carlo (QMC).
Monte Carlo sampling selects random uniformly distributed values across the sample space and later approximates the integral based on the empirical average. Quasi-Monte Carlo methods select the starting point deterministically. QMC produce a deterministic sequence of integral values that provide the best possible spread in the sample space. These sequences are referred to as low-discrepancy sequences [17].
In QMC, the analytically intractable expectation from EM is replaced by the empirical average, as the QMCEM updated estimates maximize the data log-likelihood conditional on the observed data and the current parameter values. Laplace importance sampling is used to find an importance sampling distribution with mean (µ(ξ)) and variance (Σ(ξ)) that match the curvature of the density distribution for the unobserved data, conditional on observed data and estimated parameters. This Laplace sampling distribution is then randomly sampled based on the chosen low-discrepancy sequence. Each of these draws is shifted and scaled by µ(ξ) and Σ(ξ), resulting in independent randomized sequences [17]. QMCEM overcomes the EM estimation problems that are associated with increasing dimensionality by substituting the quadrature points for the sampling distribution generated by the QMC process.
The RM algorithm is a root-finding algorithm for noise-corrupted regression functions. MHRM extends it to multiparameter problems with stochastic augmentation of missing data. Each iteration follows three stages: (a) stochastic imputation, in which m sets of complete data are generated with the MH sampler; (b) stochastic approximation, in which the approximation of the conditional expectation for the complete data information matrix is computed; and (c) Robbins-Monro update, where the new parameters are updated with the RM algorithm. MHRM avoids the curse of dimensionality, as the stochastic imputation replaces the deterministic Gaussian quadrature from the EM algorithm [16,20].
Previous research has shown that MML has issues in estimating multidimensional IRT models, as they became computationally intensive and often intractable with a large number of factors [20,21,23]. MHRM was designed as a better alternative to MML and can provide parameter estimates for high dimensional models, models with a large number of items, or models with missing data.

Structural Equation Modeling (SEM) Approach
Structural equation modeling (SEM) approaches traditionally use limited information estimation methods. While there are several options, here we describe the most commonly used method when the items are categorical. When the indicators are continuous, the product-moment-based variance-covariance matrix serves as the input data, as the model intends to replicate the indicator relations represented by this matrix. This is helped by the assumption that ordinal observed variables follow an unobserved continuous distribution that leads to the ordinal observations, the latent response distribution (LRD, y * ) [24]. The relation between an observed categorical variable, y, and its respective latent response variable is represented by As the latent value increases to the point that a discrete threshold (τ) is crossed on the latent response variable (y * ), the observed discrete categorical value of y changes. Once this relation has been established, SEM models follow the same general structure as with continuous indicators. For example, the model-implied covariance matrix for the models considered here could be written as which is similar to that used in confirmatory factor analysis. Specifically, the implied variance-covariance matrix (Σ * xx (ξ)) is a function of the factor loading matrix (Λ * x ), latent variances and covariances (Φ * ζζ ), and the diagonal matrix of unique variances (Θ * δδ ). When the indicators are binary, Equation (4) is equivalent to the multidimensional 2PL IRT model in Equation (2).
As the product-moment variance-covariance matrix is not appropriate for categorical indicators, a common alternative is to use the polychoric correlation matrix. This models the linear association between two continuous latent response variables based on categorical observed variables [24]. Even when we have the polychoric variance-covariance matrix, which represents the relations among the latent response variables, the traditional ML estimation is not appropriate. The alternative has been to use a distribution-free estimator from the weighted least squares (WLS) family, as it includes calculations of kurtosis in the estimation [6,24,25].
The WLS estimator still presents problems as sample size decreases and model complexity increases [6], requiring the use of robust methods for it to work efficiently. The common robust WLS methods are referred as diagonally weighted least squares (DWLS). Rather than inverting the full-weight matrix, these methods involve the inversion of only the diagonal elements of the weight matrix. Computation of the asymptotic covariance matrix of the estimated parameters uses both the diagonal and the full asymptotic variancecovariance matrix of the polychoric input matrix, but does not use the inverse of the full matrix [6,12]. Of the different robust DWLS methods, we chose to include the mean-and variance-adjusted WLS (WLSMV), which uses DWLS for parameter estimation and adjusts the fit indices and standard errors by the full weighted matrix. This method presents robust standard errors, parameter estimates, and fit indices, and it has been shown to recover population parameters in previous research [6].
As previously discussed, this model presents results according to the SEM parameterization, as in Equation (4). Note that for the other estimation techniques considered, the model is represented in the IRT parameterization. Further, the SEM is estimated with a probit link function, while the IRT models use the logit link function. To set the parameters in the same metric, as in Equation (2), we transformed the SEM parameters into the IRT metric, where λ j represents the factor loading, τ j represents the estimated threshold for item j, and a j and b j represent the IRT parameters from Equation (2), noting that each item has only one free slope and intercept because each item "loads" on only one factor; e.g., [7,26].

Markov Chain Monte Carlo (MCMC)
Bayesian inference is commonly done with Markov Chain Monte Carlo (MCMC) simulations. MCMC is a general method based on drawing values of all parameters ξ from sampling distributions and adjusting the draws to approximate the target posterior distribution (p(ξ|y)) [27]. This estimator draws from a Markov chain by sampling sequentially, where the distribution of the sample draws depends on the last value drawn. MCMC is implemented when it is not possible to sample ξ directly from p(ξ|y). Instead, MCMC samples iteratively, such that at each step the draws from a distribution become closer to p(ξ|y). With MCMC estimation, a Markov process is created where the stationary distribution is specified as p(ξ|y). The simulation must run long enough so that the distribution of the current draws is close enough to this stationary distribution [27].
There are different MCMC methods used to sample from p(ξ|y), such as the Gibbs sampler, the Metropolis-Hastings algorithm, and the Hamiltonian Monte Carlo (HMC) algorithm [27][28][29][30]. Here, we expand on the Hamiltonian No-U-Turn sampler (NUTS), as this is the one implemented by Stan [31].
Hamiltonian No-U-Turn (NUTS) is an extension of the Hamiltonian Monte Carlo (HMC) sampler. HMC improves posterior sampling by adding a momentum variable into the MCMC process. For each parameter ξ j , there is a momentum variable r j . During the sampling process, both ξ j and r j are updated simultaneously. The momentum variable r j defines the jumping behavior of the sampling. If the Gibbs sampler is characterized by the random walk, HMC can be described as a hurdles runner, in that it runs when needed and jumps when it finds an obstacle or running is not enough. This jumping behavior allows HMC to explore the parameter space in fewer iterations than the Gibbs sampler.
A limitation of HMC is that it requires tuning of three parameters for the momentum to work optimally. The No-U-Turn Hamiltonian Monte Carlo (NUTS) eliminates the need to hand-tune the HMC parameters. NUTS works with an algorithm that adjust the HMC parameters to maximize parameter space exploration and efficiency. NUTS performs as well as fine-tuned HMC without the need for user-specified parameters. For details about the NUTS sampler, see [30].

BMIRT Specification
The Bayesian MIRT (BMIRT) model specification for Equation (2) requires us to define the respective priors. The factors are defined by a multivariate normal distribution, θ ik ∼ N(0, Σ θ ), where the factor means are fixed at 0, and the variances fixed at 1. In identifying the model by fixing the factor means and variances, the matrix Σ θ estimates the correlation between factors. The correlations had a prior Σ θ ∼ LKJ(1), which is a uniform prior on the permissible correlation matrices parameter space [32]. Each item's threshold followed a standard normal distribution prior, b j ∼ N(0, 1). The slopes had a prior on their log transformation, with a standard normal distribution log(a j ) ∼ N(0, 1). Factor indeterminancy was adjusted by constraining the slope for the first indicator to be positive (log(a 1 k) > 0) for each factor; this establishes the direction of the factor scores (example Stan syntax can be found at the project OSF site https://osf.io/9drne/, accessed on 1 August 2021.)

Previous Research Comparing Estimation of MIRT Models
Previous research has compared methods with real data examples [10]; however, larger simulation studies are limited. Some have considered only SEM approaches [6,11,12], whereas others have evaluated the performance of only one MCMC (Gibbs) sampler [8]. Studies considering a larger number of estimation methods tend to only include a small number of factors [9,13], while some research focuses on the comparison of estimation when the factors are non-normal [14,15]. These previous studies either considered a small number of estimation methods, or when a larger number of estimation methods are considered, the number of simulation conditions are small with few factors and mostly focused on a large sample. We see the same pattern in previous research focused on estimation methods in factor analysis for categorical indicators, where they compare a small number of estimation methods, and comparing methods across a small number of factor models [6,11,[33][34][35][36][37].

Simulation Study
The simulation compared the estimation methods across sample sizes (N = 100, 300, 500, 1000), number of factors (D = 1, 2, 4, 6, 8), and number of items per factor (n it = 5, 10). This resulted in a total of 40 conditions, which were run for 500 replications each. A total of 20,000 data sets were analyzed with the five compared estimation methods. Our simulation design compared a variety of estimation methods, across a large range of sample sizes, and model size (number of factors and items per factor).
The population values were randomly drawn from probability distributions. The correlations between factors were drawn from an uniform distribution between 0.2 and 0.6 (ρ θ ∼ U(0.2, 0.6)). The item slopes were drawn from an uniform distribution between 0.4 and 3 (a ∼ U(0.4, 3)), and the threshold parameters are drawn from a standard normal distribution (b ∼ N(0, 1)).
The estimation methods were evaluated by convergence rate, bias, absolute bias, variability, coverage of the 95% CI, and proportion of explained variance in bias by the simulation conditions. Convergence rates allow us to identify under which conditions each method fails to achieve proper solutions. Bias (ξ −ξ) evaluates the mean distance between the estimated parameters (ξ) and the population true value (ξ). Absolute bias (|ξ −ξ|) considers the absolute value of bias, not allowing positive and negative bias to balance out. Given that the population values were drawn from random distributions at each replication, we also evaluate bias at low, medium, and high levels for each parameter, these levels were selected by dividing the the density distributions into three quantiles. The variability of the estimated parameters was measured by standard deviation of the bias, looking at the variability for each estimation method around the true population values. CI coverage measures the proportion of replications in which the 95% CI includes the respective population value. Finally, to examine condition effects, we evaluated the proportion of explained variance (η 2 ) in bias across our simulation conditions. For η 2 , we considered that a condition has a relevant effect on bias if η 2 ≥ 0.1, indicating that at least 10% of the variance is due to the respective condition.
The frequentist IRT methods (EM, QMCEM, and MH-RM) were estimated with the R [38] package mirt [23], the frequentist SEM method (WLSMV) method was estimated with package lavaan [39], and the MCMC-NUTS method was estimated with general Bayesian software Stan through the package rstan [31,40] (The Stan code is available at the OSF site for this project https://osf.io/9drne/, accessed on 1 August 2021).
For convergence, the NUTS method was allowed to increase the number of iterations until the potential scale reduction factor (PSRF), also know as univariateR [41], was less than 1.05 [42]. Even if the total number of iterations differed, the saved samples were always 5000 per chain. Inferences were made from posterior distributions of 15,000 samples. The frequentist methods were allowed to run for a greater number of iterations than their software defaults, to give them a better chance of finding a converged solution according to their respective algorithm. For all methods, we considered an outlier result when the bias for the b parameter was larger than 10.

Results
In this section we present the overall results from the simulation. The simulation was run in a Linux-based High-Performance Cluster for computing efficiency, which took 13.4 years of computing time.

Convergence
When looking at the ability of each method to find a final solution across simulation conditions (Table 1), we found that the NUTS MCMC method converged for every data set (R < 1.05), without any outliers. It is important to point out that the Bayesian model constrains the parameter estimates to admissible values, so it did not present improper solutions. For WLSMV, 99.5% of the data sets converged on a solution, but several data sets presented outliers, where these outliers were more common in conditions with smaller sample size, and larger number of dimensions. Finally, 1476 data sets presented Heywood cases (improper solutions) [43,44]. Note. Convergence = percentage of data sets that converged; Outliers b = percentage of converged data sets that presented outliers for the b parameter; Outliers a = percentage of converged data sets that presented outliers for the a parameter; Improper Solutions = percentage of converged data sets that presented improper solutions (Heywood cases); No SE = percentage of converged data sets that failed to calculate standard errors; Included = percentage of data sets included in the following analyses, after excluding non-convergence and outlier results.
In the case of the EM algorithm, we found that the model converged for 57.6% of the data sets. The models with six and eight factors had 0% convergence and the models with one and two factors had 100% convergence, which was expected, as the EM algorithm is more likely to fail with higher dimensionality. For the four-factor model, 87.9% of the data sets converged. Across sample size, the convergence rate increased from 53.8% for N = 100 to 60.0% for N = 1000. In terms of the number of items, with 5 and 10 indicators per factor convergence was 55.2% and 60.0%, respectively. From the converged results, 13 data sets produced outliers. Outliers were more common in the data sets with smaller sample size. Lastly, parameters could be estimated for all datasets, but standard errors (SE) were unable to be computed for three of the datasets due to non-positive definite information matrices. This occurred in the condition with five items per factor with four factors and N = 100.
For QMCEM, 96.7% of the data sets converged, the data sets that presented outliers more commonly had a smaller sample size and greater number of factors. Convergence rates were 100% for the one-, two-, and four-factor models, 99.2% for the six-factor model, and 84.3% for the eight-factor model. Across sample sizes, the convergence rate is consistently between 95.8% and 97.4%. Between number of items, we have 100% convergence for 5 indicators per factor, and 93.4% for 10 indicators. Lastly, standard errors could not be computed for 11.0% of the converged data sets. We found that the number of factors affected SEs, as they could not be computed for 53.2% of the eight-factor models, followed by 7.1% of the six-factor models, 1.5% of the four-factor models, and 0% for the one-and two-factor models. Sample size was also related to the problem with the computation of SEs, as they could not be computed for 18.7% of data sets with N = 100. When N = 1000, this was true for only 6.9% of datasets. Unsurprisingly, the cross condition where SEs could not be computed for the highest proportion of datasets was for eight factors and N = 100. In this condition, SEs could only be computed 78.4% of the time.
For the final estimation method, MHRM, models converged for 97.3% of the data sets, the outliers were more commonly present in data sets with smaller sample size and greater number of factors. There is no difference in convergence rate across the number of items per factor (5 items: 97.8%, 10 items: 96.7%). For sample size, we see a convergence rate of 89.1% for N = 100, 99.9% for N = 300 − 500, and 100% for N = 1000. The same pattern holds across number of factors: For eight factors we see 89.0%, and for six and four factors we see 97.5% and 99.9% convergence, respectively. Finally, for one and two factors we see 100% convergence. This leaves us with a convergence rate without outliers of 97.2%. SEs could not be computed for 13.7% of the converged data sets. Across number of items, there were problems computing the SEs for 11.3% of the 10-item models and 16.0% of the 5-item models. Across number of factors, we see this increase from 4.9% for the one-factor model to 22.0% for the eight-factor model. Finally, across sample size, SEs could not be computed for 17.2% of models when N = 100, and around 12.5% when N = 300, 500, 1000.
For the following results, when discussing parameter estimates, we used all the converged results without outliers (column Included in Table 1), but included the models where SEs could not be computed. For results related to variability and coverage of the estimates, we also exclude the data sets in which there were problems computing the SEs.

Effect of Conditions on Bias
When evaluating the effect of the simulation conditions on parameter bias (Table 2), we see that for NUTS only the r parameter bias is affected by sample size (N). Further, we see that N also affects the a parameter bias for EM, QMCEM, and MHRM. Lastly, the only other condition that has an effect on bias was the number of dimensions (D), affecting the r parameter for the EM. All other parameters across estimation methods have negligible effects from the simulation conditions. Given that D and N are the only simulation conditions that present a relevant impact on parameters bias, in the following tables we present the results disaggregated by these two conditions, without detail about n it or the interactions between conditions.

Bias
For the average bias (Table 3), for the difficulty b parameter, the differences in bias between estimation methods are small, as all of them achieved average bias of zero up to the first decimal (Figure 1). For the factor correlations r, we find that WLSMV results in the lowest bias, followed by NUTS, EM, and MHRM, consistently across conditions ( Figure 2). Finally, for the discrimination a parameters, we see the largest bias differences between methods, where NUTS and WLSMV result in the lowest bias, followed by EM. While QMCEM and MHRM show the highest average bias (Figure 3). Across simulation conditions, we see that, in general, as sample size increases bias decreases. MHRM presents lower bias as number of factors increase.  When looking at absolute bias in Table 4, for the factor correlation r, most estimation methods present similar absolute bias (NUTS, WLSMV, EM, and MHRM), while QMCEM consistently presents higher absolute bias. As sample size increases, the bias decreases from around 0.09 to around 0.02. NUTS, WLSMV, and MHRM present consistent absolute bias as the number of factors increase, while QMCEM absolute bias increases as the number of factors increases.
For the difficulty b parameters, we see that NUTS results in the lowest absolute bias across N and D, while the other methods show similar absolute bias; the only exception is that MHRM presents absolute bias similar to NUTS when the number of factors is larger. As sample size increases, we see the bias decrease, and with N = 1000 the bias for NUTS and MHRM are similar, while the other three methods stay similar across conditions.  Lastly, for the discrimination a parameter, NUTS again shows the lowest absolute bias, followed by WLSMV, and EM. QMCEM and MHRM show similar absolute bias. For all the methods, as sample size increases, bias decreases. With larger sample sizes, WLSMV and MHRM absolute bias becomes similar to NUTS, while EM and QMCEM consistently presents larger bias. As number of factors increase, MHRM presents lower bias, still NUTS presents lower bias across D (plots with the bias and the 90% CI for each method across simulation conditions are presented on the OSF site for this project: https://osf.io/9drne/, accessed on 1 August 2021). As we simulated population parameters from continues distributions, we looked further into the bias across different ranges from the parameter values. We split the population parameters into three quantiles, presenting bias across low, medium, and high ranges. As the discrimination parameter followed the distribution a ∼ U(0.4, 3), the low range was values lower than 1.258, medium ranged from 1.258 to 2.116, and high values were greater than 2.116. As the difficulty parameter followed the distribution b ∼ N(0, 1), the low ranges were values lower than −0.439, the medium range was between −0.439 and 0.412, and the high values were greater than 0.412. As the the factor correlations followed the distribution r ∼ U(0.2, 0.6), the low range were values lower than 0.332, the medium range was between 0.332 and 0.464, and high values were greater than 0.464.
The average bias across ranges of population values and sample sizes is presented in Table 5. Similar patterns as before appeared-as sample size increases, bias decreases. With n = 100 for the a parameter we see the IRT methods presenting higher bias at higher ranges of the population values. QMCEM and MHRM repeat this pattern as sample size increases, while EM presents similar results at larger sample sizes. WLSMV presents similar bias at different ranges of the population values, except with N = 100, where the medium range presents higher bias. NUTS presents similar bias across the different ranges of the population values.
For the b parameters, NUTS presents similar bias across different ranges, with the medium range presenting slightly smaller bias. The other methods present a pattern, with higher bias at the low and high ranges, and smaller bias at the medium range, and these differences decrease as sample size increases. For the factor correlations, NUTS, WLSMV, EM, and MHRM presents similar bias across the different ranges, while QMCEM presents higher bias as the range goes higher.

Variability
When looking at the estimated parameter variability, we see in Table 6 the standard deviation of the estimated parameter around the population values (bias). For all cases, as sample size increases, variability decreases. Across sample size, we see that NUTS presents the smallest variability. For the factor correlations, the other methods present similar variability, except that QMCEM presents higher variability at larger sample sizes. For the a parameters, WLSMV presents lower variability than the IRT methods, but this difference decreases as N increases. While for the b parameters, the IRT methods present smaller variability than WLSMV.
Across the number of factors, for the factor correlations, NUTS, WLSMV, and EM present smaller variability in most cases, as number of factors increase MHRM variability decrease, while QMCEM variability increases. For the a parameter, NUTS presents the smaller variability, followed by WLSMV. With EM and QMCEM having similar levels, MHRM variability goes from the highest in the one-factor model, to being close to WLSMV with eight factors. Lastly, for the b parameters, NUTS present the smallest variability and is consistent across the different number of factors. The IRT methods present similar levels of variability, and WLSMV presents the larger variability (plots comparing the estimation methods variability across simulation conditions are presented on the OSF site for this project: https://osf.io/9drne/, accessed on 1 August 2021). We were also interested in whether each of the methods would result in users making the correct inference-that is, that the parameters from the true population model fall within their respective 95% CI. Table 7 shows the average percentage of times that the respective 95% CI included the population value. Overall, NUTS has the highest coverage, with an average of at least 95.1%, and as high as 99.1% across all sample sizes and number of factors. It is followed by WLSMV with an average coverage of at least 90.9% and up to 98.5% across all sample sizes and number of factors. All the IRT methods presented low coverage for the b parameter, around 62% for small sample sizes, and increases to be around 72% with larger samples. Consistently show similar coverage across different number of factors, around 70%. They show similar coverage for the a parameter, across different sample sizes presented similar results around 92%, except with large sample size QMCEM and EM presented slightly lower at 85% and 88%, respectively. Across different numbers of factors in the model, their coverage is high (around 93%) with one or two factors, but as the number of factors increases, coverage decreases down to 79.% and 86.8% for QMCEM and MHRM, respectively. For the r parameter, we see that EM and QMCEM start with high coverage with a small sample and decrease as sample sizes increases, while MHRM starts with low coverage at small sample size and increases as sample size increases. Across a different number of factors, all IRT methods have high coverage with small number of factors, and it decreases for all of them as the number of factors increases, going from being around 95% to being around 78%.

Discussion
In this paper, we conducted the largest between-item multidimensional IRT simulation of which we are aware, comparing popular estimation methods in terms of convergence and parameter recovery. The present projects adds to the literature about estimation methods for latent variable models with binary indicators by studying the performance of a variety of estimation methods across a large number of simulation conditions. Whereas, previous research focused on either a small number of estimation methods or a small number of simulation conditions. Further, the results present a detailed comparison across sample size and number factors in the model [6,[8][9][10][11]13,[33][34][35][36][37].
We found that the NUTS MCMC estimation method was the most effective, showing highest rates of convergence across conditions, and not presenting improper solutions or outliers. BMIRT models constrain the estimated parameters to be admissible values, presenting an advantage over the frequentist counterparts. While the frequentist estimation methods can lead to nonconvergence or improper solutions, this most commonly happens with small sample size and large models.
For IRT full information estimation methods, the EM algorithm works well when the models have a small number of factors (four or fewer), but failed to converge with larger models. When the model includes more factors, the best-performing method was MHRM, as it overcomes the EM limitations by using an MCMC sampling method instead of fixed quadrature points. In some cases, MHRM resulted in outlier results or was unable to compute standard errors.
The limited information method (WLSMV) showed the second-highest rate of convergence, and resulted in outliers or improper solutions (Heywood cases) in a some cases; however, the improper solutions and outliers presented more commonly across small samples and large models. WLSMV performance was not affected by either sample size of model size, indicating that when converged models do not present improper solution, it is a reliable estimation method.
The NUTS MCMC estimation method for Bayesian inference presented the best convergence rate, and when allowed to increase iterations always found a proper solution (based on the simulation conditions). This is an expected behavior based on the sampler's ability to handle complex and large parameter spaces [30]. The use of priors allowed the model to avoid improper solutions and outliers, as they are implemented to limit the parameter space based on the model characteristics instead of sample characteristics (Equation (2)). For example, the difficulty b parameter was in the same metric as the underlying factors (N(0, 1)), the discrimination a parameter was in the log-normal metric which is not expected to reach high values, and the factor correlations had a limited parameter space (uniform). In this way, NUTS takes advantage of this known information to facilitate estimation.
Regarding parameter bias, the NUTS method consistently presented the lowest bias across the different parameters, followed by WLSMV. The EM algorithm presented the lowest bias across IRT full information methods but failed to converge for high-dimensional models. While MHRM presented similar bias to NUTS and WLSMV for the b parameter and factor correlations, it showed higher bias for the a parameter. When looking at the inferences that would be drawn from the results of each estimation method, we find that NUTS and WLSMV have the highest CI coverage of the true population value, above 90% for all parameters. While the IRT methods present high coverage for the a and r parameters, all of them present low CI coverage for the b parameter.

Limitations and Future Research
As with any simulation study, we could not include all possible conditions to study. We focused on between-item MIRT models, but many applied studies will also be interested in within-item MIRT models. More research is needed to evaluate whether the estimation methods follow similar behavior as the ones presented in the project. Another limitation is that we focus on the comparison of the estimation methods as implemented by the mirt R package for the IRT frequentist models, the lavaan R package for the WLSMV model, and the general Bayesian program Stan for the NUTS MCMC sampler. Other IRT software can have better performing optimizers, leading to higher rates of convergence; however, we would expect the bias and overall performance to be similar across software programs.
Based on what researchers often see in practice, we evaluated the estimation method up to eight-factor models, considering this a large number of factors for applied research in general. It would be useful to extend this research to even larger models, and find what (if any) is the limit for number of factors that can be estimated with the best-behaving methods from this project (NUTS, WLSMV, and MHRM).
The present research focused on between-item compensatory multidimensional IRT models; further research is needed to evaluate the estimation methods across within-item and non-compesatory MIRT model.

Recommendations and Conclusions
In summary, for models with few factors (and large enough sample size), the estimation methods examined in this paper have similar performance. As the number of factors increases, the NUTS MCMC method is the most recommended to yield convergence without yielding improper results, as well as presenting consistent results across different sample sizes. Within the frequentist framework, the WLSMV method was the best-performing one, but researchers should be aware of possible Heywood cases. Among the full information IRT estimation methods, MHRM is the best-performing method, but researchers should consider limitations related to the bias of the a parameters and coverage of the b parameters. We think these results are informative, due to the increasing ease with which researchers can develop large assessments and collect large datasets.