Stochastic Expectation Maximization Algorithm for Linear Mixed-Effects Model with Interactions in the Presence of Incomplete Data

The purpose of this paper is to propose a new algorithm based on stochastic expectation maximization (SEM) to deal with the problem of unobserved values when multiple interactions in a linear mixed-effects model (LMEM) are present. We test the effectiveness of the proposed algorithm with the stochastic approximation expectation maximization (SAEM) and Monte Carlo Markov chain (MCMC) algorithms. This comparison is implemented to highlight the importance of including the maximum effects that can affect the model. The applications are made on both simulated psychological and real data. The findings demonstrate that our proposed SEM algorithm is highly preferable to the other competitor algorithms.


Introduction
The time between the presentation of a stimulus and a participant's motor response is the oldest and most widely used measure for exploring the functioning of the human mind. In 1869, ref. [1] theorized this duration, called reaction time (RT), as involving three sets of activities: perceptual mechanisms, cognitive processing and motor preparation. Based on the assumption that the first and last sets of processing can be considered as having virtually identical duration for the same task, any change in RTs between two experimental conditions is then interpreted as indicating a change in the duration of cognitive processing. RT is then considered by psychologists as a tool to explore cognitive processing mechanisms ( [2]).
Psycholinguistics research on the cognitive mechanisms involved in language recognition or production frequently uses RT as a measure of behavior. Like all scientific disciplines, psycholinguistics relies on hypothesis testing to support its theoretical propositions. Since the early 2000s, researchers have taken up linear mixed-effects models (LMEMs). As described by [3], a LMEM allows for the proper consideration of one of the characteristics of psycholinguistic experiments: the presence of two random-effect variables. Indeed, the experimental structure of the experiments conducted in this field involves having a group of participants process a set of stimuli (see hereafter). Thus, the statistical analysis must allow the inclusion of these two random-effect variables, i.e., participants and items, in the structure of the model. The introduction of LMEM was thus an important methodological advance for psycholinguistics.
There is, however, one point that has not received much attention from psycholinguists. Experimentation with human beings is often subject to many vagaries. Imagine researchers whose goal is to understand how an adult retrieves the spelling of a word from their memory ( [4]). They ask a group of thirty participants to write down the names of 150 images presented on a computer screen. A device allows them to measure the time between the appearance of the image on the screen and the first writing gesture of the participant. These researchers can potentially collect 4500 RT values. However, between trials lost for technical reasons, spelling errors, and certain habits during data analysis (e.g., censoring of data greater than two standard deviations from the mean, right-censoring data), a non-negligible number of data are removed. For example, ref. [4] reported removing just over 20% of the data.
The issue of missing and censored data has received relatively little attention in psycholinguistics (see, however, [5,6]). This is especially true since the introduction of LMEMs. Psycholinguists assume that these models can be run on a sample of data with "holes". The analysis strategy is then of the "keep the case empty" type, ignoring the bias that this introduces in the estimation of the parameters of the model and thus of the decisions taken. The objective of this work is thus to develop algorithms to manage the presence of missing and censored data in these psycholinguistic experiments. Two points are to be taken into account. On one hand, experimental designs in scientific psychology may involve assumptions about interactions between two or more fixed-effect variables. On the other hand, researchers suggest, for theoretical reasons, that all interactions between fixed-effect variables and random-effect variables made possible by the model should be included ( [7]). The potential presence of these two types of interactions (fixed-fixed variables; random-fixed variables) are constraints on the development of the missing data procedure.
Let us first recall that the LMEM ( [8]) is an extension of the simple linear model that allows both fixed and random effects to be represented. In 1861, the LMEM was introduced under the name of a one-way random-effects model ( [9]), that is, a model with one random variable and without any fixed variable. From 1990 and onward, ref. [9] underlined that LMEMs became popular in many research applications including economics, sociology, insurance, agronomy, epidemiology, genetics, etc. They are used in longitudinal data analysis, multilevel modeling and small estimations. The analysis of this type of models is presented in [10]; for more literature about the methodology, theoretical results and software, see the books [11][12][13][14]. LMEMs are fitted and analyzed in R by using the package lme4 or lmertest ( [15]).
In [16], the author presented the fixed interactions between two factors in an LMEM, using the maximum product interaction F-test. Ref. [17] was interested in the sample size of an LMEM. He proposed a formula to estimate the sample size based on testing a mixed model that contained fixed interactions. In 2019, ref. [18] proposed an estimation method to recover the principal effects and interactions, because the existing method did not allow the integration of these effects in a mixed data frame. They approved that the proposed method gave optimal results to their applications' conditions. On the same side, ref. [19] presented the estimation of the fixed interactions. These interactions are normally introduced by the product of the variables, but the algebraic transformations revealed that this technique did not produce a within-unit estimator. A Monte Carlo method confirmed that the FE (fixed-effects) estimator of (x, z) was biased, if one of these variables was correlated with another one. In order to present the interaction between x and z, it is possible to use the current syntax x * z. This consideration is called "double-demeaned"; it is less efficient than the standard FE and only works with T < 2. For the application in R, the Markov chain Monte Carlo method is applied using the package mcmc. Ref. [20] introduced the MCMCpack package that contains functions to perform Bayesian inference using a posterior simulation for a number of statistical models (https://CRAN.R-project.org/package=MCMCpack (accessed on 14 Jun 2011)), while [21] presented the package MCMCglmm (https://cran.rproject.org/web/packages/MCMCglmm/index.html (accessed on 2 February 2010)) for the MCMC method to fit the generalized linear mixed-models for multiple response vectors. This method was also used by [22] to identify unknown parameters in the biological field, such as detecting the concentration of target molecules, because of the importance of this method for extracting the information. The MCMC method was applied in the work of [23] to solve mechanical problems by using a Bayesian method.
Ref. [24] presented five types of LMEMs by introducing two random variables (see Appendix A.1), where the models were compared to show the importance of including the maximum effects that could affect the model.
One of the solutions that handles the missing data can be the expectation-maximization (EM) algorithm. It is a very useful algorithm for the estimation of the maximum likelihood function. This method can be a solution when the only data available do not allow the estimation of the parameters ( [25]), and/or the expression of the likelihood is analytically impossible to maximize.
In other words, it aims to provide an estimator when the problem comes from the presence of missing data. When the data are incomplete, the knowledge of these values would make it possible to estimate the parameters. The EM algorithm takes its name from the fact that at each iteration, it operates two distinct steps: Step E: This step is called the expectation (E) step, where we are interested in finding the expected value of the unobserved or unknown variables given the observed data and the value of the parameters.
Step M: This step is called the maximization (M) step; in this step, we maximize the expected log-likelihood by using the estimation of the unknown data carried out in the previous step. These parameter estimates are then used to determine the distribution of the unknown variables in the next iteration.
At some points, the expectation or the maximization steps are impossible to apply directly ( [26]); from there, the use of an extension form of EM is useful, such as MCEM, where the (E) step is replaced by a Monte Carlo simulation, or SAEM, where the (E) step is replaced by a stochastic approximation. SAEM was a solution of nonlinear mixed-effects models (NLMEM). Ref. [27] proposed a new methodology for maximum likelihood estimation in mixtures of nonlinear mixed-effects models. The resulting MSAEM (mixture SAEM) algorithm is now implemented in the Monolix software tool.
The aim of this paper was to perform the SEM algorithm, under an LMEM by including two types of incomplete data (the censored and the MAR types) and by taking into consideration for the first time the interactions, where our proposed model contains the interactions between fixed variables and fixed-random variables. This document is organized as follows: In Section 2, we present three algorithms based on the expectation-maximization (EM) method, the first one is called the SAEM algorithm, the second is our proposed SEM algorithm and the third one is based on the MCMC method. We also present the incomplete data types, divided into missing data and censored one. In Section 3, we define the proposed model with some specific cases. In Section 4, we introduce a method to achieve the convergence. In Section 5, we compare the results obtained from simulated psychological and real data. In Section 6, we conclude the proposed study with some perspectives and future directions.

EM Algorithm
As previously said, the EM algorithm is a widely used algorithm in the case of incomplete data; in this situation, the maximum likelihood function is difficult or impossible to use to estimate the parameter vector θ of the considered model. We formalize directly an iteration from which we can understand clearly how this algorithm works: − Let y = (y 1 , . . . , y n ) be the independent and identically distributed (i.i.d.) observations of likelihood p(y|θ). − The maximization of log p(y|θ) is impossible. − We consider hidden data z = (z 1 , . . . , z n ) which make the maximization of the likelihood of the complete data possible when known.
− As we do not know these data z, we estimate the likelihood of the complete data by taking into account all the known information so the estimator is given as follows (E step): where θ k−1 is the vector of the parameters at iteration k − 1. − Finally, we maximize this estimated likelihood to determine the new value of the parameter (M step). Thus, the transition from iteration k − 1 to iteration k in the algorithm consists in determining the parameters vector at iteration k, θ k : where θ 0 is chosen arbitrarily. When one of these two steps are impossible to complete, we can consider an extension form of the EM algorithm such as the SAEM, SEM or MCEM algorithms. In the next subsection, we derive the SAEM algorithm into the formula of the EM algorithm.

SAEM Algorithm
The stochastic approximation expectation maximization (SAEM) algorithm was proposed by [26], in which the E step was replaced by a stochastic approximation. The stochastic approximation algorithm was first introduced by [28] and also used by [29], where the algorithm generates iterates of the form: where γ k is a sequence of positive step sizes, h is a function of θ, and k is a constant such that k = E[h(θ k )]. From this form, the SAEM algorithm is obtained where the E step of the EM algorithm is divided into two steps; consider the iteration k: First, we sample a realization z k of the latent variable from the conditional distribution (p θ k−1 (z|y)) of z given y, using the value of the parameter θ k−1 at iteration k − 1.
Second, by using the realization z k from the first step, we update the value of Q k (θ|θ k−1 ) (see, (1)) through a stochastic approximation procedure.
Then, the algorithm continues as follows: Initialization step: Initialize θ 0 in a fixed compact set.
Then, for all k ≥ 1, the kth iteration consists in three steps: Simulation step: simulate z k from the conditional distribution p θ k−1 (.|y).

Stochastic approximation step: compute the quantity
where m k is the number of simulations at each iteration.
Maximization step: update the parameter value according to θ k = arg max θ Q k (θ).
The SAEM algorithm is more efficient compared to the MCEM, where, at each iteration, the simulation of the missing values is repeated and the ones obtained previously are not used. In the SAEM, all the simulated missing values contribute to the expectation step and that is the advantage of this algorithm where the maximization step is cheaper than the simulation one. Next, we consider the SEM algorithm to show how it differs from an SAEM.

SEM Algorithm
The SEM approach is a specific case of the SAEM algorithm. In the SEM algorithm, the step size is set to zero γ k = 0 and the number of simulations by iteration is constant (usually, m k = 1).
This method is an extension of the EM algorithm and is a solution when the maximum likelihood is impossible to complete. This algorithm was introduced by [30], where between the steps E and M, they used a stochastic step that consisted in generating a complete sample containing latent variables from the conditional density, based on the observed data. Starting with an initial position θ 0 , an iteration k of the SEM algorithm, where, to each θ k , a θ k+1 is associated, is defined as follows: Step E: Compute the conditional density f (y|x; θ k ); Step S: Draw z k from the conditional distribution, then obtain the complete sample y k = (x, z k ); Step M: Update the parameters θ k+1 by maximizing the likelihood function based on the complete vector y k .
In the next subsection, we consider the MCEM algorithm to show how a Monte Carlo simulation is used in the EM algorithm.

MCEM Algorithm
In 1990, ref. [31] proposed to replace the expectation step to compute Q(θ|θ k−1 ) by a Monte Carlo integration, hence the name MCEM algorithm (Monte Carlo EM). At iteration k, the E step is replaced by the following procedure: Simulation step (S-step): generate m(k) realizations z k (j) (with j = 1, . . . , m(k))) of the missing data vector under the distribution function p(z; θ k ) Monte Carlo integration: compute the approximation of Q(θ|θ k−1 ) according to: The maximization step remains unchanged.
As said before, these algorithms can be used in such a way as to handle the unobserved values of the data that are defined in this next subsection.

Incomplete Data
We can talk about incomplete data when the values in our response vector are not all observed for many reasons; we can also say it is a question without an answer.
These types of data are a common problem recognized by statisticians. Ref. [32] presented many solutions to handle this problem based on simple and multiple imputations. Ref. [33] proposed to handle the incomplete data in a hierarchical model by using the SEM algorithm.
In this work, we are interested in data that contain two types of incomplete values: censored data denoted by Y c and missed data Y miss . In the next subsection, the censored data are considered.

Censoring
Censored data are any data for which we do not know the exact event time. There are two types of censored data: right-censored and left-censored.
Right-censored data: these are data where the event has not yet been achieved when the study is finished.
Left-censored data: these are data when the event is achieved before the study starts. We consider in this paper both censoring types of (left-and right-censoring) data with four percentages (0%, 5%, 10% and 20%). Let t be the censoring level and T the set of censored indices that can be written as ∈ T} the vector of observed response variables and Y c = {Y i 1 ,...,i p+k ,j , (i 1 , . . . , i p+k ) ∈ T} defining the vector of censored response variables. The second type is the missing data, defined in the next subsection.

Missing Data
In general, missing values are produced when a value in the data is not represented for a given variable, for many reasons that can be linked to the objective of the study (for example, participants do not answer the questions). This appears in many research studies, particularly when collecting data and where participants are studied over a period of time.
At first, studies were developed assuming no missing values. In the late 1980s, with the advancement in technology, that problem attracted the attention of many researchers who wished to study several techniques for handling it. Depending on the reasons for their absence, these values could be divided into three types: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR).
In Appendix A.2, we define the types of missing value and the methods to generate these types in a simulation study.
In this paper, we are interested in imputing the MAR type of missing data with four percentages (0%, 5%, 10% and 20%), which are crossed with four other percentages of censoring, resulting in 16 cases of incomplete data. For the missing data, we denote by Y o the vector of observed response variable and Y miss the missed one, so Y = (Y o , Y miss ). In the next section, we present the main results of this work.

The Proposed Model
Motivated by psychological data, in which we typically have two random variables, participants designed by S and items designed by I, we consider a model that contains p covariates {x 1 ; x 2 ; . . . ; x p } and two random variables. The predicted variable, also named the variable of interest, is denoted by Y. Then, we consider the following linear mixed model: with r the residual of the model that follows a normal distribution. We give: By a replacement of these considerations in Equation (2), our model can be rewritten as follows: By taking and denoting by T the transpose matrix, and e jl = e j .
This leads us to rewrite our model as follows: Cov(u, e T ) = 0. Now, by including all the interactions, the general model can be written as: where we take: interaction between fixed-random variables , and e = (S 0,s + I 0,i ) + r residuals .
We set k, the number of interactions, as k = 2 p − p − 1 and let j, l ∈ [1, p + k].
Our random participants variable is given as follows: and the variance-covariance matrix of the participants is written as We consider that the items random variable also follows a normal distribution: then, the variance-covariance matrix of the items is given by: For j, l ∈ [1, p + k], the variance-covariance matrix G is equal to: The goal is to predict our model by determining the parameters δ and u; we propose to use Henderson's linear system ( [34]) based on the maximum likelihood function. This approach is presented in Appendix A.3. The matrix R has the form: where n represents the total number of observations. Therefore, with Z = (z 1 , . . . , z p+k ), the variance of Y is equal to: The linear system equations of Henderson applied in this study is also presented in Appendix A.4. In the next subsection, we present some specific cases.

Specific Cases
In this section, we show how the model can be simplified if we take the following two cases: first, only the fixed-fixed interaction part is presented and second, there is no interaction.

Case 1: Fixed-Fixed Interaction
In this case, we consider the interactions between the fixed variables. Then, in the random part, the u vector is equal to 0: therefore, the V matrix is equal to: For more mathematical development, see Appendix A.4.

Case 2: No Interactions
In this case, there are no interactions between the variables; our model is simplified, where the random part with the fixed-fixed interaction is ignored: The V matrix is reduced to: See the simplified system in Appendix A.4. Next, we present how the SEM algorithm can be applied in this work.

SEM Algorithm
In this paper, we propose to handle missing data and censored problems in the presence of interactions by using the SEM algorithm, proposed by [33]. The stochastic expectation maximization (SEM) algorithm is a method used to estimate the parameters when it is complicated to use the EM algorithm; it is a particular case of multiple imputations (MI).
To understand MI ( [35]), we need to know the idea behind a simple imputation (SI) in which the nonobserved data {Y no } (missed or censored) are replaced by a value and then the parameters are estimated using known methods to maximize the likelihood function. To obtain a robust estimate, the simple imputation can be repeated several times with different values of {Y no } and the results can be combined. This method is called multiple imputations.
We applied the algorithm after crossing the MAR values with the censored ones (see Algorithm 1). In the SEM algorithm ( [30]), the values of {Y no } are drawn from the conditional distribution of the nonobserved data given the observed ones using the current values of the parameters. We generated samples from P Θ (Y no |Y o ) where this distribution was calculated using Gibbs's sampling; the procedure is underlined in Algorithm 1 (the SEM method using Gibbs's sampling was developed in [33]). For more information about the expectation maximization algorithm, see [36]. In the next subsection, we present the extension form of the SEM that leads us to the SAEM algorithm.

SAEM Algorithm
In the SEM algorithm, Θ j+1 (the parameter at state j + 1) depends only on Θ j and Y j . By taking T the operator of the EM algorithm and M, the operator which associates Θ j+1 with Y j via step M, the updated parameter Θ j+1 can be written as follows (see [30]): The SAEM algorithm (see also Algorithm 2) is an extension form of the SEM algorithm. Starting with an initial position Θ 0 , the iteration of the SAEM algorithm, to which for each Θ j , a Θ j+1 is associated, is defined by the equation: where γ j is the step size, which is a decreasing sequence of positive real numbers starting with γ 0 = 1, γ j → 0 when j → ∞. The theoretical and practical performance values of the SAEM algorithm are strongly dependent on the speed of convergence towards 0 and the regular decreasing of the sequence γ j , which explains the importance of choosing the step size.
Ref. [37] performed many tests to see which sequence was more efficient, and they obtained two types of sequences that were significant, one of a slow mode: the other type was of a linear mode: Ref. [38] chose the decreasing sequence from their previous experiments [39]; they took that sequence as That sequence was also chosen by [40,41] in a nonlinear mixed model as 1 at the first iteration and (j − N) −1 during the last iterations.
1 − e (g) )) 7: end for output: Y no obtained from sampled (u (G) ,e (G) ) 8: When applying these two algorithms, we were confronted with a problem of convergence caused by the presence of the interaction between the fixed and random variables. Therefore, in the next section, we present how to handle this problem by using the Hamiltonian Monte Carlo algorithm.

Convergence of Parameters
While introducing the fixed-random effect in the application section, we faced a convergence problem in the SEM and SAEM algorithms when using the standard form of Monte Carlo (MC). In order to solve this problem, we propose to use the hybrid MC algorithm ( [42]), which uses the symmetric Metropolis-Hastings algorithm to accept or reject a proposal based on the Hamiltonian function, hence the name of the algorithm, the Hamiltonian Monte Carlo algorithm (HMC). In the next subsection, we present how to implement the HMC algorithm.

Implementation
The Hamiltonian function H(p, q) is written in term of the joint density π(p, q): which decomposes into two terms: The first term corresponds to the density of the target distribution, the second one is determined by the target distribution when K(p, q) is unconstrained and must be specified by the implementation. The acceptance probability of moving from state (p 0 , q 0 ) to (p i , q i ) is determined using the Metropolis-Hastings algorithm ( [43,44]), with Q(p i , q i |p 0 , q 0 ) being the probability density defining each proposal: Referring to the symmetric Metropolis ( [45]), the acceptance probability is simplified to the form: In this study, we considered that we had evidence of the convergence towards stationarity due to the Markov Chain (see, [45,46]). In the next section, we present some numerical applied results based on a psychological simulation and then real data.

Numerical Experiment
This section aims to compare the SEM algorithm proposed in this paper with other methods in the presence of incomplete data in a complete LMEM. The method was first applied to simulated data, then to real ones.

Simulated Data
The simulation data were created from an experimentally obtained database. In order to explore the cognitive process of retrieval in memory of the spelling of French words, 30 participants had to handwrite the label of 150 drawings of objects, constituting a database of 4500 RTs ( [4]). After removing the RTs corresponding to technical errors, spelling errors, and censoring data greater than 2.5 standard deviations from the mean, 3434 values remained. Ref. [4] performed a linear regression analysis involving nine fixed-effects variables. In the present work, we retained two of them. The first was an estimate of the age at which the image name was acquired in childhood (age of acquisition, AoA hereafter). This factor is one of the most important predictors of picture-naming RTs ( [47]. The second fixed-effects variable corresponded to the number of letters in a picture label (Lett_cat). We performed a median split to obtain a categorical variable. Thus, image labels with a number of letters lower than the sample average were considered as short words and those with a higher number of letters were considered as long words. The interaction between these two fixed-effects variables was included in the model. Finally, there were two random effect variables: items, i.e., the 150 image labels produced by the 30 participants, and the participants themselves. The fixed-random interaction between AoA and the participants was also included in the LMEM. Therefore, the model could be written as follows: In this section, we illustrate the proposed algorithm by comparing the results to the SAEM and MCMC algorithms using different proportions of incomplete data. The comparison was performed by computing the mean absolute error (MAE), which measured the errors between the reference and the predicted vectors (with Θ as the reference value andΘ as the predicted one); the standard equation of the MAE is given by: We also computed the linear correlation between the parameters LCor = Cov(Θ i ,Θ i ) σ(Θ i )σ(Θ i ) and the Spearman rank correlation RCor = 1 − 6 ∑ d 2 n(n 2 −1) , where d 2 is the square of the difference in the ranks of the two coordinates .
In this numerical section, we provide the 16 cases to see the performance of the methods in the absence of one of these two considered types of unobserved values (MAR type/censored data) or in the presence of a low or high percentage. In addition, we used the simulated data of 4500 observations after testing our proposed methods on a larger number of observations. We obtained the same results as those for 4500 values.
After the implementation of our algorithm, we checked the convergence by illustrating the values of the estimated parameters obtained at each iteration; this is presented in Figure 1.
In Figure 2, we plotted the normalized residuals of each case. We observed a similar dispersion between the standard simulated data and the SEM algorithm, while we noticed a small but visible scattering by implementing the SAEM algorithm, and finally, we observed that when we used the MCMC method, the residuals were dispersed far from the normal line.
Our results are visualized quantitatively in Table 1. Based on the MAE value, we can observe that the SEM gave the lowest values in all cases except for the case of 5% MAR and 20% censoring, where the SAEM (8.108) beat the SEM (8.118) by a very narrow margin. By observing the full vectors, the SEM gave the best parameter values in two cases: (0%, 10%) and (20%, 5%). In the other cases, we observed that with a high percentage of MAR (20%), b 2 and b 3 in the SAEM method gave the closest values while for the other parameters, it was the SEM algorithm. In all cases, we see that the MCMC was not a good choice either for the parameters values or for the other measurements. Moreover, we can see that for LCor (respectively, Rcor), the SEM and the SAEM algorithms gave approximately the same results 0.999 (respectively, 1) in all cases. This showed that these two methods had a small difference except for the case of (20%, 5%) where the RCor of the SAEM algorithm decreased from 1 to 0.9285. The MCMC had an RCor ranging between a maximum value of 1 and a minimum value of 0.738; that maximum value was obtained with 5% of MAR and the absence of censored data.
(a) (b) Figure 1. Value of the parameter estimates at each the SEM iteration with a burn-in period M = 10, we separate the parameters according to their values in (a), we plot b0, b1, sig1 and sig2 and in (b), we plot b2, b3, sig3 and sig4.  We marked in bold for each case the closest value to the reference among the three considered methods (the reference vector here was the simulated one that did not contain any missed value).

Real Data
In this section, we applied the three algorithms to the database used to create the simulated data. Of the 4500 RTs recorded during the experiment (30 participants producing the names of 150 items), 115 were removed by right censoring (2.56% of MNAR). Ref. [4] also removed 951 values, creating 21.13% of missing MCAR/MAR data. By applying the three proposed algorithms with some missing data to solve the problem in the presence of fixed and fixed-random interactions, we obtained the results presented in Table 2. The results were compared with respect to the initial vector where the missed values were treated by keeping the cases empty (KE). Interestingly, this is how missing and censored data are handled in psycholinguistic studies.

Conclusions
In this study, we proposed an algorithm based on the stochastic expectation maximization (SEM) to estimate the parameters under a linear mixed-effects model (LMEM) that contained fixed-fixed and fixed-random interactions in the presence of different percentages of incomplete data.
The simulated and real data showed that the proposed SEM algorithm gave the best results compared to other competitors, where the bias of the parameters was smaller than the bias in the SAEM and MCMC algorithms.
The simulation results were obtained by using statistical software R (see Appendix A.5 for some source code).
We plan to extend this approach by considering a more complicated level of interaction such as random-random effects and the interactions between more than two variables. Another direction could be to consider an extension of the present work by considering a generalized model under the logistic regression by taking one for the NA values and zero for the observed ones.
Author Contributions: Conceptualization, C.P. and Y.S.; methodology, Y.S.; software, A.Z.; validation, A.Z., C.P. and Y.S.; formal analysis, A.Z.; investigation, C.P. and Y.S resources, C.P.; data curation, C.P.; writing-original draft preparation, A.Z.; writing-review and editing, Y.S.; visualization, C.P. and Y.S.; supervision, C.P. and Y.S; project administration, C.P and Y.S.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript. In the third model, they reduced Equation (A1) by ignoring the items' random slopes (the interaction between the items' fixed and random effects), where w 11 was set to 0: c,i,s ∼ (0, σ 2 ). The fourth model kept the random items' slope and excluded that of the subject, where , c,i,s ∼ (0, σ 2 ). The fifth model excluded the two random slopes (for items and subject), where τ 11 = w 00 = 0: In the next section, we present the different types of missing data and how to generate them using different methods.

Appendix A.2. Missing Data Types and Imputations
Let M be the missing data indicator that defines what is known and what is missing. The response variable is denoted by Y in the complete data, where Y = {Y o , Y no } with Y o is the observed part and Y no is the nonobserved one. Ref. [48] denoted the missing part by Y miss . Consider a dataset of p variables and n subjects, with Y = {Y 1 , . . . , Y p }. Y should look like a matrix of n rows and p columns. In the literature on missing data, see also [49,50].
We show in this section the three types of missing values and how to generate them in a simulation study based on a rule in each case ( [51]).
We start with the first type named MCAR.
The missing completely at random type (MCAR) is a special case of the MAR type that is presented later. In this type, the value is missed due to chance and the absence is not related to the subject, so the distribution of M does not depend on Y o or on Y miss and is identical for all the observations: The statistical advantage of MCAR data is that the analysis remains unbiased, none of the variables is affected more than another.
Suppose that each subject has a probability π of being missed from a variable, the missing data rule is given by P(M = 1) = π. From this rule, we can determine various properties associated with the MCAR data such as the expected percentage of missing values and the expected number of missing data patterns in the MCAR data.
If we are working with univariate patterns, let n be the number of subjects in the data and K the random variable indicating the number of subjects with missing values in the data that follows a binomial distribution given by K ∼ B(n, π), where 0 ≤ π ≤ 1. Since E(K) = nπ and V(K) = nπ(1 − π), the expected percentage of missing values is: where Π = K n is the random variable that defines the estimated percentage of missing values in a sample. the variance of this estimated percentage is given by: In this case, we have two missing data patterns, pattern 1 includes subjects with complete data; pattern 2 includes subjects with missing data. Let I j be the indicator variable of the event, where pattern j is present in at least one subject in the sample with j ∈ {1, 2}, The probability that pattern 1 is present in at least one subject is P( The probability that pattern 2 is present in at least one subject is P( Let D be the number of distinct missing data patterns, D = ∑ 2 j=1 I j ; we can determine the expected value: Therefore, in MCAR data that contain more than two missing data patterns, the expected number of distinct pattern in a sample is given by: with η 1 , . . . , η m are the corresponding probabilities for patterns 1, . . . , m, and depends only on the probability of missing values in each variable i (i.e., π 1 , . . . , π l ). m = 2 l is the number of patterns and l is the total number of variables. Generally, the most used method to implement the MCAR type is randomly deleting the desired percentage of missing values.
The second type is MAR.
Missing at random is the most classic case; Rubin and D.B (1976) defined missing data to be MAR if the distribution of missing data did not depend on Y miss : MAR data occur when the absence is not random but can be explained by the variables for which complete information exists. More generally, we are in the MAR type if the cause of missing values is not related to the goal of the study. If Y 1 has missing values, then it is regressed on other variables Y 2 to Y l . The missing values in Y 1 are then replaced by the obtained predictive values. Similarly, if Y 2 has missing values, then the Y 1 , Y 3 to Y p variables are used in the prediction model as independent variables. Thus, the missing values are replaced with the predicted ones. The missing data rules for MAR data can be organized into several categories: single-cutoff method, multiple-cutoff method, percentile method and logistic regression method.
Here, we define these three rules. Single-cutoff method: Consider Y 1 as the variable with missing value and Y 2 is the missing data predictor with a cutoff point a. If a subject has Y 2 ≥ a, then its probability of being missing from Y 1 is π 1 , and if Y 2 < a, then its probability of being missing from Y 1 is π 2 . Let U be an indicator that takes the value 1 (U = 1) when Y 2 ≥ a and U = 0 when Y < a. The missing data rule is given as: where the parameters associated with this rule are π 1 and π 2 , and π 0 is the probability that Y 2 is equal to or greater than the cutoff: P(Y 2 ≥ a) = P(U = 1) = π 0 . To find the expected percentage of missing values, we first calculate the unconditional probability of a subject being missing from Y 1 : The expected percentage of missing values in a sample is given by: and the variance of this estimated percentage of missing values is written as: Multiple-cutoff method: One advantage of this method is that it can be used to create a nonlinear relationship between the missing data indicator and the missing data predictor. To create a nonlinear relationship between the indicator and the predictor, we need to fix an upper cutoff and a lower cutoff. Suppose we fix the two cutoff points a and −a, U = 1 when Y 2 ≥ a or Y 2 ≤ −a, and U = 0 when −a < Y 2 < a. The missing data rule can be written as: and we can see that this missing data rule is the same as the missing data rule in the single-cutoff method. In the case of a linear relationship, we need to specify two or more cutoff points in the missing data predictor. Suppose we take the quartile points (Q 1 , Q 2 and Q 3 ) as a cutoff, let V be a discrete uniform random variable created based on the values of Y 2 and that takes the following values: Then, for the linear relation case, the missing data rule is given by: Let n be the total number of subjects and n j be the number of subjects in each quartile group. The variance of the estimated π 0 is: and the variance of the estimated π j where j ∈ {1, 2, 3, 4} is defined as follows: We can calculate each subject's probability of being missing by calculating the marginal probability of M = 1: π miss = P(M = 1) = π 1 π 0 + π 2 π 0 + π 3 π 0 + π 4 π 0 , and the expected percentage of missing values can be written as: E(Π) = π miss = π 1 π 0 + π 2 π 0 + π 3 π 0 + π 4 π 0 ; thus, the variance of this estimated percentage is given by: To implement the missing data rule in (A4), researchers typically delete π 1 subjects with Y 2 < Q 1 , π 2 subjects with Q 1 ≤ Y 2 < Q 2 , and so on. Percentile method: This case is an extension of the multiple-cutoff method, where each subject has a probability of being missing that depends on its percentile rank in the missing data predictor.
If there is a direct relation between the missing data indicator and the predictor, then the missing data rule is defined by: where q k is the Y 2 value corresponding to its kth percentile. If there is an indirect relationship, then the missing data rule is defined as: Logistic regression method: When generating MAR data by using the logistic regression method, the logistic regression model is considered as the missing data rule and the population regression coefficients of the model are the parameters associated with the missing data rule. Then, the logistic regression model for subject i is written as: where y 2,i is the subject i's value on Y 2 . The probability of being missing for each subject is given by: Because the above function is continuous, it means the probability of being missing for Y 1 gradually increases or decreases as the value of Y 2 increases. With the logistic regression, there is no simple formula for calculating the expected percentage of missing data, so we estimate the expected percentage of missing data by calculating the mean of the probabilities in a sample with a large sample size: The third and last type is the MNAR data: In missing not at random data, the cause of missing data is related to the variable of interest. In this type, the probability that an observation is missing depends on the observed Y o and the missing data Y miss . If Y 1 is neither MCAR nor MAR, then it is MNAR. Generating MNAR data is the same as generating MAR data; the only difference is that in MNAR data the probability of missing values depends on the variable's own value and not on the observed values of the other variables. Therefore, we can change the missing data predictor to the variable with missing values, then we use one of the methods presented above to generate MAR data. For example, to change rule (A3) that generates MAR data to one that generates MNAR data, we only need to replace the variable Y 2 by Y 1 ; therefore, the corresponding missing data rule for generating MNAR data is when Y 1 value is above a cutoff point a. Y 1 has a π 1 probability of being missing, otherwise, Y 1 has a π 2 probability of being missing.
Next, we define the linear system of Henderson to deal with the parameters' estimation.