Model Selection with Missing Data Embedded in Missing-at-Random Data

: When models are built with missing data, an information criterion is needed to select the best model among the various candidates. Using a conventional information criterion for missing data may lead to the selection of the wrong model when data are not missing at random. Conventional information criteria implicitly assume that any subset of missing-at-random data is also missing at random, and thus the maximum likelihood estimator is assumed to be consistent; that is, it is assumed that the estimator will converge to the true value. However, this assumption may not be practical. In this paper, we develop an information criterion that works even for not-missing-at-random data, so long as the largest missing data set is missing at random. Simulations are performed to show the superiority of the proposed information criterion over conventional criteria.


Introduction
In data analysis, all variables are not equally important in the final model; some variables are more important than others. Thus, what matters in practice is model selection. For model selection, several variants of the Kullback-Leibler (KL) information criteria have been developed. In the case of complete data, the criteria include Akaike's information criterion (AIC; [1]) and Takeuchi's information criterion (TIC; [2]). These all measure the "distance" from the true distribution to the model distribution in the sense of Kullback-Leibler information and find the model that minimizes the information.
Applying these information criteria to real data is not straightforward, as nearly all real data have missing values. In practical cases, missing values are an obstacle to applying statistical methods since the methods implicitly assume the availability of complete data. To handle such missing values with information criteria, modifications have been proposed by authors such as [3][4][5][6][7]. The first modification of the information criterion is attributed to [3]. The paper [4] derived the information criterion for missing data in a straight way, and the paper [5] used the symmetric KL divergence for the derivation. The paper [6] used the Hermite expansion with the Markov Chanin Monte Carlo method to approximate the H-function of the EM algorithm, and hence it needed much computation in application. The paper [7] also used the EM algorithm to develop the information criterion for the settings with missing covariates.
However, despite modification, these information criteria still face a major problem in their practical application. The common problem is their implicit or explicit assumption that any set of data selected by them produces consistent parameters in the model. This assumption is simply not realistic. In reality, these criteria can exclude a variable that causes missingness, resulting in inconsistent parameters, as the missing data are not missing at random (NMAR; [8]). For the parameters to be consistent, the missing-data mechanism requires modeling [9]. However, this is practically difficult since the data necessary for modeling are missing.
Stats 2023, 6 To overcome the problems described above, we propose a new information criterion for missing data. The primary advantage of our information criterion is that it can handle situations where a subset of the data is NMAR so long as the largest set of missing data is missing at random (MAR). In computing our criterion, we compute the parameters with the largest data set and extract the parameters necessary for the model under consideration. These two steps allow us to obtain an estimator that is consistent with the true value for any possible model. Since the existing criteria have no such feature, they are unable to produce a proper value for the information criterion when the data are NMAR.
The remainder of the paper is organized as follows. In the next section, we develop the asymptotic theory for a subset of MAR data. In Section 3, we derive our information criterion for a subset of MAR data. In Section 4, the results of two simulations are presented to demonstrate the effectiveness of the proposed information criterion in the practical situations. We conclude the paper with suggestions for future research.

Inference of Maximum Likelihood Estimator with Missing Data
We will represent the variables of the entire data set as a vector x and partition it to (y, z), where y is the variable of interest and z is not. The model density functions with respect to these variables are written as f (·) in common; however, they can be distinguished by the context. If the values of x are observed completely, we would denote this as X c in matrix form. However, since x contains missing values, X c consists of an observed part, X o , and a missing part, X m . For Y c , Y o and Y m are defined in the same way. We denote the parameters of the distributions of x and y as θ and α, respectively. The relationship between θ and α is assumed to be α = Sθ, where S is a matrix to extract the necessary part of the parameter θ, and θ = (α , β ) . The rest of θ is written as β. Note that although we could theoretically develop α = g(θ) for a nonlinear function, in this paper we confine our focus to the linear function that can be expressed as α = Sθ for simplicity.
In real data analysis, we often encounter a situation in which a subset of data is embedded in MAR data. With the MAR mechanism, the missingness of data is caused by observed values for variables included in the analysis; on the other hand, for NMAR data (i.e., data that are not MAR), the missingness is caused by variables not included in the analysis and/or by missing values of the included variables [8]. It follows that using more variables in the analysis makes it more likely that the missing data will be MAR (or close to MAR). One reason for this is that variables added into the analysis are likely to include those variables causing missingness. Another reason is that more variables are likely to include some variables related to missing values, recovering information lost by the missingness. These phenomena have been partly confirmed in simulations by [10][11][12]. Thus, MAR is a more plausible missing-data mechanism for data with a larger set of variables.
An example is given in which a subset of the missing data is NMAR but the entire data is MAR [9]. An investigation is conducted with an aim to estimate the correlation between the self-esteem and the sexual behavior for the teenagers. However, the question of the sexual behavior is so sensitive that the question is only asked for over 15. The resulting data are missing at random (MAR), since the question on the sexual behavior is missing for those less than 15 years of age. As described before, our interest is the relationship between the self-esteem and the sexual behavior for teenagers. If the variables for analysis are limited to the self-esteem and the sexual behavior, the data are NMAR since the data set excludes "age" which causes missingness, and thus the correlation estimate is biased. On the other hand, if the age is used as an additional variable, the data become MAR, and the likelihood estimation using the data on these three variables creates the consistent maximum likelihood estimator of the correlation coefficient between the self-esteem and the sexual behavior.
For missing data, an exact distribution such as the t-distribution for complete data is rarely obtained other than in special cases such as monotonic missing data [13,14]. Hence, asymptotic theory is preferable for deriving the distribution of an estimator with missing data. In asymptotic theory, the most important properties of an estimator are consistency to the true value and asymptotic normality. Of these two properties, the former is more important in practice. That an estimator has consistency means that the estimated parameter converges to the true value of the assumed model and that the estimated parameter is close enough to the true value in a large sample. For an analyst, this is a highly desirable property. In fact, an estimator without such consistency is essentially useless to the analyst as a means of approximating the parameter that he/she wishes to estimate. However, even if an estimator lacks consistency, asymptotic normality still might hold [15]. Notably, it has been shown that a maximum likelihood (ML) estimator under MAR is a good estimator, since it will have both consistency and asymptotic normality [16].
We can now provide more detail regarding the estimation methods when a subset of MAR data is used to produce the ML estimator. It has been shown that the ML estimator based on the maximum likelihood method has consistency [16]. Expressing this in the form of an equation, we havê where n is the sample size, and the arrow indicates that the left-side term converges almost surely to the right-side term as n increases to infinity. What matters here is that the ML estimator is based on missing data X o , and it is a function of the data and the corresponding missing-data indicator. This ML estimator converges to the value which maximizes E[log f (x|θ)] (where the expectation is taken with respect to x), the same convergence point as the maximizer of n −1 log f (X c |θ), which, in actuality, is not available under the presence of missing data. The asymptotic distribution of where ∇ is the first-derivative operator with respect to the parameter θ. On the other hand, the ML estimator would generally not be guaranteed to converge to the true value, α 0 in θ 0 , since the set of data corresponding to α might be NMAR. Hence, to estimate α, we construct an estimator by extracting the part corresponding to α from the ML estimatorθ, and setα = Sθ.

Information Criterion for Missing Data
In this section, we propose an information criterion (IC) for missing data that are MAR as a whole but might be NMAR as a subset. In selecting a model among multiple candidate models, the standard IC underlies the discrepancy from the true distribution g(y) to the model f (y|θ). This is measured using the KL information, which is defined as Since the second term is constant, the first term determines the KL information. The first term works as a measure of a model's goodness. When we estimate the parameter α using the ML method and obtain its estimateα, a naive measure of goodness of fit is The problem is that the true distribution g is unknown. If the complete data are available and are a random sample, we can use n −1 ∑ n j=1 log f (y j |α) as an estimated measure to the true distribution, where y j (j = 1, . . . , n) are the data used to estimateα. However, the problem with using thisempirical likelihood is that the estimates depend on the data, and hence n −1 ∑ n j=1 log f (y j |α) is anasymptotically biased estimator of E g(y) [log f (y|α)]. The likelihood with the corrected bias is known as the IC. Some ICs have a special name, such as AIC [1], TIC [2] and GIC [17]. In the correction and derivation of the IC, asymptotic properties play a central role. Of the various desirable properties of an ML estimator, the most important are asymptotic normality and the consistency of the ML estimator with respect to the true value. See also Konishi and Kitagawa [17] for the technical details.
For missing data, an IC can be computed in a similar spirit. However, considerable caution must be exercised in constructing the criterion, as different subsets of the missing data may have different types of missing-data mechanisms. Even if a subset of the missing data, call it Y o , is MAR, which would mean that a naive ML estimator based on log f (Y o |α) would have consistency, a smaller subset of Y o might be NMAR because the variables causing missingness are excluded from the model under consideration, in which case the ML would not have consistency. This means that a straightforward extension of the conventional IC could not be effectively applied to such missing-data cases.
To derive the proposed IC for cases with missing data, we will assume that the data as a whole, represented as X o , are MAR. As shown in [16], the ML estimator based on X o has consistency and asymptotic normality. However, a subset of MAR data X o , call it Y o , might be NMAR, and thus the ML estimator based on Y o would be asymptotically biased. To ensure that the ML estimator is consistent (i.e., converges to the true value), it is necessary to perform ML estimation with the entire data set, even when we are interested in models for a subset of X o , Y o . Let the ML estimator based on the entire data set X o beθ, and let the part of θ associated with the model of interest be α. In this normality case, there is a selection matrix S such that Sθ = α. The ML estimates of α can be written asα = Sθ. In this setting, we can evaluate the bias.
We will use log f (Y o |θ) to denote a log-likelihood function. With the regularity conditions and the MAR assumption, the ML estimator is proved to have consistency and asymptotic normality [16].
We can now evaluate the bias of log f (Y o |α). The bias is given as Decomposing the bias, we have Let the terms be, in order, D 1 , D 2 , and D 3 .
In this paper, we have assumed that X o is MAR and thatθ is asymptotically normally distributed. First, we evaluate D 1 . ∇ and ∇ α indicate the first derivative operator with respect to θ, and to α, respectively. Assume that the ML estimatorθ takes the form where θ 0 is the true value of the parameter, and In addition, assume that √ n(θ − θ 0 ) is asymptotically distributed as normal with mean zero and variance V X o (θ 0 ).
The expected first term in Equation (4) can be written as where tr{} is the trace of the matrix in {}.
the first term in this trace is zero. Therefore, the expected first term is tr Next, we calculate the expected second term in Equation (4) as In summary, Taking the expectation of both sides, we have Therefore, ignoring the term o p (1), we obtain an unbiased estimator for E[log f (y|α)] as The bias cannot be computed since unknown quantities such as α are involved. We can construct an estimated version of the IC given in Equation (5). By replacing the unknown parameter with the consistent ML estimator and eliminating the expectation from the second term in Equation (5), we can substitute the first two terms with This is a natural estimator for the first two terms in Equation (5). The penalty term is estimated by replacing θ 0 with its ML estimator, which can be consistently estimated under the assumption that the data as a whole are MAR. As a result, the following Equation (6) is obtained. As is conventional in deriving the AIC, we double the bias, obtaining the corrected bias as where Q Y (α|α) is the Q-function of the EM algorithm based on log f (Y c |α) evaluated at α =α andV • is the sample variance of V • for each suffix •. Under the regularity condition assumed in [16], the estimators of the variance-covariance matrices in (6) converges to their population version (i.e., This IC is an extension of the AIC for missing data. It becomes the original AIC [1] when X o = X c = Y o = Y c and θ = α without missing data. Similar to the original AIC, our IC shares the starting point E[log f (y|θ)] with [3,4]. However, the first and second terms of our IC differ from those of [3]. Our IC shares the first term with that of [4] and that of [6], but not the second term. This difference arises from the way in which the bias is approximated. The former [4] approximated the bias under the setting that all the subsets are MAR. The latter [6] approximated the bias by using the Hermite expansion and the Markov Chain Monte Carlo methods.

Simulations
To confirm the validity of our proposed IC, two simulations were performed.

Selection of the Parameter Structure
We will confirm through simulations that our IC selects the correct model when there is missing data for variables (y 1 , y 2 ) in x = (y 1 , y 2 , z) . In so doing, we compare our IC with a slightly modified AIC, since the original AIC proposed by [1] cannot deal with missing data: where p is the dimension of vectorα. This has been used as a competitor to the information criterion proposed by [4]. As an additional competitor, we use AIC cd , which Cavanaugh and Shumway [4] developed for missing data: Four candidate models were considered, all of which are based on a trivariate normal distribution with variables x = (y 1 , y 2 , z). The respective models restrict the means of (y 1 , y 2 ), and/or the variances of (y 1 , y 2 ), or impose no restrictions at all. Model 1 restricts the two expectations to be equal and the two variances to be equal. Model 2 restricts that only the expectations be equal. Model 3 restricts that only the two variances be equal. Model 4 has no restrictions. In the simulations, one of the four models is the true datagenerating model, and a model is selected according to each of the three information criteria: AIC, AIC cd , and our proposed IC. With respect to model selection, it should be noted that all the criteria assume that the true model must be included as a special case of any of the candidate models. Model 1, for example, is a special case of the other three candidate models. On the other hand, Model 2 is not a special case of Model 3, since Model 2 has no restrictions on the variances. Theoretically, models 2, 3, and 4 cannot be compared using an information criterion. For each model, the models that can or cannot be theoretically compared are shown in Table 1. A number in parentheses is that of a model that cannot be theoretically compared. Nevertheless, we have included in the simulation these theoretically un-comparable models in the set of candidate models for comparison because such a comparison is often conducted in practice, as pointed out in [18]. Table 1. True models and the corresponding candidate models. A number in parentheses indicates the number of a model which is theoretically inappropriate to use for comparison but compared in practice.

True Model
Candidate Models The simulation was conducted using the following procedure. First, a complete data set of a given size was generated from a trivariate normal distribution N(µ, Σ) for variables (y 1 , y 2 , z) , where µ = (µ 1 , µ 2 , µ 3 ) and Σ = [σ ab ] a,b=1,2,3 . The mean and variance for the data generation are shown in Table 2, where the covariances are all set to 0.7. The model that is assumed to be true is varied. Table 2. True values of parameters. The covariances are common.

True
Model Second, denoting the missing indicators for (y 1 , y 2 ) as (r 1 , r 2 ), we generated missing values according to the following two missing-data mechanisms. The first mechanism is, for i = 1, 2, This is a case where the missing-data mechanism is non-smooth. We set ψ 1 at 0.3, and ψ 2 at 0.7. The second of the two missing-data mechanisms used the binomial distribution, with the observing probability given as, for i = 1, 2, This is a smooth missing-data mechanism. We set ω 1 at 0.3, and ω 2 at 0.7. With the first missing-data mechanism, the values for (y 1 , y 2 ) beyond the threshold z = 0 are always missing, while those below z = 0 are always observed. In the second mechanism, the missingness of (y 1 , y 2 ) is stochastically determined based on the inverse of the hyperbolic cosine. Notice that the missing data for (y 1 , y 2 , z) are MAR for both mechanisms and that those for (y 1 , y 2 ) are NMAR since z, which is not included in the models, causes the missingness of y 1 and y 2 . Third, for the missing data, the model that gives the minimum IC value is selected for each of the three ICs. The procedure is repeated 1000 times, and the number of times that each of the four models is selected is recorded.
The results for the first missing-data mechanism are summarized in Table 3. As can be seen in the table, our IC is generally superior to its two competitors. In the cases where Models 1 to 3 are true, the proposed IC outperforms the other criteria for any of the sample sizes. In the case where Model 4 is true, the proposed IC underperforms the other criteria when the sample size is small. However, as the sample size increases from 200 to 800, the proposed IC gradually selects the correct model as often as the other criteria. In addition to the above findings, the AIC is found to be more robust against the violation of the missing-data mechanism than AIC cd .
The results for the second missing-data mechanism are given in Table 4. Here, again, our IC is shown to be generally superior to its competitors. While all the information criteria are capable of selecting the true model, as the sample size increases, our IC is more likely to select the true model. For any true model, our IC outperforms the other information criteria except when Model 4 is true. Even in that case, our IC is equivalent to the others. In sum, our IC nearly consistently outperforms its major competitors and selects the correct model in all of the cases.  298  118  399  185  0  205  0  795  0  0  685  315  0  0  0  1000  AIC  429  93  345  133  0  247  0  753  0  0  752  248  0  0  0  1000   Proposed  652  83  234  31  0  592  5  403  0  0  893  107  0  0  6  994  800  AIC cd  190  58  466  286  0  40  0  960  0  0  646  354  0  0  0  1000  AIC  270  58  464  208  0  51  0  949  0  0  727  273  0  0  0  1000   Table 4. Performance comparison under the missing-data mechanism (8). where has mean zero and a finite variance conditional on (x 1 , x 2 ). The four models considered are Model 5 with β 1 = β 2 = 0, Model 6 with β 1 nonzero and β 2 = 0, which is the true model, Model 7 with β 1 = 0 and β 2 nonzero, and Model 8 with both β 1 and β 2 nonzero. For example, the data-generating model of Model 6 with (β 0 , β 1 , β 2 ) = (1, 1 √ 2 , 0) is that the joint distribution of (y, x 1 , x 2 ) is a trivariate normal with mean [0, 0, 0] and variance These four models do not necessarily have a nested/nesting relationship. Model 5 is a special case of Models 6 to 8. Models 6 and 7 are a special case of Model 8, but they do not nest with one another. Models 6 and 7 are not in a nesting relationship. Thus, theoretically speaking, these four models cannot be compared with the AIC-type information criteria. However, in practice, the AIC criterion has been commonly used for the comparison of nonnested/nonnesting models [18]. Thus, we compare the four models according to this convention. The missingness of y and x 1 depends only on x 2 . To represent the missing-data mechanism in a more formal way, we introduce missing-data indicators (r y , r x 1 ) for (y, x 1 ), each of which takes a value of 1 when the corresponding variable is observed and 0 when it is not. The missing-data mechanisms that were used are with setting ω y at 0.7, and ω x 1 at 0.3. The simulation was conducted according to the following procedure. First, a sample without missing values is created by drawing random sample of size n from the joint normal distribution of (y, x 1 , x 2 ) which is specified above. Second, the missing values in the sample are created through either of the missing-data mechanisms for (y, x 1 ) as introduced above. Finally, the proposed IC and the competitor ICs are computed with such a sample for Models 1 to 4, and each is used to select the model corresponding to the minimum value. We repeated this procedure 1000 times, counting the number of times that each model is selected. The counts are used as an indicator of how well each of the ICs work in selecting the best model.
The results of the simulation are shown in Table 5. Regardless of which model is the actual true model, our proposed IC appears to work well. In the case where Model 5 is true, our proposed IC outperforms the other two competitors for any sample size. Here, AIC cd works much better than AIC, but falls behind our IC. In the case where either Models 6 or 7 is true, AIC cd shows the best performance for any sample size. The second best is our IC. AIC selects the wrong model, Model 8, more than half the time. Interestingly, all three IC's select neither Model 5 that is in a nesting relationship, nor Model 7(6) that is not in a nesting relationship when Model 6(7) is true. In the case in which Model 8 is true, AIC performs best. However, while for a sample size of 200, AIC selects the true model more often than our proposed IC, for sample sizes of more than 200, our proposed IC is comparable. AIC cd always selects the wrong models. In summary, our proposed IC performed best or, at worst, second best throughout the simulation. Even in the second-best case, the performance of our proposed IC is very close to that of its higher-performing competitor.
Note that even for different values of the parameters in the missing-data mechanisms, we observed almost the same trend.