1. Introduction
In data analysis, all variables are not equally important in the final model; some variables are more important than others. Thus, what matters in practice is model selection. For model selection, several variants of the Kullback-Leibler (KL) information criteria have been developed. In the case of complete data, the criteria include Akaike’s information criterion (AIC; [
1]) and Takeuchi’s information criterion (TIC; [
2]). These all measure the “distance” from the true distribution to the model distribution in the sense of Kullback-Leibler information and find the model that minimizes the information.
Applying these information criteria to real data is not straightforward, as nearly all real data have missing values. In practical cases, missing values are an obstacle to applying statistical methods since the methods implicitly assume the availability of complete data. To handle such missing values with information criteria, modifications have been proposed by authors such as [
3,
4,
5,
6,
7]. The first modification of the information criterion is attributed to [
3]. The paper [
4] derived the information criterion for missing data in a straight way, and the paper [
5] used the symmetric KL divergence for the derivation. The paper [
6] used the Hermite expansion with the Markov Chanin Monte Carlo method to approximate the H-function of the EM algorithm, and hence it needed much computation in application. The paper [
7] also used the EM algorithm to develop the information criterion for the settings with missing covariates.
However, despite modification, these information criteria still face a major problem in their practical application. The common problem is their implicit or explicit assumption that any set of data selected by them produces consistent parameters in the model. This assumption is simply not realistic. In reality, these criteria can exclude a variable that causes missingness, resulting in inconsistent parameters, as the missing data are not missing at random (NMAR; [
8]). For the parameters to be consistent, the missing-data mechanism requires modeling [
9]. However, this is practically difficult since the data necessary for modeling are missing.
To overcome the problems described above, we propose a new information criterion for missing data. The primary advantage of our information criterion is that it can handle situations where a subset of the data is NMAR so long as the largest set of missing data is missing at random (MAR). In computing our criterion, we compute the parameters with the largest data set and extract the parameters necessary for the model under consideration. These two steps allow us to obtain an estimator that is consistent with the true value for any possible model. Since the existing criteria have no such feature, they are unable to produce a proper value for the information criterion when the data are NMAR.
The remainder of the paper is organized as follows. In the next section, we develop the asymptotic theory for a subset of MAR data. In
Section 3, we derive our information criterion for a subset of MAR data. In
Section 4, the results of two simulations are presented to demonstrate the effectiveness of the proposed information criterion in the practical situations. We conclude the paper with suggestions for future research.
2. Inference of Maximum Likelihood Estimator with Missing Data
We will represent the variables of the entire data set as a vector and partition it to , where is the variable of interest and is not. The model density functions with respect to these variables are written as in common; however, they can be distinguished by the context. If the values of are observed completely, we would denote this as in matrix form. However, since contains missing values, consists of an observed part, , and a missing part, . For , and are defined in the same way. We denote the parameters of the distributions of and as and , respectively. The relationship between and is assumed to be , where S is a matrix to extract the necessary part of the parameter , and . The rest of is written as . Note that although we could theoretically develop for a nonlinear function, in this paper we confine our focus to the linear function that can be expressed as for simplicity.
In real data analysis, we often encounter a situation in which a subset of data is embedded in MAR data. With the MAR mechanism, the missingness of data is caused by observed values for variables included in the analysis; on the other hand, for NMAR data (i.e., data that are not MAR), the missingness is caused by variables not included in the analysis and/or by missing values of the included variables [
8]. It follows that using more variables in the analysis makes it more likely that the missing data will be MAR (or close to MAR). One reason for this is that variables added into the analysis are likely to include those variables causing missingness. Another reason is that more variables are likely to include some variables related to missing values, recovering information lost by the missingness. These phenomena have been partly confirmed in simulations by [
10,
11,
12]. Thus, MAR is a more plausible missing-data mechanism for data with a larger set of variables.
An example is given in which a subset of the missing data is NMAR but the entire data is MAR [
9]. An investigation is conducted with an aim to estimate the correlation between the self-esteem and the sexual behavior for the teenagers. However, the question of the sexual behavior is so sensitive that the question is only asked for over 15. The resulting data are missing at random (MAR), since the question on the sexual behavior is missing for those less than 15 years of age. As described before, our interest is the relationship between the self-esteem and the sexual behavior for teenagers. If the variables for analysis are limited to the self-esteem and the sexual behavior, the data are NMAR since the data set excludes “age” which causes missingness, and thus the correlation estimate is biased. On the other hand, if the age is used as an additional variable, the data become MAR, and the likelihood estimation using the data on these three variables creates the consistent maximum likelihood estimator of the correlation coefficient between the self-esteem and the sexual behavior.
For missing data, an exact distribution such as the
t-distribution for complete data is rarely obtained other than in special cases such as monotonic missing data [
13,
14]. Hence, asymptotic theory is preferable for deriving the distribution of an estimator with missing data. In asymptotic theory, the most important properties of an estimator are consistency to the true value and asymptotic normality. Of these two properties, the former is more important in practice. That an estimator has consistency means that the estimated parameter converges to the true value of the assumed model and that the estimated parameter is close enough to the true value in a large sample. For an analyst, this is a highly desirable property. In fact, an estimator without such consistency is essentially useless to the analyst as a means of approximating the parameter that he/she wishes to estimate. However, even if an estimator lacks consistency, asymptotic normality still might hold [
15]. Notably, it has been shown that a maximum likelihood (ML) estimator under MAR is a good estimator, since it will have both consistency and asymptotic normality [
16].
We can now provide more detail regarding the estimation methods when a subset of MAR data is used to produce the ML estimator. It has been shown that the ML estimator based on the maximum likelihood method has consistency [
16]. Expressing this in the form of an equation, we have
where
n is the sample size, and the arrow indicates that the left-side term converges almost surely to the right-side term as
n increases to infinity. What matters here is that the ML estimator is based on missing data
, and it is a function of the data and the corresponding missing-data indicator. This ML estimator converges to the value which maximizes
(where the expectation is taken with respect to
), the same convergence point as the maximizer of
, which, in actuality, is not available under the presence of missing data. The asymptotic distribution of
is normal, with mean zero and variance
, where ∇ is the first-derivative operator with respect to the parameter
. On the other hand, the ML estimator
would generally not be guaranteed to converge to the true value,
in
, since the set of data corresponding to
might be NMAR. Hence, to estimate
, we construct an estimator by extracting the part corresponding to
from the ML estimator
, and set
.
3. Information Criterion for Missing Data
In this section, we propose an information criterion (IC) for missing data that are MAR as a whole but might be NMAR as a subset. In selecting a model among multiple candidate models, the standard IC underlies the discrepancy from the true distribution
to the model
. This is measured using the KL information, which is defined as
Since the second term is constant, the first term determines the KL information. The first term works as a measure of a model’s goodness. When we estimate the parameter
using the ML method and obtain its estimate
, a naive measure of goodness of fit is
. The problem is that the true distribution
g is unknown. If the complete data are available and are a random sample, we can use
as an estimated measure to the true distribution, where
(
) are the data used to estimate
. However, the problem with using thisempirical likelihood is that the estimates depend on the data, and hence
is anasymptotically biased estimator of
. The likelihood with the corrected bias is known as the IC. Some ICs have a special name, such as AIC [
1], TIC [
2] and GIC [
17]. In the correction and derivation of the IC, asymptotic properties play a central role. Of the various desirable properties of an ML estimator, the most important are asymptotic normality and the consistency of the ML estimator with respect to the true value. See also Konishi and Kitagawa [
17] for the technical details.
For missing data, an IC can be computed in a similar spirit. However, considerable caution must be exercised in constructing the criterion, as different subsets of the missing data may have different types of missing-data mechanisms. Even if a subset of the missing data, call it , is MAR, which would mean that a naive ML estimator based on would have consistency, a smaller subset of might be NMAR because the variables causing missingness are excluded from the model under consideration, in which case the ML would not have consistency. This means that a straightforward extension of the conventional IC could not be effectively applied to such missing-data cases.
To derive the proposed IC for cases with missing data, we will assume that the data as a whole, represented as
, are MAR. As shown in [
16], the ML estimator based on
has consistency and asymptotic normality. However, a subset of MAR data
, call it
, might be NMAR, and thus the ML estimator based on
would be asymptotically biased. To ensure that the ML estimator is consistent (i.e., converges to the true value), it is necessary to perform ML estimation with the entire data set, even when we are interested in models for a subset of
,
. Let the ML estimator based on the entire data set
be
, and let the part of
associated with the model of interest be
. In this normality case, there is a selection matrix
S such that
. The ML estimates of
can be written as
. In this setting, we can evaluate the bias.
We will use
to denote a log-likelihood function. With the regularity conditions and the MAR assumption, the ML estimator is proved to have consistency and asymptotic normality [
16].
We can now evaluate the bias of
. The bias is given as
Decomposing the bias, we have
Let the terms be, in order,
,
, and
.
In this paper, we have assumed that
is MAR and that
is asymptotically normally distributed. First, we evaluate
. ∇ and
indicate the first derivative operator with respect to
, and to
, respectively. Assume that the ML estimator
takes the form
where
is the true value of the parameter, and
is the variance defined as
. In addition, assume that
is asymptotically distributed as normal with mean zero and variance
. By
,
, we obtain
The expected first term in Equation (
4) can be written as
where tr
is the trace of the matrix in
. Because
, the first term in this trace is zero. Therefore, the expected first term is
. Next, we calculate the expected second term in Equation (
4) as
In summary,
.
Second, we evaluate
.
where
is the
H-function of the EM algorithm based on
evaluated at
.
Finally, we evaluate
. Define
. Since
,
where
. Taking the expectation of both sides, we have
Therefore, ignoring the term
, we obtain an unbiased estimator for
as
The bias cannot be computed since unknown quantities such as
are involved.
We can construct an estimated version of the IC given in Equation (
5). By replacing the unknown parameter with the consistent ML estimator and eliminating the expectation from the second term in Equation (
5), we can substitute the first two terms with
This is a natural estimator for the first two terms in Equation (
5). The penalty term is estimated by replacing
with its ML estimator, which can be consistently estimated under the assumption that the data as a whole are MAR. As a result, the following Equation (
6) is obtained. As is conventional in deriving the AIC, we double the bias, obtaining the corrected bias as
where
is the
Q-function of the EM algorithm based on
evaluated at
and
is the sample variance of
for each suffix •. Under the regularity condition assumed in [
16], the estimators of the variance-covariance matrices in (
6) converges to their population version (i.e.,
converges to
in probability as
).
This IC is an extension of the AIC for missing data. It becomes the original AIC [
1] when
and
without missing data. Similar to the original AIC, our IC shares the starting point
with [
3,
4]. However, the first and second terms of our IC differ from those of [
3]. Our IC shares the first term with that of [
4] and that of [
6], but not the second term. This difference arises from the way in which the bias is approximated. The former [
4] approximated the bias under the setting that all the subsets are MAR. The latter [
6] approximated the bias by using the Hermite expansion and the Markov Chain Monte Carlo methods.
4. Simulations
To confirm the validity of our proposed IC, two simulations were performed.
4.1. Selection of the Parameter Structure
We will confirm through simulations that our IC selects the correct model when there is missing data for variables
in
. In so doing, we compare our IC with a slightly modified AIC, since the original AIC proposed by [
1] cannot deal with missing data:
where
p is the dimension of vector
. This has been used as a competitor to the information criterion proposed by [
4]. As an additional competitor, we use AIC
, which Cavanaugh and Shumway [
4] developed for missing data:
Four candidate models were considered, all of which are based on a trivariate normal distribution with variables . The respective models restrict the means of , and/or the variances of , or impose no restrictions at all. Model 1 restricts the two expectations to be equal and the two variances to be equal. Model 2 restricts that only the expectations be equal. Model 3 restricts that only the two variances be equal. Model 4 has no restrictions. In the simulations, one of the four models is the true data-generating model, and a model is selected according to each of the three information criteria: AIC, , and our proposed IC.
With respect to model selection, it should be noted that all the criteria assume that the true model must be included as a special case of any of the candidate models. Model 1, for example, is a special case of the other three candidate models. On the other hand, Model 2 is not a special case of Model 3, since Model 2 has no restrictions on the variances. Theoretically, models 2, 3, and 4 cannot be compared using an information criterion. For each model, the models that can or cannot be theoretically compared are shown in
Table 1. A number in parentheses is that of a model that cannot be theoretically compared. Nevertheless, we have included in the simulation these theoretically un-comparable models in the set of candidate models for comparison because such a comparison is often conducted in practice, as pointed out in [
18].
The simulation was conducted using the following procedure. First, a complete data set of a given size was generated from a trivariate normal distribution
for variables
, where
and
. The mean and variance for the data generation are shown in
Table 2, where the covariances are all set to
. The model that is assumed to be true is varied.
Second, denoting the missing indicators for
as
, we generated missing values according to the following two missing-data mechanisms. The first mechanism is, for
,
This is a case where the missing-data mechanism is non-smooth. We set
at
, and
at
. The second of the two missing-data mechanisms used the binomial distribution, with the observing probability given as, for
,
This is a smooth missing-data mechanism. We set
at
, and
at
. With the first missing-data mechanism, the values for
beyond the threshold
are always missing, while those below
are always observed. In the second mechanism, the missingness of
is stochastically determined based on the inverse of the hyperbolic cosine. Notice that the missing data for
are MAR for both mechanisms and that those for
are NMAR since
z, which is not included in the models, causes the missingness of
and
.
Third, for the missing data, the model that gives the minimum IC value is selected for each of the three ICs. The procedure is repeated 1000 times, and the number of times that each of the four models is selected is recorded.
The results for the first missing-data mechanism are summarized in
Table 3. As can be seen in the table, our IC is generally superior to its two competitors. In the cases where Models 1 to 3 are true, the proposed IC outperforms the other criteria for any of the sample sizes. In the case where Model 4 is true, the proposed IC underperforms the other criteria when the sample size is small. However, as the sample size increases from 200 to 800, the proposed IC gradually selects the correct model as often as the other criteria. In addition to the above findings, the AIC is found to be more robust against the violation of the missing-data mechanism than
.
The results for the second missing-data mechanism are given in
Table 4. Here, again, our IC is shown to be generally superior to its competitors. While all the information criteria are capable of selecting the true model, as the sample size increases, our IC is more likely to select the true model. For any true model, our IC outperforms the other information criteria except when Model 4 is true. Even in that case, our IC is equivalent to the others. In sum, our IC nearly consistently outperforms its major competitors and selects the correct model in all of the cases.
4.2. Selection of Linear Regression Models
In this simulation, we demonstrate how our IC works for linear regression models. Since the derivation given above does not consider the covariate variables, our IC does not directly apply to a regression model. However, it is in the practical analysis of a regression model that we need information criteria to select from the various candidate models. Thus, we naively applied our IC in a simulation and monitored its performance.
In this simulation, we compared the performance of our proposed IC with that of the AIC and the
criteria. Four regression models were used in the comparison. From here on, for simplicity, we will use more common notation. Let
y be the dependent variable and
be independent variables. The regression model is
where
has mean zero and a finite variance conditional on
. The four models considered are Model 5 with
, Model 6 with
nonzero and
, which is the true model, Model 7 with
and
nonzero, and Model 8 with both
and
nonzero. For example, the data-generating model of Model 6 with
is that the joint distribution of
is a trivariate normal with mean
and variance
These four models do not necessarily have a nested/nesting relationship. Model 5 is a special case of Models 6 to 8. Models 6 and 7 are a special case of Model 8, but they do not nest with one another. Models 6 and 7 are not in a nesting relationship. Thus, theoretically speaking, these four models cannot be compared with the AIC-type information criteria. However, in practice, the AIC criterion has been commonly used for the comparison of nonnested/nonnesting models [
18]. Thus, we compare the four models according to this convention.
The missingness of
y and
depends only on
. To represent the missing-data mechanism in a more formal way, we introduce missing-data indicators
for
, each of which takes a value of 1 when the corresponding variable is observed and 0 when it is not. The missing-data mechanisms that were used are
with setting
at
, and
at
.
The simulation was conducted according to the following procedure. First, a sample without missing values is created by drawing random sample of size n from the joint normal distribution of which is specified above. Second, the missing values in the sample are created through either of the missing-data mechanisms for as introduced above. Finally, the proposed IC and the competitor ICs are computed with such a sample for Models 1 to 4, and each is used to select the model corresponding to the minimum value. We repeated this procedure 1000 times, counting the number of times that each model is selected. The counts are used as an indicator of how well each of the ICs work in selecting the best model.
The results of the simulation are shown in
Table 5. Regardless of which model is the actual true model, our proposed IC appears to work well. In the case where Model 5 is true, our proposed IC outperforms the other two competitors for any sample size. Here,
works much better than AIC, but falls behind our IC. In the case where either Models 6 or 7 is true,
shows the best performance for any sample size. The second best is our IC. AIC selects the wrong model, Model 8, more than half the time. Interestingly, all three IC’s select neither Model 5 that is in a nesting relationship, nor Model 7(6) that is not in a nesting relationship when Model 6(7) is true. In the case in which Model 8 is true, AIC performs best. However, while for a sample size of 200, AIC selects the true model more often than our proposed IC, for sample sizes of more than 200, our proposed IC is comparable.
always selects the wrong models. In summary, our proposed IC performed best or, at worst, second best throughout the simulation. Even in the second-best case, the performance of our proposed IC is very close to that of its higher-performing competitor.
Note that even for different values of the parameters in the missing-data mechanisms, we observed almost the same trend.
5. Conclusions
This paper considered the practical situation in which a set of missing data may not be MAR and developed a new information criterion to handle such a situation. This contrasts with previous ICs, which implicitly consider only the unrealistic case in which the data are MAR as a whole and subsets of the data are MAR as well. Our proposed IC uses the largest data set to estimate model parameters and circumvents the problem that the conventional ICs ignore. Using numerical simulations, it was shown that our new IC works better than, or at worst, equivalently to, its competitors.
The study of missing-data information criterion is far from complete. The present paper requires refinement. Although we applied our IC to regression, this was done without a rigorous basis, as noted in the body of the paper. A solid foundation for the comparison of regression models needs to be developed. Furthermore, application of the proposed IC can be extended to cases in which the parameters of the model are not a linear function of the parameters for the entire data. The approach used to develop our IC should not be limited to regression models; rather, it can be applied to other conventional statistical models such as the generalized linear regression and the explanatory factor model. Further refinements of our IC is also possible. Currently our IC uses
as the estimator of the H-function
. That is, implementing the MCMC-based estimator of the H-function presented in [
6] to our IC can improve the performance of model selection.
Further, the framework of our proposed IC might be applied to the data taken under the two-phase design. Under the two-phase design, an efficient estimation of the regression parameter has been extensively developed [
19,
20]. In the two-phase design, some part of the first-phase variables are observed for all subjects and the second-phase variables are observed only for some of the subjects who are selected based on the first-phase variables to avoid high cost. Since the missingness of the two-phase variables are caused by the values of the first-phase variables, the entire data are MAR. However, the variables of interest vary from analysis to analysis, and in some cases, the subset of the data on the model become NMAR. Hence, the variable selection becomes necessary. In such a case, our proposed IC might contribute to select variables. It is necessary to develop our IC to deal with such a two-phase design.