Having described the 24-model structure of missing data mechanisms, it is reasonable to ask if this can be used to assist in identifying the different missing data mechanisms. Gaining evidence to help with the identification of the missing data mechanism directly from the data would be of great assistance to analysts. Accordingly, the following sections look at the possibility of direct testing of the data to gather evidence for the missing data mechanism present.
3.1. The Inability to Test Directly for MCAR or MNAR Missing Data
On the face of it, the 24-model structure suggests the possibility of a direct test for MCAR missingness. This would require that the probability of missingness,
, be constant irrespective of the variables in the conditional form
, i.e., for every combination of
, as listed in the final column of
Table 4, we have the following:
As outlined in
Section 1, however, MCAR missingness is driven by a random process. The inability to identify randomness as such and, indeed, the temptation to assign patterns to randomness is a well-understood phenomenon that has been extensively explored in the literature, for example, as described in [
16,
17,
18]. This presents a significant barrier to the identification of MCAR missingness from the data alone. In addition, there is the risk that the set of variables under examination fails to include the independent variable that drives the missingness. In this way, model misspecification can be confused with an indicator of MCAR missingness. This is aggravated by the fact that the missingness may be driven by a spatial or temporal component that is not included in the dataset.
This is why the most common method of identifying MCAR involves comparing the means and distributions of the data with and without any imputed (missing) data; the mechanism is considered MCAR if no statistically significant difference is detected within the sets of means and distributions. However, it has been shown in [
13], as described in
Section 1, that the equality of means and distributions may be owing to other causes and not driven by an MCAR mechanism.
Similarly, there are formidable barriers to directly identifying an MNAR mechanism from the data. By definition, an MNAR dataset is missing the very components that drive the MNAR mechanism. This increases the likelihood of both type I and type II errors (identifying non-MNAR missingness as MNAR, and
vice versa). Additionally, as can be seen from
Section 2, there are far more MNAR models than for MCAR and MAR combined. This is compounded by the complexity of many of the MNAR models. Overall, these characteristics make the successful direct identification of MNAR missingness from the data highly unlikely.
3.3. Simulating Missing Data
In order to assess the feasibility of direct diagnosis for MAR missingness, it is necessary to create simulated data for which the missing data mechanisms are known. This section describes how the simulated data are created for each of the missing data types (MCAR, MAR, and MNAR). The next sections look at the direct testing for these mechanisms and conclude with an examination of the experimental results. Data creation and analysis were carried out in R version 4.1.0.
The approach used to generate missing data is based on that of Ref. [
19] as described in Appendix A of that paper. It has, however, been modified to allow for a greater range of individual forms of the missing data mechanisms, as described in
Section 2. The following core data variables are created:
Y, the dependent variable,
X, the independent variable, which may also have missing data, and
Z, an independent variable that is always completely observed. There is no predetermined relationship between the dependent variable
Y and the independent variables
X and
Z, as all three are the products of random draws from a statistical distribution. This is to avoid possibly introducing bias into the diagnostic development. In addition, each
X and
Y variable has a correlated partner variable (denoted
and
), which is used to generate the different missing data indicators, as described in the following sections. In combination, these 5 data variables generate 24 different models (as outlined in
Section 2) that produce missing data. In each case, the different combinations of these variables,
, and
Z, are used to generate a missing data indicator, which has an index number that shows the type of missing data mechanism (MCAR, MAR, or MNAR) as well as the specific functional form, as described in
Section 2 and detailed in
Table 4. The missing data indicator is then used to delete the missing data from the
Y variable (where 1 = missing data, 0 = observed data). Conversion to the terminology used elsewhere in this document is shown in
Table 6.
In order to identify the X variable missing data, it is necessary to create an additional missing data indicator series. This series, denoted as , is generated by random draws from a binomial distribution with a probability of 0.2.
An additional amendment to the original Gomer and Yuan approach [
19] involves adding draws from a random normal distribution to the
, and
Z variables. This is to add a random error element to the original data series (these series are referred to as ‘dirty’ data) to replicate more realistic data conditions. Three such random error series are created in each case. For the normal distribution dataset, for instance, these comprise draws from N(0, 0.1), N(0, 0.5), and N(0, 1.0) distributions. A separate draw is created for each of the
, and
Z variables to avoid introducing an unwanted element of correlation between the variables.
The Z variable is created by generating 10,000 random draws from the same distribution used for , and .0
In order to generate the and variables, it is first necessary to create a correlation matrix that correlates Y to and X to . The Y, , X, and variables are generated using multinomial distributions, as in the original Gomer and Yuan paper. The random error variables for ‘dirtying’ the Y, , X, , and Z series are generated by random draws from a normal distribution with zero mean and an appropriate standard distribution. These random error factors are then added to the Y, , X, , and Z series. As the random error approaches the same magnitude as the original series, the correlation coefficients may need to be adjusted to maintain the (approximate) 0.3 correlation between the Y and and X and series.
The missing data mechanism indicator variable denotes which Y series data points need to be removed to create the relevant missing data. In contrast, the missing X data indicator, as described above, is used across all models to remove missing X data from the X series.
To avoid confusion, the term case will refer to the six different model classes tested, i.e., base case clean data; base case dirty data; clean missing data; dirty missing data 1; dirty missing data 2; and dirty missing data 3. The term model will refer to the 24 different functional forms tested in each case. The first two cases—base case clean data and base case dirty data—are only used to check the model construction and are not used in any of the analyses.
Three different distributions were used to generate the simulated data for the six different model cases, as shown in
Table 7. These were chosen to represent distributions commonly encountered in real-world datasets [
20], with normal representing continuous and Poisson, discrete distributions.
To create the MCAR data, a series of 10,000 draws from a binomial distribution with is generated. This gives a series (denoted as MCARi) that randomly allocates 20% of the Y variables as missing data. The data (represented by variable ) have entries that correspond to the missing data indicator (MCARi = 1) deleted.
MAR missing data mechanism indicators are generated by applying the functional forms listed in
Table 4, using the same index numbers for identification. This generates three missing data mechanism indicators that have one variable each, three missing data indicators that have two variables each, and one missing data mechanism that is based on three variables. The details are given in
Table 8.
The data (represented by variable ) have entries that correspond to the appropriate missing data indicator deleted.
Having more than one element generating the missing data element means that the initial (one element) threshold levels used by Gomer and Yuan [
19] in generating their simulated data no longer yield 20% of the total. As a result, the missing data thresholds for two- and three-element functional forms have lower threshold levels that yield approximately 20% missing data indicators. These were found through a trial-and-error process. This also applies to the threshold levels used to create MNAR models with greater than one missing data element.
MNAR missing data mechanism indicators are also generated by applying the functional forms listed in
Table 4 and using the same index numbers for identification as in these tables. This generates two missing data mechanism indicators with one variable each, six with two variables each, six with three variables each, and two with four variables each. The details are provided in
Table 9.
This time, the and data (represented by variables and Y) have entries that correspond to the appropriate missing data indicator deleted.
To create a more realistic dataset, a separate error series is generated for each of , and Z, and in each case, the error is added to the original series before any missing data deletions are carried out.
3.4. Testing for MAR Missingness
As the missing data can be represented as a dummy series (where 1 represents missing data and 0 represents observed data), a logical approach is to use a binary statistical test. In this case, a generalized linear regression (GLM) model using a logit link function is used as a test, and the output comprises the indicators of statistical fit as well as an ‘area under the curve’ (AUC) plot [
21]. The missing data indicators generated during the simulated data creation process serve as the dummy variables. In a sense, this is just a reversal of the normal process of indicating missing data; for real data, the presence of missing data generates the dummy 1/0 missing data indicator series, whereas in this case, the missing data indicators generate the gaps in the data.
As previously mentioned, the analysis was carried out on four cases for each simulated data generation, as listed in
Table 10, where “DD1”, “DD2”, and “DD3” refer to the different levels of random errors used with the distributions.
As mentioned in
Section 3.3, the first two cases—“Clean CC base case” and “Dirty CC base case”—are solely to check model construction. As these cases have complete data (even if one has a random error added), they should provide the strongest signals to identify the variables used to generate the missing data indicators.
For all tests, the statistically significant level was taken to be 0.001 to reduce the likelihood of random noise being taken as a signal. Any results at the 0.01 significance level were taken to be a partial signal and are distinguished in the tables in
Appendix A and
Appendix B by being represented by lowercase letters in parentheses. The Akaike information criteria (AIC) and area under the curve (AUC) diagnostics were also used to assess the returns, but these tables were omitted from the paper owing to the volume of material involved.
Although the tests were carried out on all 24 functional forms of the same case at the same time, the tests and results will be presented by missing data type.
Importantly, GLM tests cannot be carried out on data with missing values. As the missing data indicators are directly generated from the variables, it follows that deleting incomplete rows also eliminates the associated missing data indicators and makes testing impossible. To avoid this, it is necessary to fill in the missing data with substitute values. The purpose of this substitution is not to find values that have similar magnitudes to the original (missing) data but to fill in the gap without introducing unnecessary variation into the process. One approach is to use the arithmetic means of the
Y and
X variables, respectively, as substitutes for their missing values. This preserves the overall variable mean, and the impact of potentially reducing the variable’s standard deviation is acceptable as the least worst substitution option. An alternative is to use random values drawn from the same distribution as the models under test. Most other potential substitute values risk skewing the results or adding no new information (a normal distribution, for example, has a mean, median, and mode of 0, thereby eliminating three alternative substitutes). The results of the random-based tests are used to assess the accuracy of the single mean imputation results for cases 4, 5, and 6 (there is no missing data deletion for cases 1 and 2, and data without a random error are unlikely to be encountered with real data). The results of using these different substitutions are compared and discussed in
Section 3.5.5.
For all six cases and seven MAR models, the GLM test was carried out using the relevant missing data indicator (MARii to MARviii) as the dependent variable and the five remaining variables (, and Z) as independent variables. A logit function was used as the GLM link function. For the test to provide evidence for the MAR missing data mechanism when using these data, the only statistically significant items should be the intercept and the variables used to generate the missing data indicator.
MCAR: For all six cases, the GLM test was carried out using MCARi (the random binomial draws) as the dependent variable and the five remaining variables (, and Z) as independent variables. A logit function was used as the GLM link function. For the test to provide evidence for the MCAR missing data mechanism with this dataset, the only statistically significant item at the 0.001 level should be the intercept.
For all six cases and sixteen MNAR models, the GLM test was carried out using the relevant missing data indicator (MNARix to MARxxix) as the dependent variable and the five remaining variables (, and Z) as independent variables. A logit function was used as the GLM link function. For the test to provide evidence for the MNAR missing data mechanism with this dataset, the only statistically significant items should be the intercept and the variables used to generate the missing data indicator.
3.5. Experimental Results from Testing with Simulated Data
This section focuses on examining the possibility of diagnosing MAR missingness directly from the data using this simple test. In order to do so successfully, it will also be necessary to contrast the results found for the MCAR and MNAR data to identify possibly confusing results. As mentioned in
Section 3.4, this analysis is carried out using the single mean substitution data; the impact of changing this approach to random (from the same underlying distribution) infilling is assessed in
Section 3.5.5.
There are results for 24 models for each of the four cases; this gives a total of 96 separate tests for each set of simulated data. Unsurprisingly, the test cases—clean complete case data and dirty complete case data—produced highly accurate results and will not be discussed further.
The results for each distribution are given in tables, which are to be found in n. These comprise
Table A1,
Table A3 and
Table A5, which give the variables (in
format) found significant for the Normal(0,1), Poisson(30), and Poisson(5) simulated data, respectively. In each case, the tables also indicate the relevant missing data mechanism and the functional form for each model. The data in the output test summaries from each of the model runs are also important for interpreting the results, but, owing to the large volume of material outputted, these are excluded from this paper. Following the structure used in the rest of this document, the results will be discussed by missing data mechanism type.
3.5.1. MAR Results
As the number of variables in each functional form affects the fit of the GLM test, they will be discussed separately.
For all three distributions and four missing data cases, the tests on the one-variable functions (
, or
Z, that is, models MARii, MARiii, and MAR iv) resulted in the failure of the test and the warning messages ‘1: glm.fit: algorithm did not converge; and 2: glm.fit: fitted probabilities numerically 0 or 1 occurred’. In each case, the AUC had a value of 1, indicating a perfect fit. In these cases, the Akaike information criteria (AIC), a diagnostic statistic for assessing model fit (see Ref. [
22]), is very low (12).
Both of the two-variable functions containing (model MARv and MARvii) appear sensitive to the missing data; this was shown by the Y variable also being found significant. In contrast, this sensitivity was not found for the variable. In both combinations of variables that used to generate the missing data mechanism indicators ( and , models MARvi and MARvii), only the (not the X) variable was found to be statistically significant. The Z variable was correctly identified in all cases in which it was part of the model. The AUC values for all three two-variable functional forms were in the region of 0.95, ranging from 0.950 to 0.957. An examination of the model test summaries found that the correctly identified variables always had a positive z-score, and the intercept always had a negative z-score. In the cases where Y was incorrectly identified as a significant variable, it always had a negative z-score.
As with the two-variable models, once the missing data were deleted, the three-variable model (model MARviii) also found the Y variable significant. Again, this sensitivity was not found for the variable, and the Z variable was only identified as significant when present in the model. The AUC values for the three-variable functional forms were in the region of 0.92, ranging from 0.915 to 0.931. As with the two-variable models, test summaries indicated that correctly identified variables always had positive z-scores, the intercept always had a negative z-score, and if Y was incorrectly identified as a significant variable, it always had a negative z-score.
The overall results for the MAR model tests across all four missing data cases and three distributions suggest that it may be possible to use these results to acquire evidence for MAR missingness. This would involve combining the variables found to be significant with the signs of their corresponding z-scores. This result appears robust, even in the presence of a substantial level of random data error.
Before these results can be fully accepted, however, it is necessary to contrast them with the results obtained for MCAR and MNAR data.
3.5.2. MCAR Results
For all three distributions and all four missing data cases, the GLM output for the MCAR model (MCARi) shows that the only statistically significant part of the fitted model is the intercept. This implies that , which is the expected result for MCAR missing data. The AUC results support this finding, in that all of the distributions and cases yield a value close to 0.5. This is a value that is consistent with a randomly generated data distribution.
These results do not overlap with any of the MAR results and so are unlikely to cause confusion. As discussed in
Section 3.1, however, these results cannot be extrapolated beyond this simulated data.
3.5.3. MNAR Results
Again, as the number of variables in each functional form affects the fit of the GLM test, they will be discussed separately. The focus of the discussion is whether there is a possibility of confusion between the MAR and MNAR results.
One-variable models: In the X-based models, both X and are found to be statistically significant, unlike in the MAR models, where X is never found to be significant. The Y-based models find both Y and to be significant, with Y consistently having a positive z-score. In contrast, the z-score for Y in MAR models is always negative. These characteristics help distinguish MNAR models from MAR models.
Two-variable models: Z is found to be statistically significant only when it is present in a model. If X is in the model, X is always returned as significant, often along with . If Y is included, it is usually returned with , and Y has a positive z-score. If only is in the model, the Y z-score is negative. These MNAR models can be distinguished from MAR models because none of the MAR models found X to be significant, and Y in the MAR models has a negative z-score when found significant.
Three-variable models: Z is found statistically significant only when present in the model. Y and are often returned together, but if Y is present in the model, it has a positive z-score; if it is not, it has a negative z-score. X is always correctly returned as significant, but sometimes accompanied by and vice versa. As before, it appears that the MNAR models can be distinguished from the MAR models because none of the MAR models find X to be significant, and Y in the MAR models has a negative z-score when it is found to be significant.
Four-variable models: All five variables are returned as significant in both cases. They can be distinguished by the Y z-score, that is, positive if Y is in the model, negative if it is not. X and Y with positive z-scores are never found in MAR models; this can be used to distinguish between MAR and MNAR models.
In summary, tests on the single-variable MAR models failed to identify any significant variables, but the AUC of “1” suggested a perfect fit was present. As this was not found with any of the MCAR or MNAR models, this is a unique return for these MAR models. It may, however, be an artifact of the simulated data and unlikely to be encountered with real data. MCAR returns were distinctive but may not be found with real data.
It may also be possible to distinguish between MAR and MNAR missingness. For MAR, the X variable is never returned as statistically significant, whereas if the Y variable is returned as statistically significant, it has a negative z-score. In contrast, MNAR returns X as statistically significant if present in the model, and if Y is correctly returned as statistically significant, it always has a positive z-score. Z is always returned as statistically significant if present.
3.5.4. Out-of-Sample Testing
Having developed a series of tests to identify missing data mechanisms, the next step is to evaluate them against an out-of-sample dataset. The test data were created by following exactly the same procedure as used for the original simulation data (as described in
Section 3.3) but using a different probability distribution and random error generators. This time, the Beta(1, 2) distribution was used with the three levels of random errors generated by N(1, 0.1), N(0, 0.2), and N(0, 0.3) random draws for cases 4, 5, and 6, respectively. The same six cases were created as before, and the results will be evaluated on their ability to identify the correct missing data mechanism. The variables found significant at the 0.001 level are given in
Table A7, found in
Appendix B. Again, the first two cases (which have no missing data) are omitted from this evaluation.
MAR: The test for missing data under a one-variable MAR mechanism results in model failure, as indicated by the warning messages ‘1: glm.fit: algorithm did not converge; and 2: glm.fit: fitted probabilities numerically 0 or 1 occurred’. In each case, the AUC had a value of 1, indicating a perfect fit. This is exactly what is found with the Beta(1, 2) data in all cases and is found in no other models. For two-variable MAR models, the test is considered successful when the correct variables are returned as significant; these may also show Y as significant but with a negative z-score. For the Beta(1, 2) data, the only time Y as well as are returned as statistically significant, the associated Y z-scores are negative. All other variables identified as significant are correctly selected. For the three-variable MAR model, the expected result is that the correct variables are returned as significant; these may also be accompanied by Y but with a negative z-score. The Beta(1, 2) data return the correct variables as significant in all cases, and when Y is also returned as significant, it has a negative z-score.
MCAR: Apart from case 3 (missing data with no random errors, which found no significant variables), only the intercept was found significant, and the AUC was close to or at 0.5. These results, however, were also returned for one of the MNAR models. This finding underlines the difficulty of directly identifying MCAR in the data. More significantly, none of these results can be confused with the MAR results.
MNAR: These results are again grouped by the number of variables in each model and assessed in terms of their likelihood of being mistaken for MAR results.
One-variable models: For the X-based model, the test always returns X and as significant. For the Y-based model, both Y and may be returned as significant, but the z-score for the Y variable is always positive. As noted in the MCAR discussion, the Beta(1, 2) results sometimes resemble those of MCAR. However, in no case do they resemble MAR results, either because of the specific variables identified as significant or the sign of the z-scores.
Two-variable models: X is always returned as significant if correct, is only returned as significant if correct, Y is returned as significant with a positive z-score if correctly identified, and Z is always correctly identified.
Three-variable models: X is always returned as significant if correct, Y is always returned as significant with a positive z-score if correct, and Z is always correctly returned as significant where present in the model.
Four-variable models: All five variables are returned as statistically significant, with the models distinguishable by the sign of the Y z-score.
Overall, it appears that the tests were successful in distinguishing between MAR and MNAR, as the MNAR results were different from those found for MAR.
3.5.5. Comparing the Impact of Using Random Value Missing Data Infilling on the Test and Its Results
The results so far suggest that success in using this diagnostic test to diagnose MAR missingness is a realistic option. Repeating this testing with a data series that uses random substitution, however, casts doubt on these results. As can be seen from
Table A2,
Table A4 and
Table A6, the apparently clear signals for MAR broke down when faced with less orderly data. While the single-variable MAR models also failed with the same warning messages, the two- and three-variable models were not sufficiently differentiated from MNAR results to allow a clear distinction. For the two-variable models, both
Y and
were often returned, irrespective of whether
was actually in the model. In addition, when
Y was returned, it could have either a positive or a negative sign. This indicated that the clean
Y-sign signal from the single mean replacement approach could no longer be taken as a reliable indicator of the missing data mechanism type. For the three-variable models, the correct model variables were returned, but
Y was also identified as significant. In these model returns, the
Y sign could also be either positive or negative. Finally, the MCAR and several MNAR returns also looked similar to the MAR returns, further reducing the likelihood of correctly identifying MAR missingness.
Taken together, these findings raise significant doubts about the ability of this test to diagnose MAR missingness directly from the data. Along with this is the possibility that there may be a high correlation between some of the variables within the dataset. This correlation can be the result of random chance, a direct relationship between the variables, or the presence of a confounder that influences two or more of the variables. Such a correlation could distort the test results and appear to identify MAR when it is not present.
There was also the possibility of the dataset containing mixed missingness. If more than one type of missing data mechanism was present, the impact on the test became unpredictable. It could, however, generate either type I or type II errors—incorrectly identifying MAR when it was not present, or vice versa.