# An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Background: Review of Current Machine Learning Imputation Methods

#### 2.1. Introduction and the Nature of Missingness

#### 2.2. Nearest Neighbors

#### 2.3. Self-Organizing Maps

#### 2.4. Decision Trees

#### 2.5. Bayesian Networks

#### 2.6. Past Performance of Machine Learning Imputation Methods

#### 2.7. Dealing with Missing Not at Random

## 3. Materials and Methods

_{post}for each rule fired by a test sample. For samples that triggered negative rules, 1 − pr

_{post}was used instead, so that the probabilities could be interpreted as the probability of predicting positive. This probability was then thresholded to produce an AUC using trapezoidal approximation. Like sensitivity and specificity, the AUC was averaged over 5 cross-validation runs.

## 4. Results

## 5. Discussion

## 6. Conclusions

## Supplementary Materials

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Baraldi, A.N.; Enders, C.K. An introduction to modern missing data analyses. J. Sch. Psychol.
**2010**, 48, 5–37. [Google Scholar] [CrossRef] [PubMed] - Newgard, C.D.; Lewis, R.J. Missing Data: How to Best Account for What Is Not Known. JAMA
**2015**, 314, 940–941. [Google Scholar] [CrossRef] [PubMed] - Liu, Y.; Gopalakrishnan, V.; Madan, S. Quantitative clinical guidelines for imaging use in evaluation of pediatric cardiomyopathy. In Proceedings of the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, USA, 12 November 2015; pp. 1572–1578.
- Flett, A.S.; Westwood, M.A.; Davies, L.C.; Mathur, A.; Moon, J.C. The prognostic implications of cardiovascular magnetic resonance. Circ. Cardiovasc. Imaging
**2009**, 2, 243–250. [Google Scholar] [CrossRef] [PubMed] - Gopalakrishnan, V.; Lustgarten, J.L.; Visweswaran, S.; Cooper, G.F. Bayesian rule learning for biomedical data mining. Bioinformatics
**2010**, 26, 668–675. [Google Scholar] [CrossRef] [PubMed] - Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
- Gelman, A.; Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
- Molenberghs, G.; Beunckens, C.; Sotto, C.; Kenward, M.G. Every missingness not at random model has a missingness at random counterpart with equal fit. J. R. Stat. Soc. Ser. B (Stat. Methodol.)
**2008**, 70, 371–388. [Google Scholar] [CrossRef] - García-Laencina, P.J.; Sancho-Gómez, J.-L.; Figueiras-Vidal, A.R. Pattern classification with missing data: A review. Neural Comput. Appl.
**2009**, 19, 263–282. [Google Scholar] [CrossRef] - Kohonen, T. The self-organizing map. Proc. IEEE
**1990**, 78, 1464–1480. [Google Scholar] [CrossRef] - Fessant, F.; Midenet, S. Self-organising map for data imputation and correction in surveys. Neural Comput. Appl.
**2002**, 10, 300–310. [Google Scholar] [CrossRef] - Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1993. [Google Scholar]
- Cooper, G.F.; Herskovits, E. A Bayesian method for the induction of probabilistic networks from data. Mach. Learn.
**1992**, 9, 309–347. [Google Scholar] [CrossRef] - Chen, S.H.; Pollino, C.A. Good practice in Bayesian network modelling. Environ. Model. Softw.
**2012**, 37, 134–145. [Google Scholar] [CrossRef] - John, G.H.; Langley, P. Estimating Continuous Distributions in Bayesian Classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Montréal, QC, Canada, 18–20 August 1995; pp. 338–345.
- Fielding, S.; Fayers, P.M.; McDonald, A.; McPherson, G.; Campbell, M.K. Simple imputation methods were inadequate for missing not at random (MNAR) quality of life data. Health Qual. Life Outcomes
**2008**, 6, 57. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Kang, S.; Little, R.J.; Kaciroti, N. Missing not at random models for masked clinical trials with dropouts. Clin. Trials
**2015**, 12, 139–148. [Google Scholar] [CrossRef] [PubMed] - Little, R.J.; Rubin, D.B.; Zangeneh, S.Z. Conditions for ignoring the missing-data mechanism in likelihood inferences for parameter Subsets. J. Am. Stat. Assoc.
**2016**. [Google Scholar] [CrossRef] - Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H. The WEKA data mining software. ACM SIGKDD Explor. Newsl.
**2009**, 11, 10. [Google Scholar] [CrossRef] - Vatanen, T.; Osmala, M.; Raiko, T.; Lagus, K.; Sysi-Aho, M.; Orešič, M.; Honkela, T.; Lähdesmäki, H. Self-organization and missing values in SOM and GTM. Neurocomputing
**2015**, 147, 60–70. [Google Scholar] [CrossRef] - Lustgarten, J.L.; Visweswaran, S.; Gopalakrishnan, V.; Cooper, G.F. Application of an efficient Bayesian discretization method to biomedical data. BMC Bioinform.
**2011**, 12, 309. [Google Scholar] [CrossRef] [PubMed]

0. If (EF is between 0.395 and 0.480) and (IVSd z-score is Normal) then (MRI will be Non-Positive)Posterior Odds = 3.3, Posterior Probability = 0.767 1. If (EF is greater than 0.480) and (IVSd z-score is Normal) then (MRI will be Non-Positive)Posterior Odds = 3.3, Posterior Probability = 0.767 2. If (EF is less than 0.395) then (MRI will be Positive)Posterior Odds = 10.0, Posterior Probability = 0.909 3. If (EF is greater than 0.480) and (IVSd z-score is High) then (MRI will be Positive)Posterior Odds = 10.0, Posterior Probability = 0.909 4. If (EF is between 0.395 and 0.480) and (IVSd z-score is High) then (MRI will be Positive)Posterior Odds = 10.0, Posterior Probability = 0.909 5. If (EF is greater than 0.480) and (IVSd z-score is Low) then (MRI will be Positive)Posterior Odds = 10.0, Posterior Probability = 0.909 6. If (EF is between 0.395 and 0.480) and (IVSd z-score is Low) then (MRI will be Positive)Posterior Odds = 10.0, Posterior Probability = 0.909 |

**Figure 4.**Accuracy and agreement of decision tree imputed values in 10-fold cross validation. Accuracy was calculated from the imputed values for the samples that had values, while agreement was calculated from the imputed values for the samples that did not have values.

**Figure 5.**Imputed values for four representative variables (

**a**) ejection fraction (EF), (

**b**) interventricular septum thickness z-score (IVSdZScore), (

**c**) tricuspid regurgitation max pressure gradient (TR Max PG), and (

**d**) tricuspid regurgitation max velocity (TR Max vel). Observed values for the positive class are shown as black circles and observed values for the negative class are shown as black X’s. Imputed values for mean, k-NN, and SOM imputation are shown as red, green, and blue dots, respectively. Because decision tree (DT) imputation requires discretized values, imputed values are reported as a discretized range.

**Figure 6.**Performance of imputation-augmented rulesets compared to unaugmented rulesets: (

**a**) sensitivity vs. specificity of 14-variable models evaluated on complete vs. imputed data; and (

**b**) average receiver operating characteristic (ROC) curves of 14-variable models evaluated on complete data.

**Figure 7.**Performance of 27-variable rulesets compared to 14-variable rulesets: (

**a**) sensitivity vs. specificity of 27-variable models compared to 14-variable models evaluated on complete data vs. imputed data; and (

**b**) average ROC curves of 27-variable models evaluated on complete data.

**Table 1.**Variable definitions for the 14-variable and 27-variables and what percentage of each variable was missing in the positive (+) versus the non-positive (−) MRI group.

Variables in 14 Variable Set | Definition | Percentage Missing (+) | Percentage Missing (−) |
---|---|---|---|

BSA | Body Surface Area | 3.2% | 8.8% |

EDV index | End diastolic volume index | 38.7% | 38.6% |

ESV index | End systolic volume index | 38.7% | 38.6% |

SV index | Stroke volume index | 38.7% | 38.6% |

FS | Fractional shortening | 3.2% | 5.3% |

EF | Ejection fraction | 32.3% | 35.1% |

Ao V2 max | Aortic V2 max | 3.2% | 1.8% |

Ao max PG | Aortic max pressure gradient | 3.2% | 1.8% |

MV E/A | Mitral valve E/A ratio | 16.1% | 1.7% |

IVSd z-score | Interventricular septum thickness measured in diastole, z-score | 3.2% | 10.5% |

LVIDd z-score | Left ventricular internal dimension measured in diastole, z-score | 3.2% | 10.5% |

LVIDs z-score | Left ventricular internal dimension measured in systole, z-score | 3.2% | 12.3% |

LVPWd z-score | Left ventricular posterior wall thickness measured in diastole, z-score | 3.2% | 10.5% |

LV mass z-score | Left ventricular mass measured in diastole, z-score | 3.2% | 10.5% |

Additional Variables in 27 Variable Set | Definition | Percentage Missing (+) | Percentage Missing (−) |

Age | Age at scan | 0% | 0% |

Height | Height at scan | 3.2% | 8.8% |

Weight | Weight at scan | 0% | 1.8% |

Ao root diam | Aortic root diameter | 35.5% | 22.8% |

MV A max | Mitral valve A wave max (max atrial filling velocity) | 35.5% | 24.6% |

MV E max | Mitral valve E wave max (max early filling velocity) | 38.7% | 22.8% |

PA V2 max | Pulmonary artery V2 max | 12.9% | 3.5% |

PA max PG | Pulmonary artery max pressure gradient | 12.9% | 3.5% |

TR max PG | Tricuspid regurgitation max pressure gradient | 35.5% | 50.9% |

TR max vel | Tricuspid regurgitation max velocity | 38.7% | 52.6% |

TV A max | Tricuspid valve A wave max (max atrial filling velocity) | 25.8% | 14.0% |

TV E max | Tricuspid valve E wave max (max early filling velocity) | 19.4% | 7.0% |

TV E/A | Tricuspid valve E/A ratio | 64.5% | 64.9% |

**Table 2.**Sensitivity, specificity, accuracy, and AUC of BRL rules learned on 14 variables using imputation-augmented data versus unaugmented data evaluated on complete data only, averaged over five 10-fold cross-validations. Performance metrics are tested against the performance of the unaugmented model. After Bonferroni correction for multiple comparisons, α = 0.0125 is the significance threshold (significant values denoted by *).

Method | Sensitivity | Specificity | Accuracy | AUC |
---|---|---|---|---|

Unaugmented model | 44.7 +/− 4.7 | 88.7 +/− 5.1 | 73.5 +/− 3.4 | 59.2 +/− 6.5 |

Mean imputation | 38.9 +/− 6.1 (p = 0.17) | 87.5 +/− 4.0 (p = 0.71) | 70.0 +/− 1.8 (p = 0.11) | 60.8 +/− 2.9 (p = 0.66) |

Decision tree imputation | 42.2 +/− 7.5 (p = 0.61) | 81.9 +/− 5.4 (p = 0.10) | 67.6 +/− 4.5 (p = 0.07) | 65.8 +/− 4.8 (p = 0.14) |

k-NN imputation | 35.6 +/− 5.7 (p = 0.04) | 86.9 +/− 5.0 (p = 0.61) | 68.4 +/− 3.25 (p = 0.07) | 57.6 +/− 1.2 (p = 0.66) |

SOM imputation | 38.9 +/− 3.5 (p = 0.08) | 85.6 +/− 4.7 (p = 0.39) | 68.8 +/− 3.3 (p = 0.08) | 57.8 +/− 1.9 (p = 0.70) |

Method | Number of Rules Learned | Variables Used |
---|---|---|

Unaugmented model (14 variables) | 7 | EF, IVSd z-score (2) |

Mean imputation (14 variables) | 15 | IVSd z-score, LVIDd z-score, LV mass z-score, BSA (4) |

Decision tree imputation (14 variables) | 183 | IVSd z-score, LVIDd z-score, LVIDs z-score, LV mass z-score, EF, EDV index, SV index, MV E/A, Ao max PG (9) |

k-NN imputation (14 variables) | 15 | IVSd z-score, LVIDd z-score, LV mass z-score, BSA (4) |

SOM imputation (14 variables) | 15 | IVSd z-score, LVIDd z-score, LV mass z-score, BSA (4) |

Mean imputation (27 variables) | 43 | IVSd z-score, LVPWd z-score, LVIDs z-score, MV A max, LV mass z-score, SV index, FS, TV A max, TV E max, height (10) |

Decision tree imputation (27 variables) | 255 | Ao V2 max, EF, EDV index, FS, MV A max, PA V2 max, TR max vel, TV E/A, SV index, IVSd z-score, height, weight (12) |

k-NN imputation (27 variables) | 35 | IVSd z-score, LV mass z-score, SV index, LVIDs z-score, MV A max, TV A max, TV E max, height (8) |

SOM imputation (27 variables) | 27 | IVSd z-score, LV mass z-score, SV index, Ao root diam, LVIDs z-score, TV A max, TV E max, height (8) |

© 2017 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Liu, Y.; Gopalakrishnan, V.
An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data. *Data* **2017**, *2*, 8.
https://doi.org/10.3390/data2010008

**AMA Style**

Liu Y, Gopalakrishnan V.
An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data. *Data*. 2017; 2(1):8.
https://doi.org/10.3390/data2010008

**Chicago/Turabian Style**

Liu, Yuzhe, and Vanathi Gopalakrishnan.
2017. "An Overview and Evaluation of Recent Machine Learning Imputation Methods Using Cardiac Imaging Data" *Data* 2, no. 1: 8.
https://doi.org/10.3390/data2010008