Entropy Guided Benchmarking of Classical and Generative Imputation Methods for High-Dimensional Healthcare Survey Data

Fernandes Prabhu, Deepa; Park, Jaeyoung; Gurupur, Varadraj P.

doi:10.3390/app16094262

Open AccessArticle

Entropy Guided Benchmarking of Classical and Generative Imputation Methods for High-Dimensional Healthcare Survey Data^†

by

Deepa Fernandes Prabhu

^1,*

,

Jaeyoung Park

²

and

Varadraj P. Gurupur

²

¹

School of Modeling, Simulation, and Training, University of Central Florida, Orlando, FL 32826, USA

²

School of Global Health Management and Informatics, College of Community Innovation and Education, University of Central Florida, Orlando, FL 32816, USA

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our conference paper “Fernandes Prabhu, D.; Gurupur, V.P.; Hadley, D.; Prabhu, V. Addressing data incompleteness in NHANES 2021–2023. In Proceedings of the 5th International Conference on Innovations in Computational Intelligence and Computer Vision (ICICV 2025), Italy, June 2025”.

Appl. Sci. 2026, 16(9), 4262; https://doi.org/10.3390/app16094262

Submission received: 15 March 2026 / Revised: 16 April 2026 / Accepted: 22 April 2026 / Published: 27 April 2026

(This article belongs to the Special Issue New Trends in Decision Support Systems and Their Applications)

Download

Browse Figures

Versions Notes

Featured Application

The proposed framework can be used to prioritize variables for advanced imputation and to benchmark reconstruction quality in public health surveillance, clinical analytics, and other healthcare data systems with heterogeneous missingness.

Abstract

Missing data are a persistent challenge in large healthcare datasets, often undermining both statistical validity and machine learning performance when handled using simplistic assumptions. In this work, we examine how entropy-based diagnostics can guide the selection of imputation strategies for high-dimensional health survey data using the National Health and Nutrition Examination Survey (NHANES) 2021–2023. Shannon entropy is used to identify variables with structurally complex missingness, and a range of classical approaches (mean imputation, k-nearest neighbors, and multiple imputation by chained equations) are evaluated alongside deep generative methods, including variational autoencoders, generative adversarial networks (GANs), Wasserstein GANs (WGANs), and diffusion-based models. All methods are compared under a controlled masked-entry evaluation using root mean square error (RMSE) and Kolmogorov–Smirnov (KS) statistics to capture both reconstruction accuracy and distributional fidelity. Results show that diffusion-based models provide the most consistent balance between numerical accuracy and distributional preservation across high-entropy dietary variables, while WGAN demonstrates competitive performance for selected distributions. Structural equation modeling further indicates that entropy is a useful diagnostic signal for identifying variables that are difficult to reconstruct. Overall, this study provides a reproducible framework for aligning imputation strategy with missingness complexity in healthcare data, with implications for improving reliability in downstream analytics.

Keywords:

missing data; imputation; diffusion models; entropy; NHANES; population health informatics; healthcare analytics; generative models; KS statistic; RMSE; generative modeling; uncertainty quantification

1. Introduction

Data incompleteness remains a persistent challenge in population health research and machine learning, particularly in large-scale national surveys such as the National Health and Nutrition Examination Survey (NHANES). As a foundational resource for monitoring health trends, informing clinical guidelines, and guiding public health policy, NHANES depends on the quality and completeness of its data. When missing values are not handled appropriately, they can reduce statistical power, introduce bias, and ultimately compromise the validity and generalizability of both predictive models and inferential analyses [1,2].

A key difficulty is that missingness in NHANES is rarely uniform or purely random. Standard classifications distinguish between Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR) [2]. In practice, many variables exhibit missingness that is associated with observed demographic, socioeconomic, or health-related factors, making MAR more plausible than MCAR in most cases [3]. Under these conditions, simplistic imputation approaches can distort multivariate relationships and amplify bias, particularly when missingness varies across subpopulations. This has important implications for population-level analyses, where biased reconstruction can affect both equity and policy-relevant conclusions.

Simple percent-missing summaries are commonly used to describe data incompleteness, but they provide only a limited view of the underlying structure. They do not capture how missingness is distributed across individuals, variables, or demographic groups. Entropy-based analysis offers a complementary perspective by quantifying the uncertainty and structural complexity of missingness patterns. In NHANES, variables with high entropy often correspond to dietary intake and other self-reported measures influenced by recall bias, response burden, and survey design constraints [4]. These variables are both scientifically important and difficult to reconstruct reliably. Prior work suggests that entropy-based diagnostics can more effectively identify variables at risk of biased imputation compared to percent-missing summaries alone. Building on this perspective, we evaluate a range of imputation approaches under a unified masking and evaluation protocol, using MICE as a strong and interpretable baseline [5].

Despite their widespread use, classical imputation methods such as mean substitution, KNN, and chained regression models face limitations when applied to modern healthcare datasets. These datasets often exhibit nonlinear dependencies, heterogeneous variable types, and multimodal or heavy-tailed distributions. As a result, traditional approaches may struggle to preserve complex joint structure, particularly in high-entropy settings [6]. Deep generative models, including variational autoencoders (VAEs) and generative adversarial networks (GANs), provide a more flexible alternative by learning joint data distributions [7,8]. However, these models introduce their own challenges, including training instability, mode collapse, and difficulty balancing reconstruction accuracy with distributional fidelity in high-dimensional settings.

High-entropy variables—characterized by complex missingness and non-Gaussian distributions—represent a setting in which diffusion-based generative models may offer particular advantages, yet remain relatively underexplored in population health applications. Diffusion models approximate complex data distributions through iterative denoising processes, leading to improved training stability and more comprehensive distributional coverage compared to adversarial and variational approaches. Recent advances in diffusion-based modeling have demonstrated strong performance across image, time-series, and structured data domains, motivating their application to healthcare data imputation [9].

Motivated by these developments, we extend entropy-informed missingness analysis by systematically benchmarking diffusion-based imputation models—including conditional and fully conditioned variants—against classical and deep generative baselines using NHANES 2021–2023 data. Conditioning mechanisms incorporate observed covariates, variable identity, and missingness masks to support more accurate reconstruction of high-entropy variables. Our results reveal a trade-off between reconstruction accuracy and distributional fidelity: conditional diffusion achieves the lowest reconstruction error under masked-entry evaluation, while WGAN provides strong marginal distributional alignment as measured by the KS statistic.

This study makes the following primary contributions:

C1: Entropy-based diagnostic prioritization of high-uncertainty variables:We systematically quantify and visualize missingness structure and uncertainty to identify variables for which naive imputation is most likely to introduce bias, using Shannon entropy to complement percent-missing metrics.
C2: Controlled multi-model benchmarking under identical masking: We benchmark widely used methods—including mean imputation, KNN, and MICE—under a consistent simulated missingness protocol, assessing both reconstruction accuracy and distributional fidelity using RMSE and KS diagnostics.
C3: Conditional diffusion for entropy-sensitive reconstruction: We evaluate diffusion-based imputers conditioned on observed covariates and missingness masks, demonstrating improved reconstruction performance and better preservation of empirical distributions relative to classical and adversarial generative alternatives.
C4: Structural equation model: We examine the relationships among entropy, missingness complexity, and imputation performance metrics.

Our goal is not only to compare methods but also to better understand when and why specific imputation strategies succeed under realistic data complexity.

The remainder of this paper is organized as follows. Section 2 reviews related work on missing-data mechanisms, entropy-guided diagnostics, classical and machine learning imputation methods, and diffusion-based generative modeling. Section 3 describes the NHANES 2021–2023 dataset, variable selection, and missingness characterization. Section 4 presents the imputation methodologies and evaluation metrics. Section 5 reports comparative results. Section 6 discusses implications and limitations, and Section 7 concludes the paper.

2. Background and Related Work

This section reviews prior work on missing-data mechanisms, entropy-guided diagnostics, imputation methods, classical and machine learning-based imputation methods, and diffusion-based generative modeling. We emphasize methodological trade-offs relevant to high-dimensional population health data and position the present study as an extension of entropy-informed imputation frameworks toward diffusion-based generative modeling.

2.1. Missing-Data Mechanisms and Their Implications

Missing data pose a persistent challenge for statistical inference, predictive modeling, and equity-focused population health analysis. A common framework distinguishes between Missing Completely at Random (MCAR), Missing at Random (MAR), and Missing Not at Random (MNAR), each associated with different assumptions and implications for analysis [2,3,10,11]. While MCAR allows unbiased complete-case estimation, it is rarely observed in practice. In most health and survey datasets, missingness is related to factors such as respondent burden, access to care, health status, and operational constraints [12,13]. Under MAR, approaches such as multiple imputation and inverse probability weighting can reduce bias by leveraging observed covariates [14,15,16]. In contrast, MNAR settings remain more difficult to address because missingness depends on unobserved values, often requiring sensitivity analyses and specialized modeling strategies such as selection or pattern-mixture models [14,17].

In practice, missingness in health data is often structured across multiple levels, including variables, individuals, and collection sites. This complexity motivates diagnostic approaches that go beyond simple percent-missing summaries. For example, in electronic health records (EHRs), missing data frequently arise from clinical workflows and documentation practices rather than deliberate sampling, making them closely tied to data quality and reuse considerations [12,18,19,20]. These challenges motivate the focus of this study on (i) characterizing the structural patterns of missingness and (ii) systematically evaluating imputation methods under controlled masking regimes that reflect plausible MAR and MNAR conditions in population health data.

For benchmarking purposes, a controlled masking protocol was used to ensure consistent comparison across methods. This design ensures that all imputation methods are evaluated under identical and reproducible conditions, isolating model behavior from confounding missingness structures and enabling fair comparison across approaches. We emphasize that this protocol is not intended to replicate real-world missingness patterns and may therefore limit external validity. Accordingly, the results should be interpreted as benchmarking performance under standardized conditions rather than direct real-world deployment performance. Extending the framework to explicitly model MAR and MNAR mechanisms represents an important direction for future work.

2.2. Entropy-Guided Diagnostics for Incomplete Health Data

Percent-missing metrics remain widely reported but fail to capture the structural complexity of missingness in high-dimensional datasets. Such summaries ignore how missingness is distributed across individuals, variables, and subgroups. Entropy-based analysis provides a principled quantitative framework for characterizing uncertainty and structural heterogeneity in missingness patterns [4]. Variables with identical missingness rates may nonetheless exhibit markedly different entropy values, reflecting fundamentally different data-generation and missingness mechanisms. Entropy-based approaches have also been applied in healthcare data completeness analysis, where information-theoretic measures are used to quantify data quality and structural uncertainty [21].

In the context of NHANES, entropy-based diagnostics have revealed that high-entropy variables frequently correspond to dietary intake measures and other self-reported data subject to recall bias, response burden, and survey design constraints. These variables are often of high scientific relevance yet present substantial challenges for reconstruction. Prior work introduced reproducible entropy-guided frameworks combining heatmaps, intersection matrices, and completeness metrics to prioritize variables for advanced imputation and to guide modeling choices.

Our approach builds on prior work addressing data incompleteness in NHANES datasets, extending it through entropy-guided diagnostics and diffusion-based modeling [22]. By framing entropy as a diagnostic signal rather than a replacement for modeling, entropy-informed workflows provide a principled foundation for selecting and evaluating imputation strategies.

2.3. Imputation Approaches: From Classical Models to Diffusion

A broad spectrum of imputation methods has been proposed to address missingness, each with distinct assumptions and trade-offs. Classical approaches include complete-case analysis, mean substitution, KNN, and multiple imputation (MI). MI remains widely used in epidemiology because it can preserve uncertainty and reduce bias under MAR assumptions when models are appropriately specified [2,11,13,14]. Among MI methods, MICE is widely adopted due to its flexibility and modularity across variable types [6,23,24,25]. Practical guidance emphasizes that MI performance depends on including strong predictors of missingness and incomplete variables, and that naive complete-case analyses can be inefficient and biased even when missingness appears moderate [14,15,16]. Alternative software frameworks such as Amelia provide complementary assumptions and modeling choices for multivariate imputation [26].

Machine learning-based imputers extend modeling flexibility beyond parametric MI by leveraging nonlinear structure and interactions. Random forest imputation (MissForest) is a commonly used nonparametric baseline that performs well for mixed-type data but can be computationally intensive and may struggle under high-dimensional, highly structured missingness [27]. Matrix completion and spectral regularization methods offer another perspective by reconstructing incomplete matrices under low-rank assumptions, providing strong performance in some settings but potentially mismatching the distributional and heterogeneity characteristics of health survey variables [28]. Early work in high-dimensional biology demonstrated that imputation quality strongly depends on both distributional properties and correlation structure, motivating careful evaluation across domains [29].

Deep generative models aim to learn flexible joint distributions for imputation, with recent reviews highlighting their effectiveness in complex healthcare data settings [30,31]. Variational approaches provide probabilistic latent representations but can oversmooth complex distributions and require careful modeling of missingness mechanisms [32,33]. Methods such as MisGAN explicitly learn from incomplete data by jointly modeling data and missingness processes [34]. Although promising, these approaches often exhibit sensitivity to architecture choices and may underperform on heavy-tailed or multimodal tabular health variables.

Diffusion and score-based generative models provide a complementary route to stable training and strong distributional coverage. Diffusion-based models learn to reverse a progressive noising process and have demonstrated robust likelihood-free modeling and stable optimization relative to adversarial approaches [9,35,36,37]. Recent work extends diffusion models to tabular domains and imputation settings, demonstrating strong performance in structured and incomplete data scenarios [38,39]. These properties motivate our benchmarking of conditional and fully conditioned diffusion imputers for high-entropy NHANES variables, where non-Gaussianity, skewness, and structured missingness challenge classical and adversarial baselines.

2.4. NHANES and Clinical Informatics Case Studies

NHANES provides a canonical testbed for studying missingness due to its scale, heterogeneity, and policy relevance. Prior analyses have documented persistent incompleteness in dietary recall, laboratory biomarkers, and socioeconomic variables, with disproportionate missingness among minority and low-income populations [1,40]. Comparative studies have shown that improved imputation can enhance predictive performance for downstream tasks such as blood pressure modeling and nutritional epidemiology.

NHANES additionally requires analytic choices consistent with complex survey design, including stratification, clustering, and weighting for nationally representative inference [1,41,42]. Although this study focuses on reconstruction fidelity under controlled masking for method comparison, survey design considerations remain central for downstream estimation and should be addressed when translating imputed datasets into population-level inference. Moreover, dietary intake variables in NHANES are known to be affected by systematic measurement error and misreporting; biomarker-calibrated analyses and methodological studies highlight correlated biases in dietary assessment instruments [43,44,45,46]. These characteristics jointly motivate (i) entropy-informed diagnostics of missingness structure and (ii) robust generative imputers that preserve heavy-tailed marginals and multivariate dependencies. Prior comparative studies have emphasized the importance of benchmarking imputation methods across multiple datasets and evaluation criteria [47].

Beyond NHANES, missing data remain a pervasive challenge in EHRs, clinical trials, and health economics research. Reported missingness rates frequently exceed 20–30%, undermining statistical power and generalizability [12]. Importantly, many applied studies fail to justify their chosen imputation strategies or assess sensitivity to missingness assumptions, highlighting the need for standardized and transparent frameworks.

Building on prior entropy-based analyses and diffusion-based benchmarking studies, the present work integrates diagnostic rigor with generative modeling to advance imputation methodology for population health data.

2.5. Contributions of This Study

This study makes four primary contributions. First, it introduces an entropy-guided diagnostic strategy for identifying variables with structurally complex missingness in NHANES. Second, it establishes a controlled benchmarking design in which classical and deep generative imputers are compared under identical masking conditions. Third, it evaluates diffusion-based and fully conditional diffusion models as entropy-sensitive reconstructors for high-dimensional dietary variables. Fourth, it integrates structural equation modeling to examine the relationships among entropy, missingness complexity, and imputation performance metrics.

2.6. Conceptual Overview

The framework in Figure 1 is comparative rather than prescriptive: once high-entropy variables are prioritized, both classical and deep generative imputation families are applied to the same selected variable set under identical masking and evaluation conditions. Data collection issues—ranging from participant nonresponse and measurement error to survey design limitations—produce incomplete records that compromise analytic integrity and, if unaddressed, propagate bias into downstream analyses and policy decisions [12,48]. Entropy-informed diagnostics enable identification of structurally complex missingness patterns not captured by simple percent-missing summaries, while advanced statistical and machine learning-based imputation methods, particularly diffusion-based generative models, support realistic data reconstruction, improved analytic validity, and more equitable public health decision-making [49].

Although NHANES employs a complex sampling design with weights, strata, and clustering for population-level inference; these elements were not incorporated into model training or evaluation in this analysis. The present work focuses on reconstruction fidelity under controlled masking rather than population-weighted inference. Incorporating survey design into imputation modeling represents an important direction for future research.

2.7. Comparison of Imputation Paradigms

Table 1 summarizes major families of imputation methods from a conceptual and methodological perspective. Classical approaches operate directly on the observed data matrix and typically assume relatively simple data-generating mechanisms, whereas deep generative models aim to learn an explicit representation of the joint distribution from which missing values can be reconstructed. Foundational perspectives on missing-data mechanisms and multiple imputation are provided in [2,3,5,50].

Mean imputation provides a deterministic and easily interpretable baseline but ignores cross-variable dependencies and attenuates variance. KNN imputation conditions on local similarity in the observed feature space and can capture a limited nonlinear structure; however, it does not constitute an explicit probabilistic model. MICE extends classical methods by performing a sequence of conditional regressions, allowing partial preservation of inter-variable relationships under Missing at Random (MAR) assumptions, although performance depends on the specification of the conditional models [6,23,24].

Variational autoencoders (VAEs) learn a latent representation of the joint distribution and enable probabilistic reconstruction of incomplete records; however, reliance on element-wise reconstruction losses can oversmooth highly heterogeneous or multimodal health variables [7,51]. GAN and Wasserstein GAN (WGAN) approaches model complex non-Gaussian structure through adversarial learning and can generate realistic samples, but are known to exhibit training instability and sensitivity to mode collapse, which can affect imputation reliability [8,30,34].

Table 1. Conceptual comparison of imputation paradigms.

Method	Uses Mask	Learns Joint	Non-Gaussian	Stability	Scalability	Interpretability
Mean [3]	No	No	No	High	High	High
KNN [29]	No explicit mask	No	Limited	High	Moderate	Moderate
MICE [24]	Conditional	Partial	Moderate	High	Moderate	High
VAE [7]	Implicit	Yes	Moderate	Moderate	High	Low
GAN [8]	No	Yes	High	Low	High	Low
WGAN [52]	No	Yes	High	Moderate	High	Low
Diffusion [9]	Yes	Yes	High	High	Moderate	Moderate
Fully Conditioned Diffusion [39]	Yes + var ID	Yes	High	High	Moderate	Moderate

Diffusion-based methods formulate imputation as a denoising process in which corrupted observations are gradually refined through a learned reverse diffusion model. Conditional variants incorporate observed covariates and missingness indicators to guide reconstruction, offering strong capacity for multimodal and high-entropy variables while maintaining improved training stability relative to adversarial models [9,38,53].

Imputation methods evaluated here exhibit important trade-offs in computational complexity and interpretability. Classical approaches such as mean imputation, KNN, and MICE are computationally efficient and relatively interpretable, while deep generative models, including GANs and diffusion-based methods, require substantially higher training time and memory resources. Diffusion models provide improved stability and distributional fidelity but at increased computational cost, highlighting a trade-off between performance and scalability in practical healthcare applications.

This taxonomy highlights fundamental trade-offs among interpretability, distributional flexibility, conditioning capability, and training stability that motivate the comparative evaluation presented in subsequent sections.

3. Data and Variable Definitions

3.1. Dataset Overview

This study uses data from the National Health and Nutrition Examination Survey (NHANES) 2021–2023 cycle, a nationally representative, cross-sectional survey conducted by the National Center for Health Statistics (NCHS) at the Centers for Disease Control and Prevention (CDC) [54]. NHANES is widely used to monitor population health and combines household interviews, standardized clinical examinations, and laboratory testing performed in Mobile Examination Centers (MECs) [1].

The survey follows a complex, multistage probability sampling design, with targeted oversampling of selected demographic groups to improve the precision and equity of population-level estimates. To support nationally representative inference, NHANES provides survey weights, strata, and primary sampling units. While these design features enhance analytic validity, they also introduce additional complexity when handling missing data, particularly across variables collected through different modalities.

The 2021–2023 dataset used in this study includes 11,933 participants and 491 variables spanning demographic, dietary, laboratory, and examination domains. In line with prior work on entropy-based characterization of missingness in NHANES, we focus on variables that are both analytically important and exhibit structurally complex patterns of missingness.

3.2. Data Domains and Variable Taxonomy

Variables were organized into the following conceptual domains to support systematic missingness characterization and imputation strategy selection:

Demographic and Socioeconomic Variables: Age, sex, race/ethnicity, income-to-poverty ratio, educational attainment, and country of birth.
Dietary Intake Variables: Twenty-four-hour dietary recall measures collected across Day 1 (in-person MEC interview) and Day 2 (telephone follow-up), capturing macronutrients, micronutrients, and total energy intake. These variables are known to exhibit high entropy due to recall bias, response burden, and differential participation between recall days [44].
Laboratory Biomarkers: Objective clinical measures including lipid profiles, glycemic markers, inflammatory indicators, and select toxicological assays.
Examination Measures: Anthropometrics, blood pressure, cardiovascular assessments, pulmonary function, physical performance metrics, oral health indicators, and vision and hearing examinations collected under standardized MEC protocols.

Mortality follow-up data and self-reported comorbidity variables were intentionally excluded from the present analysis. Mortality data in NHANES are subject to delayed linkage through the National Death Index and exhibit distinct missingness and censoring mechanisms that differ fundamentally from cross-sectional examination and dietary variables. Similarly, comorbidity variables rely heavily on self-report and diagnostic access, introducing measurement error and missingness patterns that warrant separate methodological treatment. Excluding these variables allows the present study to focus on entropy-driven missingness and imputation performance in contemporaneously measured health indicators, while avoiding conflation with survival analysis or longitudinal censoring frameworks [55].

3.3. Preprocessing and Data Harmonization

Raw NHANES data files were harmonized using participant identifiers (SEQN) and merged across domains following CDC documentation guidelines. Variables were screened for coding inconsistencies, unit mismatches, and implausible values prior to analysis. Continuous variables were standardized where appropriate, and categorical variables were encoded to preserve interpretability in downstream modeling.

All preprocessing and analyses were conducted using reproducible Python-based pipelines, with variable definitions, inclusion criteria, and transformations aligned with official CDC NHANES documentation. This design ensures transparency, reproducibility, and consistency with established NHANES analytic practices.

3.4. Extent and Structure of Missingness

Missingness in NHANES 2021–2023 is substantial and heterogeneous across domains. Across the retained variables, the overall proportion of missing values exceeds 45%, with pronounced variability at both the variable and participant levels. Dietary intake variables and select laboratory biomarkers exhibit particularly high levels of incompleteness, consistent with prior analyses of NHANES data quality challenges [4].

Variable-level incompleteness was first examined using percent-missing metrics to provide an intuitive view of where information is most scarce. Table 2 displays the ten variables with the highest proportion of missing data in NHANES 2021–2023. Most correspond to day-2 dietary recall variables, whose near-complete missingness reflects planned subsampling rather than measurement failure. These variables correspond to standard NHANES dietary intake measures, and their high missingness is consistent with the survey design, particularly for second-day dietary recall variables [54].

Variables with near-complete missingness (∼100%) were excluded from entropy-based benchmarking to avoid instability in model training and evaluation. These variables were not used for imputation model training, preventing potential data leakage.

While percent-missing summaries capture the prevalence of absence, they do not describe the informational structure of the observed portion. Subsequent entropy analysis showed that variables with similar missing rates can differ markedly in heterogeneity and uncertainty, indicating that prevalence alone is insufficient for prioritizing imputation targets. The percent-missing perspective therefore identifies where data are scarce, whereas the entropy perspective provides additional insight into the structural complexity of the observed data beyond percent-missing summaries, highlighting variables that may be more challenging to reconstruct.

To maintain methodological consistency, model training and evaluation were performed on the entropy-ranked subset, while Table 2 is provided to contextualize the broader missingness landscape and to demonstrate that the entropy-selected dietary variables also reside within the highest-absence regime, reinforcing the practical relevance of the evaluation set.

Figure 2 presents the age distribution of the NHANES 2021–2023 study population, demonstrating broad representation across the lifespan and supporting subsequent stratified analyses of missingness and imputation performance.

To characterize structural complexity beyond aggregate rates, Figure 3 highlights the top-ranked high-entropy variables, which are dominated by dietary intake measures. These entropy rankings reveal that variables with similar percent-missing values may exhibit markedly different degrees of uncertainty and structural unpredictability, underscoring the limitations of percent-missing summaries alone. Entropy-informed prioritization therefore plays a critical role in identifying variables most susceptible to bias under naive imputation strategies.

Record-level completeness further reveals multimodal patterns, with a substantial subset of participants exhibiting near-total incompleteness for specific domains. These participant-level distributions are summarized using a completeness histogram, illustrating heterogeneity in data availability that cannot be addressed through uniform imputation approaches. Missingness heatmaps and intersection matrices further visualize co-occurring patterns of incompleteness across variables, revealing structured dependencies in missingness that motivate multivariate and generative reconstruction methods.

Collectively, these figures characterize the scope, structure, and heterogeneity of missing data in NHANES 2021–2023 prior to imputation and establish the empirical motivation for adopting entropy-guided, variable-specific, and diffusion-based imputation strategies. Quantitative performance comparisons and imputation outcomes are intentionally deferred to Section 5 to preserve a clear separation between data characterization and methodological evaluation, consistent with IEEE Access best practices.

3.5. Rationale for Variable Selection

Although the NHANES 2021–2023 release contains 491 variables, imputation benchmarking was intentionally restricted to the ten highest-entropy continuous variables. This decision was guided by three methodological considerations. First, high-entropy variables represent the most informationally uncertain portion of the dataset. Variables with moderate or low entropy can often be reconstructed adequately using classical approaches such as MICE or KNN, whereas high-entropy variables are precisely those for which naive imputation is most likely to distort joint distributions and downstream inference. Focusing on this subset therefore provides a stringent stress test for generative imputers.

Second, limiting the analysis to ten variables enables rigorous method comparison under identical masking and evaluation protocols. Diffusion-based models require repeated training under multiple seeds and masking regimes; restricting the dimensionality prevents conflation of model performance with sample-size artifacts and computational instability.

Third, the top ten entropy-ranked variables form a coherent epidemiologic domain—dietary intake—allowing evaluation within a consistent measurement framework rather than across heterogeneous constructs with incomparable scales and error structures. This approach follows prior information-theoretic studies that prioritize variables with the highest uncertainty for imputation research.

While percent-missing metrics identify where data are scarce, they do not reflect the structural complexity of the observed distribution. Entropy was therefore used as the primary ranking criterion to identify variables most susceptible to biased reconstruction, with missingness serving as a secondary filter to ensure adequate sample support.

The figures report results for the ten highest-entropy dietary intake measures present across NHANES 2021–2023: DR1TMOIS (total moisture, day 1), DR1TSFAT (total saturated fat), DR1TMFAT (total monounsaturated fat), DR1TM181 (oleic acid), DR1TPFAT (total polyunsaturated fat), DR1TNIAC (niacin), DR1TCARB (total carbohydrate), DR1TP182 (linoleic acid), DR1TS160 (palmitic acid), and DR2TMOIS (total moisture, day 2). These variables exhibited entropy between 12.5 and 12.7 bits and missingness between 43.9 and 51.1%, representing the most uncertain yet epidemiologically relevant portion of the dietary domain. All heatmaps and comparison plots correspond exclusively to this set to maintain consistency of evaluation.

Here, we focus on high-entropy dietary variables to ensure methodological consistency and to provide a controlled stress-test setting for imputation methods. However, missingness mechanisms and data characteristics may differ across other NHANES domains, including laboratory, examination, and socioeconomic variables, as well as across demographic subgroups (e.g., age and ethnicity). Consequently, the findings should be interpreted as a methodological benchmark within a specific variable domain rather than a comprehensive evaluation across all NHANES data types or population subgroups. Extending the analysis to additional variable domains and incorporating subgroup-specific evaluations represents an important direction for future work.

3.6. Variable Definitions and Domains

Table 3 summarizes the primary high-entropy variables emphasized throughout this study. These variables were selected because they exhibit structurally complex missingness and non-Gaussian distributions, motivating generative imputers.

3.7. High-Entropy Variable Selection

Entropy scores were computed to quantify uncertainty in the presence or absence of data for each variable. Entropy-based measures have been shown to quantify structural uncertainty in missingness patterns beyond simple percent-missing summaries. Variables with high entropy were prioritized for advanced imputation due to their increased susceptibility to bias under naive reconstruction strategies.

Consistent with prior entropy-based analyses of NHANES, dietary intake variables such as total carbohydrate, saturated fat, and micronutrient measures ranked among the highest-entropy features. These variables served as focal points for benchmarking classical, deep generative, and diffusion-based imputers in subsequent sections.

To prevent leakage of survey design variables into the generative space, identifier and weighting fields (SEQN, WTINT2YR, SDMVPSU, SDMVSTRA) were excluded prior to entropy estimation. Shannon entropy was computed for all remaining numeric variables, and the ten highest-entropy features were selected for benchmarking. The resulting set consisted exclusively of continuous dietary-intake measures from the NHANES Day-1 and Day-2 24 h recall instruments (e.g., DR1TMOIS, DR1TSFAT, DR1TMFAT). Missingness across these variables ranged from 43.9% to 51.1%, representing a challenging mixed missingness scenario consistent with previous reports on nutritional data incompleteness [50,56].

4. Methods

This section describes the entropy-guided imputation framework used in this analysis and outlines the key modeling and evaluation components. Emphasis is placed on variable-specific modeling choices, conditioning mechanisms, and rigorous evaluation under controlled missingness scenarios.

Classical missing-data theory and imputation practice build on foundational work in multiple imputation, likelihood-based inference, and applied missing-data analysis [10,29,36,37,39,40,42,43,44].

4.1. Overview of the Entropy-Guided Imputation Pipeline

The proposed pipeline consists of four sequential stages. These include missingness characterization, baseline imputation, generative modeling, and evaluation. This design ensures that imputation strategies are aligned with the empirical properties of each variable rather than applied uniformly across the dataset.

Algorithm 1 summarizes the end-to-end workflow used in this study. Briefly, we quantify variable-level missingness and empirical Shannon entropy, rank variables by entropy, and construct joint entropy–missingness diagnostics to identify high-uncertainty variables and their missing-data burden. We then standardize the selected feature set using complete-case scaling and evaluate multiple imputation families under a controlled masking protocol, including traditional statistical imputers (Mean, KNN, MICE) and deep generative baselines (VAE, GAN, WGAN), alongside a conditional diffusion-based imputer. Performance is reported on held-out masked entries using error-based metrics (RMSE) and distributional fidelity (e.g., KS statistics), enabling consistent comparison across methods. The workflow illustrated in Figure 1 can also be interpreted as an activity diagram representing the sequence of preprocessing, entropy-based prioritization, masking, imputation, and evaluation stages within the framework.

Algorithm 1 Entropy-Informed Missingness Diagnostics and Imputation Benchmarking Pipeline

Require: NHANES merged dataset

X \in R^{n \times p}

; excluded IDs/weights set

E

; top-K for analysis; masking rate

ρ

Ensure: Ranked entropy–missingness table; entropy–missingness figure; imputed outputs and benchmark metrics

1:: Data ingestion and preprocessing: read $X$ ; convert blank entries to NaN; remove columns in $E$
2:: Compute missingness: for each variable j, compute $m_{j} = 100 \cdot (1 - \frac{# observed in j}{n})$
3:: Compute entropy (numeric vars): for each numeric variable j, estimate empirical distribution over observed values and compute Shannon entropy $H_{j}$ (bits)
4:: Rank variables: sort variables by $H_{j}$ descending; select top-K variables $V_{K}$
5:: Scale (complete-case for $V_{K}$ ): form $X_{c c} = X [V_{K}]$ restricted to rows complete in $V_{K}$ ; fit scaler on $X_{c c}$ and transform to $\tilde{X}$
6:: Diagnostic visualization: build a ranked table of $(j, H_{j}, m_{j})$ ; plot top-40 entropy bars with a mid-bar entropy marker and a separate missingness percent track
7:: Benchmark masking (evaluation protocol): create a simulated mask $M \in {0, 1}$ by hiding a fraction $ρ$ of entries in $\tilde{X}$ ; construct masked input ${\tilde{X}}_{mask}$
8:: Traditional imputers: fit and impute ${\tilde{X}}_{mask}$ using Mean, KNN, and MICE to obtain ${\hat{X}}^{(mean)}, {\hat{X}}^{(knn)}, {\hat{X}}^{(mice)}$
9:: VAE baseline: train a variational autoencoder on $\tilde{X}$ ; reconstruct ${\hat{X}}^{(vae)}$ and use reconstructed values at masked locations
10:: GAN baseline: train GAN generator/discriminator on $\tilde{X}$ ; generate samples and align to real rows (e.g., nearest-neighbor pairing) to form ${\hat{X}}^{(gan)}$
11:: WGAN baseline: train WGAN generator/critic on $\tilde{X}$ ; generate samples and form ${\hat{X}}^{(wgan)}$
12:: Diffusion imputer: train a (conditional) diffusion model using inputs $({\tilde{X}}_{mask}, M, t)$ ; sample iteratively and reinsert observed entries to produce ${\hat{X}}^{(diff)}$
13:: Evaluation (masked entries only): for each method r, compute RMSE on masked positions and distributional fidelity (e.g., KS statistic) per variable and averaged over $V_{K}$ . Compute RMSE on masked entries (paired for GAN via nearest-neighbor alignment). Compute KS on full marginal distributions
14:: Report: export ranked entropy–missingness outputs, figure(s), and benchmarking tables across imputers

4.2. Notation and Problem Formulation

To ensure clarity, we explicitly define the objects used throughout the study. Let X denote the data matrix containing n participants and p variables. Let M be the binary missingness matrix where value 1 indicates that an entry in X is missing and value 0 indicates that it is observed.

For each variable j, two index sets are defined:

O(j): the set of participant indices where variable j is observed.
M(j): the set of participant indices where variable j is missing.

The goal of imputation is to construct an estimator that produces completed values, denoted as Xhat, using the observed data and the missingness pattern. The estimator can be written conceptually as

\hat{X} = f (X, M)

where f represents either a traditional statistical method, a deep generative model, or a diffusion-based model.

We note that conditioning on the missingness mask assumes knowledge of missingness patterns during reconstruction. While this is appropriate for controlled benchmarking, real-world deployment may require modeling or inferring missingness mechanisms. This distinction is discussed as a limitation.

4.3. Evaluation Metrics

Two complementary criteria were used.

Reconstruction accuracy:

R M S E (j) = \sqrt{\frac{1}{| M (j) |} \sum_{i \in M (j)} {(X_{i j} - {\hat{X}}_{i j})}^{2}}

Distributional fidelity:

K S (j) = \max_{x} D (j, x)

(1)

where D(j,x) is the absolute difference between the empirical cumulative distribution of the observed values and that of the imputed values. The Kolmogorov–Smirnov statistic is a nonparametric measure of distributional similarity widely used for goodness-of-fit testing [57,58,59].

The first measures pointwise recovery on originally missing entries, while the second measures how well the imputed distribution matches the empirical distribution.

These metrics capture complementary aspects of imputation performance, with RMSE emphasizing pointwise reconstruction accuracy and the KS statistic evaluating distributional fidelity. Their joint use enables analysis of trade-offs between numerical accuracy and preservation of empirical data distributions.

4.4. Missingness Characterization and Entropy Scoring

Let

X \in R^{n \times p}

denote the data matrix with n participants and p variables, and let

M \in {0, 1}^{n \times p}

represent the missingness indicator matrix, where

M_{i j} = 1

if

X_{i j}

is missing and 0 otherwise. For each variable

X_{j}

, Shannon entropy was computed to quantify uncertainty in its missingness pattern:

H (X_{j}) = - \sum_{k \in {obs, miss}} P_{k} \log_{2} P_{k},

(2)

where

P_{miss}

and

P_{obs}

denote the empirical probabilities of missing and observed values, respectively. Entropy scores capture not only the proportion of missing data but also the structural unpredictability of missingness across participants.

Variables were ranked by entropy, and high-entropy variables were prioritized for advanced imputation methods. This prioritization reflects the increased risk of bias and reconstruction error when structurally complex missingness is addressed using naive or uniform imputation strategies. Entropy was used as an operational ranking criterion to prioritize variables for imputation benchmarking.

4.5. Baseline Traditional Imputation Methods

To establish interpretable baselines, three classical imputation methods were evaluated:

4.5.1. Mean Imputation

Missing values were replaced with the empirical mean of each variable. While computationally efficient and stable, mean imputation ignores inter-variable dependencies and distributional structure, serving primarily as a lower-bound reference.

4.5.2. K-Nearest Neighbors (KNNs)

KNN imputation replaces missing values using a weighted average of the k nearest complete observations based on Euclidean distance in feature space. This method introduces conditional dependence but remains limited in capturing nonlinear or high-dimensional interactions.

4.5.3. Multiple Imputation by Chained Equations (MICE)

MICE iteratively imputes missing values by fitting conditional models for each variable given all others under the Missing at Random (MAR) assumption. For variable

X_{j}

with missing values, MICE estimates

P (X_{j} ∣ X_{- j}, θ_{j}),

(3)

where

X_{- j}

denotes all remaining variables and

θ_{j}

represents model parameters. Variable-specific regression families were selected to better accommodate skewed and bounded distributions. MICE serves as a strong and widely adopted baseline in nutritional epidemiology and health services research.

To illustrate the transformations performed across the proposed framework, Table 4 provides a conceptual example showing how observed values are masked, prioritized, and reconstructed across the stages described in Section 4.5.1, Section 4.5.2 and Section 4.5.3.

4.6. Variational Autoencoder Imputation

The VAE models the joint distribution through a latent variable z with prior

p (z) = N (0, I)

. The encoder

q_{ϕ} (z | x)

maps incomplete records to a latent representation, and the decoder

p_{θ} (x | z)

reconstructs missing entries. Training optimizes the evidence lower bound

L VAE = E q_{ϕ} (z | x) [| x - \hat{x} |^{2}] β, K L (q_{ϕ} (z | x), |, p (z)) .

(4)

During imputation, observed dimensions are clamped and missing values are sampled from the decoder. This probabilistic formulation allows uncertainty propagation but may oversmooth heavy-tailed dietary variables.

4.7. GAN and WGAN Imputation

Adversarial models were implemented as additional baselines. A generator produced candidate imputations conditioned on observed features, while a discriminator encouraged realism. A Wasserstein variant with gradient penalty was also evaluated to improve stability. GAN outputs were paired with real records using nearest-neighbor matching prior to RMSE computation to avoid permutation bias. These models showed reasonable reconstruction but weaker KS performance, motivating diffusion methods.

4.7.1. GAN Objectives

\min_{G} \max_{D} E_{x \sim P_{data}} [\log D (x)] + E_{z \sim P_{z}} [\log (1 - D (G (z)))] .

(5)

For GAN-based imputation, nearest-neighbor pairing was used to align generated samples with observed data prior to RMSE computation. This approach may underestimate reconstruction error and is used solely for comparative benchmarking. This potential bias is acknowledged and will be addressed in future work through alternative evaluation strategies.

4.7.2. WGAN Objectives

In these objectives, the following elements are used. The function G represents the generator network, which receives a random vector z and produces an imputed sample G(z). The function D denotes the discriminator in the standard GAN, while C denotes the critic in the Wasserstein formulation.

\max_{C} E_{x \sim P_{data}} [C (x)] - E_{z \sim P_{z}} [C (G (z))],

(6)

\min_{G} - E_{z \sim P_{z}} [C (G (z))] .

(7)

The symbol x represents real observations drawn from the empirical data distribution, denoted here as Pdata. The vector z is drawn from a simple reference distribution Pz, typically a standard Gaussian, and serves as the source of randomness for generating synthetic values.

The expectation operator

E

denotes expectation with respect to the specified data distribution. The term

\log D (x)

quantifies the discriminator’s confidence in classifying real samples as real, whereas

\log (1 - D (G (z)))

quantifies its confidence in classifying generated samples as fake.

4.8. Diffusion-Based Generative Imputation

4.8.1. Denoising Diffusion Framework

Diffusion-based models learn to approximate complex data distributions by reversing a gradual noising process. Given an observed variable

x_{0}

, forward diffusion constructs a sequence

{x_{t}}_{t = 1}^{T}

:

q (x_{t} ∣ x_{0}) = N (\sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I),

(8)

where

{\bar{α}}_{t} = \prod_{s = 1}^{t} α_{s}

defines the noise schedule. A neural network

ϵ_{θ}

is trained to reverse this process by predicting noise at each time-step.

4.8.2. Conditional Diffusion Imputation

In the conditional diffusion setting, the reverse process is conditioned on observed covariates

X_{obs}

:

p_{θ} (x_{t - 1} ∣ x_{t}, X_{obs}) .

(9)

This conditioning enables the model to leverage observed information while imputing missing values, improving reconstruction under MAR-like assumptions.

4.8.3. Fully Conditioned Diffusion with Missingness Masks

Fully conditioned diffusion extends this framework by explicitly incorporating variable identity embeddings and missingness masks M into the denoising network:

p_{θ} (x_{t - 1} ∣ x_{t}, X_{obs}, M) .

(10)

This formulation allows the model to distinguish observed from missing entries and to adapt imputation behavior accordingly. By conditioning on both covariates and missingness structure, fully conditioned diffusion-based models are better suited for high-entropy variables exhibiting heterogeneous and non-Gaussian distributions.

4.9. Training Protocol and Hyperparameters

All models were implemented in Python (version 3.11) using PyTorch (version 2.2) and evaluated under identical masking conditions to ensure a fair comparison.

Data preprocessing. Continuous variables were standardized using z-score normalization based on complete-case data. A controlled masking protocol was applied, in which 20% of observed entries were randomly removed using a fixed random seed.

Traditional imputation methods. Mean imputation was implemented using column-wise averages. KNN imputation was performed with

k = 5

using distance-weighted averaging. MICE was implemented using an iterative imputer with a maximum of 10 iterations and a fixed random seed.

Variational autoencoder (VAE). The VAE consisted of a fully connected encoder–decoder architecture with one hidden layer of 141 units and a latent dimension of 17. The model was trained for 50 epochs using the Adam optimizer with a learning rate of

9.0 \times 10^{- 3}

and batch size of 128. The loss function combined mean squared reconstruction error with a Kullback–Leibler divergence regularization term.

GAN and WGAN models. The generator and discriminator (or critic) were implemented as fully connected neural networks with two hidden layers of 169 units each and LeakyReLU activations. The latent noise vector had dimension 25. Models were trained for 50 epochs using the Adam optimizer with a learning rate of

1.18 \times 10^{- 4}

and batch size of 64. The WGAN variant used weight clipping with a threshold of 0.01 and five critic updates per generator step.

Diffusion-based models. The conditional diffusion model was implemented as a multilayer perceptron with input concatenation of observed values, missingness masks, and time embeddings. The network consisted of 3 fully connected layers with 256 units and dropout regularization (rate = 0.1). Models were trained for 50 epochs using the Adam optimizer with a learning rate of

1.0 \times 10^{- 3}

and batch size of 128. Noise was injected selectively into masked entries to simulate the diffusion process.

Computational Environment.

All experiments were conducted using Python (version 3.x) with standard scientific computing libraries, including NumPy, Pandas, and SciPy, along with machine learning frameworks such as scikit-learn and PyTorch/TensorFlow for deep generative models.

The experiments were executed on a workstation equipped with a multi-core CPU and 16–32 GB of RAM. GPU acceleration was used for training deep generative models where available; however, the framework is designed to operate on standard computing environments without requiring specialized hardware.

To ensure reproducibility, consistent random seeds were used across preprocessing, masking, and model training stages, and all models were evaluated under identical data splits and masking conditions.

The code implementation follows a structured pipeline consisting of (i) data preprocessing and filtering, (ii) entropy-based variable selection, (iii) controlled masking generation, (iv) imputation model training, and (v) evaluation using RMSE and KS metrics.

4.10. Evaluation Design and Metrics

Imputation performance was evaluated under MCAR-style simulated masking, in which 20% of observed entries were removed uniformly at random using a fixed seed (np.random.seed(42)). This protocol enables controlled method comparison without privileging any covariate-dependent structure, while acknowledging that real NHANES missingness is more plausibly MAR/MNAR.

In practice, NHANES missingness is often associated with demographics, response burden, and survey design, rendering MAR more plausible than MCAR. Consequently, reported metrics should be interpreted as comparative benchmarks rather than population-valid MAR estimates. Evaluation was performed exclusively on masked entries held out from observed data under the controlled masking protocol. Future work will implement covariate-dependent MAR masks and pattern-mixture MNAR sensitivity analyses.

Two complementary metrics were used:

RMSE: Measures average reconstruction error between imputed and true values.
KS Statistic: Quantifies distributional divergence between imputed and observed value distributions.

These metrics jointly assess pointwise accuracy and distributional fidelity, reflecting priorities relevant to downstream epidemiologic and policy analyses.

4.11. Sources of Stochastic Variability

Although we report single-run performance estimates, the pipeline contains several stochastic components that influence stability. Simulated missingness is generated using a fixed random seed for index selection, while deep generative models employ per-batch randomness through noise vectors in GAN and WGAN training and random diffusion masks using torch.rand_like during diffusion optimization. Consequently, the reported RMSE and KS values reflect one realization of these stochastic processes rather than averaged multi-seed estimates.

Adversarial models are particularly sensitive to initialization and noise sampling, as documented in prior work [8,52,60]. To mitigate evaluation bias, GAN reconstructions are assessed using nearest-neighbor pairing between generated and real samples for RMSE, while KS is computed distributionally without pairing, consistent with the implementation. Future extensions will incorporate multi-seed replication to quantify run-to-run variability.

4.12. Treatment of Survey Weights

NHANES is a complex probability survey with sampling weights, strata, and primary sampling units designed to support nationally representative estimation. The present study focuses on method comparison under controlled masking rather than direct population inference; therefore, imputation models were trained and evaluated on the measurement scale without incorporating design weights. This choice ensures that performance metrics (RMSE and KS) reflect reconstruction fidelity rather than weighted prevalence. For downstream epidemiologic analyses, imputed datasets should be combined with the NHANES interview weights (WTINT2YR) and design variables (SDMVSTRA, SDMVPSU) using survey-weighted estimation. Future work will extend the generative framework to weight-aware training, including (i) weighted loss functions, (ii) post-imputation raking, and (iii) conditioning on design strata.

4.13. Reproducibility and Implementation

All preprocessing and analyses were conducted using Python-based pipelines. Variable definitions, inclusion criteria, and transformations were aligned with official CDC NHANES documentation to ensure methodological consistency and reproducibility of the analytic workflow. Code modularization, fixed random seed, and consistent evaluation protocols ensure that results are reproducible and extensible to future NHANES cycles or related population health datasets, and evaluation steps were applied identically across methods to ensure comparability.

To support repeatability, the random seed was fixed during masking and model training, and all methods were evaluated under an identical MCAR masking protocol. At this stage, results reflect a single deterministic run; variability across random initializations is planned as an additional stability analysis. Future revisions will include multi-seed evaluation (

n \geq 5

) to quantify stochastic variance in generative model training.

To ensure full reproducibility of the experimental framework, we provide detailed descriptions of model architectures, hyperparameters, and imputation procedures used in this study.

4.13.1. Model Architectures

We evaluated both classical and deep generative imputation models on a common set of entropy-ranked variables. Data were split into training and evaluation sets prior to masking, with masking applied only to evaluation data.

Variational Autoencoder (VAE):

The VAE consists of a fully connected encoder–decoder architecture. The encoder maps the input vector

x \in R^{D}

to a latent representation using one hidden layer of 141 units, as described in Table 5, with ReLU activation, followed by separate linear layers for the mean and log-variance of a 17-dimensional latent space. The decoder mirrors this structure, reconstructing the input from the latent representation.

Generative Adversarial Network (GAN) and Wasserstein GAN (WGAN):

GAN-based models use fully connected generator and discriminator networks. The generator maps a noise vector

z

to the data space, while the discriminator distinguishes between real and generated samples. The WGAN variant replaces the standard loss with a Wasserstein loss to improve training stability. This approach was used solely for comparability and does not reflect direct reconstruction error.

Diffusion Models:

Diffusion-based imputers are implemented as conditional denoising networks that learn to reconstruct missing values through an iterative denoising process. The model takes as input the masked data, a missingness indicator mask, and a time-step embedding. The diffusion network consists of 3 fully connected layers with 256 units each.

Fully Conditioned Diffusion:

The fully conditioned diffusion model extends the conditional diffusion framework by incorporating variable identity embeddings, enabling variable-specific reconstruction and improved handling of heterogeneous feature distributions.

4.13.2. Hyperparameters and Training Settings

All models were trained under consistent experimental conditions. Key hyperparameters are summarized in Table 5.

All experiments were conducted using a fixed random seed to ensure reproducibility.

4.13.3. Imputation Procedures

A controlled missingness protocol was applied uniformly across all methods. Missing values were simulated under a Missing Completely at Random (MCAR) mechanism with a masking rate of 20% applied independently to each variable.

Classical imputation methods were implemented as follows:

Mean imputation replaces missing values with the column mean.
KNN imputation uses distance-weighted nearest neighbors ( $k = 5$ ).
MICE performs iterative chained regression with 10 iterations.

Deep generative models were trained on complete cases and applied to reconstruct missing entries. For GAN-based methods, imputation was performed using nearest-neighbor donor selection from generated samples. Diffusion models directly estimate missing values via conditional denoising.

4.13.4. Experimental Setup

The dataset consists of NHANES 2021–2023 data, with analysis restricted to the top 10 entropy-ranked continuous variables to ensure a consistent evaluation domain. Data preprocessing included standardization using StandardScaler.

Model performance was evaluated using:

Root mean square error (RMSE) on masked entries.
Kolmogorov–Smirnov (KS) statistic for distributional fidelity.

All methods were evaluated under an identical masking scheme to ensure fair comparison across imputation approaches.

4.13.5. Reproducibility Statement

All experiments were implemented in Python using standard scientific computing libraries, including PyTorch and scikit-learn. The full codebase and experimental pipeline will be made publicly available upon publication to support reproducibility and further research.

5. Results

This section reports the empirical evaluation of imputation performance across classical statistical methods, deep generative models, and diffusion-based approaches. Results are presented for high-entropy variables identified in Section 3.4, with performance assessed using both pointwise reconstruction error and distributional similarity metrics. The analysis focuses on how performance varies with missingness complexity, rather than relying solely on aggregate missingness rates.

All experiments were conducted under a fixed experimental configuration, including consistent masking patterns, parameter initialization, and diffusion noise schedules across methods. This controlled setup allows differences in RMSE and KS statistics to be attributed to model behavior rather than stochastic variation. A single random seed was used to ensure reproducibility of the reported results.

We note that generative models such as GANs, VAEs, and diffusion-based approaches are inherently sensitive to initialization. Although the observed performance trends are consistent under the fixed setup, variability across seeds remains an important consideration. Multi-seed benchmarking and confidence interval estimation will be explored in future work to provide a more comprehensive assessment of model stability.

Finally, the present study focuses on reconstruction fidelity, as measured by RMSE and KS statistics, and does not evaluate downstream predictive performance (e.g., disease risk modeling). Extending the evaluation to include downstream clinical and epidemiological tasks is an important direction for future research.

5.1. Evaluation Overview

Imputation performance was evaluated under controlled MCAR masking, with ground-truth values retained for assessment. All methods were applied to identical masked datasets to ensure comparability. Results are summarized using RMSE to quantify reconstruction accuracy and the KS statistic to assess distributional alignment between imputed and observed values.

Table 6 summarizes average reconstruction accuracy and distributional fidelity across the ten highest-entropy dietary variables under identical masking conditions. Conditional diffusion achieves the lowest mean reconstruction error (RMSE = 0.4180), followed by fully conditioned diffusion (RMSE = 0.4423), substantially outperforming classical baselines. In contrast, WGAN attains the strongest marginal distributional alignment (mean KS = 0.0692), highlighting a metric-dependent trade-off between pointwise accuracy and distributional fidelity. Traditional mean imputation exhibits pronounced distributional distortion despite deterministic stability. These findings reinforce the importance of evaluating both reconstruction error and marginal distributional preservation when benchmarking imputation methods in high-entropy health datasets.

The fully conditioned diffusion model exhibits slightly higher RMSE than the conditional diffusion model. This may reflect over-conditioning, where additional inputs such as variable identity and missingness masks increase model complexity and may hinder optimization under limited sample size. This highlights a trade-off between conditioning expressiveness and reconstruction stability.

WGAN achieves the lowest KS statistic for selected variables, indicating strong marginal distribution alignment. However, diffusion models provide a more consistent balance between reconstruction accuracy and distributional fidelity across variables. Future work will include variable-wise comparisons to further characterize these trade-offs.

As shown in Figure 4, missingness complexity was modeled as a latent construct defined by entropy and missing percentage, with parallel structural paths to RMSE and KS. While traditional methods exhibited near-zero associations between complexity and performance, deep models showed modest negative associations that did not reach statistical significance. In both groups, RMSE and KS remained strongly correlated (

r > 0.75

,

p < 0.001

). Each SEM was estimated using variable-level observations (10 variables × number of methods per group), yielding n = 30 for traditional methods and n = 50 for deep models.

The structural equation model is estimated using a limited set of variables and treats them as independent observations. This simplification may not fully capture correlations across variables and should be interpreted as exploratory rather than confirmatory.

The SEM results suggest that diffusion-based models attenuate the negative impact of entropy-driven complexity on reconstruction error relative to traditional approaches. This SEM is exploratory and intended to examine structural relationships between missingness complexity and imputation performance, rather than to provide population-level inference. Because the SEM pools variable-level observations and treats them as independent units, it does not fully model correlations across variables or repeated dependence structures across methods. Accordingly, these SEM results should be interpreted as exploratory pattern summaries rather than confirmatory evidence.

5.2. Reconstruction Accuracy Across Methods

Figure 5 summarizes RMSE performance across imputation methods for the top-ranked high-entropy variables. Classical approaches, including mean imputation and KNN, exhibit the highest reconstruction errors across variables, reflecting their limited ability to model nonlinear dependencies and heterogeneous distributions. MICE consistently outperforms other classical baselines, demonstrating the benefit of conditional modeling under MCAR assumptions.

Under masked-entry evaluation, diffusion-based models achieve the lowest average reconstruction error. The conditional diffusion model attains the lowest mean RMSE (0.4180), followed by the fully conditioned diffusion model (0.4423). Among traditional approaches, MICE performs strongest (0.5082), substantially outperforming KNN (0.7225) and mean imputation (0.9907). Adversarial approaches exhibit higher reconstruction error overall, with GAN (0.8747) and VAE (0.8894) demonstrating instability across variables.

5.3. Distributional Fidelity

Figure 6 illustrates the Kolmogorov–Smirnov (KS) statistics comparing the empirical distributions of observed variables with the distributions produced by each imputation method. While diffusion-based models achieved the lowest reconstruction error, the WGAN model produced the lowest KS divergence (mean KS = 0.0692), indicating stronger preservation of the marginal distributional structure. This pattern highlights an important trade-off between pointwise reconstruction accuracy and distributional fidelity. Diffusion models prioritize accurate reconstruction of individual values under the masking protocol, whereas adversarial approaches more closely reproduce the global shape of the empirical distribution. These findings suggest that evaluation of imputation methods should consider both reconstruction error and distributional alignment, particularly when imputed datasets are intended for downstream epidemiologic or policy analyses where marginal distributions influence statistical inference.

Distributional preservation, measured using the Kolmogorov–Smirnov statistic, reveals a different performance hierarchy. WGAN achieves the lowest mean KS value (0.0692), indicating strong distributional alignment despite moderate reconstruction error. Conditional diffusion (0.0843) and fully conditioned diffusion (0.0900) maintain competitive KS performance while preserving low RMSE. Traditional MICE (0.0992) and KNN (0.0945) remain competitive, whereas mean imputation performs worst (0.5808). GAN (0.2642) and VAE (0.2149) exhibit higher distributional distortion, reflecting instability in adversarial training.

Although WGAN achieves the lowest mean KS value, the present analysis does not establish that this advantage holds uniformly across all variables; rather, it indicates stronger marginal distribution alignment on average across the evaluated dietary features.

5.4. Traditional Imputation Baselines

Mean imputation exhibited substantial distributional distortion (KS 0.55–0.61) despite moderate reconstruction error (RMSE 0.92–1.15), confirming the known variance-collapse behavior [61]. K-nearest neighbors reduced RMSE for several variables (0.47–0.97) with heterogeneous KS performance (0.06–0.17). MICE achieved the strongest classical results with RMSE 0.24–0.77 and KS 0.05–0.17, consistent with its conditional predictive formulation [5].

Figure 7 illustrates the joint RMSE–KS performance landscape across methods. Diffusion-based models cluster in the lower-left quadrant, indicating superior joint optimization of reconstruction accuracy and distributional fidelity. WGAN achieves particularly low KS values but with higher RMSE relative to conditional diffusion. Traditional mean imputation occupies the upper-right region, confirming poor performance across both metrics. Adversarial GAN and VAE models exhibit broader dispersion, reflecting instability across variables.

Diffusion-based models (Diffusion and FullCond Diffusion) cluster consistently in the lower-left region, demonstrating strong joint optimization of accuracy and distributional preservation. In contrast, mean imputation exhibits high RMSE and elevated KS values, confirming its inadequacy under realistic missingness conditions. GAN-based methods display greater dispersion, indicating instability across variables. VAE shows substantial RMSE variance, particularly for DR2TMOIS, suggesting sensitivity to distributional skew.

Overall, the scatter analysis reinforces that conditional diffusion models achieve the most favorable accuracy–fidelity trade-off among evaluated approaches.

5.5. Variational Autoencoder

The VAE provided a balanced trade-off between reconstruction accuracy and distributional fidelity (mean RMSE 0.52; KS 0.11). The latent generative structure preserved multivariate dependencies across correlated nutritional measures, aligning with prior findings that VAEs outperform regression-based imputers in high-entropy biomedical domains [51,62].

5.6. GAN Evaluation with Nearest-Neighbor Pairing

Because unconditional GAN generation lacks one-to-one correspondence with real observations, direct rowwise RMSE is not statistically valid. A nearest-neighbor pairing protocol in standardized feature space was therefore adopted to approximate pointwise fidelity [63]. The paired evaluation yielded RMSE 0.71 and KS 0.34, indicating partial distribution capture but weaker reconstruction relative to VAE and MICE. This behavior is consistent with mode-dropping tendencies reported for vanilla GANs [8].

While this approach enables approximate alignment for evaluation, it may underestimate reconstruction error by selecting the closest generated sample rather than evaluating direct model outputs. This introduces a potential bias in favor of GAN-based methods when compared to imputers that produce explicit pointwise reconstructions.

WGAN achieves strong distributional fidelity as measured by the KS statistic for selected variables. However, diffusion models demonstrate more consistent performance across variables, suggesting a trade-off between marginal distribution alignment and overall reconstruction stability.

5.7. Diffusion-Based Models

Diffusion-based approaches demonstrate the most favorable joint performance profile. The conditional diffusion model achieves the lowest overall reconstruction error (RMSE = 0.4180) while maintaining strong distributional fidelity (KS = 0.0843). The fully conditioned diffusion model yields comparable performance (RMSE = 0.4423, KS = 0.0900), confirming that conditioning mechanisms improve stability without substantial loss in distributional alignment. The slightly higher RMSE of the fully conditioned diffusion model may reflect over-conditioning, whereby the additional inputs for variable identity and mask information increase optimization complexity under limited sample support. This suggests a trade-off between conditioning expressiveness and reconstruction stability in high-dimensional tabular imputation.

Together, these results indicate that diffusion-based generative modeling provides the most robust trade-off between accuracy and distributional preservation [38,53].

Figure 8 and Figure 9 summarize variable-wise performance. Fully conditioned diffusion improves performance for most variables individually, although conditional diffusion achieves the lowest overall mean RMSE, with particularly large gains for DR1TM181 and DR1TP182. However, KS statistics (Figure 9) reveal that conditional diffusion better preserves marginal distributions, illustrating a metric-dependent trade-off between reconstruction accuracy and epidemiologic validity.

5.8. Performance as a Function of Entropy

To examine the relationship between missingness complexity and imputation performance, RMSE and KS metrics were analyzed as functions of variable entropy. Variables with low-to-moderate entropy show modest performance differences across advanced methods, whereas high-entropy variables exhibit substantial divergence in performance. For these variables, classical and adversarial approaches experience pronounced degradation, while diffusion-based models maintain stable reconstruction accuracy and distributional alignment.

These findings suggest that entropy is a useful diagnostic for identifying variables that may benefit from advanced generative imputation. However, because this study does not directly compare entropy-based variable selection against percent-missing-based selection, conclusions regarding relative superiority should be interpreted cautiously.

5.9. Key Empirical Findings

Across metrics, diffusion-based imputers consistently outperform, under masked-entry evaluation, classical and deep generative baselines, but the best method depends on the evaluation objective. Conditional diffusion yields the lowest reconstruction error (RMSE = 0.4180), while fully conditioned diffusion performs comparably (RMSE = 0.4423). These results demonstrate that optimizing for pointwise accuracy does not necessarily guarantee improved marginal distributional alignment, motivating multi-metric reporting for high-entropy health variables.

Across both metrics, diffusion-based approaches consistently occupy the Pareto-optimal region of the RMSE–KS performance space, demonstrating superior joint optimization relative to classical and adversarial baselines.

6. Discussion

The results point to three main observations. First, entropy provides a more informative characterization of missingness than percent-missing alone, as it captures structural uncertainty rather than simply the prevalence of missing values. Variables with similar missingness rates can differ substantially in entropy, reflecting differences in reconstruction difficulty. In the NHANES dietary variables examined here, higher entropy was associated with more complex, multimodal, and heterogeneous distributions, which are inherently more sensitive to imputation bias. In practice, these findings highlight a simple but important point: variables that appear similar in the missingness rate can behave very differently during reconstruction, depending on their underlying distribution and structural complexity.

Second, benchmarking under controlled and identical masking conditions shows that classical methods remain useful but are not sufficient for variables with complex missingness patterns. Approaches such as KNN and MICE perform adequately for lower-entropy variables, where local similarity assumptions or conditional mean structures are reasonable. However, for higher-entropy variables, these assumptions break down, leading to increased bias and loss of distributional structure. Diffusion-based models achieve more consistent performance across both reconstruction accuracy and distributional fidelity.

A key reason for this improvement lies in the generative mechanism of diffusion models. Unlike GANs, which rely on adversarial training and are susceptible to mode collapse, diffusion models iteratively learn to denoise data through a sequence of stochastic transformations. This process allows them to better approximate complex, multimodal distributions and capture subtle dependencies in the data. As a result, diffusion models are more robust in high-entropy settings, where the underlying data-generating process is less structured and more uncertain.

Third, evaluating imputation quality using both RMSE and KS statistics is essential, as these metrics capture fundamentally different aspects of reconstruction. RMSE reflects pointwise accuracy, emphasizing the closeness of imputed values to observed values, while the KS statistic measures distributional similarity, capturing how well the overall shape of the variable is preserved. The observed trade-off between RMSE and KS highlights an important tension: methods optimized for minimizing reconstruction error may produce overly smoothed estimates that distort the underlying distribution, whereas methods that preserve distributional properties may allow slightly higher pointwise errors. In healthcare applications, this distinction is critical, as downstream predictive models, epidemiological analyses, and subgroup comparisons depend on maintaining realistic data distributions rather than minimizing numerical error alone.

The superior performance of diffusion models across both RMSE and KS metrics suggests that their iterative refinement process enables a more balanced reconstruction, reducing pointwise error while preserving distributional structure. This is particularly important for high-entropy variables, where capturing variability and tail behavior is essential.

From an applied perspective, the proposed entropy-guided framework is relevant beyond NHANES. Healthcare data systems often integrate survey, laboratory, administrative, and patient-reported data, all of which exhibit heterogeneous missingness mechanisms. Entropy-based prioritization offers a practical way to identify variables that require more advanced reconstruction strategies, rather than relying on uniform preprocessing rules. At the same time, we note that this study does not include a direct comparison between entropy-based and percent-missing-based variable selection. As such, conclusions regarding the relative advantage of entropy should be interpreted with caution.

6.1. Limitations

Several limitations should be acknowledged. The present study evaluates reconstruction performance under controlled masked-entry experiments rather than downstream predictive or causal modeling. The analysis also focuses on a subset of high-entropy dietary variables rather than the complete NHANES feature space. In addition, results are derived from a single national survey dataset and may not fully generalize to clinical electronic health records or longitudinal monitoring data.

6.2. Future Directions

Future work will extend the framework to explicitly address MAR and MNAR missingness mechanisms, incorporate complex survey design information into generative models, and evaluate downstream impacts on prediction stability and causal inference. Applying entropy-guided diffusion imputation to electronic health records and multimodal health data streams represents a promising direction for improving data quality in real-world health analytics pipelines.

7. Conclusions

This study examined an entropy-guided framework for benchmarking imputation strategies in high-dimensional healthcare survey data using NHANES 2021–2023. By combining entropy-based diagnostics with controlled masking experiments, we show that percent-missing summaries alone are not sufficient to capture the structural complexity of incomplete health data. Entropy provides a complementary signal that helps identify variables for which naive imputation is most likely to introduce bias.

Comparisons across classical statistical methods, deep generative models, and diffusion-based approaches indicate that conditional diffusion models achieve the lowest reconstruction error for high-entropy dietary variables, while Wasserstein GAN models provide strong marginal distributional alignment. These findings highlight an inherent trade-off between pointwise reconstruction accuracy and distributional fidelity, underscoring the importance of evaluating imputation performance using multiple metrics.

From a practical perspective, entropy-informed diagnostics offer a useful guide for selecting imputation strategies in complex healthcare datasets. By prioritizing variables with structurally complex missingness, the proposed framework supports more reliable reconstruction and can improve the robustness of downstream analyses.

Several limitations should be acknowledged. First, evaluation was conducted under controlled MCAR-style masking to enable consistent benchmarking across methods. In practice, missingness in NHANES is more likely to follow MAR or MNAR mechanisms, and future work will incorporate covariate-dependent masking and sensitivity analyses. Second, results are based on a single training realization; additional multi-seed experiments are needed to quantify variability in generative model performance.

Future work will extend this framework by incorporating survey-weighted imputation for nationally representative inference, evaluating multi-seed stability of generative models, and integrating diffusion-based reconstruction into downstream predictive modeling pipelines for population health applications.

Overall, this work provides a reproducible approach to entropy-guided imputation benchmarking and demonstrates the potential of diffusion-based generative models for reconstructing high-entropy healthcare data.

Author Contributions

Conceptualization, D.F.P.; methodology, D.F.P.; software, D.F.P.; validation, D.F.P.; formal analysis, D.F.P.; writing—original draft preparation, D.F.P.; writing—review and editing, D.F.P., J.P. and V.P.G.; supervision, V.P.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable. The study used publicly available de-identified NHANES data.

Informed Consent Statement

Not applicable.

Data Availability Statement

The NHANES datasets analyzed in this study are publicly available from the Centers for Disease Control and Prevention. The Python scripts used for preprocessing, entropy-based variable selection, masking, and imputation benchmarking are available from the corresponding author upon reasonable request.

Acknowledgments

The authors acknowledge the National Center for Health Statistics (NCHS) at the Centers for Disease Control and Prevention (CDC) for providing access to the National Health and Nutrition Examination Survey (NHANES) data used in this study. The findings and conclusions in this paper are those of the authors and do not necessarily represent the official position of the CDC or NCHS. The authors also acknowledge the use of ChatGPT (OpenAI, GPT-5.3, 2026), a large language model, as an assistive tool for language refinement, LaTeX formatting, and manuscript organization. All technical content, methodological decisions, interpretations, and conclusions are the sole responsibility of the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

National Center for Health Statistics. NHANES Survey Methods and Analytic Guidelines; Centers for Disease Control and Prevention: Atlanta, GA, USA, 2023. Available online: https://wwwn.cdc.gov/nchs/nhanes/analyticguidelines.aspx (accessed on 14 March 2026).
Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; Wiley: New York, NY, USA, 1987. [Google Scholar]
Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data, 2nd ed.; Wiley: Hoboken, NJ, USA, 2002. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
van Buuren, S. Flexible Imputation of Missing Data, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
Azur, M.J.; Stuart, E.A.; Frangakis, C.; Leaf, P.J. Multiple imputation by chained equations. Int. J. Methods Psychiatr. Res. 2011, 20, 40–49. [Google Scholar] [CrossRef] [PubMed]
Kingma, D.P.; Welling, M. Auto-encoding variational Bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Banff, AB, Canada, 14–16 April 2014; Available online: http://arxiv.org/abs/1312.6114 (accessed on 21 April 2026).
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 27), Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 33), Virtual Conference, 6–12 December 2020; pp. 6840–6851. Available online: https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html (accessed on 21 April 2026).
Schafer, J.L. Analysis of Incomplete Multivariate Data; Chapman & Hall: Boca Raton, FL, USA, 1997. [Google Scholar]
Enders, C.K. Applied Missing Data Analysis; Guilford Press: New York, NY, USA, 2010. [Google Scholar]
Haneuse, S.; Arterburn, D.; Daniels, M.J. Assessing missing data assumptions in EHR-based studies. JAMA Netw. Open 2021, 4, e210184. [Google Scholar] [CrossRef] [PubMed]
Graham, J.W. Missing data analysis: Making it work in the real world. Annu. Rev. Psychol. 2009, 60, 549–576. [Google Scholar] [CrossRef] [PubMed]
Carpenter, J.R.; Kenward, M.G. Multiple Imputation and Its Application; Wiley: Chichester, UK, 2013. [Google Scholar]
Seaman, S.R.; White, I.R. Inverse probability weighting for missing data. Stat. Methods Med. Res. 2013, 22, 278–295. [Google Scholar] [CrossRef]
White, I.R.; Carlin, J.B. Bias and efficiency of multiple imputation compared with complete-case analysis. Stat. Med. 2010, 29, 2920–2931. [Google Scholar] [CrossRef]
Molenberghs, G.; Kenward, M. Missing Data in Clinical Studies; Wiley: Hoboken, NJ, USA, 2007. [Google Scholar]
Weiskopf, N.G.; Weng, C. Methods and dimensions of electronic health record data quality assessment: Enabling reuse for clinical research. J. Am. Med. Inform. Assoc. 2013, 20, 144–151. [Google Scholar] [CrossRef]
Hersh, W.R.; Weiner, M.G.; Embi, P.J.; Logan, J.R.; Payne, P.R.O.; Bernstam, E.V.; Lehmann, H.P.; Hripcsak, G.; Hartzog, T.H.; Cimino, J.J.; et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med. Care 2013, 51, S30–S37. [Google Scholar] [CrossRef]
Hripcsak, G.; Albers, D.J. Next-generation phenotyping of electronic health records. J. Am. Med. Inform. Assoc. 2013, 20, 117–121. [Google Scholar] [CrossRef]
Nasir, A.; Gurupur, V.P.; Liu, X. A new paradigm to analyze data completeness of patient data. Appl. Clin. Inform. 2016, 7, 386–401. [Google Scholar] [CrossRef]
Fernandes Prabhu, D.; Gurupur, V.P.; Hadley, D.; Prabhu, V. Addressing data incompleteness in NHANES 2021–2023. In Proceedings of the 5th International Conference on Innovations in Computational Intelligence and Computer Vision (ICICV 2025), Calabria, Italy, 4–6 June 2025. [Google Scholar]
White, I.R.; Royston, P.; Wood, A.M. Multiple imputation using chained equations. Stat. Med. 2011, 30, 377–399. [Google Scholar] [CrossRef]
van Buuren, S.; Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef]
Raghunathan, T.E.; Lepkowski, J.M.; Van Hoewyk, J.; Solenberger, P. A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv. Methodol. 2001, 27, 85–95. [Google Scholar]
Honaker, J.; King, G.; Blackwell, M. Amelia II: A program for missing data. J. Stat. Softw. 2011, 45, 1–47. [Google Scholar] [CrossRef]
Stekhoven, D.J.; Bühlmann, P. MissForest—Non-parametric missing value imputation. Bioinformatics 2012, 28, 112–118. [Google Scholar] [CrossRef]
Mazumder, R.; Hastie, T.; Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 2010, 11, 2287–2322. [Google Scholar]
Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17, 520–525. [Google Scholar] [CrossRef]
Yoon, J.; Jordon, J.; Schaar, M. GAIN: Missing data imputation using generative adversarial nets. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 5689–5698. Available online: https://proceedings.mlr.press/v80/yoon18a.html (accessed on 21 April 2026).
Liu, M.; Li, S.; Yuan, H.; Ong, M.E.H.; Ning, Y.; Xie, F.; Saffari, S.E.; Shang, Y.; Volovici, V.; Chakraborty, B.; et al. Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques. Artif. Intell. Med. 2023, 142, 102587. [Google Scholar] [CrossRef]
Mattei, P.-A.; Frellsen, J. MIWAE: Deep generative modelling and imputation of incomplete data sets. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2019; pp. 4413–4423. Available online: https://proceedings.mlr.press/v97/mattei19a.html (accessed on 21 April 2026).
Ipsen, N.B.; Mattei, P.-A.; Frellsen, J. not-MIWAE: Deep generative modelling with missing not at random data. arXiv 2020, arXiv:2006.12871. [Google Scholar]
Li, S.C.-X.; Jiang, B.; Marlin, B.M. MisGAN: Learning from incomplete data with generative adversarial networks. arXiv 2019, arXiv:1902.09599. [Google Scholar] [CrossRef]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual Conference, 18–24 July 2021; pp. 8162–8171. Available online: https://proceedings.mlr.press/v139/nichol21a.html (accessed on 21 April 2026).
Song, Y.; Sohl-Dickstein, J.; Kingma, D.P.; Kumar, A.; Ermon, S.; Poole, B. Score-based generative modeling through stochastic differential equations. arXiv 2021, arXiv:2011.13456. [Google Scholar] [CrossRef]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2021, arXiv:2010.02502. [Google Scholar]
Kotelnikov, A.; Baranchuk, D.; Rubachev, I.; Babenko, A. TabDDPM: Modelling tabular data with diffusion models. In Proceedings of the 40th International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; pp. 17564–17579. Available online: https://proceedings.mlr.press/v202/kotelnikov23a.html (accessed on 21 April 2026).
Ouyang, Y.; Xie, L.; Li, C.; Cheng, G. MissDiff: Training diffusion models on tabular data with missing values. arXiv 2023, arXiv:2307.00467. [Google Scholar] [CrossRef]
Gurupur, V.P.; Shelleh, M. Machine learning analysis for data incompleteness (MADI). IEEE Access 2021, 9, 112321–112335. [Google Scholar] [CrossRef]
Lumley, T. Complex Surveys: A Guide to Analysis Using R; Wiley: Hoboken, NJ, USA, 2010. [Google Scholar]
Korn, E.L.; Graubard, B.I. Analysis of Health Surveys; Wiley: New York, NY, USA, 1999. [Google Scholar]
Willett, W. Nutritional Epidemiology, 2nd ed.; Oxford University Press: New York, NY, USA, 2012. [Google Scholar]
Freedman, L.S.; Commins, J.M.; Moler, J.E.; Willett, W.; Tinker, L.F.; Subar, A.F.; Spiegelman, D.; Rhodes, D.; Potischman, N.; Neuhouser, M.L.; et al. Using biomarkers to evaluate the extent of dietary misreporting in a large sample of adults: The OPEN study. Am. J. Epidemiol. 2005, 158, 1–13. [Google Scholar]
Kipnis, V.; Midthune, D.; Freedman, L.; Bingham, S.; Schatzkin, A.; Subar, A.; Carroll, R. Empirical evidence of correlated biases in dietary assessment instruments. Am. J. Epidemiol. 2001, 153, 394–403. [Google Scholar] [CrossRef]
Subar, A.F.; Kipnis, V.; Troiano, R.P.; Midthune, D.; Schoeller, D.A.; Bingham, S.; Sharbaugh, C.O.; Trabulsi, J.; Runswick, S.; Ballard-Barbash, R.; et al. Using intake biomarkers to evaluate the extent of dietary misreporting in a large sample of adults: The OPEN study. Am. J. Epidemiol. 2015, 181, 492–501. [Google Scholar] [CrossRef]
Jakobsen, J.C.; Gluud, C.; Wetterslev, J.; Winkel, P. When and how should multiple imputation be used for handling missing data in randomised clinical trials—A practical guide with flowcharts. Stat. Med. 2017, 36, 3771–3790. [Google Scholar] [CrossRef]
Agency for Healthcare Research and Quality. National Healthcare Quality and Disparities Report; AHRQ: Rockville, MD, USA, 2021. [Google Scholar]
Gurupur, V.P.; Abedin, P.; Hooshmand, S.; Shelleh, M. Analyzing the Data Completeness of Patients’ Records Using a Random Variable Approach to Predict the Incompleteness of Electronic Health Records. Appl. Sci. 2022, 12, 10746. [Google Scholar] [CrossRef]
Rubin, D.B. Inference and missing data. Biometrika 1976, 63, 581–592. [Google Scholar] [CrossRef]
Nazábal, A.; Olmos, P.M.; Ghahramani, Z.; Valera, I. Handling incomplete heterogeneous data using VAEs. Pattern Recognit. 2020, 107, 107501. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, NSW, Australia, 6–11 August 2017; pp. 214–223. Available online: https://proceedings.mlr.press/v70/arjovsky17a.html (accessed on 21 April 2026).
Tashiro, Y.; Song, J.; Song, Y.; Ermon, S. CSDI: Conditional score-based diffusion models for probabilistic time series imputation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 34), Virtual Conference, 6–14 December 2021; Available online: https://proceedings.neurips.cc/paper/2021/hash/cfe8504bda37b575c70ee1a8276f3486-Abstract.html (accessed on 21 April 2026).
Centers for Disease Control and Prevention (CDC). National Health and Nutrition Examination Survey (NHANES), 2021–2023. U.S. Department of Health and Human Services: Hyattsville, MD, USA. Available online: https://www.cdc.gov/nchs/nhanes/ (accessed on 10 March 2026).
National Center for Health Statistics. NHANES Linked Mortality Files: Methodology and Analytic Considerations; CDC: Hyattsville, MD, USA, 2023. [Google Scholar]
Little, R.J.A.; Rubin, D.B. Statistical Analysis with Missing Data, 3rd ed.; Wiley: Hoboken, NJ, USA, 2019. [Google Scholar]
Kolmogorov, A.N. Sulla determinazione empirica di una legge di distribuzione. G. Dell’Istituto Ital. Degli Attuari 1933, 4, 83–91. [Google Scholar]
Smirnov, N. Table for estimating the goodness of fit. Ann. Math. Stat. 1948, 19, 279–281. [Google Scholar] [CrossRef]
Massey, F.J. The Kolmogorov–Smirnov test for goodness of fit. J. Am. Stat. Assoc. 1951, 46, 68–78. [Google Scholar] [CrossRef]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A.C. Improved training of Wasserstein GANs. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 30), Long Beach, CA, USA, 4–9 December 2017; pp. 5767–5777. Available online: https://papers.nips.cc/paper_files/paper/2017/hash/892c3b1c6dccd52936e27cbd0ff683d6-Abstract.html (accessed on 21 April 2026).
Schafer, J.L.; Graham, J.W. Missing data: Our view of the state of the art. Psychol. Methods 2002, 7, 147–177. [Google Scholar] [CrossRef]
Ma, C.; Tschiatschek, S.; Palla, K.; Hernández-Lobato, J.M.; Nowozin, S.; Zhang, C. EDDI: Efficient dynamic discovery of high-value information with partial VAE. In Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; pp. 4234–4243. Available online: https://proceedings.mlr.press/v80/ma18b.html (accessed on 21 April 2026).
Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional GAN. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 32), Vancouver, BC, Canada, 8–14 December 2019; pp. 7335–7345. Available online: https://papers.nips.cc/paper_files/paper/2019/hash/254ed7d2de3b23ab10936522dd547b78-Abstract.html (accessed on 21 April 2026).

Figure 1. Decision-driven overview of the proposed framework. The workflow begins with a Start node and proceeds through missing-data generation and an initial recovery decision. If recovery is pursued, entropy-guided diagnostics are used to identify and prioritize high-entropy variables. Both classical baselines and deep generative/diffusion-based methods are then applied to the same selected variable set under identical masking and evaluation conditions. This design reflects a comparative benchmarking framework rather than conditional assignment of methods to different variable groups.

Figure 2. Age distribution of participants in NHANES 2021–2023. The violin plot illustrates population coverage across the lifespan, supporting stratified analyses of missingness patterns.

Figure 3. Top 40 variables ranked by Shannon entropy after exclusion of identifiers and survey design fields, with the top 10 variables highlighted. Orange bars indicate the top 10 highest-entropy variables, while gray bars indicate the remaining variables in the top 40. Blue points and adjacent numeric labels denote the entropy value (in bits) for each variable. The right-hand gray track shows the percentage of missing data for each variable. High entropy indicates structurally complex and unpredictable missingness patterns, motivating entropy-guided variable prioritization for advanced imputation.

Figure 4. Multi-group structural equation model comparing traditional (Mean, KNN, MICE) and deep (VAE, GAN, WGAN, diffusion, fully conditional diffusion) imputation methods. Standardized estimates are shown for measurement loadings and structural paths.

Figure 5. RMSE comparison across imputation methods for high-entropy variables in NHANES 2021–2023. Conditional diffusion achieves the lowest average reconstruction error across variables.

Figure 6. KS statistics comparing empirical and imputed distributions across methods. WGAN achieves the lowest KS values, indicating the strongest marginal distributional alignment, while fully conditioned diffusion exhibits substantially higher KS values despite its strong RMSE performance, indicating a trade-off between pointwise accuracy and marginal distributional fidelity.

Figure 7. Joint evaluation of imputation accuracy and distributional fidelity under masked-entry simulation. Each point represents variable-level performance for a given imputation method. Lower-left positioning indicates superior trade-off between reconstruction error (RMSE) and Kolmogorov–Smirnov (KS) statistics. Diffusion-based approaches cluster closer to the optimal region compared to traditional and adversarial methods.

Figure 8. RMSE across variables and methods. Fully conditioned diffusion achieves the lowest reconstruction error, while classical methods degrade sharply for high-entropy dietary variables.

Figure 9. Kolmogorov–Smirnov statistics comparing empirical and imputed distributions. Conditional diffusion achieves the strongest marginal fidelity, whereas fully conditioned diffusion shows elevated KS despite low RMSE.

Table 2. Top ten variables with the highest proportion of missing data in the NHANES 2021–2023 dietary dataset. Variable definitions follow the National Health and Nutrition Examination Survey (NHANES) documentation provided by the National Center for Health Statistics (NCHS), Centers for Disease Control and Prevention (CDC) [54].

Variable	% Missing	Description
DR2TMOIS	≈100%	Total moisture intake, day 2
DR2TCARB	≈100%	Total carbohydrate intake, day 2
DR2TSFAT	≈100%	Total saturated fat intake, day 2
DR2TMFAT	≈100%	Total monounsaturated fat intake, day 2
DR2TPFAT	≈100%	Total polyunsaturated fat intake, day 2
DR2TNIAC	≈100%	Niacin intake, day 2
DR2TS160	≈100%	Palmitic acid intake, day 2
DR2TM181	≈100%	Oleic acid intake, day 2
DR2TP182	≈100%	Linoleic acid intake, day 2
DR1TMOIS	≈100%	Total moisture intake, day 1

Table 3. Definitions of top 10 Representative high-entropy NHANES dietary variables (2021–2023).

Code	Domain	Type	Definition (NHANES Label Summary)
DR1TMOIS	Dietary (Day 1)	Continuous	Total moisture intake (grams) reported in the Day 1 24 h dietary recall interview.
DR1TSFAT	Dietary (Day 1)	Continuous	Total saturated fatty acid intake (grams) reported in the Day 1 24 h dietary recall.
DR1TMFAT	Dietary (Day 1)	Continuous	Total monounsaturated fatty acid intake (grams) reported in the Day 1 24 h dietary recall.
DR1TM181	Dietary (Day 1)	Continuous	Total intake of fatty acid C18:1 (oleic acid) in grams from the Day 1 dietary recall.
DR1TPFAT	Dietary (Day 1)	Continuous	Total polyunsaturated fatty acid intake (grams) reported in the Day 1 dietary recall.
DR1TNIAC	Dietary (Day 1)	Continuous	Total niacin (vitamin B3) intake (milligrams) reported in the Day 1 dietary recall.
DR1TCARB	Dietary (Day 1)	Continuous	Total carbohydrate intake (grams) reported in the Day 1 dietary recall.
DR1TP182	Dietary (Day 1)	Continuous	Total intake of fatty acid C18:2 (linoleic acid) in grams from the Day 1 dietary recall.
DR1TS160	Dietary (Day 1)	Continuous	Total intake of fatty acid C16:0 (palmitic acid) in grams from the Day 1 dietary recall.
DR2TMOIS	Dietary (Day 2)	Continuous	Total moisture intake (grams) reported in the Day 2 24 h dietary recall interview.

Notes: Variable labels above are reported as concise definitions for readability. For exact NHANES codebooks, units, and questionnaire context, see CDC NHANES documentation for the 2021–2023 cycle.

Table 4. Illustrative example of the imputation workflow across framework stages. Values are shown conceptually to demonstrate transformations rather than exact numerical outputs.

Stage	Input	Transformation	Output
Observed Data	DR1TCARB = observed	—	Complete value
Masking (Section 4.5.1)	Observed value	Random masking applied	DR1TCARB = NaN
Entropy Prioritization (Section 4.5.2)	Variable set	High-entropy variables selected	Variable retained for benchmarking
Imputation (Section 4.5.3)	Missing value (NaN)	Classical and generative models applied	Reconstructed value (method-dependent)

Table 5. Hyperparameters used across imputation models.

Model	Learning Rate	Epochs	Batch Size	Notes
VAE	0.0090	50	128	Latent dim = 17, hidden dim = 141
KNN	–	–	–	$k = 5$ , distance-weighted
MICE	–	10 iterations	–	Iterative regression
Diffusion	0.001	50	128	Conditional denoising
Full Diffusion	0.001	50	128	Includes variable embeddings

Table 6. Comparative imputation performance under controlled MCAR masking (mean across high-entropy variables).

Method	Mean RMSE	Mean KS
Mean Imputation	0.9907	0.5808
KNN Imputation	0.7225	0.0945
MICE Imputation	0.5082	0.0992
VAE	0.8894	0.2149
GAN	0.8747	0.2642
WGAN	0.6128	0.0692
Conditional Diffusion	0.4180	0.0843
Fully Conditioned Diffusion	0.4423	0.0900

Note: Bold values indicate the best (lowest) performance for each metric.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fernandes Prabhu, D.; Park, J.; Gurupur, V.P. Entropy Guided Benchmarking of Classical and Generative Imputation Methods for High-Dimensional Healthcare Survey Data. Appl. Sci. 2026, 16, 4262. https://doi.org/10.3390/app16094262

AMA Style

Fernandes Prabhu D, Park J, Gurupur VP. Entropy Guided Benchmarking of Classical and Generative Imputation Methods for High-Dimensional Healthcare Survey Data. Applied Sciences. 2026; 16(9):4262. https://doi.org/10.3390/app16094262

Chicago/Turabian Style

Fernandes Prabhu, Deepa, Jaeyoung Park, and Varadraj P. Gurupur. 2026. "Entropy Guided Benchmarking of Classical and Generative Imputation Methods for High-Dimensional Healthcare Survey Data" Applied Sciences 16, no. 9: 4262. https://doi.org/10.3390/app16094262

APA Style

Fernandes Prabhu, D., Park, J., & Gurupur, V. P. (2026). Entropy Guided Benchmarking of Classical and Generative Imputation Methods for High-Dimensional Healthcare Survey Data. Applied Sciences, 16(9), 4262. https://doi.org/10.3390/app16094262

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Entropy Guided Benchmarking of Classical and Generative Imputation Methods for High-Dimensional Healthcare Survey Data †

Featured Application

Abstract

1. Introduction

2. Background and Related Work

2.1. Missing-Data Mechanisms and Their Implications

2.2. Entropy-Guided Diagnostics for Incomplete Health Data

2.3. Imputation Approaches: From Classical Models to Diffusion

2.4. NHANES and Clinical Informatics Case Studies

2.5. Contributions of This Study

2.6. Conceptual Overview

2.7. Comparison of Imputation Paradigms

3. Data and Variable Definitions

3.1. Dataset Overview

3.2. Data Domains and Variable Taxonomy

3.3. Preprocessing and Data Harmonization

3.4. Extent and Structure of Missingness

3.5. Rationale for Variable Selection

3.6. Variable Definitions and Domains

3.7. High-Entropy Variable Selection

4. Methods

4.1. Overview of the Entropy-Guided Imputation Pipeline

4.2. Notation and Problem Formulation

4.3. Evaluation Metrics

4.4. Missingness Characterization and Entropy Scoring

4.5. Baseline Traditional Imputation Methods

4.5.1. Mean Imputation

4.5.2. K-Nearest Neighbors (KNNs)

4.5.3. Multiple Imputation by Chained Equations (MICE)

4.6. Variational Autoencoder Imputation

4.7. GAN and WGAN Imputation

4.7.1. GAN Objectives

4.7.2. WGAN Objectives

4.8. Diffusion-Based Generative Imputation

4.8.1. Denoising Diffusion Framework

4.8.2. Conditional Diffusion Imputation

4.8.3. Fully Conditioned Diffusion with Missingness Masks

4.9. Training Protocol and Hyperparameters

4.10. Evaluation Design and Metrics

4.11. Sources of Stochastic Variability

4.12. Treatment of Survey Weights

4.13. Reproducibility and Implementation

4.13.1. Model Architectures

4.13.2. Hyperparameters and Training Settings

4.13.3. Imputation Procedures

4.13.4. Experimental Setup

4.13.5. Reproducibility Statement

5. Results

5.1. Evaluation Overview

5.2. Reconstruction Accuracy Across Methods

5.3. Distributional Fidelity

5.4. Traditional Imputation Baselines

5.5. Variational Autoencoder

5.6. GAN Evaluation with Nearest-Neighbor Pairing

5.7. Diffusion-Based Models

5.8. Performance as a Function of Entropy

5.9. Key Empirical Findings

6. Discussion

6.1. Limitations

6.2. Future Directions

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Entropy Guided Benchmarking of Classical and Generative Imputation Methods for High-Dimensional Healthcare Survey Data^†