Validated Methods for Synthesising Hearing Health Data for Machine Learning: A Comparative Study of KDE and VAE Approaches

Barrett, Liam; Katiri, Roulla; Ooi, Yuen Bing; Moffitt, Isabella; Schilder, Anne G. M.; Mehta, Nishchay

doi:10.3390/app16062917

Open AccessArticle

Validated Methods for Synthesising Hearing Health Data for Machine Learning: A Comparative Study of KDE and VAE Approaches

by

Liam Barrett

^1,2,*

,

Roulla Katiri

^2,3

,

Yuen Bing Ooi

¹

,

Isabella Moffitt

¹,

Anne G. M. Schilder

^1,2,3

and

Nishchay Mehta

^1,2,3

¹

UCL Ear Institute, University College London, 332 Grays Inn Road, London WC1X 8EE, UK

²

National Institute for Health Research, University College London Hospitals, Biomedical Research Centre, London W1T 7DN, UK

³

Royal National ENT and Eastman Dental Hospitals, University College London Hospitals NHS Foundation Trust, London NW1 2BU, UK

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(6), 2917; https://doi.org/10.3390/app16062917

Submission received: 17 December 2025 / Revised: 5 March 2026 / Accepted: 14 March 2026 / Published: 18 March 2026

(This article belongs to the Special Issue Advances in Machine Learning and Big Data Analytics)

Download

Browse Figures

Review Reports Versions Notes

Featured Application

The validated synthetic data generation methods can be applied to (1) training machine learning models for hearing loss prediction without access to sensitive patient data; (2) educational applications in audiology/ENT training; (3) algorithm development and benchmarking for hearing aid fitting systems; and (4) facilitating multi-institutional research collaborations.

Abstract

Hearing loss affects approximately 1.5 billion people globally, yet access to comprehensive audiometric datasets for research remains limited due to privacy constraints. Synthetic data generation offers a promising solution, enabling broader data sharing while preserving privacy. This study developed and validated two complementary approaches for synthesising audiometric data: Kernel Density Estimation (KDE) and Variational Autoencoders (VAE). Using the National Health and Nutrition Examination Survey (NHANES) dataset comprising 36,676 participants with comprehensive hearing assessments, we trained both generative models and evaluated synthetic data quality through a rigorous Train-on-Synthetic-Test-on-Real (TSTR) machine learning validation framework and blinded expert clinical assessment by two independent audiologists. The VAE approach achieved 86.3% utility for hearing loss prediction, as compared to the benchmark real data (Train-on-Real-Test-on-Real). Both methods demonstrated strong privacy preservation, with zero exact record matches and robust membership inference attack resistance. Statistical validation confirmed equivalence within clinically negligible margins (<1 dB HL) across all audiometric frequencies. Blinded assessment of 85 patient profiles by two independent expert audiologists revealed that VAE synthetic data achieved high clinical plausibility ratings, with 96.7% of VAE profiles rated as plausible, compared to 13.3% for KDE. Inter-rater reliability was moderate (Cohen’s weighted

κ = 0.553

, ICC

= 0.556

), with 84.7% of ratings within one point, and both raters independently ranking VAE above real data above KDE. These findings establish validated methodologies for generating privacy-preserving synthetic audiometric data suitable for machine learning applications and clinical education, addressing a critical gap in hearing health research infrastructure.

Keywords:

synthetic data; audiometry; hearing loss; variational autoencoder; kernel density estimation; machine learning; privacy-preserving; NHANES

1. Introduction

Hearing loss represents one of the most prevalent sensory deficits worldwide, affecting approximately 1.5 billion people globally and ranking as the third leading cause of years lived with disability [1,2]. The study of hearing health relies heavily on audiometric data, with pure-tone audiometry (PTA) representing the reference standard for assessing hearing sensitivity across different frequencies [3,4]. Despite the critical importance of audiometric data in advancing hearing healthcare, access to comprehensive datasets remains limited due to regulatory restrictions on sharing patient information [5].

While audiometric data may appear less sensitive than other health information, several factors necessitate privacy-preserving approaches. First, regulatory frameworks including the General Data Protection Regulation (GDPR) and Health Insurance Portability and Accountability Act (HIPAA) classify all health-related data as protected information requiring appropriate safeguards, irrespective of perceived sensitivity [6]. Second, audiometric profiles can reveal information beyond hearing status: characteristic noise-induced hearing loss patterns may indicate occupational history in industrial or military settings [7]; ototoxic hearing loss patterns can suggest chemotherapy or aminoglycoside antibiotic exposure [8]; and certain audiometric configurations are associated with hereditary conditions [9]. Third, institutional data governance policies typically apply uniform restrictions to all patient data, creating practical barriers to sharing even ostensibly low-risk information. Finally, the methods developed for audiometric data are directly transferable to more sensitive health domains, positioning this work as a proof-of-concept for broader healthcare applications.

The emergence of machine learning applications in healthcare has intensified the need for large, representative datasets to train and validate predictive models. In hearing health, artificial intelligence offers transformative potential across the care pathway, from automated audiogram interpretation and hearing loss risk prediction to personalised hearing aid fitting and population health monitoring [10,11]. Recent advances include unsupervised learning approaches for phenotyping sensorineural hearing loss [12], and natural language processing methods for extracting clinical information from audiological records [13]. However, the sensitive nature of health data creates challenges in balancing the data requirements of modern machine learning approaches with the ethical and legal obligations to protect patient privacy under frameworks such as the GDPR and HIPAA [6,14]. While synthetic data generation methods have been developed and validated for electronic health records and medical imaging, no comparable work has addressed audiometric data, leaving hearing health research without validated approaches for privacy-preserving data sharing.

Synthetic data generation has emerged as a promising solution to this challenge. Unlike traditional anonymisation techniques that modify existing records, synthetic data generation creates entirely new observations that maintain statistical relationships without corresponding to any actual individuals [5,15]. This approach offers several advantages: it mitigates privacy concerns by reducing the risk of patient identification; it can enhance representation of rare conditions and under-represented demographics; it enables broader access for research; and it facilitates the development and validation of machine learning models [16].

1.1. Synthetic Data in Healthcare

The application of synthetic data in healthcare has gained substantial traction across multiple domains. In clinical research, synthetic data facilitates multi-institutional collaboration without complex data-sharing agreements or regulatory hurdles [17]. For rare diseases, where data collection is inherently limited, synthetic approaches can augment small datasets to enable more robust statistical analyses [5]. In medical education, synthetic cases provide learners with exposure to diverse clinical scenarios without patient consent concerns [18]. Healthcare data synthesis also addresses critical challenges in algorithmic fairness, allowing correction of historical biases in care access that result in under-representation of certain demographic groups [19].

However, healthcare applications impose stringent requirements on synthetic data quality that exceed those in other domains. Medical synthetic data must maintain precise clinical relationships, preserve rare but significant patterns, and adhere to physiological constraints while ensuring no information from real patients is disclosed [15]. These requirements necessitate sophisticated generative approaches and rigorous validation frameworks tailored to healthcare applications. Recent advances have expanded the methodological toolkit for tabular health data synthesis, including conditional tabular GANs (CTGANs), tabular VAEs (TVAEs), and hybrid architectures combining multiple generative paradigms [20,21]. Comparative benchmarking studies have established evaluation frameworks for assessing fidelity, utility, and privacy across these approaches [22].

1.2. Machine Learning Approaches for Synthetic Data Generation

Two primary methodological approaches have emerged for synthetic data generation. Traditional statistical methods such as Kernel Density Estimation (KDE) estimate probability distributions of variables and their relationships from observed data [23]. KDE offers interpretability and relatively lightweight computational requirements, making it accessible for clinical applications. The method uses kernel functions to estimate the probability density function non-parametrically, with bandwidth selection determining the smoothness of the estimated distribution. Scott’s rule provides an automatic bandwidth selection method based on the data’s standard deviation and sample size [24].

More advanced approaches utilise deep generative models, particularly Variational Autoencoders (VAEs). VAEs learn compressed probabilistic representations of data distributions, enabling generation of new samples by sampling from and decoding a learned latent space [25]. The VAE architecture consists of an encoder network that maps input data to a distribution over latent variables, and a decoder network that reconstructs data from latent samples. The model is trained by optimising the evidence lower bound on the data likelihood, balancing reconstruction fidelity with latent space regularisation through Kullback–Leibler divergence [26].

1.3. Specialised Considerations for Hearing Health Data

Audiometric data presents distinct challenges for synthetic data generation that differentiate it from other health data domains. Unlike electronic health records comprising heterogeneous clinical notes, laboratory values, and diagnostic codes, or medical imaging with high-dimensional pixel arrays, audiometric data exhibits a structured multi-dimensional format: pure-tone thresholds are recorded across multiple frequencies (typically 0.5–8 kHz) for both ears, creating an inherent tensor structure with known inter-variable relationships [3]. This structure imposes specific constraints that general-purpose synthetic data methods may not adequately capture.

Three characteristics distinguish audiometric data synthesis from broader health data generation efforts. First, physiological correlation structuresmust be preserved: adjacent frequencies exhibit strong positive correlations (a threshold elevation at 2 kHz predicts elevation at 4 kHz), and bilateral measurements are correlated yet constrained by clinically plausible interaural asymmetry limits (typically <40 dB difference at any frequency) [4]. Second, aetiology-specific patterns must emerge naturally from the generative process: noise-induced hearing loss presents with a characteristic notch at 3–6 kHz [7]; presbycusis produces sloping high-frequency loss above 2 kHz [27,28]; and conductive losses create air-bone gaps with features such as the Carhart notch at 2–4 kHz [29,30]. Third, demographic covariance must be maintained: males typically display greater high-frequency threshold elevations than females, while hearing sensitivity decreases with age in predictable patterns, accelerating after age 60 [27,31,32].

These domain-specific requirements motivated our methodological choices. The VAE architecture was designed with sufficient latent dimensionality (64 dimensions) to encode the complex correlation structure across 10 frequency-ear combinations plus demographic covariates. The KDE approach jointly estimates multivariate density across all audiometric and demographic variables simultaneously, preserving the correlation structure inherent in the training data. Our validation framework incorporates expert clinical review specifically because audiometric plausibility can be rapidly assessed by trained audiologists examining threshold configurations. This validation approach is less feasible for synthesised laboratory panels or imaging data, where clinical review of individual records would be prohibitively time-consuming.

Figure 1 provides an overview of the complete methodological pipeline, illustrating the flow from data acquisition through synthetic data generation to multi-faceted evaluation.

1.4. Validation Framework for Synthetic Data

Validation of synthetic data requires assessment of both statistical fidelity and downstream utility. The Train-on-Synthetic-Test-on-Real (TSTR) framework provides a rigorous methodology for evaluating whether machine learning models trained on synthetic data can perform comparably to those trained on real data [33]. In this framework, models are trained exclusively on synthetic data and evaluated on held-out real data. The ratio of TSTR performance to baseline Train-on-Real-Test-on-Real (TRTR) performance quantifies the relative discriminative fidelity of the synthetic data.

Complementary evaluation scenarios include Train-on-Real-Test-on-Synthetic (TRTS), which assesses pattern matching between real and synthetic distributions, and Train-on-Synthetic-Test-on-Synthetic (TSTS), which measures the internal consistency of the synthetic data [33]. Together, these scenarios provide a comprehensive assessment of synthetic data quality for machine learning applications.

1.5. The Mi & Sun [34] Study as Validation Benchmark

To provide an externally validated benchmark for synthetic data evaluation, we replicated the methodology of Mi and Sun [34], who investigated the association between heavy metal exposure and hearing loss using the National Health and Nutrition Examination Survey (NHANES) data. Their study employed eight machine learning classifiers to predict hearing loss from 20 features encompassing demographics, blood and urinary metal concentrations, dietary factors, and health indicators. The study achieved high classification performance (mean AUC = 0.91) and identified age, gender, hypertension, and noise exposure as the primary predictors of hearing loss.

By training synthetic data generators on the same dataset and evaluating whether models trained on synthetic data can replicate the published classification performance, we provide a rigorous test of synthetic data utility grounded in reproducible methodology.

1.6. Research Objectives

NHANES represents the largest openly available source of audiometric data, offering comprehensive hearing assessments from the American population spanning multiple decades [35]. As a publicly available dataset, NHANES does not itself require privacy protection. Rather, it serves as an openly verifiable test case for developing and validating privacy preservation methods: because the training data are accessible, the scientific community can independently assess the privacy properties of the resulting synthetic data. Establishing these methods on a public benchmark is a necessary precursor to their application with genuinely sensitive clinical datasets. This study addresses the gap in validated synthetic audiometric data generation through four primary objectives:

Characterise the demographic and audiometric profiles within the NHANES dataset, establishing baseline distributions and relationships.
Develop and compare KDE and VAE approaches for generating synthetic audiometric data.
Implement a comprehensive TSTR validation framework to assess synthetic data discriminative fidelity for hearing loss prediction.
Evaluate privacy preservation through multiple complementary metrics including membership inference attack resistance.

By developing methodologies for generating and validating synthetic audiometric data on a publicly accessible benchmark, we aim to establish transparent, reproducible methods that can subsequently be applied to sensitive clinical datasets where privacy protection is required. The synthetic datasets and methods produced through this work have potential to accelerate innovation in hearing diagnostics, enhance educational resources for audiology training, and improve understanding of hearing disorders across diverse populations.

2. Materials and Methods

2.1. Data Source and Preprocessing

2.1.1. NHANES Dataset

The NHANES dataset served as the primary data source for this study. NHANES is a programme of studies designed to assess the health and nutritional status of adults and children in the United States, conducted by the National Center for Health Statistics [35]. We utilised audiometric data from survey periods in 1999–2012 and 2015–2020, representing the largest and most accessible audiometric information available.

Data were downloaded directly from the official NHANES website and processed using standardised extraction protocols (see Section 2.1.2, Section 2.1.3, Section 2.1.4 and Section 2.1.5). The final compiled dataset included audiometric measurements from 29,714 participants with complete pure-tone audiometry data. For participants aged over 85 years, age was recoded to 80 years in accordance with NHANES privacy protocols. The dataset was split into training (80%, n = 23,771) and test (20%, n = 5943) sets using stratified random sampling to maintain consistent hearing loss prevalence across splits.

2.1.2. Data Integration and Quality Control

Data from multiple NHANES survey cycles (1999–2012 and 2015–2020) were integrated using participant sequence identifiers (SEQNs), with survey cycles tracked to enable cohort-specific analyses. No formal batch effect correction was applied, as NHANES employs standardised audiometric protocols, calibrated equipment, and trained examiners across all survey cycles, minimising systematic inter-cycle variation.

Missing data were handled using a three-stage approach. First, variables with greater than 95% missing data (primarily laboratory measurements not available across all survey cycles) were excluded from the feature set. Second, for pure-tone audiometry data, complete case analysis was applied: participants with any missing PTA values at the standard frequencies were excluded to ensure all synthetic data models were trained on fully observed audiometric profiles. Third, for non-audiological variables (demographics, health indicators) retained after the 95% exclusion threshold, remaining missing values were imputed using Multiple Imputation by Chained Equations (MICE) with predictive mean matching prior to model training [36].

Data quality validation included physiological plausibility checks constraining hearing thresholds to the clinically valid range of

- 20

to 120 dB HL. Thresholds outside this range were flagged and clipped to boundary values. Age validation ensured all participants fell within the NHANES audiometry protocol age range (12–85 years).

2.1.3. Audiometric Variables

Pure-tone audiometry measurements were extracted for both ears at five standard frequencies: 0.5, 1, 2, 4, and 8 kHz. These frequencies represent the clinically relevant range for speech understanding and are standard in audiological assessment [3]. Hearing thresholds were measured in decibels hearing level (dB HL), with higher values indicating greater hearing loss.

Tympanometric data were available in NHANES for a subset of participants, including 84 pressure-compliance measurement points per ear spanning

- 300

to

+ 198

dekapascal (daPa). However, due to substantial missingness in tympanometric variables across survey cycles and the primary focus on pure-tone audiometry for hearing loss classification, tympanometric data were not included in the synthetic data generation feature set. Future work may explore tympanometric data synthesis as sample sizes permit.

2.1.4. Demographic and Health Variables

Demographic and health-related variables included age, gender, race/ethnicity (Non-Hispanic White, Non-Hispanic Black, Mexican American, Other Hispanic, Other Race), body mass index, blood pressure measurements, and self-reported noise exposure history.

2.1.5. Hearing Loss Classification

Hearing loss was defined using the standard clinical criterion of hearing thresholds exceeding 25 dB HL at any measured frequency (0.5, 1, 2, 4, or 8 kHz) in either ear [2]. This binary classification (hearing loss present/absent) served as the primary outcome variable for machine learning validation experiments.

2.2. Synthetic Data Generation Methods

2.2.1. Kernel Density Estimation

Kernel Density Estimation is a non-parametric method for estimating the probability density function of a random variable [23]. For multivariate data, KDE estimates the joint probability distribution across all variables, thereby capturing correlations and dependencies in the data.

Our KDE implementation used a Gaussian kernel function with bandwidth determined using Scott’s rule [24]:

h = n^{- 1 / (d + 4)} \cdot σ

(1)

where n is the sample size, d is the number of dimensions, and

σ

is the standard deviation of each dimension. This automatic bandwidth selection method provides a balance between bias and variance in the density estimate without requiring manual tuning.

To generate synthetic data, we sampled from the fitted KDE model using a k-dimensional tree algorithm with leaf size of 40 for efficient density estimation. All variables (audiometric thresholds and demographic features) were generated jointly to preserve multivariate relationships. Complete hyperparameter specifications are provided in Supplementary Table S1.

2.2.2. Variational Autoencoder

Variational Autoencoders (VAEs) are deep generative models that learn a compressed probabilistic representation of the input data distribution [25]. Our VAE architecture consisted of

Encoder: Input layer → Dense(512, ReLU) → Dense(256, ReLU) → Dense(128, ReLU) → Latent mean and log-variance (64 dimensions)
Decoder: Latent sample (64) → Dense(128, ReLU) → Dense(256, ReLU) → Dense(512, ReLU) → Output layer

The model was trained using the standard VAE objective combining reconstruction loss (mean squared error) and KL divergence regularisation:

L = E_{q (z | x)} [log p (x | z)] - β \cdot D_{K L} (q (z | x) ∥ p (z))

(2)

where

β = 0.5

controls the trade-off between reconstruction fidelity and latent space regularisation [37]. The reduced

β

value (compared to standard VAE, where

β = 1

) was selected to prevent posterior collapse while maintaining meaningful latent representations. Training used the Adam optimiser with learning rate 0.001 for 200 epochs with batch size 64. Dropout regularisation (probability 0.1) and batch normalisation were applied to all hidden layers. Early stopping with patience of 30 epochs was implemented based on validation loss, with 20% of training data held out for validation.

β

-annealing was employed, gradually increasing

β

from 0 to 0.5 over the first 50 epochs to facilitate initial reconstruction learning before regularisation [38]. To generate synthetic data, we sampled from the standard normal prior

p (z) = N (0, I)

and passed samples through the trained decoder. Complete hyperparameter specifications and rationale are provided in Supplementary Table S1. Figure 2 illustrates the complete VAE architecture.

2.2.3. Consideration of Alternative Approaches

Traditional oversampling methods such as SMOTE (Synthetic Minority Over-sampling Technique) [39] and ROSE (Random Over-Sampling Examples) [40] were considered but not adopted for this study. While these methods achieve near-perfect utility in machine learning applications (Supplementary Table S2), they operate by duplicating real records (ROSE) or interpolating between existing data points (SMOTE), providing no meaningful privacy protection. Our evaluation found exact match rates of 58.9% for SMOTE and 100% for ROSE, with membership inference attack success rates exceeding 80%. These characteristics render such methods unsuitable for healthcare applications where data sensitivity and privacy preservation are primary concerns. The generative approaches presented here (KDE and VAE) create genuinely novel records by learning the underlying data distribution, achieving strong privacy protection while maintaining acceptable utility for downstream applications.

2.3. Machine Learning Validation Framework

2.3.1. TSTR Framework

We implemented the Train-on-Synthetic-Test-on-Real (TSTR) validation framework introduced by Esteban et al. [33], with four evaluation scenarios (Table 1):

The primary evaluation metric was the TSTR/TRTR ratio, representing the relative discriminative fidelity of synthetic data. Following Esteban et al. [33] and subsequent work [15], we report this ratio as a percentage, with values approaching 100% indicating synthetic data that fully preserves the predictive relationships in real data.

2.3.2. Machine Learning Models

Following the methodology of Mi & Sun [34], we implemented eight classifiers for hearing loss prediction:

Logistic Regression (LR)
Support Vector Machine (SVM) with RBF kernel
Random Forest (RF) with 100 estimators
Decision Tree (DT)
K-Nearest Neighbours (KNN) with k = 5
Gradient Boosting Machine (GBM) with 100 iterations
Feedforward Neural Network (FNN) with two hidden layers
XGBoost with L1/L2 regularisation

For the TRTR baseline, models were trained and evaluated using 10-fold stratified cross-validation with Area Under the Receiver Operating Characteristic Curve (AUC-ROC) as the primary performance metric, with results reported as mean ± standard deviation. For TSTR evaluation, models were trained on synthetic data and evaluated on held-out real test data (20% split), providing an independent assessment of synthetic data utility. F1 score, precision, and recall were computed as secondary metrics.

2.3.3. Feature Importance Analysis

SHAP (SHapley Additive exPlanations) values were computed for gradient boosting models to assess feature importance [41]. Mean absolute SHAP values were used to rank features, and the correlations between real and synthetic data feature importance rankings were computed to assess the preservation of predictive relationships. As a post-hoc analysis to investigate the mechanistic basis of any observed ranking divergence, Spearman rank correlation matrices were computed for all 20 Mi & Sun [34] features across real, KDE-synthetic, and VAE-synthetic data. Pairwise correlation differences (

Δ ρ = ρ_{synthetic} - ρ_{real}

) were computed, and mean absolute difference (MAD) across all unique feature pairs was used as a summary measure of correlation structure distortion. Higher MAD values indicate greater departure from the real data’s inter-variable relationships, providing a single metric for how faithfully each generative method preserves the joint feature structure.

2.4. Statistical Validation

2.4.1. Equivalence Testing

Statistical equivalence between real and synthetic data distributions was assessed using the Two One-Sided Test (TOST) procedure [42]. The equivalence margin was set to

ϵ = 1

dB HL, representing a clinically negligible difference in hearing thresholds [3]. Tests were performed at the 0.995 confidence level to account for multiple comparisons across frequencies and ears.

2.4.2. Distribution Comparison

Distribution similarity was assessed through multiple complementary metrics:

Kolmogorov–Smirnov (KS) statistic: Maximum difference between cumulative distribution functions.
Wasserstein distance: Earth mover’s distance between distributions.
Maximum Mean Discrepancy (MMD): Kernel-based distribution comparison.
Chi-square test for categorical variables.

2.4.3. Correlation Preservation

Preservation of multivariate relationships was assessed by computing Pearson correlation matrices for audiometric variables in both real and synthetic datasets. The mean absolute difference in correlation coefficients across all variable pairs quantified correlation preservation fidelity.

2.5. Privacy Validation

For privacy evaluation we employed a multi-faceted approach combining established metrics to assess different aspects of privacy risk. Core metrics are reported in the main results; extended evaluation results are provided in Supplementary Table S2.

2.5.1. Distance to Closest Record

For each synthetic record, we computed the minimum Euclidean distance to any record in the training set [43]. The DCR distribution was analysed to ensure synthetic records did not memorise training examples. We additionally computed the memorisation ratio, defined as the ratio of mean DCR to holdout data versus mean DCR to training data; values near 1.0 indicate no preferential proximity to training records.

2.5.2. Membership Inference Attack

Membership inference attacks attempt to determine whether a specific record was used in model training [44]. We implemented both shadow model and distance-based approaches. In the shadow model approach, an attacker model was trained to distinguish between records from the training set and held-out records based on model confidence scores. In the distance-based approach, synthetic records were classified as members if their DCR fell below a threshold. Attack success rate close to 50% (chance) indicates strong privacy preservation.

2.5.3. Attribute Inference Attack

Attribute inference attacks assess whether sensitive attributes can be inferred from synthetic data given partial knowledge of a target record [45]. For each target, the attacker observes all attributes except one sensitive attribute and attempts to predict the missing value using the synthetic dataset. We evaluated inference accuracy for demographic attributes (age and gender) and health indicators.

2.5.4. Re-Identification Risk

Re-identification risk quantifies the proportion of synthetic records that could potentially be linked to real individuals [46]. A synthetic record is considered at risk if its DCR falls below a distance threshold t, indicating sufficient similarity to a training record to enable linkage. We evaluated risk at multiple thresholds to characterise the privacy–utility trade-off.

2.6. Clinical Validation

For clinical validation, we used a novel method for systematically assessing plausibility of synthetic data. Clinical plausibility here is defined as the degree to which a synthetic patient record exhibits internally coherent relationships between demographic characteristics, health indicators, risk factors and audiometric findings that are consistent with patterns observed in routine audiological practice. This was assessed on a 5-point Likert scale from “1: Implausible” to “5: Completely plausible”. Two expert clinical audiologists independently rated 85 blinded patient profiles: one study author (R.K.) and one independent audiologist (E.D.; see Acknowledgements). Neither rater was informed which profiles were real or synthetic. Each clinician considered patient information (Age, Gender, BMI, Diabetes (yes/no), and systolic and diastolic blood pressure) as well as the audiogram, and rated the plausibility of each profile on the Likert scale. Five binary assessment criteria (Age-Hearing Appropriate, Audiogram Consistent, Health Indicators Consistent, Pattern Seen in Practice, Audiogram Usual) were also recorded per profile using Yes/Maybe/No responses. To provide a broad range of patient cohorts, the researcher generated and sampled NHANES participants from various groups:

Normal hearing participants.
Participants with mild, moderate, severe and profound hearing loss.
Participants with flat, low and high frequencies hearing loss profiles.
Participants with Diabetes.
Participants over the age of 60.
For synthetic data, low probability samples for the KDE and VAE (i.e., records with low likelihood under the generative model, representing edge cases).

Five participants were sampled from each of the above groups for each source, providing a total of 85 profiles: 25 from NHANES, 30 from KDE and 30 from VAE (an extra five for the synthetic data for the low probability samples). These were presented in a completely random order to each clinician independently to reduce possible bias towards or against a data source. Combined plausibility scores were computed as the mean of both raters’ ratings. All participant reports (Supplementary Material S3), the rating template (Supplementary Material S4), patient registry key linking anonymised IDs to data sources (Supplementary Material S5), and the completed ratings from both raters (Supplementary Material S6) are available in the Supplementary Materials.

Statistical Analysis of Clinical Plausibility

To assess differences in clinical plausibility ratings across data sources, the Kruskal–Wallis H test was employed as an omnibus non-parametric test suitable for ordinal Likert scale data, using the combined (means of both raters) plausibility scores. Post-hoc pairwise comparisons between data sources (NHANES vs. KDE, NHANES vs. VAE, KDE vs. VAE) were conducted using Mann–Whitney U tests. Additionally, Mann–Whitney U tests were performed to compare plausibility ratings between KDE and VAE within each patient cohort (normal hearing, mild loss, moderate loss, sloping loss, elderly, and low probability). To control for multiple comparisons, Benjamini–Hochberg False Discovery Rate (FDR) correction was applied to all pairwise tests, with significance determined at a corrected threshold of p < 0.05.

Inter-rater reliability was assessed using Cohen’s weighted kappa with quadratic weights for ordinal plausibility ratings, intraclass correlation coefficient (ICC, two-way random, single measures, and ICC_(2,1)), Spearman’s rank correlation, and percentage agreement (exact and within one scale point). For the binary assessment criteria, inter-rater agreement was assessed using unweighted Cohen’s kappa. Interpretation of kappa values followed the Landis and Koch classification [47].

2.7. Software and Reproducibility

All analyses were performed using Python 3.10 with scikit-learn (v1.3), TensorFlow (v2.15), SHAP (v0.44), and pandas/numpy. The complete codebase is available at https://github.com/evidENT-AI/SyntHH (accessed on 13 March 2026).

3. Results

3.1. Demographic Characteristics of the NHANES Dataset

The NHANES dataset (1999–2020) contained audiometric measurements from 29,714 participants (Table 2). The gender distribution was approximately balanced, with 14,647 males (49.3%) and 15,067 females (50.7%). The mean age was 37.8 years (SD = 23.5, median = 51.0), ranging from 12 to 85 years. Figure 3 illustrates the age distribution by gender and cohort.

The racial/ethnic composition reflected NHANES stratified sampling approach, which oversamples certain demographic groups to ensure adequate representation for health disparity analyses.

Using the criterion of hearing thresholds exceeding 25 dB HL at any frequency in either ear, 12,806 participants (43.1%) were classified as having hearing loss. This prevalence exceeds general population estimates of approximately 14–20% [48], reflecting the broad definition employed and NHANES sampling methodology.

3.2. Audiometric Profiles

Correlation Patterns

Correlation analysis revealed significant relationships between hearing thresholds at different frequencies (Table 3, Figure 4). Strong positive correlations were observed between adjacent frequencies, with correlation strength decreasing as frequency separation increased. Hearing thresholds at 0.5 kHz and 1 kHz were strongly correlated (r = 0.85, p < 0.001), while the correlation between 0.5 kHz and 8 kHz was more moderate (r = 0.57, p < 0.001).

Age demonstrated strong positive correlations with hearing thresholds at all frequencies, with the strongest relationship at 8 kHz (r = 0.79, p < 0.001), consistent with the high-frequency pattern of presbycusis. Gender was significantly correlated with hearing thresholds, with males showing higher thresholds (poorer hearing) at higher frequencies, particularly at 4 kHz (r = −0.18, p < 0.001). Participants with hearing loss demonstrated characteristic elevation of thresholds at higher frequencies (4 kHz and 8 kHz) compared to lower frequencies, possibly due to age-related sensorineural hearing loss or ototoxicity patterns.

3.3. Synthetic Data Generation

Both KDE and VAE methods successfully generated synthetic datasets matching the original data dimensions (n = 29,714). Visual comparison of synthetic audiogram distributions confirmed plausible audiometric patterns, with both methods preserving the characteristic frequency-dependent threshold distributions observed in the real data (Figure 5).

3.4. Statistical Validation

3.4.1. Equivalence Testing

Equivalence testing confirmed that both synthetic datasets were statistically equivalent to real data within the 1 dB HL margin across all frequencies (Table 4). All tests achieved p < 0.001, providing strong evidence of equivalence. The largest mean difference was 0.12 dB HL (8 kHz right ear), well within the clinically negligible range.

3.4.2. Correlation Preservation

VAE demonstrated superior preservation of multivariate relationships compared to KDE. The mean absolute difference in correlation coefficients was 0.18 for VAE versus 0.36 for KDE. Both methods preserved the strong positive correlations between adjacent frequencies and between corresponding frequencies across ears.

3.5. Machine Learning Validation

3.5.1. Baseline Performance (TRTR)

The baseline TRTR scenario established strong predictive performance for hearing loss classification using 10-fold stratified cross-validation (Table 5). XGBoost achieved the highest mean CV AUC (0.947 ± 0.012), followed by Random Forest (0.939 ± 0.015). The mean TRTR AUC across all models was 0.913. Test set performance was consistent with CV estimates, indicating robust generalisation. These results are consistent with the performance reported by Mi and Sun [34].

3.5.2. TSTR Results

The VAE synthetic data achieved higher TSTR scores than KDE across most models (Table 6). The mean TSTR/TRTR ratio was 86.3% for VAE and 72.6% for KDE. Notably, performance varied substantially across model architectures: Logistic Regression achieved near-perfect utility (99.1%), while Neural Networks reached 90.5%. Tree-based ensemble methods showed more modest performance (Random Forest: 84.1%, XGBoost: 80.6%), suggesting that simpler linear models may be better suited for downstream applications when training on VAE synthetic data. Figure 6 shows comparative ROC curves for selected models. Note that TSTR evaluation uses held-out real data for testing, providing an independent validation of synthetic data utility distinct from cross-validation.

3.5.3. Extended Validation

Extended validation scenarios provided additional evidence of VAE superiority (Table 7). VAE achieved near-perfect internal0 consistency (TSTS mean AUC: 0.991) and strong pattern matching (TRTS mean AUC: 0.907).

3.5.4. Feature Importance Preservation

SHAP analysis revealed that VAE better preserved feature importance rankings than KDE (Figure 7). The top five features for hearing loss prediction in real data were: Age, Gender, Hypertension, Noise Exposure, and Urine Thallium (Tl). VAE preserved three of these top five features (Age, Gender, and Urine Tl) and maintained Age as the dominant predictor by a wide margin. However, secondary feature rankings diverged: Hypertension and Noise Exposure were replaced in the VAE top five by Blood Lead (Pb) and BMI, while Urine Tl shifted from rank 5 to rank 4. KDE showed substantially greater distortion, with heavy metal biomarkers (Urine Tl, Urine Arsenic, and Urine Antimony) dominating the top three and displacing established clinical predictors entirely.

Comparison of Spearman correlation matrices across real, KDE, and VAE data revealed that these ranking shifts are attributable to systematic distortion of pairwise feature correlations during synthetic data generation (Table 8). The mean absolute correlation difference across all 190 unique feature pairs was 0.094 for KDE and 0.113 for VAE.

KDE primarily attenuated inter-metal correlations (e.g., Urine Mo–Urine Sb:

Δ ρ = - 0.467

; Urine Cs–Urine As:

Δ ρ = - 0.417

), consistent with the tendency of kernel density estimation to smooth heavy-tailed joint distributions [23]. This de-correlation causes downstream classifiers to treat each urinary metal biomarker as carrying independent predictive signals, inflating their individual SHAP importance and producing the heavy metal-dominated rankings observed in the KDE model.

VAE amplified correlations between Age and features that are elevated in its SHAP rankings: the Age–BMI correlation increased from

ρ = 0.222

to

ρ = 0.523

(

Δ = + 0.301

), and Age–Blood Pb increased from

ρ = 0.369

to

ρ = 0.531

(

Δ = + 0.162

). These amplified associations allow the VAE-trained classifier to extract hearing loss-relevant signals through Age-associated pathways, elevating BMI and Blood Pb in the SHAP rankings. SHAP interaction analysis further confirmed that VAE amplified the Urine Tl × Age interaction relative to the real-data model, indicating that the importance shift for this feature is driven by an amplified conditional relationship with the dominant predictor rather than by marginal distribution changes alone.

3.6. Privacy Validation

Both methods demonstrated strong privacy preservation. Zero exact matches were detected between synthetic and training data for both KDE (0/29,714) and VAE (0/29,714).

Membership inference attack success rates were 52.3% for KDE and 53.1% for VAE (Table 9), close to the 50% random baseline. These results indicate that an attacker cannot reliably determine whether a specific record was used in training. Extended privacy evaluation including distance-based metrics, attribute inference attacks, and re-identification risk assessment confirmed strong privacy preservation for both methods; full results are provided in Supplementary Table S2.

3.7. Clinical Validation

Hearing loss prevalence was accurately preserved: real data 43.1%, KDE 42.8%, and VAE 43.0%. Visual inspection confirmed that synthetic audiograms demonstrated physiologically plausible configurations consistent with known hearing loss patterns, however, with significant differences across the different generative methods.

3.7.1. Expert Clinical Plausibility Assessment

Blinded assessment of 85 patient profiles by two independent expert audiologists revealed substantial differences in clinical plausibility across data sources (Figure 8). Using combined ratings (means of both raters), the Kruskal–Wallis test indicated significant differences across sources (

H = 43.46

, p < 0.001). The VAE synthetic data achieved the highest combined mean plausibility rating (4.47, SD = 0.49), exceeding the real NHANES data (4.04, SD = 0.63), while KDE synthetic data received markedly lower ratings (2.77, SD = 0.91). Both raters independently reached the same ordinal ranking: VAE > NHANES > KDE.

Post-hoc pairwise comparisons with Benjamini–Hochberg FDR correction confirmed that all differences between sources were statistically significant: NHANES vs. KDE (p < 0.001), NHANES vs. VAE (p = 0.008), and KDE vs. VAE (p < 0.001). Of the VAE profiles, 96.7% achieved a combined plausibility rating

\geq 4

, compared with 56.0% for NHANES and only 13.3% for KDE.

Inter-rater reliability analysis demonstrated moderate agreement between the two audiologists (Cohen’s weighted

κ = 0.553

, ICC_(2,1)

= 0.556

, Spearman

ρ = 0.627

, p < 0.001). Exact agreement was observed for 45.9% of ratings, with 84.7% of all ratings falling within one scale point. Agreement was strongest for VAE profiles (100% within one point) and weakest for KDE profiles (63.3% within one point), reflecting the greater clinical ambiguity of KDE-generated data. For the binary assessment criteria, inter-rater agreement was substantial for Age-Hearing Appropriateness (

κ = 0.725

) and moderate for Pattern Seen in Practice (

κ = 0.579

) and Audiogram Usual (

κ = 0.578

).

3.7.2. Plausibility by Patient Cohort

Sub-analysis of combined plausibility ratings by patient cohort revealed consistent VAE superiority across most cohort types (Figure 9). VAE achieved combined mean ratings above 4.5 for normal hearing, mild loss, moderate loss, sloping loss, and elderly cohorts, while KDE ratings ranged from 2.2 to 3.3 across the same cohorts. The low probability cohort showed the smallest difference between methods (VAE: 3.80, KDE: 3.20), indicating that deliberately challenging cases remained difficult for both approaches. Full cohort-level statistics are provided in Supplementary Material S7.

4. Discussion

4.1. Summary of Findings

This study presents a comprehensive validation of KDE and VAE approaches for generating synthetic audiometric data from NHANES. VAE achieved 86.3% relative discriminative fidelity for machine learning applications in hearing loss prediction, representing a 13.7 percentage point improvement over KDE (72.6%). Both methods demonstrated strong privacy preservation with zero exact record matches and membership inference attack resistance near chance levels. Importantly, blinded clinical assessment by two independent expert audiologists confirmed that VAE synthetic profiles were rated as more plausible than real NHANES data (combined mean 4.47 vs. 4.04), while KDE synthetic data was largely rated as implausible (combined mean 2.77). Inter-rater reliability was moderate (

κ = 0.553

), with both raters independently reaching the same ordinal conclusions.

4.2. Comparison with Prior Work

The TSTR framework, introduced by Esteban et al. [33] for medical time series, provides a rigorous methodology for assessing synthetic data quality. Our VAE results (86.3%) align with reported values for synthetic electronic health records using similar deep generative approaches [5,16]. The replication of Mi & Sun’s [34] classification performance (TRTR mean AUC 0.913) validates our implementation and provides a reproducible benchmark for synthetic data evaluation.

4.3. Method Comparison

The KDE approach offers practical advantages: simpler implementation, interpretable probability estimates, and next to no training requirements. KDE demonstrated greater robustness in cross-dataset validation, suggesting utility in scenarios where training and deployment distributions may differ.

The VAE approach requires more computational resources but provides superior fidelity for complex multivariate relationships. The latent space representation enables principled interpolation and conditional generation, supporting applications such as generating synthetic data for specific demographic subgroups.

4.4. Implications for Hearing Health Research

The validated synthetic data generation methods address infrastructure gaps in hearing health research. We identify four concrete application domains with illustrative use cases:

Machine learning development.Researchers developing hearing loss prediction models, automated audiogram classification systems, or hearing aid recommendation algorithms require substantial training data. Synthetic datasets enable algorithm development and hyperparameter optimisation without institutional data access agreements. For example, a research team developing a neural network for audiogram pattern classification could train initial models on synthetic data, then validate on locally held real data, reducing the data sharing burden while maintaining rigorous validation.

Educational applications. Audiology training programmes require diverse case examples spanning normal hearing through profound loss across different aetiologies. Synthetic audiograms can supplement limited real case libraries, providing students with exposure to rare configurations (e.g., cookie-bite audiograms and reverse slope losses) that may be underrepresented in teaching clinics. The open availability of synthetic datasets enables curriculum standardisation across training institutions.

Clinical tool development. Hearing aid manufacturers and clinical decision support developers require test datasets for algorithm validation. Synthetic data can serve as standardised benchmarks for comparing fitting algorithms, simulating diverse patient populations without accessing proprietary clinical databases. This supports regulatory submissions requiring demonstration of algorithm performance across representative populations.

Multi-site research collaboration. Federated research consortia studying hearing health across multiple clinical sites face substantial data governance barriers. Synthetic data can facilitate preliminary analyses and protocol development: sites generate synthetic versions of their local data for shared analysis, identifying promising research directions before initiating formal data sharing agreements for validation studies on real data.

The 86.3% mean discriminative fidelity indicates that models trained on VAE synthetic data achieve approximately 86% of the performance of models trained on real data, corresponding to a TSTR–TRTR performance gap of 13.7%. A comprehensive benchmarking study of synthetic electronic health record generators reported TSTR AUROC values ranging from 0.56 for weaker methods to 0.71 for the best-performing approaches (see Figure 3e and accompanying code repository at https://github.com/yy6linda/synthetic-ehr-benchmarking (accessed on 13 March 2026)) [22]. Our VAE achieves a mean TSTR AUROC of 0.794, exceeding the performance of all methods evaluated in that benchmark. Importantly, performance varies substantially by model architecture: Logistic Regression achieved 99.1% relative utility and Neural Networks reached 90.5%, while tree-based ensemble methods showed more modest performance (Random Forest: 84.1% and XGBoost: 80.6%). Researchers planning downstream applications should evaluate which model architectures perform best on synthetic data for their specific use case, as simpler linear models may offer superior utility preservation compared to more complex architectures. Our results position the VAE approach within the acceptable range for research applications where the benefits of data accessibility and privacy protection outweigh modest performance reductions. Use cases tolerant of this performance trade-off include algorithm development and benchmarking, educational applications, and preliminary model development prior to validation on real data.

4.5. Limitations

Several limitations should be acknowledged. The NHANES dataset primarily provides unmasked air conduction measurements without corresponding bone conduction thresholds, limiting characterisation of conductive versus sensorineural components. The validation focused on hearing loss prediction; utility for other applications requires separate evaluation.

A key limitation concerns generalisability beyond the US population. The NHANES dataset reflects US demographic composition, occupational noise exposure patterns, and healthcare access structures that may differ substantially from other countries and regions. Hearing loss prevalence varies geographically due to differences in age distribution, occupational and recreational noise exposure, ototoxic medication use, and access to hearing protection and healthcare [2]. Consequently, synthetic data generators trained on NHANES may produce cohorts that do not accurately represent populations with different demographic and healthcare structures. Importantly, however, the methodological framework presented here is transferable to other audiometric datasets. Researchers seeking to apply these methods in non-US contexts should consider retraining the generative models on locally representative data where available, or explicitly acknowledging the US-centric training data when using the provided pre-trained models.

The higher-than-expected hearing loss prevalence (43.1%) perhaps reflects NHANES’ oversampling of older adults. Synthetic data inherits these characteristics, requiring careful interpretation when applying to contexts with different prevalence expectations.

Additionally, while primary feature importance rankings were well preserved (particularly the dominance of Age), secondary feature rankings diverged between real and synthetic-trained classifiers. Post-hoc correlation analysis identified the specific mechanisms: KDE attenuated inter-metal correlations, inflating independent heavy metal importance, while VAE amplified Age-paired correlations, elevating features such as BMI and Blood Pb (Table 8). This reflects a broader challenge for generative models: preserving marginal distributions and dominant correlations is more tractable than preserving the full conditional dependency structure that governs lower-ranked features. Future work should investigate methods for explicitly constraining correlation structure and feature importance preservation during synthetic data generation.

Both generative methods demonstrated limited ability to produce clinically plausible edge cases, as reflected in lower expert ratings for low-probability synthetic samples. This limitation has identifiable technical causes. For KDE, the bandwidth smoothing inherent to kernel methods pulls generated samples toward the distribution center, attenuating extreme values. For VAE, the KL divergence regularisation term encourages the latent distribution to match a standard normal prior, biasing generation toward modal regions and away from distribution tails [25]. Additionally, edge cases are by definition rare in training data, providing fewer examples for model learning. Researchers requiring synthetic data representing rare audiometric configurations should consider alternative approaches such as conditional generation [20,21] with explicit constraints, rejection sampling to enrich tail observations, or augmentation methods specifically designed for minority class enhancement.

4.6. Future Directions

The methods established in this work provide a foundation for several research and development pathways. A natural extension involves conditional generation, whereby synthetic data can be sampled according to specified demographic or clinical parameters. Recent work has demonstrated the value of conditional generative approaches for producing synthetic cohorts with particular characteristics in healthcare applications [49,50]. This capability would enable researchers to generate cohorts with specific age ranges, hearing loss severities, or comorbidity profiles, addressing the challenge of under-representation in observational datasets and supporting targeted algorithm development for clinical subpopulations [5].

The current implementation focuses on pure-tone audiometry, yet comprehensive audiological assessment encompasses multiple complementary modalities. Future work should integrate speech recognition scores, tympanometric measurements, and objective measures including otoacoustic emissions and auditory brainstem responses into a unified synthetic generation framework [3]. Such multi-modal synthesis would provide more comprehensive representations of hearing function and enable development of holistic diagnostic support tools that mirror the breadth of information available in clinical practice. Machine learning approaches have demonstrated value in analysing these objective measures individually [51,52], suggesting potential for integrated multi-modal synthetic data applications. Furthermore, while this study addresses tabular audiometric data, hearing healthcare increasingly incorporates medical imaging modalities such as otoscopy and temporal bone imaging. For such applications, advances in computer vision offer relevant methodological parallels, including saliency-driven approaches for explainability [53] and feature enhancement architectures for temporal sequence modelling [54]. Adapting such methods to audiological imaging and signal processing represents a promising avenue for multi-modal synthetic data generation.

Hearing loss is inherently a progressive condition, with trajectories that vary according to aetiology, individual susceptibility, and intervention history [27]. Extending the current cross-sectional approach to longitudinal synthesis would enable modelling of hearing change over time, supporting applications in prognosis prediction, treatment monitoring, and identification of rapid progressors who may benefit from early intervention. Recent work using long short-term memory networks for predicting hearing loss progression demonstrates the feasibility of temporal modelling approaches in audiological applications [55]. This direction would require integration with longitudinal cohort data and development of temporal generative architectures capable of capturing realistic progression patterns.

5. Conclusions

This study establishes validated methodologies for generating synthetic audiometric data with demonstrable privacy preservation properties, using KDE and VAE approaches validated against a publicly accessible benchmark. The VAE approach achieved 86.3% relative discriminative fidelity for machine learning applications, demonstrating that synthetic audiometric data can meaningfully support model development. Both methods showed zero exact record matches and robust membership inference attack resistance. Crucially, blinded clinical assessment by two independent expert audiologists confirmed that VAE synthetic profiles achieved higher plausibility ratings than real NHANES data, with 96.7% rated as clinically plausible compared to 56.0% for real data, while KDE profiles were largely rated as implausible.

These findings address a gap in hearing health research infrastructure by providing validated approaches for generating synthetic data that maintains statistical fidelity and clinical plausibility while demonstrating strong privacy preservation. By establishing these methods on a public dataset where privacy properties can be independently verified, this work provides a foundation for subsequent application to sensitive clinical datasets where privacy protection is essential.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/app16062917/s1, Supplementary Table S1: Complete model hyperparameters; Supplementary Table S2: Comparison with traditional oversampling methods; Supplementary Material S3: Patient reports for clinical plausibility assessment; Supplementary Material S4: Clinical Plausibility Rating Template; Supplementary Material S5: Patient registry key; Supplementary Material S6: Completed ratings from both raters; Supplementary Material S7: Cohort-level plausibility statistics; Supplementary Material S8: Full validation results.

Author Contributions

Conceptualisation, L.B. and N.M.; methodology, L.B. and R.K.; software, L.B., Y.B.O. and I.M.; validation, L.B. and R.K.; formal analysis, L.B., R.K. and Y.B.O.; data curation, L.B., Y.B.O. and I.M.; writing—original draft preparation, L.B.; writing—review and editing, R.K., A.G.M.S. and N.M.; visualisation, L.B.; supervision, A.G.M.S. and N.M.; funding acquisition, A.G.M.S. and N.M. All authors have read and agreed to the published version of the manuscript.

Funding

This project was supported by the National Institute for Health and Care Research (NIHR), University College London Hospitals Biomedical Research Centre (BRC), funding reference number NIHR 203328 BRC965A/HH/AS/110390.

Institutional Review Board Statement

Ethical review and approval were not required for this study as it utilises publicly available, de-identified data from the National Health and Nutrition Examination Survey (NHANES).

Informed Consent Statement

Not applicable. This study used publicly available, de-identified data.

Data Availability Statement

The NHANES data are publicly available at https://www.cdc.gov/nchs/nhanes/ (accessed on 13 March 2026). Synthetic datasets and code are available at https://github.com/evidENT-AI/SyntHH (accessed on 13 March 2026).

Acknowledgments

The authors thank the NHANES participants and staff for their contributions to this public health resource. We thank Eleanor Davies (E.D.) for serving as an independent clinical audiologist for the blinded plausibility assessment. We also thank the anonymous reviewers whose constructive feedback substantially improved this manuscript, particularly suggestions regarding inter-rater reliability in the clinical validation and mechanistic analysis of feature importance divergence, which significantly deepened the analysis and strengthened the work.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUC	Area Under the Receiver Operating Characteristic Curve
daPa	Dekapascal
dB HL	Decibels Hearing Level
DCR	Distance to Closest Record
GDPR	General Data Protection Regulation
HIPAA	Health Insurance Portability and Accountability Act
KDE	Kernel Density Estimation
MIA	Membership Inference Attack
NHANES	National Health and Nutrition Examination Survey
PTA	Pure-Tone Audiometry
SHAP	SHapley Additive exPlanations
TRTR	Train-Real-Test-Real
TRTS	Train-Real-Test-Synthetic
TSTR	Train-Synthetic-Test-Real
TSTS	Train-Synthetic-Test-Synthetic
VAE	Variational Autoencoder

References

Haile, L.M.; Kamenov, K.; Briant, P.S.; Orji, A.U.; Steinmetz, J.D.; Abdoli, A.; Abdollahi, M.; Abu-Gharbieh, E.; Afshin, A.; Ahmed, H.; et al. Hearing loss prevalence and years lived with disability, 1990–2019: Findings from the Global Burden of Disease Study 2019. Lancet 2021, 397, 996–1009. [Google Scholar] [CrossRef]
World Health Organization. World Report on Hearing; Technical Report; World Health Organization: Geneva, Switzerland, 2021. [Google Scholar]
Katz, J.; Chasin, M.; English, K.; Hood, L.J.; Tillery, K.L. Handbook of Clinical Audiology, 7th ed.; Wolters Kluwer Health: Philadelphia, PA, USA, 2015. [Google Scholar]
Margolis, R.H.; Hunter, L.L. Audiologie Evaluation of the Otitis Media Patient. Otolaryngol. Clin. N. Am. 1991, 24, 877–899. [Google Scholar] [CrossRef]
Chen, R.J.; Lu, M.Y.; Chen, T.Y.; Williamson, D.F.K.; Mahmood, F. Synthetic data in machine learning for medicine and healthcare. Nat. Biomed. Eng. 2021, 5, 493–497. [Google Scholar] [CrossRef]
Scheibner, J.; Raisaro, J.L.; Troncoso-Pastoriza, J.R.; Ienca, M.; Fellay, J.; Vayena, E.; Hubaux, J.P. Revolutionizing medical data sharing using advanced privacy-enhancing technologies: Technical, legal, and ethical synthesis. J. Med. Internet Res. 2021, 23, e25120. [Google Scholar] [CrossRef]
Natarajan, N.; Batts, S.; Stankovic, K.M. Noise-Induced Hearing Loss. J. Clin. Med. 2023, 12, 2347. [Google Scholar] [CrossRef]
Lanvers-Kaminsky, C.; Zehnhoff-Dinnesen, A.a.; Parfitt, R.; Ciarimboli, G. Drug-induced ototoxicity: Mechanisms, Pharmacogenetics, and protective strategies. Clin. Pharmacol. Ther. 2017, 101, 491–500. [Google Scholar] [CrossRef]
Shearer, A.E.; Hildebrand, M.S.; Odell, A.M.; Smith, R.J.H. Genetic Hearing Loss Overview. GeneReviews^® [Internet]. 2025. Available online: https://www.ncbi.nlm.nih.gov/sites/books/NBK1434/ (accessed on 13 March 2026).
Lesica, N.A.; Mehta, N.; Manjaly, J.G.; Deng, L.; Wilson, B.S.; Zeng, F.G. Harnessing the power of artificial intelligence to transform hearing healthcare and research. Nat. Mach. Intell. 2021, 3, 840–849. [Google Scholar] [CrossRef]
Barbour, D.L.; Howard, R.T.; Song, X.D.; Metzger, N.; Sukesan, K.A.; DiLorenzo, J.C.; Snyder, B.D.; Chen, J.Y.; Degen, E.A.; Buchbinder, J.M.; et al. Online machine learning audiometry. Ear Hear. 2019, 40, 918–926. [Google Scholar] [CrossRef] [PubMed]
Dimitrov, L.; Barrett, L.; Chaudhry, A.; Muzaffar, J.; Lilaonitkul, W.; Mehta, N. Uncovering Phenotypes in Sensorineural Hearing Loss: A Systematic Review of Unsupervised Machine Learning Approaches. Ear Hear. 2025, 46, 1401–1411. [Google Scholar] [CrossRef] [PubMed]
Joshi, N.; Noor, K.; Bai, X.; Forbes, M.; Ross, T.; Barrett, L.; Dobson, R.J.; Schilder, A.G.; Mehta, N.; Lilaonitkul, W. Automating the extraction of otology symptoms from clinic letters: A methodological study using natural language processing. BMC Med. Inform. Decis. Mak. 2025, 25, 353. [Google Scholar] [CrossRef]
Vokinger, K.N.; Stekhoven, D.J.; Krauthammer, M. Lost in anonymization—A data anonymization reference classification merging legal and technical considerations. J. Law Med. Ethics 2020, 48, 228–231. [Google Scholar] [CrossRef]
Jordon, J.; Szpruch, L.; Houssiau, F.; Bottarelli, M.; Cherubin, G.; Maple, C.; Cohen, S.N.; Weller, A. Synthetic data—What, why and how? arXiv 2022, arXiv:2205.03257. [Google Scholar] [CrossRef]
Yale, A.; Dash, S.; Dutta, R.; Guyon, I.; Pavao, A.; Bennett, K.P. Generation and evaluation of privacy preserving synthetic health data. Neurocomputing 2020, 416, 244–255. [Google Scholar] [CrossRef]
Tucker, A.; Wang, Z.; Rotalinti, Y.; Myles, P. Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. npj Digit. Med. 2020, 3, 147. [Google Scholar] [CrossRef]
Choi, E.; Biswal, S.; Malin, B.; Duke, J.; Stewart, W.F.; Sun, J. Generating multi-label discrete patient records using generative adversarial networks. In Proceedings of the Machine Learning for Healthcare 2017, Boston, MA, USA, 18–19 August 2017; Proceedings of Machine Learning Research (PMLR). Volume 68, pp. 286–305. [Google Scholar]
Bhanot, K.; Qi, M.; Erickson, J.S.; Guyon, I.; Bennett, K.P. The problem of fairness in synthetic healthcare data. Entropy 2021, 23, 1165. [Google Scholar] [CrossRef] [PubMed]
Apellániz, P.A.; Jiménez, A.; Arroyo Galende, B.; Parras, J.; Zazo, S. Synthetic Tabular Data Validation: A Divergence-Based Approach. IEEE Access 2024, 12, 103895–103907. [Google Scholar] [CrossRef]
Wang, Z.; Myles, P.; Tucker, A. A novel and fully automated platform for synthetic tabular data generation and validation. Sci. Rep. 2024, 14, 22573. [Google Scholar] [CrossRef] [PubMed]
Yan, C.; Yan, Y.; Wan, Z.; Zhang, Z.; Omberg, L.; Guinney, J.; Mooney, S.D.; Malin, B.A. A multifaceted benchmarking of synthetic electronic health record generation models. Nat. Commun. 2022, 13, 7609. [Google Scholar] [CrossRef]
Silverman, B.W. Density Estimation for Statistics and Data Analysis; Chapman and Hall: New York, NY, USA, 1986. [Google Scholar] [CrossRef]
Scott, D.W. Multivariate Density Estimation: Theory, Practice, and Visualization; John Wiley & Sons: Hoboken, NJ, USA, 1992. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-encoding variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar] [CrossRef]
Doersch, C. Tutorial on variational autoencoders. arXiv 2016, arXiv:1606.05908. [Google Scholar] [CrossRef]
Gates, G.A.; Mills, J.H. Presbycusis. Lancet 2005, 366, 1111–1120. [Google Scholar] [CrossRef]
Wu, P.; Liberman, L.; Bennett, K.; de Gruttola, V.; O’Malley, J.; Liberman, M. Primary Neural Degeneration in the Human Cochlea: Evidence for Hidden Hearing Loss in the Aging Ear. Neuroscience 2019, 407, 8–20. [Google Scholar] [CrossRef]
Ahmad, I.; Pahor, A.L. Carhart’s notch: A finding in otitis media with effusion. Int. J. Pediatr. Otorhinolaryngol. 2002, 64, 165–170. [Google Scholar] [CrossRef]
Lamblin, E.; Karkas, A.; Jund, J.; Schmerber, S. Is the Carhart notch a predictive factor of hearing results after stapedectomy? Acta Otorhinolaryngol. Ital. 2021, 41, 84–90. [Google Scholar] [CrossRef]
Cruickshanks, K.J.; Wiley, T.L.; Tweed, T.S.; Klein, B.E.; Klein, R.; Mares-Perlman, J.A.; Nondahl, D.M. Prevalence of Hearing Loss in Older Adults in Beaver Dam, Wisconsin: The Epidemiology of Hearing Loss Study. Am. J. Epidemiol. 1998, 148, 879–886. [Google Scholar] [CrossRef]
Agrawal, Y.; Platz, E.A.; Niparko, J.K. Prevalence of hearing loss and differences by demographic characteristics among US adults. Arch. Intern. Med. 2008, 168, 1522–1530. [Google Scholar] [CrossRef]
Esteban, C.; Hyland, S.L.; Rätsch, G. Real-valued (medical) time series generation with recurrent conditional GANs. arXiv 2017, arXiv:1706.02633. [Google Scholar] [CrossRef]
Mi, Y.; Sun, P. Machine learning-based prediction of hearing loss: Findings of the US NHANES from 2003 to 2018. Hear. Res. 2025, 461, 109252. [Google Scholar] [CrossRef]
National Center for Health Statistics. National Health and Nutrition Examination Survey; National Center for Health Statistics: Hyattsville, MD, USA, 2020.
van Buuren, S.; Groothuis-Oudshoorn, K. Mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef]
Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Liu, C.; Wang, X. Doubly Robust Conditional VAE via Decoder Calibration: An Implicit KL Annealing Approach. Trans. Mach. Learn. Res. 2025. Available online: https://openreview.net/forum?id=VIkycTWDWo (accessed on 16 December 2025).
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Lunardon, N.; Menardi, G.; Torelli, N. ROSE: A package for binary imbalanced learning. R J. 2014, 6, 79–89. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar] [CrossRef]
Lakens, D. Equivalence tests: A practical primer for t tests, correlations, and meta-analyses. Soc. Psychol. Personal. Sci. 2017, 8, 355–362. [Google Scholar] [CrossRef]
Zhao, Z.; Kunar, A.; Birke, R.; Chen, L.Y. CTAB-GAN: Effective Table Data Synthesizing. In Proceedings of the 13th Asian Conference on Machine Learning, Virtual, 17–19 November 2021; Proceedings of Machine Learning Research (PMLR). Volume 157, pp. 97–112. [Google Scholar]
Shokri, R.; Stronati, M.; Song, C.; Shmatikov, V. Membership inference attacks against machine learning models. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (SP), San Jose, CA, USA, 22–24 May 2017; pp. 3–18. [Google Scholar] [CrossRef]
Stadler, T.; Oprisanu, B.; Troncoso, C. Synthetic Data—Anonymisation Groundhog Day. In Proceedings of the 31st USENIX Security Symposium (USENIX Security 22); USENIX Association: Berkeley, CA, USA, 2022; pp. 1451–1468. [Google Scholar]
El Emam, K.; Mosquera, L.; Hoptroff, R. Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation. J. Med Internet Res. 2020, 22, e23139. [Google Scholar] [CrossRef]
Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef]
Hoffman, H.J.; Dobie, R.A.; Losonczy, K.G.; Themann, C.L.; Flamme, G.A. Declining Prevalence of Hearing Loss in US Adults Aged 20 to 69 Years. JAMA Otolaryngol.—Head Neck Surg. 2017, 143, 274–285. [Google Scholar] [CrossRef]
Yoon, J.; Drumright, L.N.; van der Schaar, M. Anonymization Through Data Synthesis Using Generative Adversarial Networks (ADS-GAN). IEEE J. Biomed. Health Inform. 2020, 24, 2378–2388. [Google Scholar] [CrossRef]
Wang, Z.; Sun, J. PromptEHR: Conditional Electronic Healthcare Records Generation with Prompt Learning. arXiv 2022, arXiv:2211.01761. [Google Scholar] [CrossRef]
McKearney, R.M.; MacKinnon, R.C. Objective auditory brainstem response classification using machine learning. Int. J. Audiol. 2019, 58, 224–230. [Google Scholar] [CrossRef]
Gong, Q.; Liu, Y.; Xu, R.; Liang, D.; Peng, Z.; Yang, H. Objective Assessment System for Hearing Prediction Based on Stimulus-Frequency Otoacoustic Emissions. Trends Hear. 2021, 25. [Google Scholar] [CrossRef]
Singh, A.; Sengupta, S.; Lakshminarayanan, V. Saliency-driven explainable deep learning in medical imaging: Bridging visual explainability and statistical quantitative analysis. BioData Min. 2024, 17, 18. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Xiao, Y.; Zhang, Y.; Zhang, T. Video saliency prediction via single feature enhancement and temporal recurrence. Eng. Appl. Artif. Intell. 2025, 160, 111840. [Google Scholar] [CrossRef]
Chen, P.Y.; Yang, T.W.; Tseng, Y.S.; Tsai, C.Y.; Yeh, C.S.; Lee, Y.H.; Lin, P.H.; Lin, T.C.; Wu, Y.J.; Yang, T.H.; et al. Machine learning-based longitudinal prediction for GJB2-related sensorineural hearing loss. Comput. Biol. Med. 2024, 176, 108597. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Methodologicalworkflow for synthetic audiometric data generation and validation. The pipeline comprises three stages: (1) Data and Pre-processing: NHANES audiometric data undergo quality control and feature engineering, with outputs constrained by domain-specific requirements including physiological correlations, aetiology-specific patterns, and demographic covariance; (2) Modelling: two complementary generative approaches (Kernel Density Estimation and Variational Autoencoder) produce synthetic datasets; (3) Evaluation: synthetic data quality is assessed through statistical validation (distributional and correlational fidelity), machine learning validation (Train-on-Synthetic-Test-on-Real, or TSTR, framework with 8 classifiers), clinical validation (expert plausibility review), and privacy validation (exact match detection and membership inference attack, or MIA, resistance).

Figure 2. Variational Autoencoder architecture for synthetic audiometric data generation. The encoder compresses 270 input features through three dense layers to a 64-dimensional latent space parameterised by mean (

μ

) and log-variance (

log σ^{2}

). Sampling uses the reparameterisation trick (

z = μ + σ \cdot ϵ

, where

ϵ \sim N (0, I)

). The decoder reconstructs the original feature space through mirrored dense layers.

Figure 2. Variational Autoencoder architecture for synthetic audiometric data generation. The encoder compresses 270 input features through three dense layers to a 64-dimensional latent space parameterised by mean (

μ

) and log-variance (

log σ^{2}

). Sampling uses the reparameterisation trick (

z = μ + σ \cdot ϵ

, where

ϵ \sim N (0, I)

). The decoder reconstructs the original feature space through mirrored dense layers.

Figure 3. Age distribution of NHANES participants by gender and survey cohort. The distribution shows a pronounced mode in the 10–20 year age range reflecting NHANES sampling of adolescents, with a secondary peak at 80 years representing recoded ages above 85.

Figure 4. Correlation heatmap for pure-tone audiometry thresholds across frequencies and ears. Strong correlations exist between adjacent frequencies within each ear and between corresponding frequencies across ears.

Figure 5. Distribution comparison of hearing thresholds across frequencies between real NHANES data (left, blue) and KDE synthetic data (right, green). Both distributions show similar patterns across all test frequencies, demonstrating preservation of marginal distributions.

Figure 6. ROC curves comparing TRTR baseline (blue), KDE synthetic (green), and VAE synthetic (teal) performance across machine learning models. The VAE synthetic data consistently outperforms KDE, approaching baseline performance across most classifiers.

Figure 7. SHAP feature importance comparison between models trained on real data (blue), KDE synthetic (green), and VAE synthetic (teal). The real data model identifies Age, Gender, Hypertension, Noise Exposure, and Urine Thallium as the top 5 predictors. KDE synthetic shows substantially altered rankings dominated by heavy metal biomarkers. VAE preserves Age as the dominant predictor and retains 3 of top 5 real-data features, though secondary rankings diverge.

Figure 8. Combined mean clinical plausibility ratings by data source (means of two independent expert audiologists) with standard error bars. Both raters independently rated VAE synthetic profiles as more plausible than real NHANES data, while KDE profiles were rated as largely implausible. The dashed line indicates the uncertain threshold (rating = 3).

Figure 9. Combined mean clinical plausibility ratings (means of two independent raters) for KDE and VAE synthetic data by patient cohort. VAE consistently outperformed KDE across all standard cohorts. The low probability cohort showed the smallest difference between methods, reflecting the inherent difficulty of generating plausible edge cases.

Table 1. Validation Framework Scenarios.

Scenario	Training Data	Test Data	Purpose
Train-Real-Test-Real (TRTR)	Real	Real	Baseline performance
Train-Synthetic-Test-Real (TSTR)	Synthetic	Real	Discriminative fidelity
Train-Real-Test-Synthetic (TRTS)	Real	Synthetic	Pattern matching
Train-Synthetic-Test-Synthetic (TSTS)	Synthetic	Synthetic	Internal consistency

Table 2. Demographic characteristics of the NHANES dataset.

Characteristic	Value	Percentage
Total participants	29,714	100.0%
Gender
Male	14,647	49.3%
Female	15,067	50.7%
Age (years)
Mean (SD)	37.8 (23.5)	—
Median	51.0	—
Range	12–85	—
Race/Ethnicity
Non-Hispanic White	11,301	38.0%
Non-Hispanic Black	7017	23.6%
Mexican American	5342	18.0%
Other Race	3532	11.9%
Other Hispanic	2522	8.5%
Hearing loss prevalence	12,806	43.1%

Table 3. Correlation matrix for hearing thresholds (dB HL) and demographics. * p < 0.001 after Bonferroni correction.

	0.5 kHz	1 kHz	2 kHz	4 kHz	8 kHz	Gender	Age
0.5 kHz	1.00
1 kHz	0.85 *	1.00
2 kHz	0.72 *	0.84 *	1.00
4 kHz	0.61 *	0.72 *	0.83 *	1.00
8 kHz	0.57 *	0.67 *	0.75 *	0.85 *	1.00
Gender	0.02 *	−0.04 *	−0.08 *	−0.18 *	−0.07 *	1.00
Age	0.48 *	0.58 *	0.65 *	0.74 *	0.79 *	0.01	1.00

Table 4. Equivalence testing results: mean hearing thresholds (dB HL).

Frequency	Ear	Real	KDE	VAE
0.5 kHz	Right	11.81	11.79	11.83
0.5 kHz	Left	11.68	11.63	11.70
1 kHz	Right	10.55	10.50	10.57
1 kHz	Left	10.50	10.47	10.52
2 kHz	Right	11.91	11.90	11.93
2 kHz	Left	12.53	12.53	12.55
4 kHz	Right	17.47	17.43	17.50
4 kHz	Left	18.42	18.40	18.45
8 kHz	Right	23.87	23.75	23.95
8 kHz	Left	24.51	24.42	24.55

All comparisons p < 0.001 for equivalence within 1 dB HL margin (TOST procedure).

Table 5. TRTR baseline performance (10-fold cross-validation).

Model	CV AUC (Mean ± SD)	Test AUC	F1 Score
XGBoost	0.947 ± 0.012	0.956	0.895
Random Forest	0.939 ± 0.015	0.945	0.867
Gradient Boosting	0.925 ± 0.015	0.931	0.846
SVM	0.921 ± 0.015	0.931	0.856
Neural Network	0.918 ± 0.017	0.923	0.867
KNN	0.916 ± 0.014	0.921	0.852
Logistic Regression	0.883 ± 0.017	0.886	0.805
Decision Tree	0.785 ± 0.016	0.811	0.823
Mean	0.904 ± 0.015	0.913	0.851

Table 6. TSTR performance: discriminative fidelity.

Model	KDE AUC	KDE Ratio	VAE AUC	VAE Ratio
Logistic Regression	0.851	96.1%	0.878	99.1%
SVM	0.747	80.2%	0.818	87.9%
Random Forest	0.665	70.3%	0.795	84.1%
Decision Tree	0.488	60.2%	0.687	84.7%
KNN	0.678	73.6%	0.769	83.5%
Gradient Boosting	0.562	60.4%	0.803	86.3%
Neural Network	0.708	76.7%	0.835	90.5%
XGBoost	0.602	63.0%	0.770	80.6%
Mean	0.663	72.6%	0.794	86.3%

TSTR/TRTR ratio represents relative discriminative fidelity. Models trained on synthetic data, tested on held-out real data (20% test split). Ratios computed against CV-validated TRTR baseline.

Table 7. Extended validation results.

Metric	KDE	VAE
TRTS Mean AUC	0.673	0.907
TSTS Mean AUC	0.866	0.991
SHAP Rank Difference	6.6	3.7

Table 8. Spearman correlation differences (

Δ ρ = ρ_{synthetic} - ρ_{real}

) for feature pairs relevant to SHAP ranking divergence. KDE attenuates inter-metal correlations, inflating independent feature importance. VAE amplifies Age-paired correlations, elevating BMI and Blood Pb. Bold values indicate the largest correlation shifts for each method.

Table 8. Spearman correlation differences (

Δ ρ = ρ_{synthetic} - ρ_{real}

) for feature pairs relevant to SHAP ranking divergence. KDE attenuates inter-metal correlations, inflating independent feature importance. VAE amplifies Age-paired correlations, elevating BMI and Blood Pb. Bold values indicate the largest correlation shifts for each method.

Feature Pair	Real $ρ$	KDE $ρ$	$Δ$ KDE	VAE $ρ$	$Δ$ VAE
VAE-amplified (Age-paired correlations)
Age–BMI	0.222	0.118	−0.104	0.523	+0.301
Age–Blood Pb	0.369	0.109	−0.260	0.531	+0.162
KDE-attenuated (inter-metal correlations)
Urine Mo–Urine Sb	0.489	0.023	−0.467	0.498	+0.009
Urine Cs–Urine As	0.524	0.107	−0.417	0.562	+0.038
Urine Cs–Urine Sb	0.436	0.021	−0.415	0.532	+0.096

Table 9. Membership inference attack results.

Metric	KDE	VAE
Attack Success Rate	52.3%	53.1%
Attack AUC	0.523	0.531
Random Baseline	50.0%	50.0%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Barrett, L.; Katiri, R.; Ooi, Y.B.; Moffitt, I.; Schilder, A.G.M.; Mehta, N. Validated Methods for Synthesising Hearing Health Data for Machine Learning: A Comparative Study of KDE and VAE Approaches. Appl. Sci. 2026, 16, 2917. https://doi.org/10.3390/app16062917

AMA Style

Barrett L, Katiri R, Ooi YB, Moffitt I, Schilder AGM, Mehta N. Validated Methods for Synthesising Hearing Health Data for Machine Learning: A Comparative Study of KDE and VAE Approaches. Applied Sciences. 2026; 16(6):2917. https://doi.org/10.3390/app16062917

Chicago/Turabian Style

Barrett, Liam, Roulla Katiri, Yuen Bing Ooi, Isabella Moffitt, Anne G. M. Schilder, and Nishchay Mehta. 2026. "Validated Methods for Synthesising Hearing Health Data for Machine Learning: A Comparative Study of KDE and VAE Approaches" Applied Sciences 16, no. 6: 2917. https://doi.org/10.3390/app16062917

APA Style

Barrett, L., Katiri, R., Ooi, Y. B., Moffitt, I., Schilder, A. G. M., & Mehta, N. (2026). Validated Methods for Synthesising Hearing Health Data for Machine Learning: A Comparative Study of KDE and VAE Approaches. Applied Sciences, 16(6), 2917. https://doi.org/10.3390/app16062917

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Validated Methods for Synthesising Hearing Health Data for Machine Learning: A Comparative Study of KDE and VAE Approaches

Featured Application

Abstract

1. Introduction

1.1. Synthetic Data in Healthcare

1.2. Machine Learning Approaches for Synthetic Data Generation

1.3. Specialised Considerations for Hearing Health Data

1.4. Validation Framework for Synthetic Data

1.5. The Mi & Sun [34] Study as Validation Benchmark

1.6. Research Objectives

2. Materials and Methods

2.1. Data Source and Preprocessing

2.1.1. NHANES Dataset

2.1.2. Data Integration and Quality Control

2.1.3. Audiometric Variables

2.1.4. Demographic and Health Variables

2.1.5. Hearing Loss Classification

2.2. Synthetic Data Generation Methods

2.2.1. Kernel Density Estimation

2.2.2. Variational Autoencoder

2.2.3. Consideration of Alternative Approaches

2.3. Machine Learning Validation Framework

2.3.1. TSTR Framework

2.3.2. Machine Learning Models

2.3.3. Feature Importance Analysis

2.4. Statistical Validation

2.4.1. Equivalence Testing

2.4.2. Distribution Comparison

2.4.3. Correlation Preservation

2.5. Privacy Validation

2.5.1. Distance to Closest Record

2.5.2. Membership Inference Attack

2.5.3. Attribute Inference Attack

2.5.4. Re-Identification Risk

2.6. Clinical Validation

Statistical Analysis of Clinical Plausibility

2.7. Software and Reproducibility

3. Results

3.1. Demographic Characteristics of the NHANES Dataset

3.2. Audiometric Profiles

Correlation Patterns

3.3. Synthetic Data Generation

3.4. Statistical Validation

3.4.1. Equivalence Testing

3.4.2. Correlation Preservation

3.5. Machine Learning Validation

3.5.1. Baseline Performance (TRTR)

3.5.2. TSTR Results

3.5.3. Extended Validation

3.5.4. Feature Importance Preservation

3.6. Privacy Validation

3.7. Clinical Validation

3.7.1. Expert Clinical Plausibility Assessment

3.7.2. Plausibility by Patient Cohort

4. Discussion

4.1. Summary of Findings

4.2. Comparison with Prior Work

4.3. Method Comparison

4.4. Implications for Hearing Health Research

4.5. Limitations

4.6. Future Directions

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI