Author Contributions
Conceptualisation, L.B. and N.M.; methodology, L.B. and R.K.; software, L.B., Y.B.O. and I.M.; validation, L.B. and R.K.; formal analysis, L.B., R.K. and Y.B.O.; data curation, L.B., Y.B.O. and I.M.; writing—original draft preparation, L.B.; writing—review and editing, R.K., A.G.M.S. and N.M.; visualisation, L.B.; supervision, A.G.M.S. and N.M.; funding acquisition, A.G.M.S. and N.M. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Methodologicalworkflow for synthetic audiometric data generation and validation. The pipeline comprises three stages: (1) Data and Pre-processing: NHANES audiometric data undergo quality control and feature engineering, with outputs constrained by domain-specific requirements including physiological correlations, aetiology-specific patterns, and demographic covariance; (2) Modelling: two complementary generative approaches (Kernel Density Estimation and Variational Autoencoder) produce synthetic datasets; (3) Evaluation: synthetic data quality is assessed through statistical validation (distributional and correlational fidelity), machine learning validation (Train-on-Synthetic-Test-on-Real, or TSTR, framework with 8 classifiers), clinical validation (expert plausibility review), and privacy validation (exact match detection and membership inference attack, or MIA, resistance).
Figure 1.
Methodologicalworkflow for synthetic audiometric data generation and validation. The pipeline comprises three stages: (1) Data and Pre-processing: NHANES audiometric data undergo quality control and feature engineering, with outputs constrained by domain-specific requirements including physiological correlations, aetiology-specific patterns, and demographic covariance; (2) Modelling: two complementary generative approaches (Kernel Density Estimation and Variational Autoencoder) produce synthetic datasets; (3) Evaluation: synthetic data quality is assessed through statistical validation (distributional and correlational fidelity), machine learning validation (Train-on-Synthetic-Test-on-Real, or TSTR, framework with 8 classifiers), clinical validation (expert plausibility review), and privacy validation (exact match detection and membership inference attack, or MIA, resistance).
Figure 2.
Variational Autoencoder architecture for synthetic audiometric data generation. The encoder compresses 270 input features through three dense layers to a 64-dimensional latent space parameterised by mean () and log-variance (). Sampling uses the reparameterisation trick (, where ). The decoder reconstructs the original feature space through mirrored dense layers.
Figure 2.
Variational Autoencoder architecture for synthetic audiometric data generation. The encoder compresses 270 input features through three dense layers to a 64-dimensional latent space parameterised by mean () and log-variance (). Sampling uses the reparameterisation trick (, where ). The decoder reconstructs the original feature space through mirrored dense layers.
Figure 3.
Age distribution of NHANES participants by gender and survey cohort. The distribution shows a pronounced mode in the 10–20 year age range reflecting NHANES sampling of adolescents, with a secondary peak at 80 years representing recoded ages above 85.
Figure 3.
Age distribution of NHANES participants by gender and survey cohort. The distribution shows a pronounced mode in the 10–20 year age range reflecting NHANES sampling of adolescents, with a secondary peak at 80 years representing recoded ages above 85.
Figure 4.
Correlation heatmap for pure-tone audiometry thresholds across frequencies and ears. Strong correlations exist between adjacent frequencies within each ear and between corresponding frequencies across ears.
Figure 4.
Correlation heatmap for pure-tone audiometry thresholds across frequencies and ears. Strong correlations exist between adjacent frequencies within each ear and between corresponding frequencies across ears.
Figure 5.
Distribution comparison of hearing thresholds across frequencies between real NHANES data (left, blue) and KDE synthetic data (right, green). Both distributions show similar patterns across all test frequencies, demonstrating preservation of marginal distributions.
Figure 5.
Distribution comparison of hearing thresholds across frequencies between real NHANES data (left, blue) and KDE synthetic data (right, green). Both distributions show similar patterns across all test frequencies, demonstrating preservation of marginal distributions.
Figure 6.
ROC curves comparing TRTR baseline (blue), KDE synthetic (green), and VAE synthetic (teal) performance across machine learning models. The VAE synthetic data consistently outperforms KDE, approaching baseline performance across most classifiers.
Figure 6.
ROC curves comparing TRTR baseline (blue), KDE synthetic (green), and VAE synthetic (teal) performance across machine learning models. The VAE synthetic data consistently outperforms KDE, approaching baseline performance across most classifiers.
Figure 7.
SHAP feature importance comparison between models trained on real data (blue), KDE synthetic (green), and VAE synthetic (teal). The real data model identifies Age, Gender, Hypertension, Noise Exposure, and Urine Thallium as the top 5 predictors. KDE synthetic shows substantially altered rankings dominated by heavy metal biomarkers. VAE preserves Age as the dominant predictor and retains 3 of top 5 real-data features, though secondary rankings diverge.
Figure 7.
SHAP feature importance comparison between models trained on real data (blue), KDE synthetic (green), and VAE synthetic (teal). The real data model identifies Age, Gender, Hypertension, Noise Exposure, and Urine Thallium as the top 5 predictors. KDE synthetic shows substantially altered rankings dominated by heavy metal biomarkers. VAE preserves Age as the dominant predictor and retains 3 of top 5 real-data features, though secondary rankings diverge.
Figure 8.
Combined mean clinical plausibility ratings by data source (means of two independent expert audiologists) with standard error bars. Both raters independently rated VAE synthetic profiles as more plausible than real NHANES data, while KDE profiles were rated as largely implausible. The dashed line indicates the uncertain threshold (rating = 3).
Figure 8.
Combined mean clinical plausibility ratings by data source (means of two independent expert audiologists) with standard error bars. Both raters independently rated VAE synthetic profiles as more plausible than real NHANES data, while KDE profiles were rated as largely implausible. The dashed line indicates the uncertain threshold (rating = 3).
Figure 9.
Combined mean clinical plausibility ratings (means of two independent raters) for KDE and VAE synthetic data by patient cohort. VAE consistently outperformed KDE across all standard cohorts. The low probability cohort showed the smallest difference between methods, reflecting the inherent difficulty of generating plausible edge cases.
Figure 9.
Combined mean clinical plausibility ratings (means of two independent raters) for KDE and VAE synthetic data by patient cohort. VAE consistently outperformed KDE across all standard cohorts. The low probability cohort showed the smallest difference between methods, reflecting the inherent difficulty of generating plausible edge cases.
Table 1.
Validation Framework Scenarios.
Table 1.
Validation Framework Scenarios.
| Scenario | Training Data | Test Data | Purpose |
|---|
| Train-Real-Test-Real (TRTR) | Real | Real | Baseline performance |
| Train-Synthetic-Test-Real (TSTR) | Synthetic | Real | Discriminative fidelity |
| Train-Real-Test-Synthetic (TRTS) | Real | Synthetic | Pattern matching |
| Train-Synthetic-Test-Synthetic (TSTS) | Synthetic | Synthetic | Internal consistency |
Table 2.
Demographic characteristics of the NHANES dataset.
Table 2.
Demographic characteristics of the NHANES dataset.
| Characteristic | Value | Percentage |
|---|
| Total participants | 29,714 | 100.0% |
| Gender
| | |
| Male | 14,647 | 49.3% |
| Female | 15,067 | 50.7% |
| Age (years) | | |
| Mean (SD) | 37.8 (23.5) | — |
| Median | 51.0 | — |
| Range | 12–85 | — |
| Race/Ethnicity | | |
| Non-Hispanic White | 11,301 | 38.0% |
| Non-Hispanic Black | 7017 | 23.6% |
| Mexican American | 5342 | 18.0% |
| Other Race | 3532 | 11.9% |
| Other Hispanic | 2522 | 8.5% |
| Hearing loss prevalence | 12,806 | 43.1% |
Table 3.
Correlation matrix for hearing thresholds (dB HL) and demographics. * p < 0.001 after Bonferroni correction.
Table 3.
Correlation matrix for hearing thresholds (dB HL) and demographics. * p < 0.001 after Bonferroni correction.
| | 0.5 kHz | 1 kHz | 2 kHz | 4 kHz | 8 kHz | Gender | Age |
|---|
| 0.5 kHz | 1.00 | | | | | | |
| 1 kHz | 0.85 * | 1.00 | | | | | |
| 2 kHz | 0.72 * | 0.84 * | 1.00 | | | | |
| 4 kHz | 0.61 * | 0.72 * | 0.83 * | 1.00 | | | |
| 8 kHz | 0.57 * | 0.67 * | 0.75 * | 0.85 * | 1.00 | | |
| Gender | 0.02 * | −0.04 * | −0.08 * | −0.18 * | −0.07 * | 1.00 | |
| Age | 0.48 * | 0.58 * | 0.65 * | 0.74 * | 0.79 * | 0.01 | 1.00 |
Table 4.
Equivalence testing results: mean hearing thresholds (dB HL).
Table 4.
Equivalence testing results: mean hearing thresholds (dB HL).
| Frequency | Ear | Real | KDE | VAE |
|---|
| 0.5 kHz | Right | 11.81 | 11.79 | 11.83 |
| 0.5 kHz | Left | 11.68 | 11.63 | 11.70 |
| 1 kHz | Right | 10.55 | 10.50 | 10.57 |
| 1 kHz | Left | 10.50 | 10.47 | 10.52 |
| 2 kHz | Right | 11.91 | 11.90 | 11.93 |
| 2 kHz | Left | 12.53 | 12.53 | 12.55 |
| 4 kHz | Right | 17.47 | 17.43 | 17.50 |
| 4 kHz | Left | 18.42 | 18.40 | 18.45 |
| 8 kHz | Right | 23.87 | 23.75 | 23.95 |
| 8 kHz | Left | 24.51 | 24.42 | 24.55 |
Table 5.
TRTR baseline performance (10-fold cross-validation).
Table 5.
TRTR baseline performance (10-fold cross-validation).
| Model | CV AUC (Mean ± SD) | Test AUC | F1 Score |
|---|
| XGBoost | 0.947 ± 0.012 | 0.956 | 0.895 |
| Random Forest | 0.939 ± 0.015 | 0.945 | 0.867 |
| Gradient Boosting | 0.925 ± 0.015 | 0.931 | 0.846 |
| SVM | 0.921 ± 0.015 | 0.931 | 0.856 |
| Neural Network | 0.918 ± 0.017 | 0.923 | 0.867 |
| KNN | 0.916 ± 0.014 | 0.921 | 0.852 |
| Logistic Regression | 0.883 ± 0.017 | 0.886 | 0.805 |
| Decision Tree | 0.785 ± 0.016 | 0.811 | 0.823 |
| Mean | 0.904 ± 0.015 | 0.913 | 0.851 |
Table 6.
TSTR performance: discriminative fidelity.
Table 6.
TSTR performance: discriminative fidelity.
| Model | KDE AUC | KDE Ratio | VAE AUC | VAE Ratio |
|---|
| Logistic Regression | 0.851 | 96.1% | 0.878 | 99.1% |
| SVM | 0.747 | 80.2% | 0.818 | 87.9% |
| Random Forest | 0.665 | 70.3% | 0.795 | 84.1% |
| Decision Tree | 0.488 | 60.2% | 0.687 | 84.7% |
| KNN | 0.678 | 73.6% | 0.769 | 83.5% |
| Gradient Boosting | 0.562 | 60.4% | 0.803 | 86.3% |
| Neural Network | 0.708 | 76.7% | 0.835 | 90.5% |
| XGBoost | 0.602 | 63.0% | 0.770 | 80.6% |
| Mean | 0.663 | 72.6% | 0.794 | 86.3% |
Table 7.
Extended validation results.
Table 7.
Extended validation results.
| Metric | KDE | VAE |
|---|
| TRTS Mean AUC | 0.673 | 0.907 |
| TSTS Mean AUC | 0.866 | 0.991 |
| SHAP Rank Difference | 6.6 | 3.7 |
Table 8.
Spearman correlation differences () for feature pairs relevant to SHAP ranking divergence. KDE attenuates inter-metal correlations, inflating independent feature importance. VAE amplifies Age-paired correlations, elevating BMI and Blood Pb. Bold values indicate the largest correlation shifts for each method.
Table 8.
Spearman correlation differences () for feature pairs relevant to SHAP ranking divergence. KDE attenuates inter-metal correlations, inflating independent feature importance. VAE amplifies Age-paired correlations, elevating BMI and Blood Pb. Bold values indicate the largest correlation shifts for each method.
| Feature Pair | Real | KDE | KDE | VAE | VAE |
|---|
| VAE-amplified (Age-paired correlations) |
| Age–BMI | 0.222 | 0.118 | −0.104 | 0.523 | +0.301 |
| Age–Blood Pb | 0.369 | 0.109 | −0.260 | 0.531 | +0.162 |
| KDE-attenuated (inter-metal correlations) |
| Urine Mo–Urine Sb | 0.489 | 0.023 | −0.467 | 0.498 | +0.009 |
| Urine Cs–Urine As | 0.524 | 0.107 | −0.417 | 0.562 | +0.038 |
| Urine Cs–Urine Sb | 0.436 | 0.021 | −0.415 | 0.532 | +0.096 |
Table 9.
Membership inference attack results.
Table 9.
Membership inference attack results.
| Metric | KDE | VAE |
|---|
| Attack Success Rate | 52.3% | 53.1% |
| Attack AUC | 0.523 | 0.531 |
| Random Baseline | 50.0% | 50.0% |