1. Introduction
The use of Machine Learning (ML) in healthcare is increasingly dependent on the availability of high-quality data. However, access to sufficiently large and representative datasets remains a persistent challenge due to privacy constraints, ethical considerations, and the high cost of medical data collection. Moreover, healthcare data often suffers from class imbalance, where rare but clinically critical outcomes are underrepresented, making it difficult for ML models to learn robust and generalizable decision rules. These challenges highlight the need for alternative approaches that enable the development of reliable predictive models while preserving patient confidentiality and ensuring statistical validity.
Synthetic data generation has emerged as a powerful solution to these limitations. By creating artificial data that replicates the statistical properties of real-world datasets, synthetic data offers a way to overcome scarcity, mitigate class imbalance, and reduce the risk of overfitting. In contrast to simple data augmentation, synthetic data generation maintains both the marginal distributions of variables and the complex interdependencies between them, which are crucial for healthcare applications. Preserving these relationships ensures that ML models can capture meaningful patterns without compromising the integrity of predictions. Recent advances in generative modelling, including probabilistic frameworks, Bayesian networks, and neural-based approaches, have further strengthened the ability to produce synthetic data that is both realistic and diverse [
1,
2].
In the specific context of patient triage, these challenges are particularly pronounced. Triage datasets are often small, as they rely on structured questionnaires or medical assessments collected under resource-constrained conditions. For this study, the initial dataset comprised 2125 responses from a triage questionnaire aimed at assessing patient care priority. While valuable, this dataset was insufficient to train models capable of generalization to diverse real-world scenarios. To address this, we employed synthetic data generation as a central step in our methodology, ensuring the expansion of the dataset while maintaining its statistical integrity.
The proposed approach is guided by three key objectives. First, preservation of complex relationships between variables such as age, marital status, education level, perceived health, and beliefs about triage criteria, as these interdependencies critically affect predictive accuracy. Second, mitigation of overfitting, by expanding the dataset and providing greater variability in training samples, thereby enhancing generalization capacity. Third, support for model robustness, enabling ML models to handle diverse patient profiles and improve decision reliability in clinical practice. Together, these objectives ensure that the synthetic dataset not only mimics real-world variability but also serves as a reliable foundation for predictive modelling in sensitive healthcare applications.
To achieve these goals, we implemented the Synthetic Data Vault (SDV) framework, which combines advanced probabilistic modelling and ML to generate synthetic data that faithfully reflects the statistical structure of the original dataset [
3,
4,
5]. The SDV approach was selected for its ability to automatically detect variable types, capture complex correlations, and scale efficiently to large volumes of synthetic data. By leveraging Gaussian Copula models within SDV, we ensured that the generated data preserved both distributions and dependencies, enabling the training of robust models for binary and multi-class triage classification.
Finally, to address the problem of class imbalance, we explicitly integrated data balancing into the generation process. Two classification schemes were considered: a binary model distinguishing between low- and high-priority patients, and a multi-class model offering finer stratification across four priority levels. In both cases, synthetic data was generated to balance class distributions, reducing bias and enhancing predictive performance. The statistical similarity between real and synthetic data was evaluated using a comprehensive set of descriptive measures (e.g., mean, standard deviation, skewness, kurtosis), complemented by visual inspection of distributions. Iterative refinement ensured that the synthetic dataset aligned closely with the original data, thereby providing a statistically consistent yet diversified foundation for predictive modelling.
This work proposes and evaluates a synthetic data generation methodology tailored to healthcare triage applications. By combining SDV-based generative models with explicit class balancing, we demonstrate how synthetic data can enhance the development of robust ML models in both binary and multi-class classification tasks. The contributions of this study are threefold: (i) systematic methodology for generating and validating synthetic healthcare data that preserves statistical properties and inter-variable relationships; (ii) the application of SDV and Gaussian Copula modelling to expand small triage datasets while addressing class imbalance; and (iii) a demonstration of the effectiveness of synthetic data in supporting reliable ML-based decision making for patient prioritization.
2. Related Work
In healthcare, datasets are often imbalanced and eliciting societal data concerning their support for the criteria that should guide patient prioritization are often imbalanced and biassed. Therefore, the synthetic data generation offers a promising alternative: it preserves the statistical patterns of real data without exposing individual records [
6]. In this context, synthetic datasets accurately mimic joint distributions and structural dependencies, enabling simulation, behavioural modelling and algorithm development when real samples are scarce or legally restricted [
7,
8]. Synthetic data are used to address (
i) data scarcity; and (
ii) privacy and regulation.
Modern ML and deep learning algorithms require large amounts of data to achieve robust performance. In medicine, rarity of certain conditions and logistical or ethical hurdles in data collection limit sample size. Synthetic generation enlarges training corpora while retaining key statistical properties, boosting predictive accuracy even in small-sample scenarios [
8]. In turn, health data are highly sensitive and regulated [
9]. Even with anonymisation, re-identification risks remain. High fidelity synthetic data, decoupled from real patients, facilitate research and innovation within ethical and legal boundaries [
6,
8].
Access to high quality and diverse healthcare data is often limited due to privacy concerns, regulatory constraints, and the rarity of certain clinical scenarios. To address these challenges, synthetic data generation has emerged as a valuable technique for augmenting datasets, enabling model training, testing, and validation without compromising patient confidentiality. Early work relied on classical statistical techniques; recent years have seen a shift toward ML based generative models capable of learning complex variable dependencies and producing novel combinations (“sampled zeros”) [
6,
7,
10]. These techniques aim to produce realistic and representative data that preserve the statistical properties of the original dataset, thereby supporting robust and generalizable ML models for patient prioritization and other healthcare tasks.
Bayesian-Network (BN) encode conditional relationships through directed acyclic graphs. Sampling from the learned joint distribution allows the generation of synthetic records. Despite their interpretability, they become computationally infeasible as dimensionality increases, and they struggle to model non-linear dependencies, compromising both data quality and privacy [
10,
11]. Limitations include (
i) exponential growth in learning complexity with the number of variables [
3]; (
ii) poor representation of highly non-linear relationships [
11]; and (
iii) tendency to memorize real data, putting privacy at risk [
11].
Neural Generative Models such as Variational Autoencoders (VAE) and Generative Adversarial Networks (GAN) have gained prominence. VAE map data to a latent space through variational inference, being effective with continuous variables, including longitudinal and eye-tracking datasets [
6,
7]. However, they struggle with categorical attributes and require significant computational resources for large datasets [
4,
11]. GAN, on the other hand, combine a generator and a discriminator in an adversarial training dynamic. For tabular data, variants such as Conditional Tabular Generative Adversarial (CTGAN) apply conditional techniques to handle categorical variables and class imbalances, improving synthetic fidelity and predictive performance when combined with real data [
4,
8].
Despite their potential, neural models face significant challenges in sensitive domains such as healthcare. They often operate as black-boxes, making it difficult to trace and justify the generation process, which harms transparency and regulatory compliance [
11]. Additionally, problems such as “mode collapse” where the generator produces low diversity data can result in synthetic datasets that omit rare but clinically significant patterns [
11]. These models impose a considerable computational burden, require careful hyperparameter tuning, and are sensitive to training conditions, leading to long training times and increased risk of runtime errors when working with large datasets [
3,
4,
11].
Synthia is an open source multidimensional synthetic data generator implemented in Python with version 3.13 [
12]. It uses copula-based models, which allow the statistical properties of the observed data to be captured in terms of both individual behaviour and interdependencies between variables [
12]. Synthia is further supported by Functional Principal Component Analysis (FPCA), an extension of principal component analysis where the data consist of functions rather than vectors [
12]. Synthia is a powerful tool for generating multidimensional synthetic data while preserving complex relationships. However, its applicability is limited to offline processing and it may have shortcomings in accurately replicating variable types and distribution functions from the original data [
13].
SDV emerges as a robust and modular solution based on copula theory, specifically designed for tabular data. Gaussian Copula is a function that couples multivariate distribution functions to their univariate margins by describing the dependency structure through a multivariate normal distribution applied to transformed uniform variables [
10,
14]. Univariate modelling is performed using Gaussian Mixture Models (GMM), which represent a probabilistic model composed of multiple Gaussian components, that links univariate marginals through a multivariate normal distribution after transforming the data using Empirical Cumulative Distribution Functions (ECDF) [
10,
14]. Sklar’s Theorem ensures that any multivariate distribution can be decomposed into its marginal distributions and a copula, enabling the reuse of marginals across domains [
6,
14].
SDV framework offers several advantages that make it particularly suitable for generating healthcare data. First, it demonstrates high statistical fidelity and clinical realism by preserving the structural and statistical properties of the original dataset, including relationships between tables and columns [
3,
4]. Additionally, SDV ensures the preservation of variability and the inclusion of rare cases and also eliminates the need to choose a specific copula family, providing flexibility to capture complex dependencies [
15].
Given the limitations pointed out in traditional synthetic data generation methods, the SDV emerges as a robust and highly adaptable solution. From a privacy perspective, SDV offers enhanced protection since its probabilistic approach prevents memorization of real data points, promoting an implicit form of differential privacy [
4,
10]. Furthermore, SDV is recognized for its computational efficiency and scalability, processing large datasets with reduced runtime and greater stability compared to models like GAN and VAE [
3,
4]. Lastly, among the models included in the SDV library, CopulaGAN and GaussianCopulaSynthesizer are particularly noteworthy for their capacity to model complex dependencies in tabular data. CopulaGAN combines the adversarial training dynamics of GANs with copula-based statistical modelling [
5,
10,
15]. Copulas are multivariate functions that allow for the separation of marginal distributions from the dependency structure, enabling the model to better capture non-linear relationships between variables, an essential feature when working with heterogeneous healthcare data [
16]. On the other hand, the GaussianCopulaSynthesizer simplifies this approach by assuming a Gaussian copula structure. It transforms variables into a standard normal space using probability integral transforms and models their joint distribution using a multivariate Gaussian copula. This model has demonstrated high effectiveness in reproducing realistic synthetic datasets while maintaining the integrity of variable correlations and supporting both continuous and categorical data [
3,
5,
15].
Although GAN, VAE, Synthia, and BN have merits in specific domains, SDV represents a more robust, explainable, and efficient alternative for synthetic data generation in sensitive contexts such as healthcare [
6]. Within the SDV framework, the GaussianCopulaSynthesizer was selected over CopulaGAN due to its stronger statistical grounding, greater transparency, and lower computational complexity [
5]. While CopulaGAN leverages adversarial training to capture complex non-linear dependencies, it also introduces challenges such as increased training instability and reduced interpretability. In contrast, the Gaussian Copula Synthesizer, by assuming a Gaussian copula structure, provides a simpler yet effective model that preserves variable correlations and supports mixed data types with high fidelity [
3,
4,
5]. This balance of rigour, explainability, and scalability makes SDV and specifically the Gaussian Copula Synthesizer the ideal choice for this dissertation, enabling the generation of realistic, privacy-preserving synthetic data suitable for healthcare predictive modelling [
3,
4,
5].
4. Experiments and Results
To support the development of robust and generalizable ML models, synthetic data was generated to expand the training dataset, mitigate overfitting, and address the imbalance across classes. Several generation strategies were explored, each aiming to refine the synthetic dataset in a way that preserved the statistical properties of the original data while improving class balance.
Fidelity metrics, including mean, standard deviation, skewness, and kurtosis, were computed separately for each variable in both the real and synthetic datasets. The absolute differences between corresponding metrics were then averaged across all variables to obtain a global fidelity score. This method allows for a fine-grained evaluation of the statistical similarity between real and synthetic data distributions.
4.3. Binary Scenario
Refined Synthetic Data. The final approach focused on precisely addressing the inherent class imbalance by controlling the number of synthetic instances generated per class. Unlike previous strategies, this method did not simply replicate the global distribution of the original dataset or apply balancing as a preliminary step. Instead, it adopted a targeted strategy, determining the exact number of synthetic cases needed for each class so that, when combined with the original real data, the final dataset would total 100,000 instances with fully balanced class representation.
This refined approach was applied to both the binary and multi-class classification settings, as it proved to be the most effective and robust strategy for producing high quality synthetic data. Its goal was not only to correct imbalances but also to provide a strong, generalizable foundation for training ML models with fair exposure to all class scenarios.
Data Distribution. A refined synthetic data generation strategy was implemented that explicitly considered class balance while still being based on the original imbalanced real data, which was applied to the binary and multi-class case. The goal was to ensure that the generated dataset reflected the real world distribution while also providing a sufficient number of cases for each class to train the models effectively. A crucial aspect of validating the synthetic data is comparing its distribution with that of the real data. To ensure that the generated data reflects the statistical characteristics of the real dataset, a thorough comparative analysis was conducted on each variable and several key variables were compared using statistical metrics.
Table 3 compares the distribution of all variables between the original and synthetic datasets for the third approach. In this strategy, different amounts of synthetic data were generated per class to counteract the real dataset’s imbalance, ensuring that after combining both sources, each class was equally represented.
Table 3 compares real and synthetic data under the refined synthetic data approach, showing high alignment across most variables. This method successfully maintains the structural integrity of all variables while introducing a controlled level of variability that strengthens the dataset’s robustness. For instance, the age distribution in the synthetic data, although slightly shifted towards older individuals, broadens the variability without distorting the overall population profile. This is particularly valuable as it ensures representation of wider patient scenarios, enriching model generalization.
Similarly, gender, marital status, and employment status show very consistency between real and synthetic data, with only marginal statistical deviations that fall within acceptable tolerance. The education level variable also exhibits nearly perfect alignment, proving that categorical structures are reliably replicated. The minor differences observed they reduce the risk of overfitting to narrow real world patterns, while preserving the realism needed for credible decision support models. Ranges remain consistent, and categorical levels are fully preserved, ensuring no loss of semantic meaning or data integrity.
In summary, the synthetic dataset successfully replicates the key statistical characteristics of the original data while enhancing balance and diversity.
Table 4 compares the distributions of the decision related variables (DEC1 to DEC5) between the real dataset and the synthetic data generated under the refined synthetic data. This approach focuses on generating synthetic samples in a way that, when combined with the original data, results in balanced class representation across the target variable while maintaining distributional characteristics of the predictors.
The statistical comparison across the five decision variables indicates that the synthetic data captures key distributional characteristics present in the real dataset. For all variables, the medians and inter-quartile ranges are preserved, reflecting consistency in the core distribution shape. However, the synthetic data exhibits slightly lower means across all variables, coupled with higher standard deviations. This indicates a broader spread and a modest shift in central tendency, potentially reflecting the increased representation of lower values in the synthetic samples. Despite these variations, the alignment in median and quartile values supports the conclusion that the synthetic data remains representative of the real data’s structure. This validates the suitability of the generated samples for downstream tasks, while fulfilling the goal of class balancing imposed by the refined synthetic data.
Data Balancing. The method began by analyzing the number of real cases in each class (275 in class 0 and 2800 in class 1). Based on this, the appropriate number of synthetic samples was individually generated for each class to ensure that both reached an equal representation in the final dataset. This means generating proportionally more data for the underrepresented class (class 0), and less for the dominant class (class 1), correcting the existing bias without distorting the statistical properties of the original data.
A total of 49,725 synthetic instances were generated for class 0 and 47,200 for class 1. When combined with the original real data, each class reached exactly 50,000 instances, resulting in a perfectly balanced final dataset. This equal class distribution provides a solid foundation for developing classification models, enabling the algorithm to learn how to predict patient priority levels without being affected by the structural bias present in the original dataset. In addition to improving class balance, this controlled generation ensured that each synthetic record adhered to the learned statistical relationships and logical consistency of the original dataset. By doing so, the model avoided common pitfalls associated with naive oversampling, such as redundancy or data artefacts.
4.4. Multi-Class Scenario
Refined Synthetic Data. Following the success of the refined balancing strategy in the binary classification scenario, the same methodology was extended to the multi-class configuration. Given the more complex distribution of classes in this scenario, with varying levels of under-representation, a careful and systematic approach was adopted to ensure an even class distribution while preserving the statistical integrity of the original data. Rather than applying a generic oversampling method or maintaining the natural imbalance, this refined approach aimed to generate the exact number of synthetic instances needed per class. The objective was to ensure that, when combined with the real dataset, each of the four classes would be equally represented, resulting in a final dataset composed of 100,000 balanced instances, 25,000 per class.
Data Distribution. This scenario involved four distinct priority levels: non-priority, low priority, priority, and high priority. The synthetic data was carefully generated to ensure equal representation across these categories, with each class totalling 25,000 instances after combining real and synthetic records. Before moving forward with the evaluation of individual variables, it was essential to validate the overall distribution of the synthetic data. A comparative distribution analysis was performed between the real and synthetic data across all variables, highlighting how well the synthetic data replicates the structure of the real dataset in this multi-class setting. This analysis served as a critical step in verifying the fidelity and reliability of the synthetic dataset. Particular attention was given to variables that influence classification performance, ensuring that the synthetic data maintained both statistical consistency and semantic realism.
Table 5 presents a comparison of the distribution for the variables between the real dataset and the synthetic data generated under the final multi-class balancing approach. The table provides an overview of key distributional statistics to assess the fidelity of the synthetic data in replicating the structure of the original dataset.
The refined synthetic dataset under the multi-class scenario shows strong alignment with the real data while introducing small variations that enhance representativeness. The age variable presents a slightly higher mean and skewness, capturing a broader spread of cases and improving balance across groups.
Gender, marital status, and education level remain almost identical to the real data, preserving structural fidelity. For employment status, the synthetic data introduces slightly higher variability, which is beneficial for generalization. Overall, this refined approach ensures both realism and controlled heterogeneity, making it especially suited for robust multi-class modelling.
Table 6 presents a comparative summary of the statistical distribution of five decision related variables (DEC1–DEC5) between the real dataset and the synthetically generated dataset, both derived under the final multi-class balancing strategy. This evaluation is crucial to assess whether the synthetic data can reliably mimic the key distributional characteristics of decision making variables found in the original dataset, which are central to downstream modelling and interpretation.
The descriptive statistics demonstrate that the synthetic data maintains a close approximation to the real dataset across all five decision variables. The mean and median values are generally consistent, with DEC1 and DEC2 showing slightly higher central tendency in the real data, while the synthetic data exhibits marginally increased dispersion as reflected in higher standard deviations for all variables.
The general shape and symmetry of the distributions are preserved, the synthetic dataset offers a faithful representation of the real decision variables, supporting its adequacy for subsequent analytical tasks within the multi-class framework.
The statistical comparison confirmed that the synthetic data is a highly reliable representation of the real dataset. The preservation of means, medians, and quartiles of the variables highlights the model’s ability to faithfully replicate central tendency and variability, ensuring that the synthetic data reflects the essential statistical structure of the real dataset. While minor differences exist they do not significantly impact the overall validity of the dataset. The high degree of similarity in statistical metrics suggests that the synthetic data can be confidently used for analysis and modelling, maintaining the essential characteristics of the original real world data.
Data Balancing. The process began with an analysis of the real data distribution across the four priority levels: class 0 (non-priority), class 1 (low priority), class 2 (priority), and class 3 (high priority). Based on the number of real instances in each class, synthetic data was generated in a targeted manner to complement the deficit in each category. For class 0 has been generated 24,955 data samples, 24,253 synthetic data for class 1, 23,071 for class 2, and 24,637 synthetic data for class 3. When combined with the respective real data, these additions brought each class to exactly 25,000 instances. This resulted in a fully balanced multi-class dataset with a total of 100,000 examples, ensuring fair and equal representation of all priority levels for model training.
In addition to achieving balance, special care was taken to ensure that the synthetic records remained statistically and semantically coherent. The average values of key decision variables within each class stayed within the expected ranges defined for their respective categories. Variables such as age, political orientation, and service usage patterns remained within realistic and medically plausible boundaries. By applying this refined and controlled synthesis approach, the resulting multi-class dataset provides a robust and fair foundation for machine learning applications, enhancing the model’s ability to learn from all classes equally and generalize effectively across different patient profiles.
Given this overall performance, this approach was selected as the basis for developing the algorithm integrated into the patient triage application, as it proved to be the most effective in terms of distributional fidelity and data balancing. The next step will be to apply machine learning algorithms to this dataset to build and validate the predictive components of the system.
4.6. Discussion
The experimental results demonstrate that the proposed synthetic data generation approach, based on the SDV Gaussian Copula Synthesizer, effectively reproduces the statistical structure of the original healthcare dataset while ensuring data privacy. The analysis of distributional metrics, including mean, standard deviation, skewness, and kurtosis, confirms a high degree of similarity between real and synthetic data, indicating that the generated samples maintain the integrity and variability of the original variables. These findings suggest that the probabilistic, statistically grounded nature of SDV provides a stable and explainable alternative to neural generative models such as GANs and VAEs, which often suffer from training instability and overfitting.
When compared with previous studies (
Table 7), the proposed approach extends existing work in several dimensions. Unlike prior GAN-based studies that focused primarily on binary classification and small datasets [
3,
8], this study successfully handles both binary and multiclass healthcare triage problems. Moreover, the SDV framework enables the expansion of the dataset from 2125 real to 100,000 synthetic records without compromising statistical fidelity. This scalability highlights the robustness of the probabilistic copula-based modelling approach and its ability to support applications that require larger and more diverse datasets.
From a practical view, these results have important implications for data-driven decision making in healthcare. The ability to generate realistic, privacy-preserving synthetic data enables hospitals, research institutions, and policymakers to conduct advanced modelling and triage simulations without exposing sensitive patient information. This can facilitate the development of predictive tools, improve ethical resource allocation, and support compliance with data protection regulations.
However, several limitations should be acknowledged. First, the present study focused on statistical fidelity and did not include a full evaluation of model performance metrics or statistical distance measures such as Jensen–Shannon or Kolmogorov–Smirnov divergence. Second, the dataset used contained primarily demographic and decision-related attributes, which may not fully represent the complexity of clinical data. Furthermore, the next stage of this research will focus on developing an explainable prioritization platform that leverages the generated synthetic data to support transparent and interpretable triage decision making. This platform will combine data-driven prioritization models with explainability mechanisms, enabling healthcare professionals to better understand and justify the recommendations produced by AI-assisted systems.
In summary, the proposed SDV-based approach demonstrates that statistically grounded models can generate scalable, high-fidelity, and privacy-respecting synthetic data suitable for healthcare triage applications. Ongoing work will extend this research to include more complex data modalities, quantitative similarity metrics, and open access to the implementation upon request, thus reinforcing transparency and reproducibility in synthetic data generation research.
From a privacy perspective, the generation of synthetic data inherently supports data protection principles. Because the SDV Gaussian Copula Synthesizer reproduces only the statistical relationships among variables and not the actual records, no individual information from the original dataset is retained. This mechanism effectively prevents data re-identification, enabling the use of realistic datasets for analysis and model training in compliance with privacy regulations.