Next Article in Journal
Toward Clinically Dependable AI for Brain Tumors: A Unified Diagnostic–Prognostic Framework and Triadic Evaluation Model
Previous Article in Journal
AdaDenseNet-LUC: Adaptive Attention DenseNet for Laryngeal Ultrasound Image Classification
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Conditional Tabular Generative Adversarial Network Based Clinical Data Augmentation for Enhanced Predictive Modeling in Chronic Kidney Disease Diagnosis

1
Department of Computer Science and Engineering (AI & ML), Dayananda Sagar University, Ramanagara, Bengaluru 562112, Karnataka, India
2
Department of AI & ML, HCA Healthcare, Frederick, MD 21701, USA
3
Department of CSE, Manipal University Jaipur, Jaipur 303007, Rajasthan, India
4
Faculty of Data Science and IT, INTI International University, Nilai 71800, Malaysia
5
Department of Electronics and Communication Engineering, Dayananda Sagar University, Ramanagara, Bengaluru 562112, Karnataka, India
6
Department of Mechatronics Engineering, Manipal University Jaipur, Jaipur 303007, Rajasthan, India
7
Applied AI & Data Science, Brown University, Providence, RI 02912, USA
8
College of Business & Management, Colorado Technical University, Colorado Springs, CO 80907, USA
9
Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal 576104, Karnataka, India
*
Author to whom correspondence should be addressed.
BioMedInformatics 2026, 6(1), 6; https://doi.org/10.3390/biomedinformatics6010006
Submission received: 28 November 2025 / Revised: 30 December 2025 / Accepted: 9 January 2026 / Published: 22 January 2026

Abstract

The lack of clinical data for chronic kidney disease (CKD) prediction frequently results in model overfitting and inadequate generalization to novel samples. This research mitigates this constraint by utilizing a Conditional Tabular Generative Adversarial Network (CTGAN) to enhance a constrained CKD dataset sourced from the University of California, Irvine (UCI) Machine Learning Repository. The CTGAN model was trained to produce realistic synthetic samples that preserve the statistical and feature distributions of the original dataset. Multiple machine learning models, such as AdaBoost, Random Forest, Gradient Boosting, and K-Nearest Neighbors (KNN), were assessed on both the original and enhanced datasets with incrementally increasing degrees of synthetic data dilution. AdaBoost attained 100% accuracy on the original dataset, signifying considerable overfitting; however, the model exhibited enhanced generalization and stability with the CTGAN-augmented data. The occurrence of 100% test accuracy in several models should not be interpreted as realistic clinical performance. Instead, it reflects the limited size, clean structure, and highly separable feature distributions of the UCI CKD dataset. Similar behavior has been reported in multiple previous studies using this dataset. Such perfect accuracy is a strong indication of overfitting and limited generalizability, rather than feature or label leakage. This observation directly motivates the need for controlled data augmentation to introduce variability and improve model robustness. The dataset with the greatest dilution, comprising 2000 synthetic cases, attained a test accuracy of 95.27% utilizing a stochastic gradient boosting approach. Ensemble learning techniques, particularly gradient boosting and random forest, regularly surpassed conventional models like KNN in terms of predicted accuracy and resilience. The results demonstrate that CTGAN-based data augmentation introduces critical variability, diminishes model bias, and serves as an effective regularization technique. This method provides a viable alternative for reducing overfitting and improving predictive modeling accuracy in data-deficient medical fields, such as chronic kidney disease diagnosis.

1. Introduction

Chronic Kidney Disease (CKD) is a progressive and potentially fatal disorder marked by a slow and permanent decline in renal function. The kidneys are essential biological filters that eliminate toxins and waste from the bloodstream. Their dysfunction results in the buildup of blood toxicity, potentially triggering a slew of other lethal health consequences. The incidence of chronic kidney disease constitutes a major public health issue in India. A recent study conducted by the All-India Institute of Medical Sciences (AIIMS) estimated that 9.94% of the Indian population is afflicted by the condition. India, as a swiftly advancing economy, has a concerning rise in lifestyle-related disorders, many of which directly jeopardize kidney health. The principal risk factors for chronic kidney disease include diabetes, hypertension, and obesity, all of which are becoming progressively prevalent. A meta-analysis conducted by [1] emphasized this association, determining that 27% of persons with hypertension and 31% of adults with diabetes also exhibit chronic kidney disease. The asymptomatic characteristics of chronic kidney disease, especially in its initial phases, pose significant pathological problems. By the time symptoms appear, the disease has frequently advanced to a critical stage. Consequently, routine screening for these lifestyle disorders is essential for early identification. The creation of prediction models through machine learning and data analysis provides an essential solution to this issue, facilitating identification prior to the disease advancing to a severe state. Chronic kidney disease progresses through five stages. Stages 1 and 2, and often early stage 3, are typically asymptomatic and may not cause major clinical complications. Because symptoms usually appear only in later stages, CKD often remains undiagnosed until significant kidney damage has occurred. This delayed detection highlights the importance of early risk identification and monitoring. Dialysis is a medical therapy that artificially replicates the vital activities of a healthy kidney, encompassing the filtration of waste, excess fluid, and toxins from the blood, as well as the regulation of electrolyte balance [2].
Individuals with end-stage kidney failure (CKD Stage 5) confront a bleak prognosis, frequently necessitating dialysis several times weekly. This treatment, although life-sustaining, is expensive and significantly reduces the patient’s quality of life. Kidney transplantation, the sole final therapy, is a complex and mostly inaccessible alternative for most patients due to substantial expenses and lengthy waiting lists. Notwithstanding medical breakthroughs, chronic kidney disease has exacerbated the increasing mortality rates in recent years. Although the risk factors and clinical diagnosis of CKD are well established, early-stage CKD is often overlooked because patients usually do not show clear symptoms and may not undergo complete screening. In such cases, machine learning models can assist by combining multiple clinical indicators to identify individuals at higher risk at an earlier stage. The goal of this approach is not to replace clinical judgment but to support early screening and risk stratification, particularly in large-scale or resource-limited healthcare settings. Recent studies have successfully created machine learning models that can diagnose chronic kidney disease using several physiological data, including blood glucose, blood urea, and red blood cell counts. Nonetheless, the effective and extensive implementation of these models is considerably hindered by a critical constraint: the lack of big, demographically varied datasets. Numerous modern studies, like this one, utilize the CKD dataset from the UCI Machine Learning Repository [3], which has merely 400 samples. The clearly defined distribution of this limited dataset facilitates high accuracy in research contexts; nevertheless, it presents a significant difficulty when similar models are implemented in real-world clinical settings, where new data may display divergent trends and greater variability. Data augmentation is an effective regularization method employed to mitigate data scarcity in machine learning processes [4]. This entails producing novel data samples by the application of changes to existing datasets. This methodology is extensively employed in applications like picture classification, where novel training examples are generated through the application of rotations, shears, zooms, and blurring to the original images. This study aligns with the United Nations Sustainable Development Goal 3 (Good Health and Well-Being) by advancing data-driven methods for early detection and risk stratification of chronic kidney disease, thereby supporting timely intervention and improved patient outcomes. Additionally, by addressing data scarcity through synthetic data generation, the proposed approach contributes to SDG 9 (Industry, Innovation, and Infrastructure) by promoting innovative and scalable artificial intelligence solutions for resilient healthcare systems.
The present study employed a methodology to enhance the UCI-CKD dataset, increasing its size from 400 to 2400 samples. The Conditional Tabular Generative Adversarial Network (CTGAN) technique [5] was employed to produce high-quality synthetic tabular data. This study evaluates the effectiveness of the data augmentation approach and compares the performance of different classification models on supplemented datasets. This study aimed to identify the most appropriate models for analyzing CTGAN-synthesized data and to illustrate that this method can alleviate overfitting problems common in data-deficient medical fields.

2. Literature Review

The primary challenge to developing strong models, especially in machine learning and data science, is the arduous data collection procedure. This pertains to the arduous and time-consuming processes associated with obtaining, cleansing, and organizing the data for analysis. The challenge arises from multiple aspects, including the necessity to amalgamate data from diverse and heterogeneous sources, as well as intrinsic data quality problems, such as absent values, inconsistencies, and noise. A recent study highlighted that data scientists allocate up to 80% of their time on data preparation chores, illustrating the gravity of this bottleneck. Moreover, many applications like medical imaging or natural language processing necessitate rigorous human annotation and labeling, a procedure that is both expensive and susceptible to inter-annotator discrepancies [6].
Recently, several studies have successfully developed effective models for the accurate prediction of CKD. Various techniques have been used to develop a robust model for accurate prediction. Some of these techniques include PCA [7] and SMOTE [8], which achieved accuracies of 99.2% and 98.86%, respectively. The best-performing model in the former study was XGBoost, whereas it was SVM for the latter. Other techniques also include soft voting among ensemble learners [9], achieving an accuracy of 99%. Several of these studies have been conducted using the CKD dataset provided by the UCI Machine Learning Repository [3]. This dataset only has 400 samples, which has been a limitation in all studies that have utilized it. Ref. [9] proposed extending the dataset to improve the resilience of current models, which is the primary focus of this study. Ref. [10] conducted a distinctive investigation in which long-term medical records of patients were amassed and utilized to train diverse sequence models that exhibit proficiency in time-series data analysis. This work, despite the challenging data collection methods, achieved a commendable accuracy of 98% utilizing a combination of multiple sequence and CNN models. Conversely, other experiments utilizing the UCI dataset have achieved accuracy scores over 100%, suggesting the potential for complete overfitting. Refs. [11,12] attained flawless accuracy on this dataset, necessitating regularization to ensure the models derived from this dataset have real-world applicability. Refs. [12,13] performed a comprehensive feature selection procedure to identify the optimal set of features for the predictive job. Various studies have been performed on distinct datasets [13,14,15] with the most recent investigations utilizing patient records from Tawam Hospital in Abu Dhabi. The studies demonstrated good outcomes, with accuracies of 93.29% and 95%, respectively, utilizing the XGBoost model, as illustrated in Table 1. These investigations were constrained by the restricted sample sizes in the dataset. The authors of this study contend that several of the analyzed studies may have improved by employing a comprehensive data augmentation process. This may have enhanced the applicability of the models in practical clinical environments. The above discussion indicates that researchers have identified significant findings about model performance, data limitations, overfitting issues, alternative data and methodologies, and other datasets.
The research by [6,8] attained accuracies of 99.2% and 98.86%, respectively, employing XGBoost and SVM. Ref. [9] attained 99% accuracy through soft voting among ensemble learners. The primary constraint identified in most studies is the limited size of the dataset, especially in the UCI CKD dataset. Refs. [6,8] proposed data augmentation as a viable method for improving model resilience. Refs. [11,12] attained 100% accuracy, prompting concerns regarding overfitting and the necessity for regularization to guarantee the applicability of models in real-world contexts. Ref. [10] employed an innovative methodology by aggregating long-term patient medical records to train sequence models, achieving a notable accuracy of 98%. This study, despite the challenges of rigorous data collection, underscores the promise of employing time-series data. Refs. [13,14,15,16,17] utilized a distinct dataset from Tawam Hospital in Abu Dhabi and encountered sample restrictions, although they attained commendable accuracies of 93.29% and 95% with the XGBoost model. Despite numerous studies achieving excellent accuracy in CKD prediction models, their practical use is frequently constrained by insufficient datasets and the consequent risk of overfitting. The UCI CKD dataset, comprising only 400 samples, exemplifies this issue effectively [18,19,20]. This study employs synthetic data extension through CTGAN to address data scarcity, in contrast to earlier works constrained by tiny, fixed data sets. To address this issue, researchers must investigate alternate data sources and, crucially, employ data augmentation techniques like SMOTE to enhance dataset size and improve model generalizability [20,21,22,23,24].
Table 1. Summary of studies from literature and their critical observations on chronic kidney disease (CKD).
Table 1. Summary of studies from literature and their critical observations on chronic kidney disease (CKD).
StudyBest Performing ModelDatasetAccuracyLimitations
[6]XGBoostUCI’s CKD dataset99.2%Black box nature of tree-based models, Limited dataset
[8]Linear Support Vector MachinesUCI’s CKD dataset, TCIA dataset98.86%Small datasets, Limited exploration of different models
[9]Soft VotingUCI’s CKD dataset99%Limited dataset
[10]Ensemble of CNN-Adamax, LSTM-Adam, LSTM-BLSTMTaiwan’s NHIRD database98%Strenuous data collection process, Computational complexity because of numerous features
[11]Random Forest, AdaBoostUCI’s CKD dataset100%Blackbox nature of tree-based models, limited dataset, perfect overfitting.
[12]Random Forest, Decision Tree, Gradient Boost, XGBoostUCI’s CKD dataset100%Black box nature of tree-based models, limited dataset, perfect overfitting.
[13]XGBoostTawam Hospital’s medical records93.29%Limited dataset
[14]XGBoostTawam Hospital’s medical records95%Limited dataset
[22]Random Forest ClassifierUCI’s CKD dataset96%Black box nature of tree-based models, Limited dataset
[25]Random ForestUCI’s CKD dataset99.16%Black box nature of tree-based models, Limited dataset
[26]Multiclass Decision ForestUCI’s CKD dataset99.1%Black box nature of tree-based models, Limited dataset
[27]Decision TreeUCI’s CKD dataset91%Black box nature of tree-based models, Limited dataset
[28]Support Vector MachinesUCI’s CKD dataset99.3%Limited dataset

3. Methodology

This research outlines a systematic approach for a chronic kidney disease (CKD) investigation, as depicted in Figure 1. This procedure aims to tackle the difficulties associated with a constrained dataset and emphasizes the mitigation of overfitting to guarantee the model’s practical relevance. The process commences with a comprehensive data examination and sanitization of the dataset acquired from the UCI Machine Learning Repository. This dataset is a publicly available benchmark dataset provided by the University of California, Irvine (UCI) Machine Learning Repository. It contains clinical and laboratory information collected from patients for the purpose of CKD classification. The dataset includes 400 patient records with demographic details, blood and urine test results, and a binary class label indicating the presence or absence of chronic kidney disease. Due to its limited size and clean structure, this dataset is widely used to evaluate and compare machine learning models for CKD prediction. The dataset comprised 400 samples, 24 characteristics (13 categorical and 11 numerical), and exhibited an uneven target feature distribution (250 positive cases for CKD and 150 negative cases).
The dataset utilized for this investigation, as presented in Table 2, was sourced from the UCI Machine Learning Repository [3] and is labeled “chronic kidney disease.” The dataset had 400 instances, encompassing 24 attributes and one target attribute. Of the 24 features, 13 were categorical (discrete) and 11 were numerical (continuous). Out of 400 samples, 250 tested positive for chronic kidney disease, while 150 tested negatives. Multiple preprocessing strategies were employed on this dataset to address null and missing values, which are elaborated upon in the methodology part of this work. Prior research has examined the efficacy of picking a subset of the 24 variables that significantly enhance predictive accuracy. This method conserves time, resources, and the necessity for comprehensive model testing. In this study, the decision to retain all features was chosen to ensure a more robust dataset that would mitigate overfitting, in contrast to the original dataset. Future research may explore the feasibility of selecting a small subset from the feature set.

3.1. Data Processing

Table 3 highlights the substantial occurrence of absent values in the original dataset. The straightforward method of eliminating these cases was dismissed due to its potential to further diminish the already constrained dataset size and significantly jeopardise the retention of crucial distributional attributes of the data, as illustrated in Figure 2. This loss may have resulted in simplistic models with restricted practical applications. A hybrid imputation technique integrating mean/mode and random sampling was employed to address this issue, consistent with standard practices in tabular data preprocessing. After imputation, all features were converted to guarantee they have the requisite data types for modeling [29].
All numerical features were transformed into the float data type to facilitate precise calculations and mitigate the potential loss of numerical accuracy in the following modeling phases. Missing values in numerical columns were rectified by random sampling, wherein a random existing value from the same feature was chosen to substitute the null item.
Categorical features were addressed through a two-step methodology: features exhibiting a significant number of missing values, namely red_blood_cells and pus_cells, were imputed via random sampling, whereas the other categorical features were filled using mode imputation, substituting nulls with the most prevalent value in that feature. Given that all categorical variables were binary, the LabelEncoder from the sklearn library was utilized to transform each category into a numerical representation of either 0 or 1. Figure 2 illustrates a graphical representation of these preprocessing operations.

3.2. Data Augmentation

Data augmentation denotes the procedure of creating novel data samples from a pre-existing dataset. This strategy is utilized when accessible data are deficient for constructing an effective model. This topic is the focus of the study, making it a critical component of the technique. Generative Adversarial Networks (GAN), initially introduced by [16], are a category of machine learning models grounded in neural networks. GANs utilize an adversarial training methodology that involves two competing neural networks, the Generator and the Discriminator, engaged in a minimax game framework, where a victory for one network results in a defeat for the other. The models underwent training in an unsupervised learning framework through multiple iterations until the generator achieved the ability to produce samples that deceived the discriminator 50% of the time. Generative Adversarial Networks (GANs) are commonly employed alongside convolutional neural networks for image production purposes. The GAN was conceived as a generative model for the creation of realistic visuals by assimilating the characteristics of its training dataset. The authors [17] trained the model using a combination of datasets, encompassing handwritten numbers, human faces, and animal photos. Since their introduction, GANs have been repurposed for diverse applications, including the modeling of clinical data, which is pertinent to our work [20]. This was accomplished using an upgrade of GANs referred to as the Conditional Tabular GAN. Conditional Tabular Generative Adversarial Networks (CTGAN) are a type of generative model designed specifically for tabular data that contain both numerical and categorical variables. Unlike standard GANs, CTGAN can handle imbalanced classes and complex feature distributions commonly found in clinical datasets. By conditioning the data generation process on selected variables, such as the target class label, CTGAN is able to generate realistic synthetic samples that preserve important statistical relationships present in the original data. This makes CTGAN particularly suitable for medical datasets, where data are often limited, heterogeneous, and sensitive to class imbalance. A concise overview of GAN and its constituent components is provided below and illustrated in Figure 3.
  • Generator: The generator network, typically composed of multiple neural network layers, is tasked with generating fake samples that resemble the original data as closely as possible. More formally, Generator learns a distribution pg over the training data x. The generator initially takes random noise z as input sampled from a prior distribution pz(z). It then learns to model this noise into samples that closely resemble the distribution of training data.
  • Discriminator: The discriminator is a binary classifier that distinguishes between real and synthetic data. As time progresses, the discriminator becomes better at distinguishing between samples from the original dataset and those synthesized by the generator. Formally, D(x) represents the probability of x coming from the training distribution or pg. D is trained to assign low probabilities to the samples from pg and higher probabilities for the samples from the real data.

Adversarial Process

During training, both the generator (G) and discriminator (D) iteratively refined their performance. A GAN is considered effective when the generator can produce synthesized samples that the discriminator classifies as genuine with a probability approaching 50%. This equilibrium point signifies that G generates data that are indistinguishable from the real dataset. Training is frequently terminated at this point using techniques such as early stopping. Formally, the Generator and Discriminator engage in the following minimax game defined by the value function V D , G as shown in Equation (1)
m i n G   m a x D V D , G = E x p data x log D x + E z p z z log 1 D G z
To augment the CKD dataset using the GAN model, a prebuilt Python 3.12 implementation of the model was used from the SDV open-source library. This model was chosen due to its built-in quality checks and guardrails, which ensure the generation of high-quality synthetic data. The choice pool also consisted of TabGAN [18]. However, SDV implementation was chosen because of its ease of use and other advantages mentioned previously. The traditional implementation of GANs is designed for generative tasks on images. When a normal GAN is adapted for modeling tabular data, it presents various limitations. These limitations are mixed data types, non-Gaussian distributions, multimodal distributions, learning from sparse one-hot encoded vectors, and imbalanced categorical columns. These limitations are discussed in [5,19]. In the same study, the authors discussed advancement over the traditional GAN architecture to model these limitations. Figure 4 shows the feature correlation matrix of the UCI-CKD dataset and the relationship across clinical variables.
They presented a Conditional Tabular GAN (CTGAN) model that builds upon the traditional approach of a GAN by implementing a conditional generator and training-by-sampling, among other modifications, to deal with imbalanced discrete columns. Some salient features of CTGANs are briefly discussed below.
  • Conditional Generation: CTGAN conditions the generated data according to specific classes with their respective distributions. This enables CTGANs to model relationships across the distributions of various classes and, consequently, generate more realistic synthetic data.
  • Mode-Specific Normalization: In the approach, for each continuous-valued column, all its modes are computed along with their probability densities. Then, for a specific row, the mode with the highest probability distribution is determined for that numerical feature and is represented using a concatenation of a one-hot vector that indicates which mode the row corresponds to and a scalar that is determined by normalizing the value of the feature using the chosen mode. Therefore, each row is represented as a concatenation of scalars and one-hot vectors, because categorical features are already represented using one-hot vectors.
  • Training-by-sampling: Owing to the imbalances presented in the categorical columns, an accurate representation of the minority classes in the generator’s distribution becomes a challenge. To tackle this, CTGAN employs a method called training-by-sampling, which involves sampling from both the real and conditional generator distributions so that the discriminator becomes adept at effectively estimating the distance between the two samples.
Figure 5 illustrates the distribution of key clinical characteristics across the CKD-positive (Class 0) and CKD-negative (Class 1) cohorts, including (a) red blood cell count, (b) specific gravity, (c) blood urea, (d) random blood glucose, (e) hemoglobin, and (f) packed cell volume.

3.3. Model Training for Classification

For any classification task, it is essential to choose models that can effectively model the distribution of the training dataset. This study works with several models that operate on different principles. This was performed to investigate which type of model performs well with synthesized datasets using CTGAN. This is an important aspect of this study. The models that are chosen for the binary classification task are K-Neighbors Classification, Decision Trees, Random Forest Classification, AdaBoost, Gradient Boosting, Stochastic Gradient Boosting, XGBoost, Categorical Boosting, Extra Trees, LGBM, and Support Vector Machines. Presented below are some foundational principles of these models used for classification in this study.
  • K-Nearest Neighbors (KNN): It works by calculating the distance between the data point and the closest cluster with a defined class and assigning the label of that class to the current data point, as in Equation (2).
Manhattan   distance = i = 1 n q i p i
  • Decision Tree Classifier (DTC): The Decision Tree Classifier (DTC) constructs a hierarchical tree structure where each internal node represents a decision based on a feature, leading to the assignment of class labels at the leaf nodes in Equations (3) and (4).
Information Gain:
G a i n S , A = E n t r o p y S v all values of A E n t r o p y S v S v S
where
  • G a i n ( S , A ) is the information gain achieved by splitting set S based on attribute A
  • E n t r o p y ( S ) is the entropy of set S (measure of uncertainty)
  • S v  is the subset of S containing data points with value v for attribute A
Gini Impurity:
G i n i S = 1 i all classes p i 2
where:
  • G i n i ( S ) is the Gini impurity of set S
  • p i is the proportion of data points in class i within set S
  • Gradient Boosting: Gradient boosting is a powerful ensemble method that sequentially builds a series of decision trees to correct the errors of preceding trees. By minimizing a predefined loss function, gradient boosting iteratively adds shallow trees to the ensemble, optimizing performance. Gradient boosting relies on specific loss functions depending on the task (classification or regression) in Equation (5).
Logarithmic loss (for binary classification):
L y , f x = i = 1 n y i l o g f x i + 1 y i l o g 1 f x i
where:
  • L(y, f(x)) is the binary cross-entropy loss function
  • y i is the true value for data point i
  • f x i is the predicted value for data point i by the ensemble model.
All the gradient boosting models are based on these loss functions with some modifications according to their respective approaches towards optimizing the loss function.
  • Support Vector Machine (SVM): Support Vector Machine (SVM) works by finding the hyperplane that best separates classes in the feature space in Equations (6) and (7).
For linearly separable data:
[ Decision   function :   f x = sign w x + b ]
For non-linearly separable data:
[ Decision   function :   f x = sign ( i = 1 N α i y i K x i , x + b ) ]
where:
  • f x is the decision function that predicts the class of input x .
  • w is the weight vector.
  • x is the input vector.
  • b is the bias term.
  • α i are the Lagrange multipliers.
  • yi are the class labels.
  • K(xi, x) is the kernel function that computes the inner product of vectors x i and x in the transformed feature space.
Each model was trained and cross-validated using a 5-fold cross-validation mechanism to prevent overfitting. Each model was also tuned using GridSearchCV from sklearn.model_selection library to find the best set of hyperparameters for each dataset, thereby producing the best evaluation metrics for each dataset.

Accuracy Metrics

A simple implementation of each of these models was used to establish a baseline performance for each model on the input dataset. At this stage, the dataset has been imputed upon to rectify null values and binary encoded for categorical features. For this study, a suitable hyperparameter tuning has been performed on each of the models for the best possible accuracy scores in Equations (8)–(11). To find the best possible hyperparameter values for each model, the GridSearchCV utility provided by the Scikit-Learn library was used. The baseline accuracy is presented in the following table, with separate columns for original and scaled data.
A c c u r a c y = T P + T N T P + T N + F P + F N   %
Precision = T P T P + F P %
Recall = T P T P + F N %
F 1   Score = 2 Precision Recall Precision + Recall %
where TP = True Positive
  • TN = True Negatives
  • FN = False Negatives
  • FP = False Positives
When evaluating classification models for crucial medical illnesses like chronic kidney disease, dependence on accuracy alone can be deceptive, particularly because of the prevalent problem of class imbalance in medical datasets. A model may seem to perform effectively by predominantly predicting the majority class, thereby exaggerating accuracy while neglecting to accurately detect instances of the minority class. For a more thorough and clinically significant assessment, precision, memory, and the F1 score are necessary. Precision quantifies the ratio of true positive predictions to the total positive predictions, emphasizing the reduction in false positives, which is essential to prevent erroneous patient diagnoses. Recall, or sensitivity, evaluates the model’s capacity to identify all true positive cases, a critical consideration in clinical settings where overlooking a CKD diagnosis (false negatives) may result in severe health repercussions. The F1-score integrates precision and recall into a singular statistic, balancing the trade-offs between false positives and false negatives to provide a comprehensive assessment of performance.
The feature correlation matrix provides significant insights into the relationships of factors and their correlation with CKD as shown in Figure 4 A strong positive correlation between two qualities signifies that they tend to rise or fall concurrently, while a strong negative correlation implies an inverse relationship. This study’s binary goal variable “class” (indicating the presence or absence of CKD has robust positive correlations with critical physiological markers: hemoglobin (0.77), packed cell volume (0.74), and specific gravity (0.73), highlighting their prognostic significance. In contrast, albumin exhibits a significant negative correlation (−0.63), underscoring its importance in disease prediction. Significant inter-feature correlations encompass a strong positive association between packed cell volume and red blood cell count (0.79), a notable link between hemoglobin and packed cell volume (0.74), and a moderate positive correlation between serum creatinine and blood urea (0.59). Negative correlations, such as the relationship between serum creatinine and salt (−0.69), underscore intricate interdependencies within the clinical data. Additional factors, such as age, blood pressure, and potassium levels, exhibit poor relationships with chronic kidney disease, indicating a diminished individual predictive capacity. Comprehending these compensatory reactions improves predictive models and increases their clinical relevance for the early and precise diagnosis of CKD.

3.4. Exploratory Data Analysis

Violin plots were employed as an excellent data visualization method to illustrate the distribution of continuous variables across various classification groups. These plots integrate the benefits of box plots, which illustrate quartiles and medians, with kernel density estimation, which indicates the probability density of the data over a spectrum of values. This dual representation offers a more comprehensive view of feature distributions by depicting central tendency, variability, shape, modality, and skewness of the data. This study utilized violin plots to illustrate variables significantly associated with chronic kidney disease, contrasting their distributions between two groups: individuals diagnosed with CKD (Class 0) and those without the condition (Class 1). The plots demonstrated dramatic differences in critical markers, including red blood cell count, specific gravity, hemoglobin, and packed cell volume, with CKD patients exhibiting a strong downward shift in distribution, showing that lower values are markedly related to the disease. In contrast, parameters such as blood urea and random blood glucose demonstrated elevated levels in CKD patients, indicating that these heightened markers may act as significant indicators of the ailment. The visually distinct distribution patterns highlight the importance of these features in classification models, necessitating the preservation of their statistical properties during data processing to ensure model correctness and clinical validity.
The k-Nearest Neighbors (KNN) classifier had been improved by adjusting the neighborhood size to k = 1. This setup indicates that the classification of a new test data point is exclusively dictated by the class of its nearest neighbor, given by the minimum Manhattan distance. The value of k was chosen because a comprehensive search within the range of k = 1 to k = 100 produced the highest test accuracy at k = 1. Nonetheless, employing k = 1 entails a well-documented risk of overfitting to data noise and outliers, since the classification decision relies on a solitary instance. The KNN classifier achieved its highest accuracy when k = 1; however, this choice is known to be highly sensitive to noise and outliers. This result is reported to highlight the tendency of distance-based models to overfit when trained on small and highly structured datasets. For real-world applications and improved generalization, more stable k values in the range of 3–7 are recommended. In datasets characterized by substantial noise, this may result in diminished generalization and perhaps erroneous predictions. Consequently, for effective generalization, values of k ranging from 3 to 5 are generally advised. The accompanying table largely emphasizes accuracy among other performance factors. Other measures, namely precision, recall, and F1-score, were omitted for conciseness as they closely paralleled the reported accuracy values across virtually all models. This consistency indicates that the original dataset did not experience significant class imbalance, enabling accuracy to function as an adequately representative performance metric for the models being compared, as illustrated in Table 4.

Synthetic Data Generation

To thoroughly evaluate the effects of data augmentation, five separate augmented datasets were generated by altering the quantity of synthetic samples, as illustrated in Table 5. The precise quantities of synthetic samples introduced were 200, 500, 800, 1000, and 2000. This incremental method was selected to assess the impact of augmenting the relative proportion of synthetic data and its ensuing effect on model generalization and resilience. Synthetic data generation was executed with the open-source CTGAN implementation found in the Synthetic Data Vault (SDV) package [21,23]. The pre-processed original data, originally stored as a CSV file, was imported to train an instance of the CTGAN Synthesizer class. To prevent data leakage and ensure fair model evaluation, the CTGAN synthesizer was trained exclusively on the training split of the dataset after the train–test partition. The test data were never used during CTGAN training or synthetic sample generation. All augmented datasets were formed by combining synthetic samples only with the training data, while the original test set remained unchanged for final evaluation. Training the synthesizer necessitated a metadata object, which was automatically derived from the pandas DataFrame via the SingleTableMetadata() class. This object retains critical dataset details, encompassing the metadata version, column names, data types (categorical or numerical), and the primary key. The CTGAN model was trained for 15,000 epochs based on empirical convergence behavior observed during training. Quality metrics provided by the SDV framework, including column shape similarity and column pair trend scores, were monitored to assess the stability and realism of the data generated. Preliminary experiments with fewer epochs resulted in lower quality scores, while additional training beyond 15,000 epochs provided no noticeable improvement. Although early stopping was considered, it was not applied due to the unsupervised and adversarial nature of CTGAN training, where a clear convergence criterion is difficult to define. The epoch count was established through experimentation to guarantee that the synthesizer accurately represented the intricate distributions of the original dataset. To prevent the synthetic data from significantly diverging from the original data’s attributes, restrictions were imposed during training to conform to the minimum and maximum limits of each column. Following the conclusion of training, the model was employed to sample five novel synthetic datasets. The synthetic datasets were later integrated with the original data to produce five enhanced datasets, each characterized by a distinct ratio of original to synthetic samples. During synthetic data generation, CTGAN was conditioned on the target class label to preserve the original proportion of CKD and non-CKD samples. This ensured that class balance was maintained across all augmented datasets. As a result, no artificial class imbalance was introduced, and performance metrics such as precision and recall remained consistent with those observed on the original dataset. Figure 6 quantifies the quality of the created synthetic data in relation to the actual dataset through measures like column shapes and column pair trends. The scores were calculated using the Quality Report tool from the SDV library, which conducts a systematic comparison between the synthesized and original data distributions.
Figure 6 illustrates that the total quality ratings for the synthesized datasets remain consistently within a limited range, slight rise with the augmentations, showing a slight increase with the addition of synthetic samples. This nuanced enhancement is posited to stem from the CTGAN model generating a larger quantity of samples that closely conform to the fundamental distribution of the original dataset. This favorable tendency is limited by the concurrent production of a few additional instances that markedly diverge from the established data distribution. As the number of synthetic samples increases beyond 800, a partial separation between the original and synthetic data becomes visible in the t-SNE and PCA plots. This behavior is expected when the synthetic dataset grows much larger than the original dataset, as the generator introduces additional variability to cover the underlying data distribution. Rather than indicating poor data quality, this separation reflects the natural trade-off between increasing dataset diversity and maintaining close overlap with the original samples. At moderate augmentation levels (around 800 samples), the balance between overlap and variability appears most favorable. The magnitude of this divergence and its consequent effect on the data structure are further analyzed using dimensionality reduction techniques. Two-dimensional reduction techniques were utilized to illustrate the correlation between the original and supplemented datasets in a diminished space:
  • t-distributed Stochastic Neighbor Embedding (t-SNE): This is an unsupervised, non-linear method for dimensionality reduction. t-SNE is proficient at modeling non-linear relationships and is employed here to condense the high-dimensional feature space into two dimensions for display purposes. This is especially useful for mapping data clusters, offering insight into whether the synthetic data maintains the local structure of the original dataset.
  • Principal Component Analysis (PCA): This technique emphasizes the modeling of linear relationships in the data by converting highly correlated aspects into a reduced set of uncorrelated features, referred to as principal components. PCA, although less effective than t-SNE for intricate, non-linear structures, is proficient at detecting global outliers and verifying whether the synthetic dataset significantly diverges from the primary linear trends of the original data.
The comparative distributions of the original dataset and those enhanced with 1000 and 2000 synthetic cases are clearly illustrated in the t-SNE plots (Figure 6a–j), validating the nuanced dispersion generated by the augmentation technique. Table 6 presents the relevant t-SNE and PCA plots for all supplemented datasets, facilitating a direct comparison with the original data in two dimensions.
The t-SNE graphs demonstrate a continuous trend: as the quantity of synthetic examples escalates, the synthetic clusters increasingly align with the clusters generated by the original data points. In smaller enhanced datasets (200 and 500 examples), the synthetic clusters typically reside between the original data clusters. Moreover, the synthetic points closely conform to the approximate polynomial line or curve established by the original clusters. In extensive augmented datasets (1000 and 2000 examples), the synthetic clusters exhibit increased separation from the central trend line, positioning themselves nearer to the unique boundaries of the original data clusters. Nonetheless, these extensive datasets also generate different clusters of synthetic data that are markedly divergent from the primary trends of the original data. The local configuration and alignment of these distinct clusters continue to reflect the patterns of the original data, demonstrating the unavoidable inclusion of variation when the synthesized sample size significantly surpasses the original dataset size (N = 400). The PCA plots exhibit a significant linear correlation between the synthesized and original datasets. The prevailing tendency suggests that an increase in synthetic data volume leads to a complete overlap of the augmented dataset with the original data distribution along the first main component. The majority of the residual synthetic cases are clustered at somewhat elevated levels of the second principal component. The synthetic data effectively maintains outliers at elevated levels of the first principal component, in accordance with the actual dataset. The visual inspection of the t-SNE and PCA plots indicates that the synthetic dataset, enhanced with 800 samples, exhibits the most favorable overlap and integration with the original data structure. This indicates that a 33% dilution rate (800 synthetic points to 400 original points) strikes a suitable equilibrium between augmenting the dataset and maintaining the integrity of the original data distribution. The identification of approximately 800 synthetic samples (33% dilution) as an optimal augmentation level is based on empirical performance trends observed across multiple classifiers rather than formal statistical hypothesis testing. The experiments were conducted using a fixed random seed to ensure reproducibility. Confidence intervals, variance estimates, and statistical significance testing were not evaluated in this study and are considered a limitation. Future work will involve repeated experiments with different random seeds and statistical analysis to validate the robustness of the observed performance differences.

4. Results and Discussion

The initial model training on the original dataset with optimized hyperparameters created a robust baseline for comparison. Nearly all classifiers exhibited strong performance, validating the dataset’s consistency and predictive capability. The AdaBoost classifier attained the maximum training and testing accuracy at 100%, whilst the KNN classifier exhibited the lowest test accuracy at 79.16%. Given that distance-based models like KNN and SVM are affected by feature scaling, MinMax normalization was employed on the numerical data utilizing the Scikit-learn module. Following scaling, both distance-based models exhibited significant enhancement: KNN test accuracy rose to 99.17% with k = 1, albeit increasing the risk of overfitting, while SVM similarly attained 99.17% accuracy. The tree-based algorithms (Decision Tree, Random Forest, AdaBoost, and XGBoost) exhibited negligible variation, as they rely on feature thresholds instead of distances. A minor decline in accuracy was observed for AdaBoost and XGBoost, likely due to normalization somewhat altering the original data distribution, which influenced the models’ identification of optimal split points.
Figure 6, Figure 7 and Figure 8 provide a comprehensive summary of model performance across both the original and supplemented datasets. Figure 6 illustrates the evolution in test accuracy as the proportion of synthetic data grows. Most models exhibit great accuracy, typically over 95%; however, KNN demonstrates a significant drop, highlighting its susceptibility to distributional noise from CTGAN-generated data. Figure 7 illustrates the distribution of training accuracy, indicating that the majority of models exhibit consistently high accuracy, while a few outliers indicate sporadic inferior synthetic examples.
Figure 8 and Figure 9 illustrate the distribution of test accuracy, further validating that ensemble models exhibit stability, whereas KNN demonstrates greater variability. As the quantity of synthetic data created by CTGAN rose in the training datasets, the accuracy of most classifiers diminished progressively. This pattern signifies that the synthetic data failed to accurately mimic the statistical structure of the original dataset, resulting in minor variances or noise. Notwithstanding this, all models consistently exhibited strong performance, sustaining test accuracy exceeding 95%, while the F1-scores closely aligned, thereby affirming balanced class distributions. At reduced dilution levels (200 synthetic cases), accuracy decreased, and little remained consistent. Notably, KNN exhibited superior performance at this juncture, since the preliminary synthetic samples closely adhered to the actual data trend, hence enhancing neighborhood predictions. The ensemble models, particularly AdaBoost and XGBoost, consistently exhibited superior performance, attaining maximum accuracy at the 33% dilution level (about 800 synthetic samples). Furthermore, model accuracy diminished with the increase in synthetic noise, indicating that the optimal synthetic-to-original ratio is between 100% and 200%.
The performance trend at various dilution levels indicates that modest augmentation enhances model generalization through controlled variability, whereas excessive augmentation diminishes it by introducing random deviations. The decline at elevated dilution levels suggests that the CTGAN model, despite its robustness, occasionally produced redundant or statistically inconsistent samples over prolonged training. This also illustrates the difficulties in replicating intricate clinical data distributions through generative models. Ensemble models exhibited relative stability; however, distance-based models such as KNN and SVM were significantly impacted by this synthetic variability. The findings indicate that normalization affects model behavior variably: it enhances distance-based models while potentially impeding ensemble models by modifying the inherent data linkages.
Subsequent enhancements can further fortify the model and data creation procedure. The CTGAN architecture can be optimized by modifying hyperparameters, including embedding dimension, learning rate, and generator–discriminator ratio, to yield more representative samples. Ensemble generation—integrating outputs from various CTGAN models trained with distinct random seeds—may enhance the diversity of synthetic data and mitigate the impact of noise. Figure 8a,b shows the comparative analysis of training and testing accuracy of the model performance across different dataset sizes, illustrating learning behavior and generalization capability as the amount of training data increases.
Post-processing techniques, like statistical filtering and similarity scoring, can effectively eliminate implausible synthetic data before retraining. Moreover, integrating CTGAN with other data production frameworks like Variational Autoencoders or diffusion models might augment data authenticity. Retraining the CTGAN on an augmented dataset, when additional clinical data emerges, can yield more intricate and broadly applicable patterns. Ultimately, explainability methods such as SHAP or LIME can guarantee that the model’s feature significance corresponds with clinical comprehension. Figure 9 shows the test accuracy heatmap showing model performance across datasets.
The research demonstrates that gradient-boosting ensemble models, including AdaBoost and XGBoost, are the most efficient and dependable classifiers for predicting chronic kidney disease. They exhibit good accuracy and robustness in both actual and synthetic datasets, demonstrating resistance to overfitting and fluctuations in scale. Distance-based algorithms like KNN and SVM, while capable of achieving high accuracy post-scaling, exhibit increased sensitivity to noise and data variability. Moderate application of controlled synthetic augmentation by CTGAN can enhance model robustness, with the ideal synthetic data ratio ranging from 100% to 200% of the original dataset. From a methodological point of view, this study shows that CTGAN-based data augmentation is a practical solution when clinical datasets are small. The approach helps reduce overfitting, adds controlled variability, and improves the stability of machine learning models trained on limited data.
However, the results should not be directly interpreted as real-world CKD prediction performance. The experiments are based on a small and well-structured public dataset and do not fully represent the complexity, variability, and noise present in real clinical data. Therefore, this work should be viewed as a methodological study rather than a clinical decision-making system. The medical importance of this study lies in its ability to support early risk screening and research development when clinical data are limited. In many healthcare settings, especially in low-resource or large-population environments, complete and well-curated datasets are not always available [30,31]. The proposed CTGAN-based augmentation approach helps improve model reliability under such constraints, enabling researchers and clinicians to build more stable predictive tools that may assist in identifying individuals who require further clinical evaluation.

5. Conclusions

The efficacy of different classifiers and the possible advantages of artificial data augmentation have been the subject of important discoveries in this work. This work demonstrates how data augmentation can enhance the development of CKD risk prediction models in data-limited clinical and research settings. According to the study, gradient boosting models had better performance and dependability on both authentic and augmented datasets than KNN and SVM. Given their robustness, these models indicate that it may be possible to build a reliable system for screening chronic kidney disease based solely on essential traits, while also efficiently handling missing data by utilizing suitable imputation methods. In particular, the study highlights the improved dataset robustness and representativeness that can be achieved by using the CTGAN approach after a substantial amount of real data has been gathered. This highlights the significance of using data augmentation sparingly. When synthetic data was increased to 100–200% of the initial dataset size, the best results were seen for datasets with subtle or unclear patterns. Additional expansion might be possible for datasets with more pronounced patterns. However, more study is needed to investigate the use of CTGAN scaling on less complex datasets.

Author Contributions

P.R.—Conceptualization, Methodology, Writing—Original Draft, Project Administration; V.N.J.—Supervision, Validation, Formal Analysis, Writing—Review and Editing; K.P.—Data Curation, Investigation, Visualization, Software, Writing—Original Draft; G.K.K.—Conceptualization, Methodology, Writing—Original Draft Preparation; M.B.—Resources, Validation, Writing—Review and Editing; S.N.P.—Validation, Formal Analysis, Software; M.R.—Formal Analysis, Software, Writing—Original Draft; K.V.—Investigation, Data Curation, Writing—Review and Editing; N.N.—Conceptualization, Funding Acquisition, Supervision, Writing—Review and Editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no competing interests.

References

  1. Shrestha, N.; Gautam, S.; Mishra, S.R.; Virani, S.S.; Dhungana, R.R. Burden of chronic kidney disease in the general population and high-risk groups in South Asia: A systematic review and meta-analysis. PLoS ONE 2021, 16, e0258494. [Google Scholar] [CrossRef] [PubMed]
  2. Ma, X.; Liu, R.; Xi, X.; Zhuo, H.; Gu, Y. Global burden of chronic kidney disease due to diabetes mellitus, 1990–2021, and projections to 2050. Front. Endocrinol. 2025, 16, 1513008. [Google Scholar] [CrossRef] [PubMed]
  3. Rubini, L.S.P.; Eswaran, P.; UCI Machine Learning Repository. Chronic Kidney Disease. 2015. Available online: https://archive.ics.uci.edu/dataset/336/chronic+kidney+disease (accessed on 27 November 2025).
  4. Waskom, M.L. Seaborn: Statistical data visualization. J. Open-Source Softw. 2021, 6, 3021. [Google Scholar] [CrossRef]
  5. Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional GAN. In Advances in Neural Information Processing Systems 32; Curran Associates: Red Hook, NY, USA, 2019; pp. 7335–7345. [Google Scholar]
  6. Islam, M.A.; Majumder, M.Z.H.; Hussein, M.A. Chronic kidney disease prediction based on machine learning algorithms. J. Pathol. Inform. 2023, 14, 100189. [Google Scholar] [CrossRef] [PubMed]
  7. Chowdhury, M.N.H.; Reaz, M.B.I.; Ali, S.H.M.; Crespo, M.L.; Ahmad, S.; Salim, G.M.; Haque, F.; Ordóñez, L.G.G.; Islam, J.; Mahdee, T.M.; et al. Deep learning for early detection of chronic kidney disease stages in diabetes patients: A TabNet approach. Artif. Intell. Med. 2025, 166, 103153. [Google Scholar] [CrossRef]
  8. Chittora, P.; Chaurasia, S.; Chakrabarti, P.; Kumawat, G.; Chakrabarti, T.; Leonowicz, Z.; Jasinski, M.; Jasinski, L.; Gono, R.; Jasinska, E.; et al. Prediction of chronic kidney disease—A machine learning perspective. IEEE Access 2021, 9, 17312–17334. [Google Scholar] [CrossRef]
  9. Dritsas, E.; Trigka, M. Machine learning techniques for chronic kidney disease risk prediction. Big Data Cogn. Comput. 2022, 6, 98. [Google Scholar] [CrossRef]
  10. Saif, D.; Sarhan, A.M.; Elshennawy, N.M. Early prediction of chronic kidney disease based on an ensemble of deep learning models and optimizers. J. Electr. Syst. Inf. Technol. 2024, 11, 17. [Google Scholar] [CrossRef]
  11. Halder, R.K.; Uddin, M.N.; Uddin, A.M.; Aryal, S.; Saha, S.; Hossen, R.; Ahmed, S.; Rony, M.A.T.; Akter, F. ML-CKDP: Machine learning-based chronic kidney disease prediction with smart web application. J. Pathol. Inform. 2024, 15, 100371. [Google Scholar] [CrossRef]
  12. Hema, K.; Meena, K.; Pandian, R. Analyze the impact of feature selection techniques in the early prediction of CKD. Int. J. Cogn. Comput. Eng. 2024, 5, 66–77. [Google Scholar] [CrossRef]
  13. Ghosh, S.K.; Khandoker, A.H. Investigation on explainable machine learning models to predict chronic kidney diseases. Sci. Rep. 2024, 14, 3687. [Google Scholar] [CrossRef]
  14. Zheng, J.X.; Li, X.; Zhu, J.; Guan, S.Y.; Zhang, S.X.; Wang, W.M. Interpretable machine learning for predicting chronic kidney disease progression risk. Digit. Health 2024, 10, 20552076231224225. [Google Scholar] [CrossRef] [PubMed]
  15. Pedregosa, F.; Michel, V.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; VanderPlas, J.; Cournapeau, D.; Varoquaux, G.; Gramfort, A.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  16. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
  17. Binu, S.K.; Devi, R. Adaptive synthetic sampling with generative adversarial networks (AS-GAN) for predicting chronic kidney disease on unbalanced data. In Proceedings of the 4th International Conference on Mobile Networks and Wireless Communications (ICMNWC 2024), Tumkuru, India, 4–5 December 2024; pp. 1–6. [Google Scholar]
  18. Cascella, M.; Scarpati, G.; Bignami, E.G.; Cuomo, A.; Vittori, A.; Di Gennaro, P.; Crispo, A.; Coluccia, S. Utilizing an artificial intelligence framework (conditional generative adversarial network) to enhance telemedicine strategies for cancer pain management. J. Anesth. Analg. Crit. Care 2023, 3, 19. [Google Scholar] [CrossRef]
  19. Rao, P.K.; Chatterjee, S. TabNet to identify risks in chronic kidney disease using GAN’s synthetic data. In Proceedings of the 2nd International Conference on Technological Advancements in Computational Sciences (ICTACS 2022), Tashkent, Uzbekistan, 10–12 October 2022; pp. 209–215. [Google Scholar]
  20. Tian, G.; Rehman, A.; Xing, H.; Feng, L.; Gulzar, N.; Hussain, A. Automatic intelligent chronic kidney disease detection in Healthcare 5.0. In Proceedings of the IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom 2023), Exeter, UK, 1–3 November 2023; pp. 2134–2140. [Google Scholar]
  21. Kannan, M.; Umamaheswari, D.; Manimekala, B.; Mary, I.P.S.; Savitha, P.M.; Rozario, J. An enhancement of machine learning model performance in disease prediction with synthetic data generation. Sci. Rep. 2025, 15, 33482. [Google Scholar] [CrossRef]
  22. Kaur, C.; Kumar, M.S.; Anjum, A.; Binda, M.B.; Mallu, M.R.; Ansari, M.S.A. Chronic kidney disease prediction using machine learning. J. Adv. Inf. Technol. 2023, 14, 384–391. [Google Scholar] [CrossRef]
  23. Kuo, N.I.; Gallego, B.; Jorm, L. Attention-based synthetic data generation for calibration-enhanced survival analysis: A case study for chronic kidney disease using electronic health records. arXiv 2025, arXiv:2503.06096. [Google Scholar] [CrossRef]
  24. Liu, K.; Altman, R.B. Conditional generative models for synthetic tabular data: Applications for precision medicine and diverse representations. Annu. Rev. Biomed. Data Sci. 2025, 8, 21–49. [Google Scholar] [CrossRef]
  25. Revathy, S.; Bharathi, B.; Jeyanthi, P.; Ramesh, M. Chronic kidney disease prediction using machine learning models. Int. J. Eng. Adv. Technol. 2019, 9, 6364–6367. [Google Scholar] [CrossRef]
  26. Gunarathne, W.H.S.D.; Perera, K.D.M.; Kahandawaarachchi, K.A.D.C.P. Performance evaluation on machine learning classification techniques for disease classification and forecasting through data analytics for chronic kidney disease (CKD). In Proceedings of the IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE 2017), Washington, DC, USA, 23–25 October 2017; pp. 291–296. [Google Scholar]
  27. Anantha Padmanaban, K.R.; Parthiban, G. Applying machine learning techniques for predicting the risk of chronic kidney disease. Indian J. Sci. Technol. 2016, 9. [Google Scholar] [CrossRef]
  28. Swain, D.; Mehta, U.; Bhatt, A.; Patel, H.; Patel, K.; Mehta, D.; Acharya, B.; Gerogiannis, V.C.; Kanavos, A.; Manika, S. A robust chronic kidney disease classifier using machine learning. Electronics 2023, 12, 212. [Google Scholar] [CrossRef]
  29. Irianto, S.Y.; Karnila, S.; Hasibuan, M.S.; Dewi, D.A.; Kurniawan, T.B.; Kurniawan, H. Progressive massive fibrosis detection using generative adversarial networks and long short-term memory. J. Appl. Data Sci. 2025, 6, 2298–2311. [Google Scholar] [CrossRef]
  30. Zafar, R.; Rehman, I.U.; Shah, Y.; Ming, L.C.; Goh, K.W.; Suleiman, A.K.; Khan, T.M. Impact of pharmacist-led intervention for reducing drug-related problems and improving quality of life among chronic kidney disease patients: A randomized controlled trial. PLoS ONE 2025, 20, e0317734. [Google Scholar] [CrossRef]
  31. Towards Data Science. GANs for Tabular Data. Available online: https://towardsdatascience.com/review-of-gans-for-tabular-data-a30a2199342 (accessed on 31 March 2024).
Figure 1. Flowchart showing the methods employed in the current investigation.
Figure 1. Flowchart showing the methods employed in the current investigation.
Biomedinformatics 06 00006 g001
Figure 2. Data imputation and transformation process showing how missing categorical and numerical values are handled, followed by encoding and conversion for model readiness.
Figure 2. Data imputation and transformation process showing how missing categorical and numerical values are handled, followed by encoding and conversion for model readiness.
Biomedinformatics 06 00006 g002
Figure 3. CTGAN training process shows how the generator and discriminator interact through conditional sampling and adversarial learning to produce realistic synthetic data.
Figure 3. CTGAN training process shows how the generator and discriminator interact through conditional sampling and adversarial learning to produce realistic synthetic data.
Biomedinformatics 06 00006 g003
Figure 4. The feature correlation matrix of the UCI-CKD dataset illustrates the relationships across clinical variables, with pronounced positive and negative correlations highlighting critical biomarkers associated with CKD progression.
Figure 4. The feature correlation matrix of the UCI-CKD dataset illustrates the relationships across clinical variables, with pronounced positive and negative correlations highlighting critical biomarkers associated with CKD progression.
Biomedinformatics 06 00006 g004
Figure 5. Distribution of major clinical characteristics across CKD-positive (Class 0) and CKD-negative (Class 1) cohorts: (a) red blood cell count, (b) specific gravity, (c) blood urea, (d) random blood glucose, (e) hemoglobin, and (f) packed cell volume.
Figure 5. Distribution of major clinical characteristics across CKD-positive (Class 0) and CKD-negative (Class 1) cohorts: (a) red blood cell count, (b) specific gravity, (c) blood urea, (d) random blood glucose, (e) hemoglobin, and (f) packed cell volume.
Biomedinformatics 06 00006 g005
Figure 6. t-SNE and PCA comparison of original and CTGAN-generated datasets. Subfigures (aj) dimensionality-reduced projections.
Figure 6. t-SNE and PCA comparison of original and CTGAN-generated datasets. Subfigures (aj) dimensionality-reduced projections.
Biomedinformatics 06 00006 g006aBiomedinformatics 06 00006 g006b
Figure 7. Trends in test accuracy across varying dataset sizes for multiple machine learning models.
Figure 7. Trends in test accuracy across varying dataset sizes for multiple machine learning models.
Biomedinformatics 06 00006 g007
Figure 8. Comparison of accuracy across datasets, showing model performance stability during training and slight variability during testing as the dataset size increases (a) training and (b) testing.
Figure 8. Comparison of accuracy across datasets, showing model performance stability during training and slight variability during testing as the dataset size increases (a) training and (b) testing.
Biomedinformatics 06 00006 g008
Figure 9. Test accuracy heatmap showing model performance across datasets of varying sizes, highlighting consistent accuracy for ensemble methods and variability in simpler models.
Figure 9. Test accuracy heatmap showing model performance across datasets of varying sizes, highlighting consistent accuracy for ensemble methods and variability in simpler models.
Biomedinformatics 06 00006 g009
Table 2. Snapshot of the Dataset Description.
Table 2. Snapshot of the Dataset Description.
AttributeMeaningCategoryScaleMissing
ageAgeNumericalYears9
bpBlood PressureNumericalmm/Hg12
sgSpecific gravityNominal1.005 to 1.02547
allAlbuminNominal0 to 546
suSugarNominal0 to 549
rbcRed blood cellsNominalAbnormal, Normal152
pcwhite blood cellNominalAbnormal, Normal65
pccwhite blood cell clumpsNominalNot present, Present4
baBacteriaNominalNot present, Present4
bgrBlood glucose randomNumericalmg/dL44
butBlood ureaNumericalmg/dL19
scSerum creatinineNumericalmg/dL17
sodSodiumNumericalmEq/L87
potPotassiumNumericalmEq/L88
hemoHemoglobinNumericalgms52
pcvPacked cell volumeNumericalP cv71
wcWhite blood cell countNumericalcells/cum106
rcRed blood cell countNominalmillions/cmm131
htnHypertensionNominalNo, Yes2
dmDiabetes mellitusNominalNo, Yes2
cadCoronary artery diseaseNominalNo, Yes2
appetAppetiteNominalPoor, Good1
pePedal edemaNominalNo, Yes1
andAnemiaNominalNo, Yes1
ClassificationClassNominalNot CKD, CKD0
Table 3. Data Preprocessing Method.
Table 3. Data Preprocessing Method.
Feature TypeImputation MethodSpecific Application
NumericalRandom SamplingNull values were replaced with a randomly selected existing value from the same feature, preserving the feature’s distributional shape.
CategoricalRandom SamplingApplied to columns with a higher proportion of missing values (e.g., red_blood_cells, pus_cells) to avoid skewing the category distribution toward the mode.
CategoricalMode SubstitutionApplied to the remaining categorical features, null values were filled with the most frequent category.
Table 4. Baseline Accuracy Scores.
Table 4. Baseline Accuracy Scores.
ModelDatasetOriginal Data (%)Scaled Data (%)
K-Nearest NeighborsTrain100 (k = 1)99.64 (k = 2)
Test79.16 (k = 1)99.167 (k = 2)
Decision TreeTrain100100
Test96.6796.67
Decision Tree with TuningTrain98.2198.21
Test97.597.5
Random ForestTrain100100
Test98.3398.33
Ada BoostTrain10098.93
Test10099.167
Gradient BoostingTrain100100
Test97.597.5
Stochastic Gradient BoostingTrain100100
Test96.6796.67
XGBoostTrain100100
Test99.16796.67
Categorical BoostTrain100100
Test97.597.5
Extra Trees ClassifierTrain97.8697.86
Test99.16799.167
LGBMTrain100100
Test99.16799.167
SVMTrain96.0798.57
Test96.6799.167
Table 5. Quality scores of synthetic datasets.
Table 5. Quality scores of synthetic datasets.
Synthetic Sample SizeColumn ShapesColumn Pair TrendsOverall Score
20092.09%84.23%88.16%
50091.82%85.3%88.56%
80092.67%87.56%90.11%
100092.33%87.49%89.91%
200092.48%88.56%90.52%
Table 6. Results on all Synthetic Datasets with Different Machine Learning Models.
Table 6. Results on all Synthetic Datasets with Different Machine Learning Models.
DatasetModelTraining AccuracyTest
Accuracy
PrecisionRecallF1-ScoreSupportConfusion Matrix
Six hundred Samples (67% dilution)KNN86.985.55868685180[[119, 4] [22, 35]]
DTC10094.44949494180[[120, 3] [7, 50]]
DTC (tuned)99.7695.56969696180[[120, 3] [5, 52]]
Random Forest99.7697.78989898180[[123, 0] [4, 53]]
Ada Boost10096.67979797180[[123, 0] [6, 51]]
Gradient Boosting10097.78989898180[[123, 0] [4, 53]]
Stochastic Gradient Boosting10097.78989898180[[123, 0] [4, 53]]
XGBoost10097.78989898180[[123, 0] [4, 53]]
Categorical Boosting10097.78989898180[[123, 0] [4, 53]]
Extra Trees98.5797.22979797180[[122, 1] [4, 53]]
LGBM10097.78989898180[[123, 0], [4, 53]]
SVM94.5292.22929292180[[114, 9] [5, 52]]
Nine hundred Samples (44% dilution)KNN80.3281.11828181270[[154, 29] [22, 65]]
DTC10094.07949494270[[176, 7] [9, 78]]
DTC (tuned)96.3592.22939292270[[169 14] [7 80]]
Random Forest10096969696270[[178 5] [6 81]]
Ada Boost10094.44949494270[[175 8] [7 80]]
Gradient Boosting10096.29969696270[[176 7] [3 84]]
Stochastic Gradient Boosting10095.55969696270[[176 7] [5 82]]
XGBoost10097.04979797270[[179 4] [4 83]]
Categorical Boosting99.8496.3969696270[[177 6] [4 83]]
Extra Trees97.1495.2959595270[[173 10] [3 84]]
LGBM10097.04979797270[[177 6] [2 85]]
SVM94.4495.55969696270[[173 10] [2 85]]
1200 Samples (33% dilution)KNN82.3879.44807980360[[197 42] [32 89]]
DTC10091.11919191360[[218 21] [11 110]]
DTC (tuned)94.393.33939393360[[225 14] [10 111]]
Random Forest99.7696.94979797360[[235 4] [7 114]]
Ada Boost10096.11969696360[[230 9] [5 116]]
Gradient Boosting10096.94979797360[[233 6] [5 116]]
Stochastic Gradient Boosting10096.67979797360[[233 6] [6 115]]
XGBoost10096.39969696360[[231 8] [5 116]]
Categorical Boosting98.8196.11969696360[[231 8] [6 115]]
Extra Trees95.8395.56969696360[[229 10] [6 115]]
LGBM10096.39969696360[[230 9] [4 117]]
SVM95.2494.72959595360[[227 12] [7 114]]
1400 Samples (29% dilution)KNN79.5980.47808080420[[273 26] [56 65]]
DTC10093.81949494420[[284 15] [11 10]]
DTC (tuned)96.2294.29949494420[[289 10] [14 107]]
Random Forest99.8995.48959595420[[290 9] [10 111]]
Ada Boost99.894.76959595420[[286 13] [9 112]]
Gradient Boosting10095.47959595420[[289 10] [9 112]]
Stochastic Gradient Boosting10095.24959595420[[288 11] [9 112]]
XGBoost10095959595420[[287 12] [9 112]]
Categorical Boosting98.8895.48969596420[[287 12] [7 114]]
Extra Trees96.2294.52959595420[[285 14] [9 112]]
LGBM10095.95969696420[[290 9] [8 113]]
SVM93.8894.52959595420[[286 13] [10 111]]
2400 Samples (17% dilution)KNN79.9477.64777877720[[449 55] [106 110]]
DTC10092.08929292720[[475 29] [28 188]]
DTC (tuned)94.6493.47949393720[[478 26] [21 195]]
Random Forest98.5795.42959595720[[486 18] [15 201]]
Ada Boost95.2495.69969696720[[485 19] [12 204]]
Gradient Boosting97.2695.14959595720[[483 21] [14 202]]
Stochastic Gradient Boosting10095.27959595720[[485 19] [15 201]]
XGBoost10094.72959595720[[481 23] [15 201]]
Categorical Boosting96.9694.86959595720[[482 22] [15 201]]
Extra Trees94.1795959595720[[481 23] [13 203]]
LGBM10094.58959595720[[487 17] [22 194]]
SVM93.4593.9949494720[[479 25] [19 197]]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Randhawa, P.; Jasthi, V.N.; Piyush, K.; Kaushik, G.K.; Batamulay, M.; Prasad, S.N.; Rawat, M.; Veernapu, K.; Naik, N. Conditional Tabular Generative Adversarial Network Based Clinical Data Augmentation for Enhanced Predictive Modeling in Chronic Kidney Disease Diagnosis. BioMedInformatics 2026, 6, 6. https://doi.org/10.3390/biomedinformatics6010006

AMA Style

Randhawa P, Jasthi VN, Piyush K, Kaushik GK, Batamulay M, Prasad SN, Rawat M, Veernapu K, Naik N. Conditional Tabular Generative Adversarial Network Based Clinical Data Augmentation for Enhanced Predictive Modeling in Chronic Kidney Disease Diagnosis. BioMedInformatics. 2026; 6(1):6. https://doi.org/10.3390/biomedinformatics6010006

Chicago/Turabian Style

Randhawa, Princy, Veerendra Nath Jasthi, Kumar Piyush, Gireesh Kumar Kaushik, Malathy Batamulay, S. N. Prasad, Manish Rawat, Kiran Veernapu, and Nithesh Naik. 2026. "Conditional Tabular Generative Adversarial Network Based Clinical Data Augmentation for Enhanced Predictive Modeling in Chronic Kidney Disease Diagnosis" BioMedInformatics 6, no. 1: 6. https://doi.org/10.3390/biomedinformatics6010006

APA Style

Randhawa, P., Jasthi, V. N., Piyush, K., Kaushik, G. K., Batamulay, M., Prasad, S. N., Rawat, M., Veernapu, K., & Naik, N. (2026). Conditional Tabular Generative Adversarial Network Based Clinical Data Augmentation for Enhanced Predictive Modeling in Chronic Kidney Disease Diagnosis. BioMedInformatics, 6(1), 6. https://doi.org/10.3390/biomedinformatics6010006

Article Metrics

Back to TopTop