Diagnosis of Cervical Cancer Based on a Hybrid Strategy with CTGAN

Tang, Mengdi; Chen, Hua; Lv, Zongjian; Cai, Guangxing

doi:10.3390/electronics14061140

Open AccessArticle

Diagnosis of Cervical Cancer Based on a Hybrid Strategy with CTGAN

School of Science, Hubei University of Technology, Wuhan 430068, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(6), 1140; https://doi.org/10.3390/electronics14061140

Submission received: 11 February 2025 / Revised: 9 March 2025 / Accepted: 11 March 2025 / Published: 14 March 2025

Download

Browse Figures

Versions Notes

Abstract

Cervical cancer remains a significant global public health challenge, particularly in low- and middle-income countries where invasive diagnostic methods are underutilized due to limited medical resources. Machine learning has provided a new pathway to address this challenge, but existing machine learning prediction methods face three major challenges: feature redundancy, class imbalance, and sample scarcity. To address these issues, this study proposes a hybrid data processing strategy with Conditional Tabular Generative Adversarial Networks (CTGAN) and machine learning to construct a more accurate and efficient auxiliary diagnostic model for cervical cancer. The hybrid strategy first employs the Minimal Redundancy Maximal Relevance (mRMR) algorithm and XGBoost-based Recursive Feature Elimination (RFE) for secondary feature screening. Subsequently, the SMOTE-ENN combination sampling method is applied to handle extreme class imbalance, and CTGAN is utilized to augment the dataset, thereby mitigating data scarcity. Experimental validation on the Risk Factors of Cervical Cancer (RFCC) dataset from a Venezuelan hospital demonstrates that, after processing with the proposed hybrid strategy, the Logistic Regression (LR) model achieves the best overall prediction results, with accuracy, precision, recall, and F1-score reaching 99.00%, 99.28%, 98.77%, and 99.02%, respectively, outperforming existing methods.

Keywords:

cervical cancer; feature selection; sample balancing; CTGAN; hybrid strategy

1. Introduction

Cervical cancer, a leading cause of mortality among women of reproductive age (15–49 years), exhibits significant regional disparities in disease burden. According to the Global Burden of Disease (GBD) database, cervical cancer ranked among the deadliest diseases for women in 204 countries and territories between 1990 and 2021 [1]. The World Health Organization (WHO) reports approximately 570,000 new cases and 310,000 deaths annually, with 85% occurring in low- and middle-income countries [2].

Cervical cancer, also known as cervical carcinoma, is a malignant tumor caused by the uncontrolled growth and proliferation of tissue cells in the cervix due to abnormal cell division mechanisms [3]. Early detection is critical for improving survival rates, enhancing recovery possibilities, and safeguarding human lives. Although the current “gold standard” diagnostic method—cervical biopsy—is highly accurate, its invasive nature may lead to complications such as pain, bleeding, and infection, requires recovery time, and requires specialized pathological equipment and personnel, resulting in high costs. Consequently, its adoption rate in low-income regions remains below 5% [4,5]. This highlights the urgent need to develop non-invasive, low-cost screening methods.

In recent years, machine learning has provided a new pathway to address this challenge by leveraging the synergistic predictive value of behavioral factors (e.g., age at first sexual intercourse, smoking history) and non-invasive tests (e.g., Hinselmann colposcopy, Schiller’s iodine test, cytology). This approach is pivotal for lowering screening barriers and enabling large-scale, equitable diagnostics. Existing studies have made progress in certain technical aspects: Newaz et al. [6] proposed a hybrid resampling technique combining oversampling and undersampling to establish class balance, achieving a diagnostic accuracy of 94.47% using genetic algorithms (GA) to identify key risk factors. Kaushik et al. [7] employed visualization and XGBoost for cervical cancer prediction, attaining 96.5% accuracy. Aloss et al. [8] utilized the Crow Search Algorithm (CSA) for feature selection and tested multiple classifiers, with SVM achieving the highest accuracy of 97.70%. Tanimu et al. [9] developed a decision tree (DT) algorithm based on risk factors, integrating feature selection and SMOTE-Tomek sampling to address data imbalance, resulting in 98.72% accuracy. They also demonstrated that reducing input dimensions and increasing sample size significantly improved model performance. Chadaga et al. [10] established a custom stacked ensemble machine learning approach to predict risk factors for contracting cervical cancer. First, univariate and multivariate statistics were used to perform in-depth exploratory analyses. Then, one-way analysis of variance (ANOVA), mutual information (MI), and Pearson’s correlation techniques were utilized for feature selection. Finally, the Borderline-SMOTE technique was employed to balance the data, leading to an accuracy of 98%. Kumawat et al. [11] compared six classification algorithms (e.g., artificial neural networks, Bayesian networks, SVM, XGBoost) and found that XGBoost without feature selection achieved the highest accuracy, at 94.94%. Priya et al. [12] applied SVM, gradient boosting machines (GBM), and perceptron learning to achieve 96.33% accuracy. Bhavani et al. [13] proposed a stacked ensemble technique with heterogeneous base learners and meta-learners, combined with SMOTE and RFE, achieving 97.10% accuracy. Shakil et al. [14] used SMOTE and adaptive synthetic sampling to address class imbalance, along with Chi-square and LASSO regression for feature selection, achieving 97.60% accuracy with DT. Ali et al. [15] designed an ensemble classifier integrating random forest, SVM, Gaussian naive Bayes, and DT, achieving 98.06% accuracy. Despite these advancements, existing studies fail to systematically address three critical challenges: feature redundancy, class imbalance, and data scarcity in cervical cancer prediction.

In summary, while machine learning has significantly advanced cervical cancer diagnosis, three common limitations persist, as follows: (1) how to eliminate redundant features and identify optimal feature subsets through more scientific feature engineering to enhance prediction accuracy; (2) how to address the pervasive issue of extreme class imbalance in medical datasets; (3) how to overcome the scarcity of samples in cervical cancer datasets, as existing methods rely on limited raw data (e.g., the UCI dataset with only 858 samples), hindering model generalization.

To address these challenges, this study proposes a hybrid strategy with CTGAN for cervical cancer prediction. For the first time, we systematically integrate three modules—feature engineering, sample balancing, and data augmentation—into a unified framework, as follows:

For the first challenge, we employ the mRMR algorithm for initial feature screening, followed by XGBoost-based RFE for secondary feature selection to identify the optimal feature subset;
For the second challenge, we apply the SMOTE-ENN combined sampling method to address extreme class imbalance in medical data;
For the third challenge, we utilize CTGAN to augment the dataset, overcoming the generalization bottleneck caused by limited training samples.

The research workflow is illustrated in Figure 1.

2. Materials and Methods

2.1. Cervical Cancer Dateset

2.1.1. Data Source and Description

The experimental dataset is the Risk Factors of Cervical Cancer (RFCC) dataset from the UCI Machine Learning Repository [16], which is publicly available at https://archive.ics.uci.edu/dataset/383/cervical+cancer+risk+factors (accessed on 17 July 2021). The dataset structure is shown in Table 1.

The cervical cancer dataset, provided by a Venezuelan hospital, comprises 858 samples and 36 attributes categorized into behavioral factors (e.g., number of sexual partners, age at first sexual intercourse, smoking habits, sexually transmitted diseases), non-invasive tests (e.g., Hinselmann, Schiller, cytology), and invasive tests (e.g., biopsy). The baseline characteristics of the dataset are detailed in Table 2.

2.1.2. Data Cleaning

During the data cleaning phase, a hierarchical strategy is adopted to address missing values. As shown in Table 2, the features “STDs: time since first diagnosis” and “STDs: time since last diagnosis” exhibited missing rates exceeding 90%, significantly higher than other features. Following Allison’s principles for handling missing data [17], variables with missing rates over 70% are removed to avoid estimation bias. Consequently, these two features are discarded, reducing the feature dimensions from 35 to 33.

For remaining features with missing rates ≤15%, a type-specific imputation strategy is applied—numerical features (e.g., age, smoking years) are filled with mean values, while categorical features (e.g., smoking, sexually transmitted diseases) are imputed using mode values [18].

2.1.3. Data Standardization

Due to the varying dimensions and scales of each sample, directly analyzing the original eigenvalues without preprocessing may amplify the influence of features with higher magnitudes in the overall analysis. The raw data must be processed to guarantee the reliability of the results. Standardization and normalization are common processing methods. Normalization aims to scale data within a predefined range, facilitating the comparison and weighting of features with varying magnitudes and orders of magnitude. In contrast, standardization usually refers to the transformation of data into a distribution with a mean of 0 and a standard deviation of 1. This transformation will not change the essential distribution of the data, but will unify the scale of the data, which is convenient for comparison and calculation between different features. Given that the dataset in this study contains a considerable number of outliers, normalization is not appropriate; however, standardization proves to be a suitable approach [19].

Based on the attributes of the RFCC dataset, we select the Z-score standardization method for data preprocessing in this paper. The transformation function is as follows:

x_{n e w} = \frac{x - μ}{σ}

(1)

where

μ

is the mean of the features in each column and

σ

is the standard deviation of the features in each column.

2.2. Proposed Hybrid Strategy with CTGAN

2.2.1. Feature Selection

The effectiveness of the features is the key to determining the accuracy of the model. The effectiveness of cancer diagnostic models can be enhanced by performing feature selection on the data. This approach allows for the removal of irrelevant or redundant features, thereby reducing the dataset size and simplifying the model’s complexity. In this paper, we apply a two-stage feature selection process to identify the optimal subset of features. The initial step is to use mRMR to select features with the top 24 scores. In the second stage, recursive feature elimination utilizing the XGBoost algorithm is employed to determine the final set of features.

mRMR method for initial screening

The mRMR [20] algorithm is a heuristic feature selection algorithm based on mutual information [21]. By analyzing the correlation of the evaluation function computed features and the output results, a feature set with high correlation and few redundant features is obtained. Algorithm 1 outlines the detailed steps involved in the mRMR algorithm.

Algorithm 1: mRMR Method

Step 1: Calculate the mutual information between individual features and results

Step 2: Sort the results according to the values

Step 3: Remove the features with zero mutual information from the subset of features to
be selected by using two-way search and retain the features with the highest
correlation

Step 4: Choose the weakly correlated features for classifier validation until the results are
satisfactory

Step 5: Compute mutual information difference (MID) and mutual information entropy
(MIQ) for the selected feature subset

Step 6: Sort the MID or MIQ and verify the features with maximum values

Step 7: Repeat step 3 to step 6 until the results are satisfactory

Step 8: Add a weight element to the MID or MIQ

Step 9: Find the combination of the features that corresponds to the optimal classification
indicators through several cycles of calculation and verification of the classifier

Step 10: Select the optimal feature set on the feature sets obtained from the two search
methods

2.: Recursive feature elimination method for secondary screening

The recursive feature elimination method [22] is employed to decrease the feature dimensions and identify the most optimal subset of the feature. The specific steps in the RFE algorithm are shown in Algorithm 2.

Algorithm 2: Recursive Feature Elimination Method

Step 1: Initialize the XGBoost classifier with the features pre-selected via the mRMR
algorithm as the baseline feature subset

Step 2: Assess the feature importance metrics using the average information gain

Step 3: Evaluate the classification accuracy through the cross-validation technique

Step 4: Eliminate the lowest-ranked feature from the current subset to generate a revised
feature subset

Step 5: Determine the significance of each feature within the updated feature subset

Step 6: Reassess the classification accuracy of the updated feature subset

Step 7: Repeat step 4 to step 6 until no more features are rejected

Step 8: Choose the feature subset that yields the highest classification accuracy from a set
of K distinct feature subsets

The recursive feature elimination method screened 12 optimal features including Schiller, first sexual intercourse, and other features. The importance scores for the optimal feature set are displayed in Table 3.

2.2.2. SMOTE-ENN for Sample Balancing

In medical datasets, the numbers of positive and negative samples often vary significantly, leading to an uneven distribution of categories. The model is not able to fully learn the characteristics and laws of the data of a few categories because of the sample imbalance, and thus performs poorly in dealing with a few categories, resulting in low recall, low accuracy, and even poor prediction precision. In order to solve this problem, this paper uses the SMOTE-ENN combined sampling method to balance the samples; it first uses SMOTE to oversample the minority classes to make the class ratio close to equilibrium, and then uses ENN to undersample the majority classes to remove the boundary noise samples, and it finally obtains a cleaner, more balanced and discriminative dataset.

The specific steps of the SMOTE-ENN algorithm are shown in Algorithm 3.

Algorithm 3: SMOTE-ENN Method

Step 1: Divide the unbalanced dataset into minority class S_min and majority class S_maj

Step 2: Determine the K closest neighbors for each sample in the minority class

Step 3: Calculate the number of new samples required for every minority class sample
based on the unbalanced proportion of the dataset

Step 4: Select N’s nearest neighbors randomly from its K-Nearest Neighbors for each
minority class sample

Step 5: Construct the new samples based on equation x_new = x + rand(0,1)⋅(x_n − x) if the
nearest neighbors are chosen while generating new samples x_n from the samples x

The variable “x” represents a sample in the cervical cancer dataset, i.e., a row of data in the table.

2.2.3. CTGAN for Sample Expanding

Due to the sensitive nature and privacy of medical datasets, their sample sizes are often not large enough to meet the sample size requirements of machine learning models. The RFCC cervical cancer dataset contains only 858 sets of experimental samples, so CTGAN [23] is further introduced to expand the size of the dataset to improve the performance of the machine learning model.

CTGAN is a generative adversarial network that is intended to learn how data are distributed from a tabled dataset and then generate new samples with similar statistical characteristics. The model is designed to process structured data in tabular form, and can be applied in various task domains such as data synthesis, data augmentation, and the simulation of real-world data distributions. CTGAN is a solution to data scarcity or constraints, as well as protecting the privacy of the original data.

CTGAN mainly consists of a generator and a discriminator. The generator takes random noise

z \sim ρ_{z}

as input and generates synthetic data

G (z)

similar to the distribution of real data. The discriminator accepts both the real data

x \sim ρ_{r}

and the output

G (z)

of the generator, and determines whether the input is real data or fake data. During the training process, adversarial optimization is used to improve the performance of both the generator and the discriminator. The generator constantly enhances the quality of the generated data to deceive the discriminator, while the discriminator enhances its ability to distinguish between real and fake data.

The specific network structure of CTGAN is shown in Figure 2.

The specific steps of CTGAN are shown in Algorithm 4.

Algorithm 4: CTGAN Method

Step 1: Estimate the number of modes m_i for each consecutive column C_i using a
variational Gaussian mixture model

Step 2: Fit a Gaussian mixture distribution

Step 3: Calculate the probability for each value C_ij in C_i in each pattern

Step 4: Sample a pattern from a given probability density and normalize the values
using the sampled patterns

The distribution of samples in the cervical cancer dataset after using SMOTE-ENN as well as SMOTE-ENN and CTGAN is shown in Table 4, in which the 30,000 samples expanded by CTGAN are only involved in the training process of the model.

2.3. Classifiers

2.3.1. Logistic Regression

Logistic Regression (LR) is a widely adopted classification algorithm in clinical research, used to assist clinicians in determining the patient’s condition, formulating treatment plans, and assessing health risks [24]. This supervised learning method addresses binary classification problems by transforming linear regression outputs through a sigmoid activation function, thereby constraining predictions within the probabilistic range of

[0, 1]

. The fundamental mathematical expression of the sigmoid function is

σ (z) = \frac{1}{1 + e^{- z}}

(2)

where

z = w^{T} \cdot x

,

w

is the learnable weight,

x

is the feature vector, and

σ (z)

denotes the value of the predicted probability.

The fitting function

H_{θ} (x)

for Logistic Regression is

H_{θ} (x) = g (θ^{T} x) = \frac{1}{1 + e^{- θ^{T} x}}

(3)

The loss function

J (θ)

for Logistic Regression is

J (θ) = - \frac{1}{m} [\sum_{i = 1}^{m} y^{(i)} \log (H_{θ} (x^{(i)})) + (1 - y^{(i)}) \log (1 - H_{θ} (x^{(i)}))]

(4)

The maximum likelihood method is the most common method used for estimating parameters in Logistic Regression algorithms [25].

2.3.2. K-Nearest Neighbor

K-Nearest Neighbor (KNN) [26] is a basic machine learning algorithm that classifies unlabeled instances by identifying their K most proximate labeled neighbors.. The fundamental principle behind KNN is as follows. Firstly, calculate the distance between the target instance and all training samples. Next, pick the K nearest known samples that are the most closely related to the sample for classification. Finally, the data to be classified are classified into the same category as the one with the most frequent occurrence among the K-Nearest Neighbor samples.

Selecting the K value and the distance metrics are the two main basic elements involved in building the KNN model. The accuracy of the model’s prediction is greatly dependent on the K-value; a low K-value can easily result in model overfitting, while a high K-value can easily result in model underfitting. Furthermore, the discrimination capability is substantially determined by the employed distance function, with common implementations including Euclidean distance, Manhattan distance, and cosine distance [27].

2.3.3. Decision Tree

Decision tree (DT) [28], a supervised machine learning algorithm, addresses both classification and regression tasks through hierarchical rule-based partitioning. The tree structure of the DT depicts the decision-making process. Internal nodes denote a test of a feature or attribute, and branches denote the outcome of those tests. Leaf nodes denote a category or value. Choosing the best features for partitioning is necessary when building a decision tree. Common methods include information theoretical metrics and other approaches. ID3, C4.5, and CART [29,30] are all decision tree algorithms based on information theory.

The ID3 algorithm employs information gain to choose features for dataset segmentation and create decision tree models. However, this method exhibits critical limitations, including heightened sensitivity to high-dimensional feature spaces that may lead to suboptimal generalization performance. Inductive bias in the ID3 algorithm is prevented by employing an information gain rate as a splitting rule in C4.5, which is a way to enhance ID3. Although ID3 and C4.5 algorithms are designed to gain as much information as possible from the training samples, their decision trees are large. To improve the efficiency of generating decision trees, the CART algorithm selects features based on the Gini index and constructs the decision tree recursively.

2.3.4. Support Vector Machine

The Support Vector Machine (SVM) [31], a classical supervised learning algorithm, is applicable to both binary and multi-class classification tasks by identifying a maximal-margin hyperplane in a kernel-induced feature space. By constructing a kernel function, SVM has the ability to transform from low-dimensional linearly indivisible to high-dimensional linearly divisible, which solves the problem of linear indivisibility in low-dimensional spaces [32].

In practice, choosing a suitable kernel function is critical to SVM implementation. Widely adopted kernels include linear, polynomial, radial basis function, and sigmoid formulations [33].

2.4. Performance Metrics of the Model

Cancer prediction in this paper is a binary outcome classification problem, with model performance quantified through a diagnostic evaluation metrics based on confusion matrix analysis, as shown in Table 5 [34]. This matrix systematically categorizes prediction outcomes into four mutually exclusive classes, as follows: (i) true positive (TP)—correct identification of malignant cases; (ii) false negative (FN)—malignant instances erroneously classified as benign; (iii) false positive (FP)—benign cases incorrectly predicted as malignant; (iv) true negative (TN)—accurate recognition of non-cancerous samples.

Accuracy, precision, recall, and F1 score are evaluation indexes of the classification model that can be obtained from the confusion matrix, and their specific meanings are described below [35].

Accuracy, defined as the percentage of samples that are predicted correctly as a percentage of the total. The formula is as follows:

Accuracy = \frac{T P + T N}{T P + F N + F P + T N}

(5)

A high accuracy rate in cervical cancer classification prediction implies that the model correctly classifies a greater percentage of all samples, including both those with and without cervical cancer. The accuracy is reasonable as it gives an assessment of overall classification accuracy and can aid in determining the overall effectiveness of the model for cervical cancer prediction.

2.: Precision, defined as the percentage of positive samples predicted by the model that actually turn out to be positive samples. The formula is as follows:

Precision = \frac{T P}{T P + F P}

(6)

3.: Recall, defined as the percentage of true positive samples that are predicted to be positive. The formula is as follows:

Recall = \frac{T P}{T P + F N}

(7)

The high recall rate in this study suggests that a higher percentage of cervical cancer patients were correctly diagnosed. In the medical field, it is crucial for cancer patients to receive cancer diagnosis at the earliest possible time, otherwise it could delay treatment. Therefore, the recall is an important indicator that cannot be ignored in disease diagnosis.

4.: F1-score, defined as the harmonic mean of both the accuracy and recall of the model. The formula is as follows:

F 1 - score = \frac{2 T P}{2 T P + F P + F N}

(8)

F1-score is a comprehensive evaluation metric of the model’s precision and recall. It is also a good choice for cancer classification prediction because it balances the model’s ability to correctly classify cervical cancer patients and non-cervical cancer patients.

All performance indices are bounded within 0 and 1, where higher values correspond to superior discriminatory performance in binary classification tasks.

3. Results

3.1. Experimental Environment

In this experiment, the operating system is Windows 11, the processor is Intel (R) Core (™) i5-1135G7@2.40 GHz 2.42 GHz, and the memory capacity is 16.00 GB. The development environment is Python 3.10.13, and the versions of the Python repository are pymrmr (0.1.11), scikit-learn (1.2.1), XGBoost (2.0.3), imbalanced-learn (0.11.0), and CTGAN (0.9.0). The experimental data are from an RFCC dataset, and the LR, KNN, DT and SVM models are used to classify and predict whether there is cervical cancer. The performance of each model in the classification and prediction of cervical cancer and the effects of various data preprocessing strategies on the classification and prediction of cervical cancer are compared and analyzed, and the best results in this paper are compared with the best results of other studies.

3.2. Experimental Parameter Setting

In this experiment, the sample data are split into two parts, the training set and the test set, according to the ratio of 70% and 30%. The grid search method [36] is adopted to determine the optimal parameter for all models. The principle of the grid search method is to conduct a cyclic traversal of all possible values of the parameters. For each parameter combination, a detailed comparison and analysis of its performance in the model training process is carried out. Finally, the parameter combination that achieves the highest training performance is selected to improve the prediction accuracy of the model, as depicted in Table 6.

The parameters of the mRMR, RFE, SMOT-ENN, CTGAN and other methods used in the hybrid strategy are as shown in Table 7.

To make the test results more reliable and persuasive, this study employs the 10-fold cross-validation method [37] for in-depth analysis. When using the 10-fold cross-validation method to train the model, the training samples are evenly divided into 10 parts. In each training session, one of the parts is reserved for validating the model, while the other nine parts of the sample data are utilized for the training work. The final evaluation result is determined by the average value of the results from these 10 tests, which are repeated 10 times during this cross-validation process.

3.3. Performance Optimization Links of Hybrid Strategy

3.3.1. Comparative Experiments on Feature Selection

To validate the superiority of hybrid feature selection over individual feature selection methods, this study designs four comparative experiments:

Group 1 (Baseline)—No feature selection implemented;

Group 2 (mRMR only)—Feature selection using solely the mRMR algorithm;

Group 3 (RFE only)—Feature selection exclusively through XGBoost-based RFE;

Group 4 (mRMR + RFE)—Hybrid feature selection combining mRMR for initial screening followed by XGBoost-based RFE for secondary refinement.

The experiments are conducted on a standardized UCI cervical cancer dataset, with four classifiers (LR, KNN, DT, and SVM) being employed for performance evaluation. Model assessment metrics included accuracy, precision, recall, and F1-score, with detailed results presented in Table 8.

As shown in the Table 8, across all models, the baseline group (Group 1) exhibited inferior performance metrics compared to the feature-selected groups (Groups 2–4), confirming the necessity of feature engineering for enhancing classification performance. The hybrid strategy (Group 4, mRMR + RFE) demonstrated the best overall performance, specifically for LR and SVM, where Group 4 outperformed other groups across all metrics. For KNN, although the precision of Group 4 (53.55%) was slightly lower than that of Group 3 (53.90%) by 0.35%, its accuracy (94.05%), recall (35.00%), and F1-score (39.60%) remained superior. For DT, Group 4 and Group 3 showed comparable overall performances. While the differences in accuracy, precision, and F1-score between the two groups were minimal (0.02%, 0.53%, and 0.41%), Group 4 achieved a 1.33% higher recall than Group 3.

In summary, feature engineering is a critical step for improving classification performance. The hybrid feature selection method outperformed single-stage feature engineering approaches on the cervical cancer dataset. Therefore, we selected the hybrid strategy for feature engineering in this study.

3.3.2. Comparative Experiments on Sample Balancing

After applying hybrid feature selection to the cervical cancer dataset, this study experimentally compares the following four sampling methods to determine the optimal sample balancing strategy:

SMOTE [38]—Synthetic Minority Over-sampling Technique, which generates synthetic samples for the minority class through oversampling;

SMOTE-Tomek [39]—Combines SMOTE oversampling with Tomek Links to remove boundary noise between classes;

Borderline-SMOTE [40]—Focuses on generating synthetic samples in borderline regions of the minority class;

SMOTE-ENN—Integrates SMOTE oversampling with Edited Nearest Neighbors (ENN) to clean noise in the majority class.

The performances of these methods have been evaluated using four classifiers, LR, KNN, DT, and SVM, with results summarized in Table 9.

As can be seen from the Table 9, the accuracy, precision and F1-score of SMOTE-ENN are significantly higher than those of other methods using LR, KNN, DT and SVM. Its recall rate (96.95%) is only 0.34% lower than that of Borderline-SMOTE (97.29%) on KNN, and is the highest among the other models. Therefore, SMOTE-ENN was selected for the sample balancing of the cervical cancer dataset.

3.3.3. CTGAN Synthetic Data Quality Analysis

After sample balancing, the samples of the cervical cancer dataset are still scarce, and we use CTGAN to augment the dataset. To verify the quality of the data generated by CTGAN, this study evaluates the distribution consistency and statistical characteristics via two dimensions.

Quantitative characteristics

The Kolmogorov–Smirnov (KS) test [41] is used to verify the consistency of the distribution of the synthetic data and the real data on the quantitative features of the optimal features after feature screening (

H_{0}

: the two distributions are the same, and the significance level is

α = 0.05

). The test results are as follows:

As can be seen from the Table 10, the p values of STDs: number of diagnosis, smoking (packs/year), IUD (years) and first sexual intercourse (1.00, 0.71, 0.45 and 0.85) were all greater than 0.05, indicating that the synthetic data were consistent with the real data on these characteristics.

However, the p value of hormonal contraceptives (years) is less than 0.05, and in order to find out the cause, we plotted the distribution boxplot [42] of the synthetic data and the real data of the feature, as shown in the Figure 3.

As can be seen from Figure 3, the p value of hormonal contraceptives (years) being less than 0.05 may be due to the long-tail nature of the distribution of this feature (e.g., a small number of patients have used hormonal contraceptives for more than 10 years). However, if the boxplot distribution of the synthetic data of this feature is close to the real data, the generated data can still be regarded as valid. Therefore, the distribution of quantitative characteristics between synthetic data and real data is basically the same.

2.: Qualitative characteristics

For qualitative features, we used a chi-square test [43] (

χ^{2}

) to verify the matching of the category distributions (

H_{0}

: no significant difference in category frequency, significance level

α = 0.05

).

As shown in Table 11, the p values of the categorical features are all greater than 0.05 (0.06~1.00), indicating that the frequency distribution of the synthetic data on the categorical features is basically consistent with that of the real data.

In conclusion, the synthetic data generated by CTGAN demonstrated a successful passage of distribution consistency tests. Their overall quality meets the requirements for medical data analysis, and can effectively support the training of cervical cancer prediction models.

3.4. Comparative Analysis of Data Processing Strategies

We methodically conducted a detailed comparative investigation of various data processing strategies introduced in sequence, accurately analyzing the effectiveness of various data processing strategies based on classification performance.

Control Group: Utilizing the dataset after data cleaning and standardization.

Strategy I: Combining the mRMR method and the RFE method based on the XGBoost algorithm, the operation process of hybrid feature selection is implemented cautiously. Through the rigorous screening mechanism, those features that have little relevance to the cervical cancer classification problem or have redundant information are eliminated one by one, which significantly reduces the feature dimensions of the dataset and leads to a more reasonable optimization of the data structure, and lays a solid foundation for the subsequent analysis and modeling.

Strategy II: On top of the solid foundation of the successful application of strategy 1, the SMOTE-ENN combined sampling method is further introduced to deal with the extreme imbalance problem that widely exists in medical data. The implementation of this strategy can effectively reduce the model’s over-reliance on a priori information about sample proportions in the learning process, such that the model can learn from a more objective and comprehensive perspective, thus improving the model’s generalization ability and classification accuracy.

Hybrid strategy: After fully integrating the advantages of strategy 1 and strategy 2, CTGAN is innovatively adopted to expand the data. This data expansion method skillfully balances the growth of data volume and the protection of personal privacy, and can significantly increase the sample size of the dataset under the premise of strict compliance with the principle of privacy protection, which effectively solves the many problems caused by the lack of sample size, provides rich and diverse data resources for the in-depth training of the model, and powerfully promotes the enhancement of the model’s performance.

To comprehensively and scientifically verify the validity and reliability of the proposed method in this paper, a control group and three experimental groups are carefully set up. The predictions of the cervical cancer dataset produced by each strategy are demonstrated in Table 12.

As can be seen from Table 12, the accuracies of LR, KNN, DT and SVM in the control group reach 95.37%, 93.42%, 94.29%, and 93.71%, respectively, but the precision, recall, and F1 score are less than 70%.

For strategy I, all the metrics of each model are higher than the control group. LR is improved by 0.53%, 3.86%, 4.40% and 3.91%, KNN by 0.63%, 13.68%, 18.75% and 18.20%, DT by 0.28%, 1.58%, 5.26% and 3.20%, and SVM by 1.06%, 12.52%, 10.71% and 12.04%. The reason is that the hybrid feature extraction effectively eliminates redundant variables and focuses the model on the most valuable features, thus improving the accuracy of the prediction model to a certain extent, but at this point, the precision, recall, and F1 score are all still less than 70%.

For strategy II, the precision, recall, and F1 score of each model are substantially improved. The precision, recall and F1 score of LR are improved to 98.70%, 98.43% and 98.56%, KNN to 99.01%, 96.95% and 97.95%, DT to 97.97%, 98.69% and 98.32%, and SVM to 97.39%, 92.57% and 94.89%, respectively. The reason is that the number of minority class samples is increased after the SMOTE-ENN balanced samples are used in strategy II, which avoids over-biasing the majority class, thus effectively improving the prediction ability for the minority class.

For the hybrid strategy, all four models are optimal in all metrics. LR improves to 99.00%, 99.28%, 98.77%, and 99.02%, KNN improves to 98.16%, 99.31%, 97.14%, and 98.20%, DT improves to 98.40%, 98.16%, 98.73%, and 98.44%, and SVM improves to 98.10%, 97.90%, 98.43%, and 98.15%, respectively. The reason is that the strategy not only combines the advantages of strategies I and II, but also uses CTGAN to expand the dataset, which alleviates the problem of sample scarcity in the medical dataset, and further improves the generalization ability of the model.

To summarize, the hybrid strategy with CTGAN proposed in this paper plays a vital role in predicting cervical cancer diagnosis classification.

3.5. Comparative Analysis of Different Classifiers

To validate the effectiveness of the proposed hybrid strategy, a comparative analysis was conducted between a control group and an experimental group. The control group used raw data (858 samples) after cleaning and standardization, while the experimental group employed the hybrid strategy (feature selection, SMOTE-ENN, and CTGAN augmentation). Four classifiers—Logistic Regression (LR), K-Nearest Neighbors (KNN), decision tree (DT), and Support Vector Machine (SVM)—were evaluated.

After implementing the hybrid strategy on the cervical cancer dataset, to identify the optimal cervical cancer prediction model, we established control and experimental groups for comparative analysis. The dataset was processed solely with data cleaning and standardization before conducting classification predictions across all models in the control group. The dataset underwent our proposed CTGAN-enhanced hybrid data processing strategy in the experimental group. The results are summarized in Table 13 and Figure 4.

Based on the statistics in Table 13 and the visualization results presented in Figure 4, it can be clearly observed that the four experimental models demonstrate conclusively superior predictive accuracy compared to their control group counterparts. In particular, the improvement in the key metrics of precision, recall, and F1-score is outstanding. A high recall rate implies that a greater percentage of cervical cancer patients are correctly diagnosed. In the medical field, it is essential for cancer patients to be diagnosed promptly, which is far more serious than the consequences of misdiagnosing those who are not suffering from it. Therefore, the recall rate is a non-negligible indicator in the field of disease diagnosis. In addition, the precision when using KNN in the experimental group is the highest in comparison with each model, reaching as high as 99.31%, while the accuracy, recall, and F1-score using LR in the experimental group are the highest, which are 99.00%, 98.77%, and 99.02%, respectively. In conclusion, the cervical cancer risk prediction model incorporating LR and the novel hybrid strategy developed in this study demonstrate superior predictive performance compared to the other three classification models.

3.6. Comparative Analysis with Other Studies

To rigorously investigate the use of our CTGAN-enhanced hybrid framework in cervical cancer prediction, this section carefully compares and analyzes the optimal prediction results obtained with this strategy with the previous results documented in the existing literature.

Since other studies use accuracy as the main evaluation index, this paper compares the accuracy of cervical cancer prediction uniformly, as shown in Table 14.

Comparative analysis shows that this paper’s results are superior to those of existing methods in the RFCC dataset, and the prediction accuracy of cervical cancer has been improved to 99.00%. The reason for this is that this paper proposes a hybrid strategy approach to data processing, which not only filters out the optimal feature subset, but also effectively alleviates the problems of sample imbalance and insufficient sample size in medical datasets.

4. Discussion

Cervical cancer, a leading cause of premature mortality among women of reproductive age (15–49), demands optimized early diagnosis for global public health impact. While cervical biopsy (the gold-standard diagnostic) achieves high accuracy, its invasive nature risks pain, bleeding, and infection, coupled with requirements for specialized pathology infrastructure and personnel. Limited accessibility in low-income countries results in 85% of deaths occurring in low-resource settings. This disparity underscores the urgent need for non-invasive, low-cost screening technologies, where machine learning offers a promising solution.

This study proposes a hybrid strategy with CTGAN, pioneering the systematic integration of three modules—feature engineering, sample balancing, and data augmentation—to construct an innovative framework for cervical cancer prediction. The following breakthroughs were achieved.

4.1. Technical Advantages of the Hybrid Strategy

All models exhibited improved performance metrics after hybrid feature selection. The mRMR algorithm initially eliminated highly redundant features, while XGBoost-RFE further refined the feature subset to retain the most discriminative attributes. This two-stage approach enhanced model generalizability and computational efficiency, providing a robust solution for high-dimensional cervical cancer data.

The SMOTE-ENN combined sampling method significantly improved recall rates across all models. Initially, SMOTE oversampled the minority class (positive cases increased from 55 to 766), balancing class distribution. Subsequently, ENN undersampled the majority class by removing boundary noise (negative cases reduced from 803 to 717). This resulted in a cleaner, balanced dataset, enabling models to better learn the features of the minority class and reducing missed diagnoses.

CTGAN-generated synthetic data enhanced model generalizability by expanding the training set while preserving the original data distribution. This approach reduced overfitting risks and improved the Logistic Regression (LR) model’s performance to 99.00% accuracy, 99.28% precision, 98.77% recall, and 99.02% F1-score, surpassing existing methods.

In addition, in this cervical cancer prediction study, 10-fold cross-validation has been employed to fully utilize the data and reduce evaluation bias caused by different data partitions. Additionally, the SMOTE-ENN and CTGAN methods are implemented to generate enhanced training data, which not only assists models in capturing more complex patterns, but also helps mitigate overfitting risks.

4.2. Clinical and Public Health Implications

The proposed hybrid strategy achieved remarkable diagnostic metrics—99.00% accuracy, 99.28% precision, 98.77% recall, and 99.02% F1-score—with profound clinical significance. High recall (98.77%) minimizes missed diagnoses, enabling timely treatment, while high precision (99.28%) reduces unnecessary interventions. By integrating behavioral factors (e.g., smoking history) with non-invasive tests (e.g., Hinselmann colposcopy), this framework offers a scalable, cost-effective alternative to invasive biopsies, particularly in resource-limited settings.

4.3. Limitations and Future Directions

Despite the remarkable results of the hybrid strategy, there are still the following challenges to focus on.

Regarding computational complexity and training efficacy, the three-stage optimization process incurs computational overhead. However, the significant gains in diagnostic accuracy justify this trade-off, and synthetic data can be reused for long-term efficiency.

For medical data privacy concerns, the CTGAN-generated synthetic data in this study contain no real individual information, and synthetic samples were created through distributional simulation rather than direct replication, thereby partially mitigating privacy leakage risks. Nevertheless, potential privacy vulnerabilities inherent in machine learning training processes cannot be entirely eliminated, necessitating the further exploration of advanced privacy-preserving technologies. Future work will focus on innovating and applying privacy-enhancing techniques such as homomorphic encryption, differential privacy, and federated learning in healthcare scenarios. Our goal is to develop optimized frameworks that balance diagnostic accuracy with robust patient privacy protection.

In clinical practice, this study primarily focused on theoretical research and methodological innovation. The proposed framework has not yet undergone clinical validation. Future efforts will prioritize collaborations with medical institutions to evaluate the generalization capability of our hybrid strategy using clinical datasets, assess its performance in real-world diagnostic workflows, and validate model robustness in authentic clinical environments. These objectives constitute key priorities for our research group moving forward.

5. Conclusions

This study presents a hybrid strategy integrating feature selection (mRMR-RFE), sample balancing (SMOTE-ENN), and CTGAN-based data augmentation for cervical cancer prediction. The framework achieved state-of-the-art performance (99.00% accuracy) on the UCI cervical cancer dataset. The key clinical contributions include high recall (98.77%) to reduce missed diagnoses and high precision (99.28%) to avoid overtreatment, offering a viable alternative to invasive screening. Moreover, CTGAN-generated synthetic data circumvent the direct utilization of sensitive raw data, partially preserving patient privacy. This study’s limitations include high computational complexity and residual privacy risks. Future work will explore privacy-preserving techniques (notably homomorphic encryption, differential privacy, and federated learning) to optimize the trade-off between diagnostic performance and data security. Additionally, the proposed methodology will undergo clinical validation in real-world healthcare scenarios to enhance its practical applicability.

Author Contributions

Conceptualization, M.T. and H.C.; methodology, M.T.; software, M.T.; validation, M.T. and Z.L.; investigation, M.T.; writing—original draft preparation, M.T.; writing—review and editing, M.T., H.C., Z.L. and G.C. supervision, H.C. and G.C.; project administration, H.C.; funding acquisition, H.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Hubei Provincial Department of Science and Technology Joint Fund, Construction of a Risk Model for Predicting the Pregnancy of Cesarean Section Scar Pregnancy Based on Multiple High-risk Factors, grant number JCZRLH202500083.

Data Availability Statement

The dataset used in this research is publicly available at the UCI machine learning repository on https://archive.ics.uci.edu/dataset/383/cervical+cancer+risk+factors (accessed on 17 July 2021).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CTGAN	Conditional tabular generative adversarial networks
GBD	Global Burden of Disease
WHO	World Health Organization
SMOTE	Synthetic Minority Over-Sampling Technique
ENN	Edited Nearest Neighbors
mRMR	Minimal redundancy maximal relevance
RFE	Recursive feature elimination
XGBoost	Extreme gradient boosting
RFCC	Risk factors of cervical cancer
GA	Genetic algorithm
CSA	Crow search algorithm
RF	Random forest
SVM	Support Vector Machine
NB	Naive Bayes
LR	Logistic Regression
KNN	K-Nearest Neighbor
DT	Decision tree
ANAVA	One-way analysis of variance
MI	Mutual information
ANN	Artificial neural networks
GBM	Gradient-boosting machine
LASSO	Least absolute shrinkage and selection operator
STDs	Sexually transmitted diseases
IUD	Intrauterine device
Dx	Digital Radiography
HPV	Human papilloma virus
CIN	Cervical intraepithelial neoplasia

References

Sun, P.; Yu, C.; Yin, L.; Chen, Y.; Sun, Z.; Zhang, T.; Shuai, P.; Zeng, K.; Yao, X.; Chen, J. Global, regional, and national burden of female cancers in women of child-bearing age, 1990–2021: Analysis of data from the global burden of disease study 2021. EClinicalMedicine 2024, 74, 102713. [Google Scholar] [CrossRef]
Bray, F.; Ferlay, J.; Soerjomataram, I.; Siegel, R.L.; Torre, L.A.; Jemal, A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2018, 68, 394–424. [Google Scholar] [CrossRef]
Marván, M.; López-Vázquez, E. The Anthropocene: Politik–Economics–Society–Science: Preventing Health and Environmental Risks in Latin America; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
Mezei, A.K.; Armstrong, H.L.; Pedersen, H.N.; Campos, N.G.; Mitchell, S.M.; Sekikubo, M.; Byamugisha, J.K.; Kim, J.J.; Bryan, S.; Ogilvie, G.S. Cost-effectiveness of cervical cancer screening methods in low-and middle-income countries: A systematic review. Int. J. Cancer 2017, 141, 437–446. [Google Scholar] [CrossRef]
Web Annex, A. WHO Guideline for Screening and Treatment of Cervical Pre-Cancer Lesions for Cervical Cancer Prevention; World Health Organization: Geneva, Switzerland, 2021. [Google Scholar]
Newaz, A.; Muhtadi, S.; Haq, F.S. An intelligent decision support system for the accurate diagnosis of cervical cancer. Knowl.-Based Syst. 2022, 245, 108634. [Google Scholar] [CrossRef]
Kaushik, K.; Bhardwaj, A.; Bharany, S.; Alsharabi, N.; Rehman, A.U.; Eldin, E.T.; Ghamry, N.A. A machine learning-based framework for the prediction of cervical cancer risk in women. Sustainability 2022, 14, 11947. [Google Scholar] [CrossRef]
Aloss, A.; Sahu, B.; Deeb, H.; Mishra, D. A crow search algorithm-based machine learning model for heart disease and cervical cancer diagnosis. In Electronic Systems and Intelligent Computing: Proceedings of ESIC 2021; Springer: Berlin/Heidelberg, Germany, 2022; pp. 303–311. [Google Scholar]
Tanimu, J.J.; Hamada, M.; Hassan, M.; Kakudi, H.; Abiodun, J.O. A machine learning method for classification of cervical cancer. Electronics 2022, 11, 463. [Google Scholar] [CrossRef]
Chadaga, K.; Prabhu, S.; Sampathila, N.; Chadaga, R.; KS, S.; Sengupta, S. Predicting cervical cancer biopsy results using demographic and epidemiological parameters: A custom stacked ensemble machine learning approach. Cogent Eng. 2022, 9, 2143040. [Google Scholar] [CrossRef]
Kumawat, G.; Vishwakarma, S.K.; Chakrabarti, P.; Chittora, P.; Chakrabarti, T.; Lin, J.C.-W. Prognosis of cervical cancer disease by applying machine learning techniques. J. Circuits Syst. Comput. 2023, 32, 2350019. [Google Scholar] [CrossRef]
Priya, S.; Karthikeyan, N.; Palanikkumar, D. Pre Screening of Cervical Cancer Through Gradient Boosting Ensemble Learning Method. Intell. Autom. Soft Comput. 2023, 35, 2673–2685. [Google Scholar] [CrossRef]
Bhavani, C.; Govardhan, A. Cervical cancer prediction using stacked ensemble algorithm with SMOTE and RFERF. Mater. Today Proc. 2023, 80, 3451–3457. [Google Scholar] [CrossRef]
Shakil, R.; Islam, S.; Akter, B. A precise machine learning model: Detecting cervical cancer using feature selection and explainable AI. J. Pathol. Inform. 2024, 15, 100398. [Google Scholar] [CrossRef]
Ali, M.S.; Hossain, M.M.; Kona, M.A.; Nowrin, K.R.; Islam, M.K. An ensemble classification approach for cervical cancer prediction using behavioral risk factors. Health Anal. 2024, 5, 100324. [Google Scholar] [CrossRef]
Fernandes, K.; Cardoso, J.S.; Fernandes, J. Transfer learning with partial observability applied to cervical cancer screening. In Proceedings of the Pattern Recognition and Image Analysis: 8th Iberian Conference, IbPRIA 2017, Faro, Portugal, 20–23 June 2017; Proceedings 8. Springer: Berlin/Heidelberg, Germany, 2017; pp. 243–250. [Google Scholar]
Allison, P.D. Missing data. In The SAGE Handbook of Quantitative Methods in Psychology; Sage Publications Ltd.: Thousand Oaks, CA, USA, 2009; Volume 23, pp. 72–89. [Google Scholar]
Zheng, A.; Casari, A. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2018. [Google Scholar]
Tang, X.; Cai, L.; Meng, Y.; Gu, C.; Yang, J.; Yang, J. A novel hybrid feature selection and ensemble learning framework for unbalanced cancer data diagnosis with transcriptome and functional proteomic. IEEE Access 2021, 9, 51659–51668. [Google Scholar] [CrossRef]
Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
Rakesh, D.K.; Jana, P.K. A general framework for class label specific mutual information feature selection method. IEEE Trans. Inf. Theory 2022, 68, 7996–8014. [Google Scholar] [CrossRef]
Jeon, H.; Oh, S. Hybrid-recursive feature elimination for efficient feature selection. Appl. Sci. 2020, 10, 3211. [Google Scholar] [CrossRef]
Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional gan. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Christodoulou, E.; Ma, J.; Collins, G.S.; Steyerberg, E.W.; Verbakel, J.Y.; Van Calster, B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J. Clin. Epidemiol. 2019, 110, 12–22. [Google Scholar] [CrossRef]
Czepiel, S.A. Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation. 2002. Available online: https://www.stat.cmu.edu/~brian/valerie/617-2022/617-2021/week07/resources/mlelr.pdf (accessed on 16 June 2013).
Zhang, S.; Li, X.; Zong, M.; Zhu, X.; Cheng, D. Learning k for knn classification. ACM Trans. Intell. Syst. Technol. 2017, 8, 1–19. [Google Scholar] [CrossRef]
Abu Alfeilat, H.A.; Hassanat, A.B.; Lasassmeh, O.; Tarawneh, A.S.; Alhasanat, M.B.; Eyal Salman, H.S.; Prasath, V.S. Effects of distance measure choice on k-nearest neighbor classifier performance: A review. Big Data 2019, 7, 221–248. [Google Scholar] [CrossRef]
Podgorelec, V.; Kokol, P.; Stiglic, B.; Rozman, I. Decision trees: An overview and their use in medicine. J. Med. Syst. 2002, 26, 445–463. [Google Scholar] [CrossRef]
Gite, P.; Chouhan, K.; Krishna, K.M.; Nayak, C.K.; Soni, M.; Shrivastava, A. ML Based Intrusion Detection Scheme for various types of attacks in a WSN using C4. 5 and CART classifiers. Mater. Today Proc. 2023, 80, 3769–3776. [Google Scholar] [CrossRef]
Javed Mehedi Shamrat, F.; Ranjan, R.; Hasib, K.M.; Yadav, A.; Siddique, A.H. Performance evaluation among id3, c4. 5, and cart decision tree algorithm. In Pervasive Computing and Social Networking: Proceedings of ICPCSN 2021; Springer: Singapore, 2022; pp. 127–142. [Google Scholar]
Jakkula, V. Tutorial on Support Vector Machine (SVM); Washington State University: Pullman, WA, USA, 2006; Volume 37, p. 3. [Google Scholar]
Xue, H.; Yang, Q.; Chen, S. SVM: Support vector machines. In The Top Ten Algorithms in Data Mining; Chapman and Hall/CRC: Boca Raton, FL, USA, 2009; pp. 51–74. [Google Scholar]
Kavzoglu, T.; Colkesen, I. A kernel functions analysis for support vector machines for land cover classification. Int. J. Appl. Earth Obs. Geoinf. 2009, 11, 352–359. [Google Scholar] [CrossRef]
Chen, H.; Mei, K.; Zhou, Y.; Wang, N.; Cai, G. Auxiliary Diagnosis of Breast Cancer Based on Machine Learning and Hybrid Strategy. IEEE Access 2023, 11, 96374–96386. [Google Scholar] [CrossRef]
Khalsan, M.; Machado, L.R.; Al-Shamery, E.S.; Ajit, S.; Anthony, K.; Mu, M.; Agyeman, M.O. A survey of machine learning approaches applied to gene expression analysis for cancer prediction. IEEE Access 2022, 10, 27522–27534. [Google Scholar] [CrossRef]
Shekar, B.; Dagnew, G. Grid search-based hyperparameter tuning and classification of microarray cancer data. In Proceedings of the 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP), Gangtok, India, 25–28 February 2019; pp. 1–8. [Google Scholar]
Fushiki, T. Estimation of prediction error by using K-fold cross-validation. Stat. Comput. 2011, 21, 137–146. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Zeng, M.; Zou, B.; Wei, F.; Liu, X.; Wang, L. Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data. In Proceedings of the 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS), Chongqing, China, 28–29 May 2016; pp. 225–228. [Google Scholar]
Han, H.; Wang, W.-Y.; Mao, B.-H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
Lopes, R.H.; Reid, I.; Hobson, P.R. The Two-Dimensional Kolmogorov-Smirnov Test. 2007. Available online: https://bura.brunel.ac.uk/handle/2438/1166 (accessed on 27 April 2007).
Kampstra, P. Beanplot: A boxplot alternative for visual comparison of distributions. J. Stat. Softw. 2008, 28, 1–9. [Google Scholar] [CrossRef]
McHugh, M.L. The chi-square test of independence. Biochem. Med. 2013, 23, 143–149. [Google Scholar] [CrossRef]

Figure 1. Workflow of cervical cancer diagnosis based on a hybrid strategy with CTGAN.

Figure 2. The network structure diagram of CTGAN.

Figure 3. The distribution boxplot of hormonal contraceptives (years).

Figure 4. Comparison of the results of the control and experimental groups: (a) accuracy; (b) precision; (c) recall; (d) F1 score.

Table 1. Structure of RFCC dataset.

Dataset	Sample Size (n)		Input Features (n)		Target	Mandates
Dataset	Positive	Negative	Behavioral Factors	Non-Invasive Examine	Invasive Examine	Mandates
RFCC	55	803	32	3	1	Predict the presence of cervical cancer

Table 2. The baseline characteristics of the RFCC dataset.

No.	Feature Name	Type	Biopsy		Missing (n, %)
No.	Feature Name	Type	Positive (n = 55)	Negative (n = 803)	Missing (n, %)
1	Age	Int ¹	28 (21.0–35) *	25 (20–32)	0 (0%)
2	Number of sexual partners	Int	2 (2–3)	2 (2–3)	26 (3.0%)
3	First sexual intercourse	Int	17 (15–18)	17 (15–18)	7 (0.8%)
4	Number of pregnancies	Int	2 (2–3)	2 (1–3)	56 (6.5%)
5	Smoking	Bool ²	10 (18.2%)/45 (81.8%) *	113 (14.1%)/690 (85.9%)	13 (1.5%)
6	Smoking (years)	Int	2.5 (1–9)	7 (2–12)	13 (1.5%)
7	Smoking (packs/year)	Int	0.5 (0.2–2)	1.35 (0.5–3)	13 (1.5%)
8	Hormonal contraceptives	Bool	36 (65.5%)/19 (34.5%)	553 (68.9%)/250 (31.1%)	108 (12.6%)
9	Hormonal contraceptives (years)	Int	2 (0.5–9)	2.25 (1–4)	108 (12.6%)
10	Intrauterine device (IUD)	Bool	9 (16.4%)/46 (83.6%)	74 (9.2%)/729 (90.8%)	117 (13.6%)
11	IUD (years)	Int	3 (2–6)	4 (1.5–7)	117 (13.6%)
12	Sexually transmitted diseases (STDs)	Bool	12 (21.8%)/43 (78.2%)	67 (8.3%)/736 (91.7%)	105 (12.2%)
13	STDs (number)	Int	2 (1–2)	2 (1–2)	105 (12.2%)
14	STDs: condylomatosis	Bool	7 (12.7%)/48 (87.3%)	37 (4.6%)/766 (95.4%)	105 (12.2%)
15	STDs: cervical	Bool	0 (0%)/55 (100%)	0 (0%)/803 (100%)	105 (12.2%)
16	STDs: vaginal	Bool	0 (0%)/55 (100%)	4 (0.5%)/799 (99.5%)	105 (12.2%)
17	STDs: vulvo-perineal condylomatosis	Bool	7 (12.7%)/48 (87.3%)	36 (4.5%)/767 (95.5%)	105 (12.2%)
18	STDs: syphilis	Bool	0 (0%)/55 (100%)	18 (2.2%)/785 (97.8%)	105 (12.2%)
19	STDs: pelvic inflammatory disease	Bool	0 (0%)/55 (100%)	1 (0.1%)/802 (99.9%)	105 (12.2%)
20	STDs: genital herpes	Bool	1 (1.8%)/54 (98.2%)	0 (0%)/803 (100%)	105 (12.2%)
21	STDs: molluscum contagiosum	Bool	0 (0%)/55 (100%)	1 (0.1%)/802 (99.9%)	105 (12.2%)
22	STDs: AIDS	Bool	0 (0%)/55 (100%)	0 (0%)/803 (100%)	105 (12.2%)
23	STDs: HIV	Bool	5 (9.1%)/50 (90.9%)	13 (1.6%)/790 (98.4%)	105 (12.2%)
24	STDs: Hepatitis B	Bool	0 (0%)/55 (100%)	1 (0.1%)/802 (99.9%)	105 (12.2%)
25	STDs: HPV	Bool	0 (0%)/55 (100%)	2 (0.2%)/801 (99.8%)	105 (12.2%)
26	STDs: number of diagnoses	Int	0 (0-0)	0 (0-0)	105 (12.2%)
27	STDs: time since first diagnosis	Int	--	--	787 (91.7%)
28	STDs: time since last diagnosis	Int	--	--	787 (91.7%)
29	Dx: cancer	Bool	6 (10.9%)/49 (89.1%)	12 (1.5%)/791 (98.5%)	0 (0%)
30	Dx: cervical intraepithelial neoplasia (CIN)	Bool	3 (5.5%)/52 (94.5%)	6 (0.7%)/797 (99.3%)	0 (0%)
31	Dx: human papilloma virus (HPV)	Bool	6 (10.9%)/49 (89.1%)	12 (1.5%)/791 (98.5%)	0 (0%)
32	Digital Radiography (Dx)	Bool	7 (12.7%)/48 (87.3%)	17 (2.1%)/786 (97.9%)	0 (0%)
33	Hinselmann	Bool	25 (45.5%)/30 (54.5%)	10 (1.2%)/793 (98.8%)	0 (0%)
34	Schiller	Bool	48 (87.3%)/7 (12.7%)	26 (3.2%)/777 (96.8%)	0 (0%)
35	Citology	Bool	18 (32.7%)/37 (67.3%)	26 (3.2%)/777 (96.8%)	0 (0%)

¹ The Int type indicates that the feature is numeric. ² The Bool type indicates that the feature is categorical. * Data are “median (interquartile range)” or “n of yes (%)/n of no (%)”.

Table 3. Optimal features subset of RFCC dataset.

No.	Feature Name	Feature Importance
1	Schiller	72.483283
2	Dx: CIN	9.848515
3	Dx: HPV	2.212579
4	First sexual intercourse	2.010904
5	Citology	1.982685
6	IUD (years)	1.982242
7	Hormonal Contraceptives (years)	1.943814
8	STDs: Number of diagnoses	1.90573
9	Smoking (packs/year)	1.761776
10	Hinselmann	1.714833
11	Dx	1.477720
12	Dx: Cancer	0.675923

Table 4. Datasets after sample balancing and expanding.

Dataset	Original		After SMOTE-ENN		After SMOTE-ENN and CTGAN
Dataset	Positive	Negative	Positive	Negative	Positive	Negative
RFCC	55	803	766	717	15,723 + 766	14,277 + 717

Table 5. Confusion matrix.

Label	Predicted Positive	Predicted Negative
Positive	TP	FN
Negative	FP	TN

Table 6. The parameters of the classifiers.

Model	Parameters	Implication	Value
LR	penalty	penalty term	L2
	solver	optimization algorithm	liblinear
	C	the inverse of the regularized intensity	1.0
KNN	n_neighbors	K value	2
	weights	weights of the nearest neighbor samples	uniform
	algorithm	algorithm	auto
DT	splitter	characterization criteria	random
DT	max_depth	depth of the tree	10
SVM	C	penalty coefficient	0.8
SVM	kernel	type of kernel functions	rbf

Table 7. The parameters of the model used in the hybrid strategy.

Model	Parameters	Implication	Value
mRMR	--	mutual information	MID
mRMR	--	number of output features	24
RFE	estimator	classifier	XGBoost
	score	evaluation indicators	accuracy
	min_features _to_select	the minimum number of output features	12
SMOTE-ENN	sampling_strategy	sampling strategy	auto
SMOTE-ENN	random_state	random seeds	7
CTGAN	epochs	number of iterations	50

Table 8. Comparison of prediction results under different feature filtering approaches.

Model	Preprocess Method	Accuracy	Precision	Recall	F1-Score
LR	Group 1 (Baseline)	95.37%	63.87%	65.09%	62.19%
	Group 2	95.82%	66.74%	68.48%	65.13%
	Group 3	95.42%	63.86%	68.24%	63.68%
	Group 4	95.90%	67.73%	69.49%	66.10%
KNN	Group 1 (Baseline)	93.42%	39.87%	16.25%	21.40%
	Group 2	93.75%	48.52%	24.58%	29.98%
	Group 3	93.82%	53.90%	24.63%	31.46%
	Group 4	94.05%	53.55%	35.00%	39.60%
DT	Group 1 (Baseline)	94.29%	55.11%	53.05%	51.74%
	Group 2	94.44%	55.35%	56.75%	53.76%
	Group 3	94.59%	57.22%	56.98%	54.53%
	Group 4	94.57%	56.69%	58.31%	54.94%
SVM	Group 1 (Baseline)	93.71%	50.88%	31.69%	35.64%
	Group 2	94.24%	58.22%	40.79%	43.74%
	Group 3	94.16%	56.96%	36.29%	41.34%
	Group 4	94.76%	62.97%	42.15%	47.35%

Bold value is the optimal result for each classifier using different feature selection methods.

Table 9. Comparison of prediction results derived using different sampling techniques.

Model	Methods	Positive/Negative	Accuracy	Precision	Recall	F1-Score
LR	SMOTE	803/803	96.28%	95.96%	96.68%	96.30%
	SMOTE-Tomek	800/800	96.34%	96.08%	96.63%	96.34%
	Borderline-SMOTE	803/803	96.41%	96.19%	96.68%	96.41%
	SMOTE-ENN	766/717	98.52%	98.70%	98.43%	98.56%
KNN	SMOTE	803/803	96.31%	95.80%	96.92%	96.34%
	SMOTE-Tomek	800/800	96.06%	95.71%	96.48%	96.07%
	Borderline-SMOTE	803/803	96.61%	96.04%	97.29%	96.64%
	SMOTE-ENN	766/717	97.93%	99.01%	96.95%	97.95%
DT	SMOTE	803/803	97.27%	96.68%	97.91%	97.27%
	SMOTE-Tomek	800/800	97.40%	96.76%	98.12%	97.42%
	Borderline-SMOTE	803/803	97.15%	96.37%	98.02%	97.16%
	SMOTE-ENN	766/717	98.27%	97.97%	98.69%	98.32%
SVM	SMOTE	803/803	96.13%	95.41%	96.99%	96.17%
	SMOTE-Tomek	800/800	95.73%	95.26%	96.28%	95.74%
	Borderline-SMOTE	803/803	96.29%	95.79%	96.88%	96.31%
	SMOTE-ENN	766/717	97.36%	97.51%	97.38%	97.43%

Bold values is the optimal result for each classifier using different sampling methods.

Table 10. The KS statistic of quantitative characteristics.

	Feature	KS Statistic	p-Value
1	STDs: Number of diagnosis	0.0000	1.00
2	Smoking (packs/year)	0.0198	0.71
3	IUD (years)	0.0303	0.45
4	First sexual intercourse	0.1399	0.85
5	Hormonal contraceptives (years)	0.1469	8.7 × 10⁻⁹

Table 11. The Chi-square statistic of qualitative characteristics.

	Feature	$χ^{2}$ Statistic	p-Value
1	Schiller	0.1135	0.74
2	Hinselmann	0.0000	1.00
3	Citology	0.0533	0.82
4	Dx	3.4176	0.06
5	Dx: HPV	0.0000	1.00
6	Dx: CIN	0.2882	0.59
7	Dx: Cancer	0.9092	0.34

Table 12. Comparison of prediction results of cervical cancer dataset under different strategies.

Model	Strategy	Accuracy	Precision	Recall	F1-Score
LR	Control Group	95.37%	63.87%	65.09%	62.19%
	Strategy I	95.90%	67.73%	69.49%	66.10%
	Strategy II	98.52%	98.70%	98.43%	98.56%
	Hybrid Strategy	99.00%	99.28%	98.77%	99.02%
KNN	Control Group	93.42%	39.87%	16.25%	21.40%
	Strategy I	94.05%	53.55%	35.00%	39.60%
	Strategy II	97.93%	99.01%	96.95%	97.95%
	Hybrid Strategy	98.16%	99.31%	97.14%	98.20%
DT	Control Group	94.29%	55.11%	53.05%	51.74%
	Strategy I	94.57%	56.69%	58.31%	54.94%
	Strategy II	98.27%	97.97%	98.69%	98.32%
	Hybrid Strategy	98.40%	98.16%	98.73%	98.44%
SVM	Control Group	93.71%	50.88%	31.69%	35.64%
	Strategy I	94.76%	62.97%	42.15%	47.35%
	Strategy II	97.36%	97.51%	97.38%	97.43%
	Hybrid Strategy	97.65%	96.92%	98.60%	97.73%

Bold values is the optimal result for each classifier using proposed hybrid strategy.

Table 13. Results for the control and experimental groups.

	Model	Accuracy	Precision	Recall	F1-Score
Control Group	LR	95.37%	63.87%	65.09%	62.19%
	KNN	93.42%	39.87%	16.25%	21.40%
	DT	94.29%	55.11%	53.05%	51.74%
	SVM	93.71%	50.88%	31.69%	35.64%
Experimental Group	LR	99.00%	99.28%	98.77%	99.02%
	KNN	98.16%	99.31%	97.14%	98.20%
	DT	98.40%	98.16%	98.73%	98.44%
	SVM	97.65%	96.92%	98.60%	97.73%

Table 14. Accuracy comparison of this study with previous studies in the RFCC dataset.

Literature	Methods of Data Processing	Methods of Classification	Time	Accuracy
[6]	HS + GA	RF	2022	94.47%
[7]	--	XGBoost	2022	96.50%
[8]	CSA	RF + SVM + NB + LR + KNN	2022	97.70%
[9]	RFE + SMOTETomek	DT	2022	98.72%
[10]	ANOVA + Pearson + Borderline-SMOTE	LR + DT + KNN + SVM + NB	2022	98.00%
[11]	--	XGBoost	2023	94.94%
[12]	--	MLP with GBM	2023	96.33%
[13]	SMOTE + RFE	SVM + RF + LR + BC + KNN	2023	97.10%
[14]	Chi-square + LASSO	DT	2024	97.60%
[15]	SMOTE	NB + RF + GBM + AdaBoost + LR + DT + SVM	2024	98.06%
This study	mRMR + RFE + SMOTE-ENN + CTGAN	LR	2025	99.00%

Bold values shows that our proposed hybrid strategy achieved the optimal accuracy compared to existing methods.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, M.; Chen, H.; Lv, Z.; Cai, G. Diagnosis of Cervical Cancer Based on a Hybrid Strategy with CTGAN. Electronics 2025, 14, 1140. https://doi.org/10.3390/electronics14061140

AMA Style

Tang M, Chen H, Lv Z, Cai G. Diagnosis of Cervical Cancer Based on a Hybrid Strategy with CTGAN. Electronics. 2025; 14(6):1140. https://doi.org/10.3390/electronics14061140

Chicago/Turabian Style

Tang, Mengdi, Hua Chen, Zongjian Lv, and Guangxing Cai. 2025. "Diagnosis of Cervical Cancer Based on a Hybrid Strategy with CTGAN" Electronics 14, no. 6: 1140. https://doi.org/10.3390/electronics14061140

APA Style

Tang, M., Chen, H., Lv, Z., & Cai, G. (2025). Diagnosis of Cervical Cancer Based on a Hybrid Strategy with CTGAN. Electronics, 14(6), 1140. https://doi.org/10.3390/electronics14061140

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Diagnosis of Cervical Cancer Based on a Hybrid Strategy with CTGAN

Abstract

1. Introduction

2. Materials and Methods

2.1. Cervical Cancer Dateset

2.1.1. Data Source and Description

2.1.2. Data Cleaning

2.1.3. Data Standardization

2.2. Proposed Hybrid Strategy with CTGAN

2.2.1. Feature Selection

2.2.2. SMOTE-ENN for Sample Balancing

2.2.3. CTGAN for Sample Expanding

2.3. Classifiers

2.3.1. Logistic Regression

2.3.2. K-Nearest Neighbor

2.3.3. Decision Tree

2.3.4. Support Vector Machine

2.4. Performance Metrics of the Model

3. Results

3.1. Experimental Environment

3.2. Experimental Parameter Setting

3.3. Performance Optimization Links of Hybrid Strategy

3.3.1. Comparative Experiments on Feature Selection

3.3.2. Comparative Experiments on Sample Balancing

3.3.3. CTGAN Synthetic Data Quality Analysis

3.4. Comparative Analysis of Data Processing Strategies

3.5. Comparative Analysis of Different Classifiers

3.6. Comparative Analysis with Other Studies

4. Discussion

4.1. Technical Advantages of the Hybrid Strategy

4.2. Clinical and Public Health Implications

4.3. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI