Diagnosis of Cervical Cancer Based on a Hybrid Strategy with CTGAN
Abstract
:1. Introduction
- For the first challenge, we employ the mRMR algorithm for initial feature screening, followed by XGBoost-based RFE for secondary feature selection to identify the optimal feature subset;
- For the second challenge, we apply the SMOTE-ENN combined sampling method to address extreme class imbalance in medical data;
- For the third challenge, we utilize CTGAN to augment the dataset, overcoming the generalization bottleneck caused by limited training samples.
2. Materials and Methods
2.1. Cervical Cancer Dateset
2.1.1. Data Source and Description
2.1.2. Data Cleaning
2.1.3. Data Standardization
2.2. Proposed Hybrid Strategy with CTGAN
2.2.1. Feature Selection
- mRMR method for initial screening
Algorithm 1: mRMR Method |
Step 1: Calculate the mutual information between individual features and results |
Step 2: Sort the results according to the values |
Step 3: Remove the features with zero mutual information from the subset of features to be selected by using two-way search and retain the features with the highest correlation |
Step 4: Choose the weakly correlated features for classifier validation until the results are satisfactory |
Step 5: Compute mutual information difference (MID) and mutual information entropy (MIQ) for the selected feature subset |
Step 6: Sort the MID or MIQ and verify the features with maximum values |
Step 7: Repeat step 3 to step 6 until the results are satisfactory |
Step 8: Add a weight element to the MID or MIQ |
Step 9: Find the combination of the features that corresponds to the optimal classification indicators through several cycles of calculation and verification of the classifier |
Step 10: Select the optimal feature set on the feature sets obtained from the two search methods |
- 2.
- Recursive feature elimination method for secondary screening
Algorithm 2: Recursive Feature Elimination Method |
Step 1: Initialize the XGBoost classifier with the features pre-selected via the mRMR algorithm as the baseline feature subset |
Step 2: Assess the feature importance metrics using the average information gain |
Step 3: Evaluate the classification accuracy through the cross-validation technique |
Step 4: Eliminate the lowest-ranked feature from the current subset to generate a revised feature subset |
Step 5: Determine the significance of each feature within the updated feature subset |
Step 6: Reassess the classification accuracy of the updated feature subset |
Step 7: Repeat step 4 to step 6 until no more features are rejected |
Step 8: Choose the feature subset that yields the highest classification accuracy from a set of K distinct feature subsets |
2.2.2. SMOTE-ENN for Sample Balancing
Algorithm 3: SMOTE-ENN Method |
Step 1: Divide the unbalanced dataset into minority class Smin and majority class Smaj |
Step 2: Determine the K closest neighbors for each sample in the minority class |
Step 3: Calculate the number of new samples required for every minority class sample based on the unbalanced proportion of the dataset |
Step 4: Select N’s nearest neighbors randomly from its K-Nearest Neighbors for each minority class sample |
Step 5: Construct the new samples based on equation xnew = x + rand(0,1)⋅(xn − x) if the nearest neighbors are chosen while generating new samples xn from the samples x |
2.2.3. CTGAN for Sample Expanding
Algorithm 4: CTGAN Method |
Step 1: Estimate the number of modes mi for each consecutive column Ci using a variational Gaussian mixture model |
Step 2: Fit a Gaussian mixture distribution |
Step 3: Calculate the probability for each value Cij in Ci in each pattern |
Step 4: Sample a pattern from a given probability density and normalize the values using the sampled patterns |
2.3. Classifiers
2.3.1. Logistic Regression
2.3.2. K-Nearest Neighbor
2.3.3. Decision Tree
2.3.4. Support Vector Machine
2.4. Performance Metrics of the Model
- Accuracy, defined as the percentage of samples that are predicted correctly as a percentage of the total. The formula is as follows:
- 2.
- Precision, defined as the percentage of positive samples predicted by the model that actually turn out to be positive samples. The formula is as follows:
- 3.
- Recall, defined as the percentage of true positive samples that are predicted to be positive. The formula is as follows:
- 4.
- F1-score, defined as the harmonic mean of both the accuracy and recall of the model. The formula is as follows:
3. Results
3.1. Experimental Environment
3.2. Experimental Parameter Setting
3.3. Performance Optimization Links of Hybrid Strategy
3.3.1. Comparative Experiments on Feature Selection
3.3.2. Comparative Experiments on Sample Balancing
3.3.3. CTGAN Synthetic Data Quality Analysis
- Quantitative characteristics
- 2.
- Qualitative characteristics
3.4. Comparative Analysis of Data Processing Strategies
3.5. Comparative Analysis of Different Classifiers
3.6. Comparative Analysis with Other Studies
4. Discussion
4.1. Technical Advantages of the Hybrid Strategy
4.2. Clinical and Public Health Implications
4.3. Limitations and Future Directions
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
CTGAN | Conditional tabular generative adversarial networks |
GBD | Global Burden of Disease |
WHO | World Health Organization |
SMOTE | Synthetic Minority Over-Sampling Technique |
ENN | Edited Nearest Neighbors |
mRMR | Minimal redundancy maximal relevance |
RFE | Recursive feature elimination |
XGBoost | Extreme gradient boosting |
RFCC | Risk factors of cervical cancer |
GA | Genetic algorithm |
CSA | Crow search algorithm |
RF | Random forest |
SVM | Support Vector Machine |
NB | Naive Bayes |
LR | Logistic Regression |
KNN | K-Nearest Neighbor |
DT | Decision tree |
ANAVA | One-way analysis of variance |
MI | Mutual information |
ANN | Artificial neural networks |
GBM | Gradient-boosting machine |
LASSO | Least absolute shrinkage and selection operator |
STDs | Sexually transmitted diseases |
IUD | Intrauterine device |
Dx | Digital Radiography |
HPV | Human papilloma virus |
CIN | Cervical intraepithelial neoplasia |
References
- Sun, P.; Yu, C.; Yin, L.; Chen, Y.; Sun, Z.; Zhang, T.; Shuai, P.; Zeng, K.; Yao, X.; Chen, J. Global, regional, and national burden of female cancers in women of child-bearing age, 1990–2021: Analysis of data from the global burden of disease study 2021. EClinicalMedicine 2024, 74, 102713. [Google Scholar] [CrossRef]
- Bray, F.; Ferlay, J.; Soerjomataram, I.; Siegel, R.L.; Torre, L.A.; Jemal, A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2018, 68, 394–424. [Google Scholar] [CrossRef]
- Marván, M.; López-Vázquez, E. The Anthropocene: Politik–Economics–Society–Science: Preventing Health and Environmental Risks in Latin America; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
- Mezei, A.K.; Armstrong, H.L.; Pedersen, H.N.; Campos, N.G.; Mitchell, S.M.; Sekikubo, M.; Byamugisha, J.K.; Kim, J.J.; Bryan, S.; Ogilvie, G.S. Cost-effectiveness of cervical cancer screening methods in low-and middle-income countries: A systematic review. Int. J. Cancer 2017, 141, 437–446. [Google Scholar] [CrossRef]
- Web Annex, A. WHO Guideline for Screening and Treatment of Cervical Pre-Cancer Lesions for Cervical Cancer Prevention; World Health Organization: Geneva, Switzerland, 2021. [Google Scholar]
- Newaz, A.; Muhtadi, S.; Haq, F.S. An intelligent decision support system for the accurate diagnosis of cervical cancer. Knowl.-Based Syst. 2022, 245, 108634. [Google Scholar] [CrossRef]
- Kaushik, K.; Bhardwaj, A.; Bharany, S.; Alsharabi, N.; Rehman, A.U.; Eldin, E.T.; Ghamry, N.A. A machine learning-based framework for the prediction of cervical cancer risk in women. Sustainability 2022, 14, 11947. [Google Scholar] [CrossRef]
- Aloss, A.; Sahu, B.; Deeb, H.; Mishra, D. A crow search algorithm-based machine learning model for heart disease and cervical cancer diagnosis. In Electronic Systems and Intelligent Computing: Proceedings of ESIC 2021; Springer: Berlin/Heidelberg, Germany, 2022; pp. 303–311. [Google Scholar]
- Tanimu, J.J.; Hamada, M.; Hassan, M.; Kakudi, H.; Abiodun, J.O. A machine learning method for classification of cervical cancer. Electronics 2022, 11, 463. [Google Scholar] [CrossRef]
- Chadaga, K.; Prabhu, S.; Sampathila, N.; Chadaga, R.; KS, S.; Sengupta, S. Predicting cervical cancer biopsy results using demographic and epidemiological parameters: A custom stacked ensemble machine learning approach. Cogent Eng. 2022, 9, 2143040. [Google Scholar] [CrossRef]
- Kumawat, G.; Vishwakarma, S.K.; Chakrabarti, P.; Chittora, P.; Chakrabarti, T.; Lin, J.C.-W. Prognosis of cervical cancer disease by applying machine learning techniques. J. Circuits Syst. Comput. 2023, 32, 2350019. [Google Scholar] [CrossRef]
- Priya, S.; Karthikeyan, N.; Palanikkumar, D. Pre Screening of Cervical Cancer Through Gradient Boosting Ensemble Learning Method. Intell. Autom. Soft Comput. 2023, 35, 2673–2685. [Google Scholar] [CrossRef]
- Bhavani, C.; Govardhan, A. Cervical cancer prediction using stacked ensemble algorithm with SMOTE and RFERF. Mater. Today Proc. 2023, 80, 3451–3457. [Google Scholar] [CrossRef]
- Shakil, R.; Islam, S.; Akter, B. A precise machine learning model: Detecting cervical cancer using feature selection and explainable AI. J. Pathol. Inform. 2024, 15, 100398. [Google Scholar] [CrossRef]
- Ali, M.S.; Hossain, M.M.; Kona, M.A.; Nowrin, K.R.; Islam, M.K. An ensemble classification approach for cervical cancer prediction using behavioral risk factors. Health Anal. 2024, 5, 100324. [Google Scholar] [CrossRef]
- Fernandes, K.; Cardoso, J.S.; Fernandes, J. Transfer learning with partial observability applied to cervical cancer screening. In Proceedings of the Pattern Recognition and Image Analysis: 8th Iberian Conference, IbPRIA 2017, Faro, Portugal, 20–23 June 2017; Proceedings 8. Springer: Berlin/Heidelberg, Germany, 2017; pp. 243–250. [Google Scholar]
- Allison, P.D. Missing data. In The SAGE Handbook of Quantitative Methods in Psychology; Sage Publications Ltd.: Thousand Oaks, CA, USA, 2009; Volume 23, pp. 72–89. [Google Scholar]
- Zheng, A.; Casari, A. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2018. [Google Scholar]
- Tang, X.; Cai, L.; Meng, Y.; Gu, C.; Yang, J.; Yang, J. A novel hybrid feature selection and ensemble learning framework for unbalanced cancer data diagnosis with transcriptome and functional proteomic. IEEE Access 2021, 9, 51659–51668. [Google Scholar] [CrossRef]
- Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
- Rakesh, D.K.; Jana, P.K. A general framework for class label specific mutual information feature selection method. IEEE Trans. Inf. Theory 2022, 68, 7996–8014. [Google Scholar] [CrossRef]
- Jeon, H.; Oh, S. Hybrid-recursive feature elimination for efficient feature selection. Appl. Sci. 2020, 10, 3211. [Google Scholar] [CrossRef]
- Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling tabular data using conditional gan. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Christodoulou, E.; Ma, J.; Collins, G.S.; Steyerberg, E.W.; Verbakel, J.Y.; Van Calster, B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J. Clin. Epidemiol. 2019, 110, 12–22. [Google Scholar] [CrossRef]
- Czepiel, S.A. Maximum Likelihood Estimation of Logistic Regression Models: Theory and Implementation. 2002. Available online: https://www.stat.cmu.edu/~brian/valerie/617-2022/617-2021/week07/resources/mlelr.pdf (accessed on 16 June 2013).
- Zhang, S.; Li, X.; Zong, M.; Zhu, X.; Cheng, D. Learning k for knn classification. ACM Trans. Intell. Syst. Technol. 2017, 8, 1–19. [Google Scholar] [CrossRef]
- Abu Alfeilat, H.A.; Hassanat, A.B.; Lasassmeh, O.; Tarawneh, A.S.; Alhasanat, M.B.; Eyal Salman, H.S.; Prasath, V.S. Effects of distance measure choice on k-nearest neighbor classifier performance: A review. Big Data 2019, 7, 221–248. [Google Scholar] [CrossRef]
- Podgorelec, V.; Kokol, P.; Stiglic, B.; Rozman, I. Decision trees: An overview and their use in medicine. J. Med. Syst. 2002, 26, 445–463. [Google Scholar] [CrossRef]
- Gite, P.; Chouhan, K.; Krishna, K.M.; Nayak, C.K.; Soni, M.; Shrivastava, A. ML Based Intrusion Detection Scheme for various types of attacks in a WSN using C4. 5 and CART classifiers. Mater. Today Proc. 2023, 80, 3769–3776. [Google Scholar] [CrossRef]
- Javed Mehedi Shamrat, F.; Ranjan, R.; Hasib, K.M.; Yadav, A.; Siddique, A.H. Performance evaluation among id3, c4. 5, and cart decision tree algorithm. In Pervasive Computing and Social Networking: Proceedings of ICPCSN 2021; Springer: Singapore, 2022; pp. 127–142. [Google Scholar]
- Jakkula, V. Tutorial on Support Vector Machine (SVM); Washington State University: Pullman, WA, USA, 2006; Volume 37, p. 3. [Google Scholar]
- Xue, H.; Yang, Q.; Chen, S. SVM: Support vector machines. In The Top Ten Algorithms in Data Mining; Chapman and Hall/CRC: Boca Raton, FL, USA, 2009; pp. 51–74. [Google Scholar]
- Kavzoglu, T.; Colkesen, I. A kernel functions analysis for support vector machines for land cover classification. Int. J. Appl. Earth Obs. Geoinf. 2009, 11, 352–359. [Google Scholar] [CrossRef]
- Chen, H.; Mei, K.; Zhou, Y.; Wang, N.; Cai, G. Auxiliary Diagnosis of Breast Cancer Based on Machine Learning and Hybrid Strategy. IEEE Access 2023, 11, 96374–96386. [Google Scholar] [CrossRef]
- Khalsan, M.; Machado, L.R.; Al-Shamery, E.S.; Ajit, S.; Anthony, K.; Mu, M.; Agyeman, M.O. A survey of machine learning approaches applied to gene expression analysis for cancer prediction. IEEE Access 2022, 10, 27522–27534. [Google Scholar] [CrossRef]
- Shekar, B.; Dagnew, G. Grid search-based hyperparameter tuning and classification of microarray cancer data. In Proceedings of the 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP), Gangtok, India, 25–28 February 2019; pp. 1–8. [Google Scholar]
- Fushiki, T. Estimation of prediction error by using K-fold cross-validation. Stat. Comput. 2011, 21, 137–146. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Zeng, M.; Zou, B.; Wei, F.; Liu, X.; Wang, L. Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data. In Proceedings of the 2016 IEEE International Conference of Online Analysis and Computing Science (ICOACS), Chongqing, China, 28–29 May 2016; pp. 225–228. [Google Scholar]
- Han, H.; Wang, W.-Y.; Mao, B.-H. Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. In Proceedings of the International Conference on Intelligent Computing, Hefei, China, 23–26 August 2005; pp. 878–887. [Google Scholar]
- Lopes, R.H.; Reid, I.; Hobson, P.R. The Two-Dimensional Kolmogorov-Smirnov Test. 2007. Available online: https://bura.brunel.ac.uk/handle/2438/1166 (accessed on 27 April 2007).
- Kampstra, P. Beanplot: A boxplot alternative for visual comparison of distributions. J. Stat. Softw. 2008, 28, 1–9. [Google Scholar] [CrossRef]
- McHugh, M.L. The chi-square test of independence. Biochem. Med. 2013, 23, 143–149. [Google Scholar] [CrossRef]
Dataset | Sample Size (n) | Input Features (n) | Target | Mandates | ||
---|---|---|---|---|---|---|
Positive | Negative | Behavioral Factors | Non-Invasive Examine | Invasive Examine | ||
RFCC | 55 | 803 | 32 | 3 | 1 | Predict the presence of cervical cancer |
No. | Feature Name | Type | Biopsy | Missing (n, %) | |
---|---|---|---|---|---|
Positive (n = 55) | Negative (n = 803) | ||||
1 | Age | Int 1 | 28 (21.0–35) * | 25 (20–32) | 0 (0%) |
2 | Number of sexual partners | Int | 2 (2–3) | 2 (2–3) | 26 (3.0%) |
3 | First sexual intercourse | Int | 17 (15–18) | 17 (15–18) | 7 (0.8%) |
4 | Number of pregnancies | Int | 2 (2–3) | 2 (1–3) | 56 (6.5%) |
5 | Smoking | Bool 2 | 10 (18.2%)/45 (81.8%) * | 113 (14.1%)/690 (85.9%) | 13 (1.5%) |
6 | Smoking (years) | Int | 2.5 (1–9) | 7 (2–12) | 13 (1.5%) |
7 | Smoking (packs/year) | Int | 0.5 (0.2–2) | 1.35 (0.5–3) | 13 (1.5%) |
8 | Hormonal contraceptives | Bool | 36 (65.5%)/19 (34.5%) | 553 (68.9%)/250 (31.1%) | 108 (12.6%) |
9 | Hormonal contraceptives (years) | Int | 2 (0.5–9) | 2.25 (1–4) | 108 (12.6%) |
10 | Intrauterine device (IUD) | Bool | 9 (16.4%)/46 (83.6%) | 74 (9.2%)/729 (90.8%) | 117 (13.6%) |
11 | IUD (years) | Int | 3 (2–6) | 4 (1.5–7) | 117 (13.6%) |
12 | Sexually transmitted diseases (STDs) | Bool | 12 (21.8%)/43 (78.2%) | 67 (8.3%)/736 (91.7%) | 105 (12.2%) |
13 | STDs (number) | Int | 2 (1–2) | 2 (1–2) | 105 (12.2%) |
14 | STDs: condylomatosis | Bool | 7 (12.7%)/48 (87.3%) | 37 (4.6%)/766 (95.4%) | 105 (12.2%) |
15 | STDs: cervical | Bool | 0 (0%)/55 (100%) | 0 (0%)/803 (100%) | 105 (12.2%) |
16 | STDs: vaginal | Bool | 0 (0%)/55 (100%) | 4 (0.5%)/799 (99.5%) | 105 (12.2%) |
17 | STDs: vulvo-perineal condylomatosis | Bool | 7 (12.7%)/48 (87.3%) | 36 (4.5%)/767 (95.5%) | 105 (12.2%) |
18 | STDs: syphilis | Bool | 0 (0%)/55 (100%) | 18 (2.2%)/785 (97.8%) | 105 (12.2%) |
19 | STDs: pelvic inflammatory disease | Bool | 0 (0%)/55 (100%) | 1 (0.1%)/802 (99.9%) | 105 (12.2%) |
20 | STDs: genital herpes | Bool | 1 (1.8%)/54 (98.2%) | 0 (0%)/803 (100%) | 105 (12.2%) |
21 | STDs: molluscum contagiosum | Bool | 0 (0%)/55 (100%) | 1 (0.1%)/802 (99.9%) | 105 (12.2%) |
22 | STDs: AIDS | Bool | 0 (0%)/55 (100%) | 0 (0%)/803 (100%) | 105 (12.2%) |
23 | STDs: HIV | Bool | 5 (9.1%)/50 (90.9%) | 13 (1.6%)/790 (98.4%) | 105 (12.2%) |
24 | STDs: Hepatitis B | Bool | 0 (0%)/55 (100%) | 1 (0.1%)/802 (99.9%) | 105 (12.2%) |
25 | STDs: HPV | Bool | 0 (0%)/55 (100%) | 2 (0.2%)/801 (99.8%) | 105 (12.2%) |
26 | STDs: number of diagnoses | Int | 0 (0-0) | 0 (0-0) | 105 (12.2%) |
27 | STDs: time since first diagnosis | Int | -- | -- | 787 (91.7%) |
28 | STDs: time since last diagnosis | Int | -- | -- | 787 (91.7%) |
29 | Dx: cancer | Bool | 6 (10.9%)/49 (89.1%) | 12 (1.5%)/791 (98.5%) | 0 (0%) |
30 | Dx: cervical intraepithelial neoplasia (CIN) | Bool | 3 (5.5%)/52 (94.5%) | 6 (0.7%)/797 (99.3%) | 0 (0%) |
31 | Dx: human papilloma virus (HPV) | Bool | 6 (10.9%)/49 (89.1%) | 12 (1.5%)/791 (98.5%) | 0 (0%) |
32 | Digital Radiography (Dx) | Bool | 7 (12.7%)/48 (87.3%) | 17 (2.1%)/786 (97.9%) | 0 (0%) |
33 | Hinselmann | Bool | 25 (45.5%)/30 (54.5%) | 10 (1.2%)/793 (98.8%) | 0 (0%) |
34 | Schiller | Bool | 48 (87.3%)/7 (12.7%) | 26 (3.2%)/777 (96.8%) | 0 (0%) |
35 | Citology | Bool | 18 (32.7%)/37 (67.3%) | 26 (3.2%)/777 (96.8%) | 0 (0%) |
No. | Feature Name | Feature Importance |
---|---|---|
1 | Schiller | 72.483283 |
2 | Dx: CIN | 9.848515 |
3 | Dx: HPV | 2.212579 |
4 | First sexual intercourse | 2.010904 |
5 | Citology | 1.982685 |
6 | IUD (years) | 1.982242 |
7 | Hormonal Contraceptives (years) | 1.943814 |
8 | STDs: Number of diagnoses | 1.90573 |
9 | Smoking (packs/year) | 1.761776 |
10 | Hinselmann | 1.714833 |
11 | Dx | 1.477720 |
12 | Dx: Cancer | 0.675923 |
Dataset | Original | After SMOTE-ENN | After SMOTE-ENN and CTGAN | |||
---|---|---|---|---|---|---|
Positive | Negative | Positive | Negative | Positive | Negative | |
RFCC | 55 | 803 | 766 | 717 | 15,723 + 766 | 14,277 + 717 |
Label | Predicted Positive | Predicted Negative |
---|---|---|
Positive | TP | FN |
Negative | FP | TN |
Model | Parameters | Implication | Value |
---|---|---|---|
LR | penalty | penalty term | L2 |
solver | optimization algorithm | liblinear | |
C | the inverse of the regularized intensity | 1.0 | |
KNN | n_neighbors | K value | 2 |
weights | weights of the nearest neighbor samples | uniform | |
algorithm | algorithm | auto | |
DT | splitter | characterization criteria | random |
max_depth | depth of the tree | 10 | |
SVM | C | penalty coefficient | 0.8 |
kernel | type of kernel functions | rbf |
Model | Parameters | Implication | Value |
---|---|---|---|
mRMR | -- | mutual information | MID |
-- | number of output features | 24 | |
RFE | estimator | classifier | XGBoost |
score | evaluation indicators | accuracy | |
min_features _to_select | the minimum number of output features | 12 | |
SMOTE-ENN | sampling_strategy | sampling strategy | auto |
random_state | random seeds | 7 | |
CTGAN | epochs | number of iterations | 50 |
Model | Preprocess Method | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|---|
LR | Group 1 (Baseline) | 95.37% | 63.87% | 65.09% | 62.19% |
Group 2 | 95.82% | 66.74% | 68.48% | 65.13% | |
Group 3 | 95.42% | 63.86% | 68.24% | 63.68% | |
Group 4 | 95.90% | 67.73% | 69.49% | 66.10% | |
KNN | Group 1 (Baseline) | 93.42% | 39.87% | 16.25% | 21.40% |
Group 2 | 93.75% | 48.52% | 24.58% | 29.98% | |
Group 3 | 93.82% | 53.90% | 24.63% | 31.46% | |
Group 4 | 94.05% | 53.55% | 35.00% | 39.60% | |
DT | Group 1 (Baseline) | 94.29% | 55.11% | 53.05% | 51.74% |
Group 2 | 94.44% | 55.35% | 56.75% | 53.76% | |
Group 3 | 94.59% | 57.22% | 56.98% | 54.53% | |
Group 4 | 94.57% | 56.69% | 58.31% | 54.94% | |
SVM | Group 1 (Baseline) | 93.71% | 50.88% | 31.69% | 35.64% |
Group 2 | 94.24% | 58.22% | 40.79% | 43.74% | |
Group 3 | 94.16% | 56.96% | 36.29% | 41.34% | |
Group 4 | 94.76% | 62.97% | 42.15% | 47.35% |
Model | Methods | Positive/Negative | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|---|---|
LR | SMOTE | 803/803 | 96.28% | 95.96% | 96.68% | 96.30% |
SMOTE-Tomek | 800/800 | 96.34% | 96.08% | 96.63% | 96.34% | |
Borderline-SMOTE | 803/803 | 96.41% | 96.19% | 96.68% | 96.41% | |
SMOTE-ENN | 766/717 | 98.52% | 98.70% | 98.43% | 98.56% | |
KNN | SMOTE | 803/803 | 96.31% | 95.80% | 96.92% | 96.34% |
SMOTE-Tomek | 800/800 | 96.06% | 95.71% | 96.48% | 96.07% | |
Borderline-SMOTE | 803/803 | 96.61% | 96.04% | 97.29% | 96.64% | |
SMOTE-ENN | 766/717 | 97.93% | 99.01% | 96.95% | 97.95% | |
DT | SMOTE | 803/803 | 97.27% | 96.68% | 97.91% | 97.27% |
SMOTE-Tomek | 800/800 | 97.40% | 96.76% | 98.12% | 97.42% | |
Borderline-SMOTE | 803/803 | 97.15% | 96.37% | 98.02% | 97.16% | |
SMOTE-ENN | 766/717 | 98.27% | 97.97% | 98.69% | 98.32% | |
SVM | SMOTE | 803/803 | 96.13% | 95.41% | 96.99% | 96.17% |
SMOTE-Tomek | 800/800 | 95.73% | 95.26% | 96.28% | 95.74% | |
Borderline-SMOTE | 803/803 | 96.29% | 95.79% | 96.88% | 96.31% | |
SMOTE-ENN | 766/717 | 97.36% | 97.51% | 97.38% | 97.43% |
Feature | KS Statistic | p-Value | |
---|---|---|---|
1 | STDs: Number of diagnosis | 0.0000 | 1.00 |
2 | Smoking (packs/year) | 0.0198 | 0.71 |
3 | IUD (years) | 0.0303 | 0.45 |
4 | First sexual intercourse | 0.1399 | 0.85 |
5 | Hormonal contraceptives (years) | 0.1469 | 8.7 × 10−9 |
Feature | Statistic | p-Value | |
---|---|---|---|
1 | Schiller | 0.1135 | 0.74 |
2 | Hinselmann | 0.0000 | 1.00 |
3 | Citology | 0.0533 | 0.82 |
4 | Dx | 3.4176 | 0.06 |
5 | Dx: HPV | 0.0000 | 1.00 |
6 | Dx: CIN | 0.2882 | 0.59 |
7 | Dx: Cancer | 0.9092 | 0.34 |
Model | Strategy | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|---|
LR | Control Group | 95.37% | 63.87% | 65.09% | 62.19% |
Strategy I | 95.90% | 67.73% | 69.49% | 66.10% | |
Strategy II | 98.52% | 98.70% | 98.43% | 98.56% | |
Hybrid Strategy | 99.00% | 99.28% | 98.77% | 99.02% | |
KNN | Control Group | 93.42% | 39.87% | 16.25% | 21.40% |
Strategy I | 94.05% | 53.55% | 35.00% | 39.60% | |
Strategy II | 97.93% | 99.01% | 96.95% | 97.95% | |
Hybrid Strategy | 98.16% | 99.31% | 97.14% | 98.20% | |
DT | Control Group | 94.29% | 55.11% | 53.05% | 51.74% |
Strategy I | 94.57% | 56.69% | 58.31% | 54.94% | |
Strategy II | 98.27% | 97.97% | 98.69% | 98.32% | |
Hybrid Strategy | 98.40% | 98.16% | 98.73% | 98.44% | |
SVM | Control Group | 93.71% | 50.88% | 31.69% | 35.64% |
Strategy I | 94.76% | 62.97% | 42.15% | 47.35% | |
Strategy II | 97.36% | 97.51% | 97.38% | 97.43% | |
Hybrid Strategy | 97.65% | 96.92% | 98.60% | 97.73% |
Model | Accuracy | Precision | Recall | F1-Score | |
---|---|---|---|---|---|
Control Group | LR | 95.37% | 63.87% | 65.09% | 62.19% |
KNN | 93.42% | 39.87% | 16.25% | 21.40% | |
DT | 94.29% | 55.11% | 53.05% | 51.74% | |
SVM | 93.71% | 50.88% | 31.69% | 35.64% | |
Experimental Group | LR | 99.00% | 99.28% | 98.77% | 99.02% |
KNN | 98.16% | 99.31% | 97.14% | 98.20% | |
DT | 98.40% | 98.16% | 98.73% | 98.44% | |
SVM | 97.65% | 96.92% | 98.60% | 97.73% |
Literature | Methods of Data Processing | Methods of Classification | Time | Accuracy |
---|---|---|---|---|
[6] | HS + GA | RF | 2022 | 94.47% |
[7] | -- | XGBoost | 2022 | 96.50% |
[8] | CSA | RF + SVM + NB + LR + KNN | 2022 | 97.70% |
[9] | RFE + SMOTETomek | DT | 2022 | 98.72% |
[10] | ANOVA + Pearson + Borderline-SMOTE | LR + DT + KNN + SVM + NB | 2022 | 98.00% |
[11] | -- | XGBoost | 2023 | 94.94% |
[12] | -- | MLP with GBM | 2023 | 96.33% |
[13] | SMOTE + RFE | SVM + RF + LR + BC + KNN | 2023 | 97.10% |
[14] | Chi-square + LASSO | DT | 2024 | 97.60% |
[15] | SMOTE | NB + RF + GBM + AdaBoost + LR + DT + SVM | 2024 | 98.06% |
This study | mRMR + RFE + SMOTE-ENN + CTGAN | LR | 2025 | 99.00% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tang, M.; Chen, H.; Lv, Z.; Cai, G. Diagnosis of Cervical Cancer Based on a Hybrid Strategy with CTGAN. Electronics 2025, 14, 1140. https://doi.org/10.3390/electronics14061140
Tang M, Chen H, Lv Z, Cai G. Diagnosis of Cervical Cancer Based on a Hybrid Strategy with CTGAN. Electronics. 2025; 14(6):1140. https://doi.org/10.3390/electronics14061140
Chicago/Turabian StyleTang, Mengdi, Hua Chen, Zongjian Lv, and Guangxing Cai. 2025. "Diagnosis of Cervical Cancer Based on a Hybrid Strategy with CTGAN" Electronics 14, no. 6: 1140. https://doi.org/10.3390/electronics14061140
APA StyleTang, M., Chen, H., Lv, Z., & Cai, G. (2025). Diagnosis of Cervical Cancer Based on a Hybrid Strategy with CTGAN. Electronics, 14(6), 1140. https://doi.org/10.3390/electronics14061140