Prediction of Early Diagnosis in Ovarian Cancer Patients Using Machine Learning Approaches with Boruta and Advanced Feature Selection
Abstract
1. Introduction
2. Materials and Methods
2.1. Data Preprocessing
2.2. Feature Selection and Dimensionality Reduction
2.3. Data Splitting
2.4. Classification Algorithms
2.5. Hyperparameter Tuning
2.6. Evaluation Metrics
- AUC (Area Under the Curve): This index assesses the ability of a model to perform discrimination between classes and has particular significance for unbalanced datasets [35].
3. Results
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
PCA | Principal Component Analysis |
RFE | Recursive Feature Elimination |
MI | Mutual Information |
SVM | Support Vector Machine |
ANN | Artificial Neural Network |
AUC | Area Under the Curve |
TP | True Positive |
TN | True Negative |
FP | False Positive |
FN | False Negative |
References
- Globocan. Global Cancer Observatory: Cancer Today; International Agency for Research on Cancer: Lyon, France, 2020; Available online: https://gco.iarc.fr/today (accessed on 31 March 2025).
- Siegel, R.L.; Miller, K.D.; Jemal, A. Cancer statistics, 2020. CA Cancer J. Clin. 2020, 70, 7–30. [Google Scholar] [CrossRef] [PubMed]
- Torre, L.A.; Bray, F.; Siegel, R.L.; Ferlay, J.; Lortet-Tieulent, J.; Jemal, A. Global cancer statistics, 2012. CA Cancer J. Clin. 2015, 65, 87. [Google Scholar] [PubMed]
- Surveillance, Epidemiology, and End Results Cancer Stat Facts: Ovarian cancer. Available online: http://seer.cancer.gov/statfacts/html/ovary.html (accessed on 25 July 2017).
- American Cancer Society. Ovarian Cancer Survival Rates. Available online: https://www.cancer.org/cancer/ovarian-cancer/detection-diagnosis-staging/survival-rates.html (accessed on 31 March 2025).
- Fischerova, D.; Burgetova, A. Imaging techniques for the evaluation of ovarian cancer. Best. Pract. Res. Clin. Obstet. Gynaecol. 2014, 28, 697–720. [Google Scholar] [CrossRef] [PubMed]
- Wernick, M.N.; Yang, Y.; Brankov, J.G.; Yourganov, G.; Strother, S.C. Machine Learning in Medical Imaging. IEEE Signal Process. Mag. 2010, 27, 25–38. [Google Scholar] [CrossRef] [PubMed]
- Rajkomar, A.; Dean, J.; Kohane, I. Machine Learning in Medicine. N. Engl. J. Med. 2019, 380, 1347–1358. [Google Scholar] [CrossRef]
- Ayyoubzadeh, S.M.; Ahmadi, M.; Yazdipour, A.B.; Ghorbani-Bidkorpeh, F.; Ahmadi, M. Prediction of ovarian cancer using artificial intelligence tools. Health Sci. Rep. 2024, 7, e2203. [Google Scholar] [CrossRef]
- Lu, M.; Fan, Z.; Xu, B.; Chen, L.; Zheng, X.; Li, J.; Znati, T.; Mi, Q.; Jiang, J. Using machine learning to predict ovarian cancer. Int. J. Med. Inform. 2020, 141, 104195. [Google Scholar] [CrossRef]
- Little, R.; Rubin, D. Statistical Analysis with Missing Data, 3rd ed.; Wiley: Hoboken, NJ, USA, 2019. [Google Scholar] [CrossRef]
- Schafer, J.L. Analysis of Incomplete Multivariate Data, 1st ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 1997. [Google Scholar] [CrossRef]
- Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
- Liu, H.; Motoda, H. Computational Methods of Feature Selection, 1st ed.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2007. [Google Scholar] [CrossRef]
- Jolliffe, I.T. Principal Component Analysis, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar] [CrossRef]
- Kursa, M.B.; Rudnicki, W.R. Feature selection with the Boruta package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar]
- Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene Selection for Cancer Classification using Support Vector Machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
- Beraha, M.; Metelli, A.; Papini, M.; Tirinzoni, A.; Restelli, M. Feature Selection via Mutual. Information: New Theoretical Insights. In Proceedings of the 2019 International Joint Conference on Neural Networks (IJCNN), Budapest, Hungary, 14–19 July 2019; pp. 1–9. [Google Scholar]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’ 16); Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
- Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. Catboost: Unbiased Boosting with Categorical Features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montréal, QC, Canada, 3–8 December 2018; pp. 6639–6649. [Google Scholar]
- Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
- Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
- Zhang, H. The optimality of Naive Bayes. In Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference (FLAIRS), Miami Beach, FL, USA, 17–19 May 2004; pp. 562–567. [Google Scholar]
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar]
- Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
- Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
- Hutter, F.; Babic, D.; Hoos, H.H.; Hu, A.J. Boosting verification by automatic tuning of decision procedures. In Proceedings of the Formal Methods in Computer Aided Design (FMCAD’ 07); IEEE Computer Society: Washington, DC, USA, 2007; pp. 27–34. [Google Scholar]
- Hutter, F.; Hoos, H.H.; Leyton-Brown, K. Sequential Model-Based Optimization for General Algorithm Configuration. In Learning and Intelligent Optimization; Coello, C.A.C., Ed.; LION 2011. Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6683. [Google Scholar] [CrossRef]
- Bartz-Beielstein, T.; Markon, S. Tuning search algorithms for real-world applications: A regression tree based approach. In Proceedings of the 2004 Congress on Evolutionary Computation, Portland, OR, USA, 19–23 June 2004; pp. 1111–1118. [Google Scholar]
- Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
- Rainio, O.; Teuho, J.; Klén, R. Evaluation metrics and statistical tests for machine learning. Sci. Rep. 2024, 14, 6086. [Google Scholar] [CrossRef]
- Ferri, C.; Hernández-Orallo, J.; Modroiu, R. An experimental comparison of performance measures for classification. Pattern Recogn. Lett. 2009, 30, 27–38. [Google Scholar] [CrossRef]
- Fawcett, T. An introduction to ROC analysis. Pattern Recogn. Lett. 2006, 27, 861–874. [Google Scholar]
- Yang, L.R.; Yang, M.; Chen, L.L.; Shen, Y.L.; He, Y.; Meng, Z.T.; Wang, W.Q.; Li, F.; Liu, Z.J.; Li, L.H.; et al. Machine learning for epithelial ovarian cancer platinum resistance recurrence identification using routine clinical data. Front. Oncol. 2024, 14, 1457294. [Google Scholar] [CrossRef] [PubMed]
- Wu, M.; Gu, S.; Yang, J.; Zhao, Y.; Sheng, J.; Cheng, S.; Xu, S.; Wu, Y.; Ma, M.; Luo, X.; et al. Comprehensive machine learning-based preoperative blood features predict the prognosis for ovarian cancer. BMC Cancer 2024, 24, 267. [Google Scholar] [CrossRef]
- Sheela Lavanya, J.M.; Subbulakshmi, P. Innovative approach towards early prediction of ovarian cancer: Machine learning-enabled XAI techniques. Heliyon 2024, 10, e29197. [Google Scholar] [CrossRef]
- Amniouel, S.; Yalamanchili, K.; Sankararaman, S.; Jafri, M.S. Evaluating Ovarian Cancer Chemotherapy Response Using Gene Expression Data and Machine Learning. BioMedInformatics 2024, 4, 1396–1424. [Google Scholar] [CrossRef]
- Paik, E.S.; Lee, J.W.; Park, J.Y.; Kim, J.H.; Kim, M.; Kim, T.J.; Choi, C.H.; Kim, B.G.; Bae, D.S.; Seo, S.W. Prediction of survival outcomes in patients with epithelial ovarian cancer using machine learning methods. J. Gynecol. Oncol. 2019, 30, e65. [Google Scholar] [CrossRef]
- Gui, T.; Cao, D.; Yang, J.; Wei, Z.; Xie, J.; Wang, W.; Xiang, Y.; Peng, P. Early prediction and risk stratification of ovarian cancer based on clinical data using machine learning approaches. J. Gynecol. Oncol. 2024, 36, e53. [Google Scholar] [CrossRef]
- Chen, Z.; Ouyang, H.; Sun, B.; Ding, J.; Zhang, Y.; Li, X. Utilizing explainable machine learning for progression-free survival prediction in high-grade serous ovarian cancer: Insights from a prospective cohort study. Int. J. Surg. 2025. [Google Scholar] [CrossRef]
- Piedimonte, S.; Mohamed, M.; Rosa, G.; Gerstl, B.; Vicus, D. Predicting Response to Treatment and Survival in Advanced Ovarian Cancer Using Machine Learning and Radiomics: A Systematic Review. Cancers 2025, 17, 336. [Google Scholar] [CrossRef]
Variable | Description |
---|---|
AFP | Alpha-fetoprotein; a tumor marker primarily used to assess liver function and detect certain cancers. |
AG | Albumin/Globulin ratio; a diagnostic indicator of liver function and protein balance. |
Age | The age of the individual, typically used as a demographic variable. |
ALB | Albumin; a protein produced by the liver, used to assess liver function and nutritional status. |
ALP | Alkaline phosphatase; an enzyme related to liver, bone, and bile duct function. |
ALT | Alanine aminotransferase; an enzyme that helps assess liver damage. |
AST | Aspartate aminotransferase; an enzyme that is indicative of liver and heart function. |
BASO# | Absolute basophil count; basophils are a type of white blood cell involved in immune responses, including allergies. |
BASO% | Percentage of basophils among total white blood cells. |
BUN | Blood urea nitrogen; a marker used to evaluate kidney function. |
Ca | Calcium; a mineral essential for bone health, muscle function, and nerve signaling. |
CA125 | Cancer antigen 125; a biomarker used to assess ovarian cancer. |
CA19-9 | Cancer antigen 19-9; a marker used to assess pancreatic cancer. |
CA72-4 | Cancer antigen 72-4; a tumor marker mainly used for gastric cancer. |
CEA | Carcinoembryonic antigen; a protein often elevated in various cancers, particularly colorectal cancer. |
CL | Chloride; an electrolyte that helps maintain fluid balance and acid–base status. |
CO2CP | Carbon dioxide content; measures the blood’s bicarbonate concentration, important for assessing acid–base balance. |
CREA | Creatinine; a waste product of muscle metabolism, commonly used to assess kidney function. |
DBIL | Direct bilirubin; a form of bilirubin that is conjugated in the liver and used to assess liver function and jaundice. |
EO# | Absolute eosinophil count; eosinophils are white blood cells involved in allergic responses and parasitic infections. |
EO% | Percentage of eosinophils among total white blood cells. |
GGT | Gamma-glutamyl transferase; an enzyme used to evaluate liver and biliary system disorders. |
GLO | Globulin; a class of proteins that includes immunoglobulins, which play a role in immune function. |
GLU | Glucose; a key source of energy for cells, its levels are used to assess metabolic function and diabetes. |
HCT | Hematocrit; the proportion of blood that is composed of red blood cells, used to assess anemia or dehydration. |
HE4 | Human epididymis protein 4; a biomarker for ovarian cancer detection. |
HGB | Hemoglobin; a protein in red blood cells that carries oxygen from the lungs to the tissues. |
IBIL | Indirect bilirubin; the unconjugated form of bilirubin, elevated in liver dysfunction and hemolysis. |
K | Potassium; an essential electrolyte that regulates cell function, heart rhythm, and muscle contractions. |
LYM# | Absolute lymphocyte count; lymphocytes are a subset of white blood cells that are critical for immune function. |
LYM% | Percentage of lymphocytes among total white blood cells. |
MCH | Mean corpuscular hemoglobin; a measure of the average amount of hemoglobin per red blood cell. |
MCV | Mean corpuscular volume; the average volume of a red blood cell, used to classify anemia. |
Mg | Magnesium; a mineral important for muscle and nerve function and enzymatic processes. |
MONO# | Absolute monocyte count; monocytes are white blood cells involved in immune response and inflammation. |
MONO% | Percentage of monocytes among total white blood cells. |
MPV | Mean platelet volume; a measure of the size of platelets in the blood, used to assess platelet production and function. |
Na | Sodium; an electrolyte that helps regulate fluid balance, blood pressure, and nerve function. |
NEU | Neutrophils; the most abundant type of white blood cell, important for fighting bacterial infections. |
PCT | Procalcitonin; a biomarker used to detect bacterial infections and assess sepsis. |
PDW | Platelet distribution width; a measure of the variability in platelet size, useful for assessing platelet function. |
PHOS | Phosphate; a mineral important for bone health and cellular energy production. |
PLT | Platelets; cells involved in blood clotting and wound healing. |
RBC | Red blood cells; cells responsible for oxygen transport in the body. |
RDW | Red cell distribution width; a measure of the variability in red blood cell size, useful for diagnosing anemia. |
TBIL | Total bilirubin; a combination of direct and indirect bilirubin, used to assess liver function and jaundice. |
TP | Total protein; the sum of albumin and globulin in the blood, reflecting overall nutritional and liver status. |
UA | Uric acid; a waste product of purine metabolism; elevated levels can indicate kidney dysfunction or gout. |
Menopause | A binary categorical variable indicating whether the individual is postmenopausal. |
TYPE | A binary categorical variable. |
Model | Method | Accuracy | F1 | Precision | Recall | AUC |
---|---|---|---|---|---|---|
Random Forest | Boruta | 0.8667 | 0.8665 | 0.8689 | 0.8667 | 0.9426 |
PCA | 0.7238 | 0.7228 | 0.72821 | 0.7238 | 0.8146 | |
RFE | 0.8952 | 0.8952 | 0.8954 | 0.8952 | 0.9505 | |
Mutual Information | 0.8571 | 0.8569 | 0.8605 | 0.8571 | 0.9307 | |
XGBoost | Boruta | 0.8381 | 0.8375 | 0.8444 | 0.8381 | 0.9216 |
PCA | 0.7429 | 0.7424 | 0.7453 | 0.7429 | 0.8320 | |
RFE | 0.8381 | 0.8378 | 0.84133 | 0.8381 | 0.9508 | |
Mutual Information | 0.8381 | 0.8378 | 0.8413 | 0.8381 | 0.9175 | |
CatBoost | Boruta | 0.8952 | 0.8945 | 0.9073 | 0.8952 | 0.9502 |
PCA | 0.7524 | 0.7511 | 0.7587 | 0.7524 | 0.8396 | |
RFE | 0.8857 | 0.8856 | 0.8881 | 0.8857 | 0.9430 | |
Mutual Information | 0.8857 | 0.8854 | 0.8909 | 0.8857 | 0.9296 | |
Decision Tree | Boruta | 0.7905 | 0.7904 | 0.7909 | 0.7905 | 0.7896 |
PCA | 0.6190 | 0.6186 | 0.6200 | 0.6190 | 0.6540 | |
RFE | 0.8285 | 0.8284 | 0.8289 | 0.8285 | 0.8494 | |
Mutual Information | 0.7904 | 0.7902 | 0.7923 | 0.7904 | 0.7908 | |
K-Nearest Neighbors | Boruta | 0.8190 | 0.8190 | 0.8191 | 0.8190 | 0.8534 |
PCA | 0.7428 | 0.7378 | 0.7650 | 0.7428 | 0.7821 | |
RFE | 0.8285 | 0.8276 | 0.8366 | 0.8285 | 0.8798 | |
Mutual Information | 0.8285 | 0.8283 | 0.8306 | 0.8285 | 0.8844 | |
Naive Bayes | Boruta | 0.8000 | 0.7986 | 0.8093 | 0.8000 | 0.9023 |
PCA | 0.7142 | 0.7079 | 0.7369 | 0.7142 | 0.7634 | |
RFE | 0.8190 | 0.8164 | 0.8402 | 0.8190 | 0.9296 | |
Mutual Information | 0.838095 | 0.836465 | 0.853893 | 0.838095 | 0.929245 | |
Gradient Boosting | Boruta | 0.8476 | 0.8472 | 0.8523 | 0.8476 | 0.9183 |
PCA | 0.7333 | 0.7319 | 0.7392 | 0.7333 | 0.8004 | |
RFE | 0.8571 | 0.8565 | 0.8637 | 0.8571 | 0.9346 | |
Mutual Information | 0.8476 | 0.8472 | 0.8523 | 0.8476 | 0.9174 | |
SVM | Boruta | 0.8476 | 0.8474 | 0.8497 | 0.8476 | 0.8762 |
PCA | 0.8000 | 0.7998 | 0.8010 | 0.8000 | 0.8534 | |
RFE | 0.8761 | 0.8759 | 0.8797 | 0.8761 | 0.9154 | |
Mutual Information | 0.8761 | 0.8753 | 0.8877 | 0.8761 | 0.8925 | |
ANN | Boruta | 0.8476 | 0.8472 | 0.8524 | 0.8476 | 0.8828 |
PCA | 0.7905 | 0.7845 | 0.8297 | 0.7905 | 0.8842 | |
RFE | 0.7905 | 0.7894 | 0.7977 | 0.7905 | 0.8157 | |
Mutual Information | 0.8095 | 0.8093 | 0.8115 | 0.8095 | 0.8330 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Öznacar, T.; Güler, T. Prediction of Early Diagnosis in Ovarian Cancer Patients Using Machine Learning Approaches with Boruta and Advanced Feature Selection. Life 2025, 15, 594. https://doi.org/10.3390/life15040594
Öznacar T, Güler T. Prediction of Early Diagnosis in Ovarian Cancer Patients Using Machine Learning Approaches with Boruta and Advanced Feature Selection. Life. 2025; 15(4):594. https://doi.org/10.3390/life15040594
Chicago/Turabian StyleÖznacar, Tuğçe, and Tunç Güler. 2025. "Prediction of Early Diagnosis in Ovarian Cancer Patients Using Machine Learning Approaches with Boruta and Advanced Feature Selection" Life 15, no. 4: 594. https://doi.org/10.3390/life15040594
APA StyleÖznacar, T., & Güler, T. (2025). Prediction of Early Diagnosis in Ovarian Cancer Patients Using Machine Learning Approaches with Boruta and Advanced Feature Selection. Life, 15(4), 594. https://doi.org/10.3390/life15040594