Evaluating the Feasibility of Machine-Learning-Based Predictive Models for Precancerous Cervical Lesions in Patients Referred for Colposcopy

Background: Colposcopy plays an essential role in cervical cancer control, but its performance remains unsatisfactory. This study evaluates the feasibility of machine learning (ML) models for predicting high-grade squamous intraepithelial lesions or worse (HSIL+) in patients referred for colposcopy by combining colposcopic findings with demographic and screening results. Methods: In total, 7485 patients who underwent colposcopy examination in seven hospitals in mainland China were used to train, internally validate, and externally validate six commonly used ML models, including logistic regression, decision tree, naïve bayes, support vector machine, random forest, and extreme gradient boosting. Nine variables, including age, gravidity, parity, menopause status, cytological results, high-risk human papillomavirus (HR-HPV) infection type, HR-HPV multi-infection, transformation zone (TZ) type, and colposcopic impression, were used for model construction. Results: Colposcopic impression, HR-HPV results, and cytology results were the top three variables that determined model performance among all included variables. In the internal validation set, six ML models that integrated demographics, screening results, and colposcopic impression showed significant improvements in the area under the curve (AUC) (0.067 to 0.099) and sensitivity (11.55% to 14.88%) compared with colposcopists. Greater increases in AUC (0.087 to 0.119) and sensitivity (17.17% to 22.08%) were observed in the six models with the external validation set. Conclusions: By incorporating demographics, screening results, and colposcopic impressions, ML improved the AUC and sensitivity for detecting HSIL+ in patients referred for colposcopy. Such models could transform the subjective experience into objective judgments to help clinicians make decisions at the time of colposcopy examinations.


Introduction
Although cervical cancer is a preventable disease, it remains one of the most common cancers and causes of cancer-related death among women worldwide [1]. Cervical cancer also reflects global inequities, as its incidence and mortality rates in low-and middleincome countries (LMICs) are nearly twice and three times those in high-income countries, respectively. The World Health Organization (WHO) set a target that by 2030, 70% of women will be screened with a high-performance test by 35 years of age and again by 45 years of age [2]. Colposcopy-guided biopsy is crucial for detecting cervical precancers and determining treatment or further observation, but it is also the main bottleneck limiting screening performance. The diagnostic ability of many colposcopists is not favorable, as up to 40% of high-grade squamous intraepithelial lesions or worse (HSIL+) cases are missed in LMICs [3]. The consistency rate between colposcopic impressions and histopathological results varies widely, ranging from 52% to 66% [4][5][6]. As the screening modality gradually changes from cytology to primary HPV screening, cervical abnormalities are likely to be mild and difficult to identify, which aggravates the difficulty of the colposcopy diagnosis.
The American Society for Colposcopy and Cervical Pathology (ASCCP) has recommended modifying colposcopy practice on the basis of cytology, human papillomavirus (HPV) genotyping, and colposcopic impression [7,8]. However, the guidelines do not cover all situations. For instance, there are no uniform standards to determine follow-up or biopsy or direct treatment when facing non-16/18 high-risk HPV infections with mild cytological abnormalities. Colposcopists need to integrate more clinical information, such as age and reproductive history, to make such decisions, which inevitably depend on subjective experience. As women with a wide range of underlying precancer risks are referred for colposcopy every day [9], experience alone is not a long-term solution, especially for junior colposcopists. Thus, it is particularly important to establish an objective model that can predict the risk of HSIL+.
Recently, the application of machine learning (ML) methods to healthcare has been rapidly growing because of the increasing availability of large databases and computing power [10][11][12][13]. ML algorithms are often used to process multi-dimensional data, bringing new hope to solving clinical dilemmas. Previous ML studies that used screening results and demographics to predict cervical precancers performed well in screening scenarios [14][15][16][17]. However, there has been a lack of research into integrating colposcopic impressions with predicting HSIL+ among patients referred for coloscopies. Furthermore, relevant predictive models have always been developed on the basis of single-center data without external validation, which inevitably leads to model overfitting.
Therefore, this study integrates demographics, screening results, and colposcopic impressions to evaluate the feasibility of different ML models for predicting HSIL+ in patients referred for colposcopy. We aim to transform the subjective experience into objective predictive models for clinical use.

Study Design and Population
This is a multicenter, retrospective diagnostic study. The clinical records of patients who underwent colposcopic examination at seven hospitals in mainland China between January 2019 and October 2021 were collected. Women referred for colposcopy owing to abnormal screening results, clinical symptoms, or concerns about their own health were included, and those aged 24 to 65 with histopathological results were selected for further analysis. Demographic and clinical data were obtained, including age, gravidity, parity, menopause status, cytological results, HPV results, transformation zone (TZ) type, colposcopic impression, and histological diagnosis.
This study was approved by the Institutional Review Board of the Chinese Academy of Medical Sciences and Peking Union Medical College (No. CAMS and PUMC-IEC-2022-022). Informed consent was waived owing to the retrospective nature of this study and the anonymization of patient information.

Screening Tests, Colposcopy, and Histology Diagnosis
Cytology results were classified into five categories in accordance with the Bethesda system, including negative for intraepithelial lesion or malignancy (NILM), atypical squamous cells of undetermined significance (ASC-US), low-grade squamous intraepithelial lesion (LSIL), atypical squamous cells-cannot exclude high-grade squamous intraepithelial lesion (ASC-H), and HSIL+. HPV results were categorized as HPV 16/18 positive, other HR-HPV positive, and HR-HPV negative.
Experienced colposcopists assessed the type of TZ and gave a colposcopic impression (normal, LSIL, HSIL, or cancer) following the 2011 IFCPC colposcopic terminology for the cervix [18]. All abnormalities were biopsied, and endocervical curettage was performed if necessary. Histological diagnosis performed by experienced pathologists from local hospitals was considered the gold standard. Histological results were classified as normal, LSIL, and HSIL+ in accordance with the Lower Anogenital Squamous Terminology (LAST) system. Any disagreements were resolved by discussion. The worst grade of the dysplasia present was taken as the final diagnosis.

Development and Validation of ML Models
Features were selected on the basis of literature reviews and expert discussion. Age, gravidity, and parity have been identified as risk factors for cervical cancer in previous studies and were included in our ML models [19,20]. The ASCCP guidelines recommend that colposcopy practice should be performed on the basis of cytology, HPV results, and colposcopic impression [7]. Thus, these variables were also used for model construction.
Although the influence of menopause status and TZ type on the development of HSIL+ is not consistent, these variables play important roles in colposcopy practice and are recommended by experts to be included in the model [6,21,22].
The complete data were divided into internal and external sets on the basis of hospital site. Individuals collected from five hospitals were used as the internal set. The synthetic minority oversampling technique (SMOTE) was adopted to eliminate the impact of imbalanced data. In the oversampling process, the number of minority samples was randomly filled to the same number as that of the majority samples. The internal set was then randomly divided into the training and internal validation sets at a ratio of 8:2. The other two hospitals were used as the external validation set to assess the generalization of the model. We also used five-fold cross-validation to test the accuracy of models on the whole dataset from seven hospitals.
Six commonly used ML models were selected in this study, including logistic regression (LR), decision tree (DT), naïve bayes (NB), support vector machine (SVM), random forest (RF), and extreme gradient boosting (XGBoost). As a classical statistical model, the LR model performs well in a variety of statistical prediction and regression scenarios. The DT model divides the search space into smaller parts and searches for each part by asking yes or no questions. The NB model, based on the Bayesian inference theorem, can demonstrate the importance of different variables in the prediction process from another perspective. SVM with a radial basis function (RBF) maps input variables into a nonlinear hyperplane with higher feature spaces and can classify both linear and nonlinear data. The RF is an ensemble learning algorithm based on bagging. It uses different random samples to train multiple decision trees and the voting method to obtain the final classification result. XGBoost is an efficient and extensible learning classifier that is based on boosting. The objective function of XGBoost is regularized, which is beneficial to control overfitting and further improve model performance. The development of these models was operated on a Jupyter Notebook in Anaconda based on python 3.8.5. We used the sk-learn.tree decision tree classifier function to develop the DT model. The NB model was developed using sk-learn.naive.bayes, and the SVM model was built using the sk-learn.svm.SVC function. The random forest model was established by the RandomForestClassifier function in the sklearn.ensemble package. Additionally, the XGBoost model was developed through the XGBClassifier function in the XGBoost package. The model parameters were optimized by comparing the average accuracy in a five-fold cross-validation. The prediction function was used to generate the probability predicted by each model. We chose the optimal cut-off point, with excellent sensitivity and good specificity, for every model.

Statistical Analysis
Data analysis was performed using SAS version 9.4 (Cary, NC, USA). The chi-square test was used to compare differences in the distribution of variables between HSIL+ and <HSIL cases in the training and validation datasets. A two-sided p < 0.05 was regarded as statistically significant. Sensitivity, specificity, accuracy, the Matthews correlation coefficient (MCC), receiver operating characteristic (ROC) curves, and the area under the curve (AUC) were used to evaluate the diagnostic performance of different ML models and colposcopists at detecting HSIL+. The importance of variables in the SVM and NB models was calculated by estimating the decrease in AUC when the variable was removed from the model. The importance of variables in the LR, DT, RF, and XGBoost models was reflected by coefficients in the models. The overall-importance of the variable is obtained by averaging the variable ranking in six models. Figure 1 shows the process of model development and validation. We identified 7583 patients who had undergone colposcopy during the study period, 125 of whom were excluded according to our predefined selection criteria. The data of 7485 patients were used to develop and validate the ML models. The training and internal validation set consisted of 1632 HSIL+ patients and 4932 <HSIL patients. The external validation set consisted of 163 HSIL+ patients and 731 <HSIL patients. Among the whole population, 4135 (55.4%) were aged between 30 and 45; most had one or two pregnancies (5079, 68.1%), one or two deliveries (6018, 80.7%), and no menopause (6287, 84.3%). There were 4073 (54.6%) patients with cytological ASC-US or above, 2479 (33.2%) with HPV 16/18 positivity, 3587 (48.1%) with other HR-HPV positivity; 38.2% (2013) of the women were infected by multiple HPV subtypes. In total, 2847 (38.2%) participants had type 3 TZ, while 3695 (49.5%) and 1629 (21.8%) had colposcopic impressions of LSIL and HSIL+, respectively. The characteristics of the study population in different datasets are displayed in Table 1. Compared with <HSIL controls, HSIL+ patients were more likely to be between 30 to 45 years old (p = 0.003), had more gravidities and parities, were more likely to be cytological LSIL+, HPV 16/18 positive, and infected by a single HR-HPV, had an HSIL+ colposcopic impression, and were less likely to have type 3 TZ (all p < 0.001).     ). The DT model was also slightly better than the other models in terms of accuracy, sensitivity, and MCC. In the external validation set, the AUC, accuracy, sensitivity, specificity, and MCC of colposcopists were 0.755 (95% CI: 0.708-0.803), 86.13% (95% CI: 83.86-88.40%), 58.90% (95% CI: 51.34-66.45%), 92.20% (95% CI: 90.26-94.15%), and 0.524 (95% CI: 0.491-0.557), respectively. Greater increases in AUC (0.087 to 0.119) and sensitivity (17.17 to 22.08%) were observed among the six models than in the internal validation set. In general, the performance of each model was similar but inferior to that in the internal validation set. Among them, the NB model showed a slightly higher AUC (0.874 [95% CI: 0.843-0.905]) than the other models, while the DT model performed relatively poorly for all indicators. The RF model yielded a relatively higher accuracy, specificity, and MCC but lower sensitivity. Figure 2 shows the ROC curves of the different models in the internal and external validation sets.

Model Performance and Variable Ranking
The combined confusion matrix and performance of the six models for detecting HSIL+ based on five-fold cross-validation are provided in Tables S1 and S2. The average accuracy, balanced accuracy, and MCC ranged from 86.56% to 89.13%, 80.94% to 83.48%, and 0.628 to 0.693, respectively.
The rankings of variable importance are shown in Table 3. Overall, colposcopic impression was the most important predictor, followed by HR-HPV infection and cytology. TZ type, HR-HPV multi-infection, and age ranked fourth, fifth, and sixth, respectively. Gravidity, parity, and menopause status all ranked seventh. Colposcopic impression had the strongest predictive effect in all six models. HR-HPV and cytology results ranked second or third in all models except for the LR model. TZ type was important in the LR, DT, and RF models but ranked lower in the SVM and XGBoost models. HR-HPV multi-infection ranked higher than fifth in the LR and XGBoost models. Age ranked higher than sixth in the SVM and NB models.
In the external validation set, the AUC, accuracy, sensitivity, specificity, and MCC of colposcopists were 0.755 (95% CI: 0.708-0.803), 86.13% (95% CI: 83.86%-88.40%), 58.90% (95% CI: 51.34%-66.45%), 92.20% (95% CI: 90.26%-94.15%), and 0.524 (95% CI: 0.491-0.557), respectively. Greater increases in AUC (0.087 to 0.119) and sensitivity (17.17% to 22.08%) were observed among the six models than in the internal validation set. In general, the performance of each model was similar but inferior to that in the internal validation set. Among them, the NB model showed a slightly higher AUC (0.874 [95% CI: 0.843-0.905]) than the other models, while the DT model performed relatively poorly for all indicators. The RF model yielded a relatively higher accuracy, specificity, and MCC but lower sensitivity. Figure 2 shows the ROC curves of the different models in the internal and external validation sets.   The number 1 indicates that the variable ranks first in terms of importance, and 9 indicates that the variable ranks last among all variables. † The overall-importance of the variable is obtained by averaging the variable ranking in six models. Abbreviations: HR-HPV, high-risk human papillomavirus; LR, logistic regression; SVM, support vector machine; DT, decision tree; NB, naïve bayes; RF, random forest; XGBoost, extreme gradient boosting.

Subgroup Analysis
Subgroup analysis was conducted to evaluate the predictive performance of different models in 487 HSIL+ patients who were diagnosed as <HSIL by colposcopists. As shown in Table 4 Figure 3 shows the ROC curves of the six models in different datasets.

Discussion
In this study, we have developed and validated ML-based models to predict HSIL+ in patients referred to colposcopy. All six ML models that integrated demographics,

Discussion
In this study, we have developed and validated ML-based models to predict HSIL+ in patients referred to colposcopy. All six ML models that integrated demographics, screening results, and colposcopic impressions had significantly improved AUC and sensitivity compared with colposcopists, regardless of testing in the internal or external validation sets. Colposcopic impressions, HR-HPV results, and cytology results were the top three variables for determining model performance.
Colposcopy plays a vital role in detecting HSIL+, but its overall performance remains unsatisfactory. The sensitivity and specificity of colposcopistsfor detecting HSIL+ in our study were 70.86% and 95.06%, respectively, which were similar to the sensitivity and specificity of 71.6% and 98.0%, respectively, achieved by two expert colposcopists in a previous study [6]. However, the sensitivity was only 54.7% in another study carried out in China owing to various degrees of colposcopists' clinical experience [23]. The primary task for colposcopists is to identify and treat high-grade diseases to reduce the risk of developing invasive cancer. However, several studies have shown that colposcopic diagnoses tend to underestimate rather than overestimate the pathology of biopsies [6,23,24]. This imprecision is partly because colposcopy largely depends upon the subjective experience of operators. Immaturity and the evolution of colposcopy diagnostic criteria also make junior colposcopists more confused [25]. Additionally, some potential lesions, visible at molecular and cellular levels, are not visible under colposcopy, which can be intensified with a change of screening modality from cytology to primary HPV screening. Therefore, more information, such as screening results and basic demographics, should be considered when determining follow-up, biopsy, or immediate treatment for patients referred for colposcopy.
Different ML models with various variables have been used for cervical cancer screening in previous studies. For example, Kahng et al. used an SVM model with age, cytology, and the presence of 15 HR-HPV genotypes to predict progression to cervical lesions, with an accuracy of 74.41% [14]. Karakitsos et al. developed a neural network classifier based on cytology, HPV, E6/E7 mRNA, and p16 immunostaining to identify cervical intraepithelial neoplasia grade 2+ (CIN2+), which obtained a significantly improved AUC (0.916) compared with cytological diagnosis alone (0.866) [16]. Another study reported that the valuable predictive factors were age, cytology, HR-HPV DNA/mRNA, E6 oncoprotein, HPV genotyping, and p16/Ki-67, which had an AUC of 0.92 for predicting CIN2+ [26]. Several studies have used epidemiologic risk factors and molecular markers to predict histologic grade or risk stratification on the basis of logistic regression results. For example, Rothberg et al. constructed a model using age, race, marital status, insurance, smoking history, income, and previous HPV results on around 100,000 women, obtaining an AUC of 0.81 for HSIL+ [17]. Another study used mRNA level, DNA index, parity, and age to achieve excellent discrimination for HSIL, with an AUC of 0.99 [15]. These studies further highlight the value of integrating clinical information and reflect the advantages of ML in processing multi-dimensional data. However, these studies were based on screening scenarios, and no model has been developed to further combine colposcopic impressions to predict HSIL+ under a diagnostic context.
All six ML models in our study showed improved AUC and sensitivity for detecting HSIL+ compared with colposcopists. However, the importance of colposcopy ranked first among all variables, no matter the model. This finding is consistent with previous research that showed that colposcopic impression had a stronger association with the final histological diagnosis of HSIL than pap smears or HPV 16/18 infections [27]. We also found that HR-HPV infection and cytology ranked second and third in overall variable importance, in line with colposcopy being a reason for referral, as proposed by the ASCCP [7]. In the 2017 ASCCP guidelines, serial cytology and HPV testing but not biopsy may be recommended when the risk of precancer is very low (no evidence of HPV 16/18, <HSIL cytology, and a completely normal colposcopic impression). In contrast, immediate treatment may be warranted for those at very high risk (at least two of the following: HPV 16/18 positive, HSIL cytology, and high-grade colposcopy impression). Silver et al. calculated pooled risk estimates for 32 strata on the basis of cytology, HPV 16/18 results, and colposcopy in a meta-analysis, and their results also support a risk-based approach to colposcopy and biopsy practice [28].
We found that TZ type and age ranked fourth and sixth among all variables, respectively. The proportion of HSIL+ in 30-to 45-year-olds was higher than in the other two age groups, which is consistent with the peak diagnosis age of HSIL being between 30 and 45 years [19]. The precise identification of the squamocolumnar junction (SCJ) and the evaluation of the TZ are crucial steps during colposcopy. In total, 38.2% of patients in our study had type 3 TZ. This proportion fluctuates greatly among different studied populations [6,29]. Previous studies have suggested that TZ type is associated with colposcopic accuracy, as higher accuracy and sensitivity for detecting HSIL+ were found in women with type 1/2 TZ compared with women with type 3 TZ [24]. In our study, the proportion of type 3 TZ in women with HSIL+ was less than that in women with <HSIL. This indicates that TZ type may not be directly related to developing HSIL+ but does affect the performance of colposcopy, which requires special attention in clinical practice to avoid the misdiagnosis of CINs.
Moreover, patients with increased gravidity or parity were more likely to have HSIL+. Previous studies have confirmed a direct relationship between reproductive factors and the risk of cervical cancer [20,30,31]. The ectopia of cervical columnar epithelium caused by delivery-related cervical traumas and high concentrations of estrogen and progesterone during pregnancy make the SCJ more susceptible to HPV infection. The hormonal profile and immunodepression caused by pregnancy may also favor infection or accelerate cervical carcinogenesis. Additionally, more pregnancies may represent insufficient condom use, and subsequently induced abortions may lead to joint infections of bacteria and viruses, which aggravates damage to the cervix. The impact of menopause status on the development of HSIL+ is complicated by hormone levels and immune status. Cervical atrophy, decreased cell detachment, and the contraction of the SCJ in postmenopausal women may result in the lower accuracy of colposcopy, which should be paid more attention to in clinical practice.
In total, 33.2% of HR-HPV-positive patients were infected with multiple HR-HPVs in our study, which is within the range of 26.3% and 43.2% that has been previously reported [32,33]. We found that a higher incidence of HSIL+ in patients with single infections than those with multiple infections. However, the association between multiple HPV infections and cervical lesions remains inconclusive. Some studies support multiple infections having no additive or synergistic effects on the development of HSIL+ [33,34], while other studies suggest that coinfection may act synergistically in cervical carcinogenesis [35]. Other data highlight the importance of the type of HPV infection. For example, a single infection with HPV 16 is more virulent than a multiple-infection pattern, while for HPVs 52, 53, 56, 51, 39, 66, 59, 68, and 35, a multiple-infection pattern is more likely to develop into HSIL+ than a single infection [32].
We found that the AUC of the DT model was significantly higher than that of the NB model in the internal validation set but that the NB model performed best in the external validation set despite no statistical significance being found. This finding is consistent with the algorithm, in which DT tends to overfit the training data, while NB is not prone to overfitting [36]. The DT is based on a hierarchical structure and has the advantage of good interpretability. However, it tends to overfit unless properly pruned, and it is unsuitable for all applications as it linearly separates samples. The NB model is easily implemented, highly scalable, and does not require many data entries to make proper classifications. Nevertheless, the independence assumption behind NB classifiers is quite bold and usually never holds in practice, leading to inaccuracies in the calculations of class probabilities. However, as long as the probability of the correct category is higher than those of the others, inaccuracies in the probability calculation will not affect outcomes. Both methods are fast and efficient, and in most cases, both will be tested before deciding. We also used two ensemble algorithms to construct the predictive model. RF is based on bagging, which fits many decision trees on different samples of the same dataset and averages the predictions. XGBoost is based on boosting, which adds ensemble members sequentially that correct the predictions made by prior models and outputs a weighted average of the predictions. These two models had no significant improvement in predicting HSIL+ compared to other base models in our study.
To our knowledge, few studies have developed predictive models for cervical precancers in Chinese women referred for colposcopy. We also conducted internal and external validation of the ML models in a large set of samples from multiple centers. All variables we used are clinically easy to obtain. However, the study has some limitations that must be considered. First, it was a retrospective study, so prospective follow-up is needed to explore the long-term predictive performance of our models. Second, some variables that may improve the predictions of HSIL+, such as HPV infection history, smoking status, and sexual behavior, were not collected. Nevertheless, we have made the best use of the information available from existing sources. Third, in our study, colposcopy examinations were conducted by senior colposcopists, which may not be generalizable to all primary medical institutions. A way to better extrapolate the models using colposcopic information from junior colposcopists needs to be evaluated. As colposcopic impression was the variable with the greatest impact on model performance, improving the diagnostic ability of colposcopists is still a priority in the future. New technology such as artificial intelligence (AI) colposcopy and innovative forms of colposcopy training could bring exciting changes to the current situation [37][38][39]. Our predictive models trained by non-imaging clinical data could also be integrated with the current image-based AI colposcopy to make colposcopy practice more intelligent, objective, and comprehensive.

Conclusions
This study demonstrates that ML can improve the AUC and sensitivity for detecting HSIL+ in patients referred for colposcopy by incorporating demographics, screening results, and colposcopic impressions. These models can transform a subjective experience into an objective judgment to help clinicians make decisions at the time of colposcopy examinations.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/diagnostics12123066/s1. Table S1 Combined confusion matrix of different models based on five-fold cross-validation. Table S2 Performance of different models for detecting HSIL+ based on five-fold cross-validation.
Author Contributions: M.C. and J.W. planned and designed the study. Q.L. and P.X. collected the data, and M.C. analyzed and interpreted the data. J.W. performed the statistical analysis. P.X., Y.J. and Y.Q. provided conceptional assistance. M.C. wrote the first draft, and Q.L., P.X. and Y.Q. revised the manuscript. All co-authors reviewed the manuscript, made corrections, and approved this publication. All authors have read and agreed to the published version of the manuscript. Informed Consent Statement: Patient consent was waived because this was a retrospective study of pre-existing clinical archives and because all data were completely anonymized.

Data Availability Statement:
The datasets generated and/or analyzed during the current study are not publicly available because of personal information protection, patient privacy regulation, and medical institutional data regulatory policies but are available from the corresponding author upon reasonable request and with permission.

Conflicts of Interest:
The authors declare no conflict of interest.