A Clinical Prediction Model of Overall Survival for Patients with Cervical Cancer Aged 25–69 Years

Aims: This study aims to develop a prediction tool for the overall survival of cervical cancer patients. Methods: We obtained 4116 female patients diagnosed with cervical cancer aged 25–69 during 2008–2019 from the Surveillance, Epidemiology, and End Results Program. The overall survival between groups was illustrated by the Kaplan–Meier method and compared by a log-rank test adjusted by the Bonferroni–Holm method. We first performed the multivariate Cox regression analysis to evaluate the predictive values of the variables. A prediction model was created using cox regression based on the training set, and the model was presented as a nomogram. The proposed nomogram was designed to predict the 1-year, 3-year, and 5-year overall survival of patients with cervical cancer. Besides the c-index, time-dependent receiver operating curves, and calibration curves were created to evaluate the accuracy of the nomogram at the timepoint of one year, three years, and five years. Results: With a median follow-up of 54 (28, 92) months, 1045 (25.39%) patients were deceased. Compared with alive individuals, the deceased were significantly older and the primary site was more likely to be the cervix uteri site, large tumor size, higher grade, and higher combined summary stage (all p values < 0.001). In the multivariate Cox regression, age at diagnosis, race, tumor size, grade, combined summary stage, pathology, and surgery treatment were significantly associated with the all-cause mortality for patients with cervical cancer. The proposed nomogram showed good performance with a C-index of 0.82 in the training set. The 1-year, 3-year, and 5-year areas under the curves (with 95% confidence interval) of the receiver operating curves were 0.88 (0.84, 0.91), 0.84 (0.81, 0.87), and 0.83 (0.80, 0.86), respectively. Conclusions: This study develops a prediction nomogram model for the overall survival of cervical cancer patients with a good performance. Further studies are required to validate the prediction model further.


Introduction
Cervical cancer is the fourth most common malignant tumor in women, leading to a substantial health threat worldwide [1,2]. Owing to the advances in prevention, diagnosis, and treatment, the incidence and mortality of cervical cancer have decreased by at least half in the past three decades in developed countries [1]. Still, the disease remained a significant health burden on a global scale. It was reported that 569,847 patients were newly diagnosed with cervical cancer, and 311,365 deaths were caused by cervical cancer worldwide in 2018.
Squamous cell carcinoma accounts for the major histological subtypes (about 70%), and adenocarcinoma is the second most common subtype accounting for about 25% [1,3]. Many factors have been previously reported to be associated with the survival of this malignancy, such as lymph node metastasis, histologic type, tumor size, etc. [4]. However, the overall survival varies among cervical cancer patients at the individual level, even for those with the same disease stage and histologic type. A single predictive biomarker alone is insufficient to evaluate the disease's survival comprehensively.
As a class of artificial intelligence, machine learning uses algorithmic methods to make machines perform disease prediction without programming [5]. Applying machine learning to big data provides a powerful method for evaluating complex healthcare information [6]. Therefore, this study aims to develop a prediction tool for the overall survival of cervical cancer patients based on machine learning.

Data Source
The Surveillance, Epidemiology, and End Results (SEER) Program collects populationbased cancer incidence and survival from the US cancer registries, which cover about 48% of the total US population. Patient demographics, tumor site, morphology, and stage at diagnosis, treatment, and follow-up survival status were routinely collected in the SEER registries. This study obtained data from the "Incidence-SEER Research Data, 8 Registries, Nov 2021Sub (1975". The cervical cancer diagnosis was based on the International Classification of Diseases for Oncology, 3rd Edition (ICD-O-3). We included participants who were (1) pathologically diagnosed with cervical cancer, (2) aged between 25-69, (3) with complete survival records, and (4) newly diagnosed between 2008-2019. Exclusion criteria were (1) diagnosed only by autopsy or death certificate, (2) without race, tumor site, size, grade, or stage record, and (3) missing surgery record. In the SEER database, the surgery records of participants were recorded as (1) Yes (have received surgery treatment), (2) No (have not received surgery treatment), or (3) Not available (have no information about surgery treatment). We excluded patients without available information on surgery. It should be noted that "missing surgery record" indicated that we are not sure if the participant received surgery treatment, instead of that they have not received surgery. Follow-up time was defined as the time from diagnosis to death or the last contact date. Finally, 4116 female patients diagnosed with cervical cancer during 2008-2019 were included in this study.
In this study, sociodemographic, pathologic, and clinical variables were obtained for further analysis, including age at diagnosis, race (White, Black, and other races), primary site (cervix uteri, endocervix, exocervix, and overlapping lesion), tumor size, grade (Grade I, well-differentiated; Grade II, moderately differentiated; Grade III, poorly differentiated; Grade IV, undifferentiated), combined summary stage (regional, localized, and distant), pathology (squamous cell carcinoma, adenocarcinoma, and others), and surgical treatment. All the data were acquired from the SEER database by SEER*Stat software (version 2.4.0).

Development and Validation of the Prediction Model
We first performed the multivariate Cox proportional hazard regression model to evaluate the predictive values of the variables. Multiple biomarkers were input to the Cox regression model, including age at diagnosis, race, primary site, tumor size, grade, combined summary stage, pathology, and surgical treatment. The results were shown by hazard ratios (HRs) with 95% confidence intervals (CIs). Multicollinearity refers to the high correlation between two or more predictor variables in a regression model. Multicollinearity can lead to unstable estimates of the regression coefficients, which makes it difficult to determine the true effect of each predictor variable on the outcome variable. The variance inflation factor is a measure widely used to assess the degree of multicollinearity in a regression model. We calculated the variance inflation factors of each variable to evaluate the multicollinearity. The variance inflation factor value of 1 indicates no correlation, between 1 and 5 indicates moderation correlation, above 5 indicates high correlation.
The input data were randomly divided into a training set and a testing set at a 7:3 ratio. The training set (n = 2882) was used to create the prediction model, while the testing set (n = 1234) was used to validate the model performance. The prediction model was created by cox regression and was presented as a nomogram. The proposed nomogram was designed to predict the 1-year, 3-year, and 5-year overall survival of patients with cervical cancer. Besides the c-index, time-dependent receiver operating, and calibration curves were created to evaluate the accuracy of the nomogram at the timepoint of one year, three years, and five years.

Statistical Analysis
Descriptive statistics were used to describe the baseline characteristics. We represented the continuous variables as mean ± standard deviation and categorical variables as percentages. The baseline characteristics were compared by the Kruskal-Wallis test or chi-square test as appropriate. The overall survival between the group was illustrated by the Kaplan-Meier method and compared by a log-rank test adjusted by the Bonferroni-Holm method. We used the Bootstrapping 1000 resamples to validate the performance of the proposed model internally. p values < 0.05 were considered statistical significance. All statistical analyses were performed in R software (version 4.0).

Participant Characteristics
With a median follow-up of 54 (28, 92) months, 1045 (25.39%) patients were deceased. Compared with the alive individuals, the dead were significantly older and more likely to be in the cervix uteri site, large tumor size, higher grade, and higher combined summary stage (all p values < 0.001). The baseline participant characteristics are shown in Table 1.

Cox Regression Analysis
In the multivariate Cox regression, age at diagnosis, race, tumor size, grade, combined summary stage, pathology, and surgery treatment were significantly associated with the all-cause mortality for patients with cervical cancer. The results of the multivariate Cox regression analysis are shown in Table 2. Compared with the white race, black race patients were at a 1.37 (1.14-1.65)-fold risk of all-cause death. Furthermore, Figure 1 illustrates the overall survival of cervical cancer patients of different races (log-rank p value < 0.0001). The survival was significantly lower in the black race than in the white (BH-adjusted p value < 0.001) and other races (BH-adjusted p value < 0.001). However, no statistical difference was observed between the white race and other races (BH-adjusted p value = 0.62). Additionally, the variance inflation factors of each variable were provided in the Supplementary Table S1. Moderately differentiated and poorly differentiated grade showed high multicollinearity with variance inflation factor values of 5.4 and 5.8, respectively. Therefore, we input all variables into the prediction model.

Development and Validation of the Prediction Model
A prediction model was created by cox regression based on the training set, and the proposed model was presented as a nomogram in Figure 2. Multiple variables were input: age at diagnosis, race, primary site, tumor size, grade, combined summary stage, pathology, and surgical treatment. The C-index in the training set was 0.82. Figure 1. K m plotter of overall survival for patients with cervical cancer in different races. The overall survival for cervical cancer patients of different races (log-rank p value < 0.0001). The survival was significantly lower in the black race than in the white (BH-adjusted p value < 0.001) and other races (BH-adjusted p value < 0.001). However, no statistical difference was observed between the white race and other races (BH-adjusted p value = 0.62).

Development and Validation of the Prediction Model
A prediction model was created by cox regression based on the training set, and the proposed model was presented as a nomogram in Figure 2. Multiple variables were input: age at diagnosis, race, primary site, tumor size, grade, combined summary stage, pathology, and surgical treatment. The C-index in the training set was 0.82. In the testing set, the nomogram showed good performance with a C-index of 0.81. Figure 3 shows that the performance remains satisfied at the time point of one year, three years, and five years, with the area under the curves (AUCs) of 0.88 (0.84, 0.91), 0.84 (0.81, In the testing set, the nomogram showed good performance with a C-index of 0.81. Figure 3 shows that the performance remains satisfied at the time point of one year, three years, and five years, with the area under the curves (AUCs) of 0.88 (0.84, 0.91), 0.84 (0.81, 0.87), and 0.83 (0.80, 0.86). Additionally, we showed the calibration plots of the nomogram in Figure 4. The sensitivity and specificity of the model in the testing set are shown in Table 3. Our results showed that the nomogram had good calibration when predicting 1-year, 3-year, and 5-year overall survival probability. In the testing set, the nomogram showed good performance with a C-index of 0 Figure 3 shows that the performance remains satisfied at the time point of one year, th years, and five years, with the area under the curves (AUCs) of 0.88 (0.84, 0.91), 0.84 (0 0.87), and 0.83 (0.80, 0.86). Additionally, we showed the calibration plots of the nomog in Figure 4. The sensitivity and specificity of the model in the testing set are show Table 3. Our results showed that the nomogram had good calibration when predictin year, 3-year, and 5-year overall survival probability.

Discussion
In this study, we obtained 4116 female patients diagnosed with cervical cancer during 2008-2019 from the SEER Program. Based on cox regression analysis, we developed a prediction model and presented it as a nomogram. In the validation, the model showed good performance with a C-index of 0.82. The ROCs show that the performance remains satisfied at one year, three years, and five years, with AUCs of 0.89, 0.86, and 0.84.
Previous studies have revealed many risk factors to predict overall survival for patients with cervical cancer (e.g., lymph node metastasis, histologic type, tumor size) [4]. However, cervical cancer patients show distinct prognoses even for those with the same histologic type. Prediction using a single biomarker alone is insufficient to comprehensively evaluate the disease survival. Nomograms based on machine learning integrate multiple biomarkers to comprehensively evaluate disease prognosis [7][8][9]. The visualized method is designed to generate the precise prediction tailored to an individual patient, providing a simple-to-use tool for clinicians to predict overall survival [10]. Recently, many nomograms have been developed for cancer diagnosis and prognosis, which showed better performance than the traditional clinical stage system [11][12][13].
Few studies proposed prediction tools to evaluate the overall survival of cervical cancer [14,15]. Polterauer et al. [14] developed a nomogram to predict overall survival in cervical cancer patients diagnosed using 528 consecutive patients. Gynecologists and Obstetricians stage, tumor size, age at diagnosis, histologic subtype, lymph node ratio, and parametrial involvement were input to the prediction model as nomogram covariates. This model was internally validated using 1000 bootstrap resampling, and the c-index for overall survival was 0.72 (25th and 75th percentiles, 0.70 and 0.74) [14]. In another study by Kidd and colleagues [15], 234 cervical cancer patients were included to develop the nomograms. The proposed nomograms showed reliable performance for recurrence-free survival, disease-specific survival, and overall survival with C-indexes of 0.741, 0.739, and 0.658, respectively [15]. Compared with previous studies, we included a large populationbased sample size, which made our results more reliable and might be applied to the general population. Importantly, our nomogram showed good performance with a C-index of 0.82 and 1-year, 3-year, and 5-year, AUCs of 0.89, 0.86, and 0.84, respectively. The proposed nomogram was convenient and can be easily converted into an online prediction tool, which would help clinicians to make treatment decisions.
Tumor stage and histology subtype are well-demonstrated risk factors for worse survival of cervical cancer patients. However, it remains uncertain whether older age reduces the overall survival [16][17][18]. The median age of cervical cancer diagnosis is 49 years, and cervical cancer is mostly diagnosed in patients aged between 35 to 44 years. Current guidelines recommend cervical cancer screening for women below 65 but not above 65 years [19]. Still, many patients were diagnosed at elder age (above 65 years), which accounts for about 20% of all patients [20]. Therefore, research is required to investigate the risk factors to predict the survival of cervical cancer. In a previous study on 43,350 cervical cancer patients, Quinn et al. [21] reported that increased age (particularly > 70 years) was associated significantly with decreased survival trends. The trend remains consistent when stratified by various tumor stages and histology subtypes [21]. In our study, patients aged 65-69 were at a 1.6-fold risk of all-cause death compared with 25-29 years.
Despite the advantages, some limitations should be noticed. First, this study is based on the SEER program, which is performed in the US. It remains unclear whether the model can be applied to other races. Second, the predictive factors were input in this model were all records from the SEER. However, this database did not collect some important predictive biomarkers for cervical cancer (especially the recently proposed ones). Third, the prediction model was validated in the internal validation set. The further validation based on an external dataset would be necessary. Last but not least, this study excluded participants with missing records, which might induce additional selection bias. Further studies are required to further validate the prediction model.

Conclusions
In the present study, we developed a prediction nomogram model for the overall survival of cervical cancer patients with a good performance. Further studies are required to validate the prediction model further.