Mathematical Modelling of Cervical Precancerous Lesion Grade Risk Scores: Linear Regression Analysis of Cellular Protein Biomarkers and Human Papillomavirus E6/E7 RNA Staining Patterns

The current practice of determining histologic grade with a single molecular biomarker can facilitate differential diagnosis but cannot predict the risk of lesion progression. Cancer is caused by complex mechanisms, and no single biomarker can both make accurate diagnoses and predict progression risk. Modelling using multiple biomarkers can be used to derive scores for risk prediction. Mathematical models (MMs) may be capable of making predictions from biomarker data. Therefore, this study aimed to develop MM–based scores for predicting the risk of precancerous cervical lesion progression and identifying precancerous lesions in patients in northern Thailand by evaluating the expression of multiple biomarkers. The MMs (Models 1–5) were developed in the test sample set based on patient age range (five categories) and biomarker levels (cortactin, p16INK4A, and Ki–67 by immunohistochemistry [IHC], and HPV E6/E7 ribonucleic acid (RNA) by in situ hybridization [ISH]). The risk scores for the prediction of cervical lesion progression (“risk biomolecules”) ranged from 2.56–2.60 in the normal and low–grade squamous intraepithelial lesion (LSIL) cases and from 3.54–3.62 in cases where precancerous lesions were predicted to progress. In Model 4, 23/86 (26.7%) normal and LSIL cases had biomolecule levels that suggested a risk of progression, while 5/86 (5.8%) cases were identified as precancerous lesions. Additionally, histologic grading with a single molecular biomarker did not identify 23 cases with risk, preventing close patient monitoring. These results suggest that biomarker level–based risk scores are useful for predicting the risk of cervical lesion progression and identifying precancerous lesion development. This multiple biomarker–based strategy may ultimately have utility for predicting cancer progression in other contexts.


Introduction
Almost all cervical cancers and their precursor lesions, including squamous intraepithelial lesions (SIL) and cervical intraepithelial neoplasia (CIN), are caused by persistent recruits Arp2/3 complex proteins and binds to actin microfilaments. It also promotes lamellipodia and invadopodia formation, cell migration, endocytosis, cell mortality, and tumor invasiveness [33,34]. Cortactin is overexpressed in many cancers [35,36] at high risk of invasiveness and metastasis, including hepatocellular carcinoma [37], colorectal cancer, glioblastoma, HNSCC, oral squamous cell carcinoma, lung squamous cell carcinoma, gliosarcoma, breast cancer, and melanoma [35]. Amplification of the CTTN gene and the resulting overexpression of cortactin have been observed in 15% of primary metastatic breast carcinomas and nearly 30% of HNSCCs [33,38]. Cortactin may also be associated with E6/E7 RNA in HR-HPV-associated cervical cancer and is a potential diagnostic biomarker studied by our group. Furthermore, the majority of these highly sensitive techniques have not yet been introduced to clinical practice.
Over the last two decades, a variety of machine learning techniques and feature selection algorithms have been widely applied to determine disease prognosis and predict certain conditions [39]. These techniques are used in conjunction with logistic regression models to assess the importance of various genes. After important genes are identified, the same logistic regression model is then used for cancer classification and risk prediction [39]. Several prediction models are currently widely used in clinical practice, including the model for breast cancer incidence [40,41] and the predictive risk-scoring model for central lymph node metastasis [42]. The prediction model for breast cancer recurrence can be viewed at https://breast.predict.nhs.uk/predict.html (accessed on 8 March 2023). Mathematical models (MMs) are also used to determine the likelihood of relapse and predict responses to chemotherapy among patients with breast cancer [43], as well as to diagnose precancerous cervical lesions and predict progression [44,45].
In the comparison between histopathological method and modelling using multiple biomarkers, this study showed more advantages that can be used to derive scores for risk prediction, not only for the diagnosis of cervical lesion from similar biopsy samples. However, the current practice of histologic grading or using a single molecular biomarker can facilitate differential diagnosis.
Our study investigated the expression of cortactin in FFPE cervical specimens with diverse lesion grades in combination with other related biomarkers. Biomarkers p16 INK4A and Ki-67 were used as protein biomarker controls during immunohistochemical (IHC) staining. HPV E6/E7 RNA ISH was also used. The relationship between IHC staining and ISH data was evaluated in association with clinical characteristics, and MMs were developed to estimate risk scores using linear regression analysis. Receiver operating characteristic (ROC) curves and areas under the curve (AUC) were used to identify the best MMs. Risk scores from the model were then used to predict the risk of abnormal or precancerous cervical lesion progression and may have utility in other cancer contexts in the future.

Specimens
Three hundred and sixty-three FFPE cervical tissue samples were collected from women who underwent routine cervical cancer testing with a colposcopy at Phayao Hospital, Phayao, Thailand in 2012 (233 samples) and 2013 (130 samples). This work was approved by the Human Research Ethics Committee of the University of Phayao (2/015/59) and Phayao Hospital (HE-59-02-0008). The sample size was calculated according to the known prevalence of HPV in the community as follows: N (case/age group) = Z 2 1−a P(1 − P)/d 2 ". The required number of participants was calculated from a mean ± SD of 52.9 ± 32.1% of HPV prevalence, a Z of 1.96 for the 95% confidence level, and a d of 0.05 [46].
All FFPE cervical tissues were reviewed by two pathologists, and the following histopathological grades were assigned: normal (211 cases), low-grade squamous intraepithelial lesion (LSIL; 65 cases), HSIL (58 cases), and invasive cervical cancer (squamous cell carcinoma [SCC]; 29 cases). The 233 samples collected in 2012 were defined as the "test sample set" and the 130 samples collected in 2013 as the "confirmed sample set" ( Table 1).
The test sample set was used to develop the MM using a linear regression model, and the confirmed sample set was used to test the regression model.

Tissue Microarray (TMA) Preparation
The selected areas of the FFPE cervical tissues were stained with hematoxylin and eosin and graded by a pathologist according to the World Health Organization criteria. Paraffin tissue blocks were made by removing 1.5 mm cores of the tissues and organized into TMAs (Arraymold, Salt Lake City, UT, USA).

HR-HPV E6/E7 RNA Chromogenic ISH
E6/E7 RNA chromogenic ISH was performed using the RNAscope 2.5 HD Detection Kit (BROWN) and Quick Guide for FFPE Tissues (Advanced Cell Diagnostics, Hayward, CA, USA) with specific combinations of E6 or E7 probes to detect 18 different HR-HPV types when low copy target gene expression was anticipated (1-20 copies per cell). The FFPE sections (5 µm) were de-paraffinized through xylene and ethanol washes and treated as follows: pre-treatment 1 (endogenous hydrogen peroxide block solution) for 10 min at RT; pre-treatment 2 for 45 min at 105 • C; and pre-treatment 3 (protease digestion) for 30 min at 40 • C. After the treatments, the sections were rinsed with water. The tissues were hybridized in a hybridization solution with E6/E7 RNA chromogenic ISH probes in a moist chamber and without a cover slip for 2-3 h at 40 • C. Thereafter, the hybridized probe's signal was amplified through the serial application of Amp 1 (pre-amplifier step), Amp 2 (signal enhancer step), Amp 3 (amplifier step), Amp 4 (label probe step), Amp 5, and Amp 6 (signal amplification steps); this was followed by the washing steps. Horseradish peroxidase (HRP) activity was then evaluated through the application of 3, 3 -diaminobenzidine (DAB) for 10 min at RT. The sections were then counterstained with hematoxylin, cleared in xylene, and mounted with Permount. The expression signal data were recorded according to negative and positive staining. The internal controls used for the RNAscope chromogenic ISH were proprietary probes for human sequence ubiquitin C (positive control to demonstrate detectable RNA in the FFPE samples) and Bacillus subtilis (B. subtilis) dapB RNA targets (negative control). Ubiquitin C staining was scored to confirm the presence of the signal and its intensity. B. subtilis dapB staining was reviewed to confirm the absence of staining.

IHC Staining
IHC staining was performed on the TMAs to determine the expression of cortactin, p16 INK4A , and Ki-67. Briefly, following de-paraffinization and re-hydration, the tissue sections on the slides were antigen-retrieved using a target retrieval solution (citrate buffer, pH 6.0) at 105 • C in an autoclave for 30 min. Rabbit monoclonal anti-cortactin antibody clone Ep1922Y (Abcam, Cambridge, MA, USA), mouse monoclonal anti-human p16 INK4A clone D25 (EMD Millipore Corporation, Temecula, CA, USA), and Ki-67 monoclonal antibody clone 20Raj1 (eBioscience TM , Thermo Fisher Scientific, San Diego, CA, USA) were applied at dilutions of 1:200, 1:100, and 1:100, respectively, in 1× phosphate-buffered saline for 60 min. This was followed by incubation with secondary detection antibodies using the Genemed Power-Stain TM 1.0 Poly HRP DAB Kit for Mouse + Rabbit (Sakura Finetek, Torrance, CA, USA). Immunostaining results were evaluated using light microscopy with a 40× objective, and both Allred score (AS; score and intensity of staining) and positive/negative status were recorded.
The criteria for distinguishing positive and negative IHC statuses are shown in Table 2. (1) Staining was assessed as strong positive (block positive) according to the amount of uniform strong positive staining in the cytoplasm and nucleus in~1/3 to 3/3 thickness, signal strength (which would appear as a dark brown color), and diffusion (the signal involved >50% of the epithelium). "Positive" (2) Positive ambiguous results were further grouped into three patterns: (2.1) Strong/basal (strong, diffuse, continuous staining of the lower third of the epithelium without upward extension).
(2.2) Weak/diffuse (weak, diffuse, discontinuous staining reaching at least two third of the epithelium). (2.3) Strong/focal (strong, focal, and discontinuous staining located at any level of the epithelium). "Positive" (3) Negative results were defined as either the total absence of staining or weak, focal, and discontinuous staining. "Negative" Ki-67 Negative Ki-67 staining was defined as either the total absence of staining or weak basal staining. "Negative"

Statistical Analysis
Statistical analysis was performed using SPSS version 16. The correlations between the cervical grade and the protein marker (positive/negative) were evaluated using the Pearson Chi-square test (significance level: p < 0.05). The correlations between the cervical grade and the protein marker (AS) were evaluated using one-way ANOVA (significance level: p < 0.05). The MM was included in the regression analysis (significance level: p < 0.05), and the ROC curves and AUC were evaluated using SPSS.

Baseline Characteristics
A total of 363 FFPE cervical tissue samples were studied. These samples were retrieved in 2012 (233 samples) and 2013 (130 samples) from women aged 19-95 years. Table 1 shows the sample characteristics. The most common age group in both 2012 and 2013 was the 41-50-year age group. LSILs and HSILs were common based on the abnormal histopathological grades.

HR-HPV E6/E7 RNA Chromogenic ISH
Positive E6/E7 RNA signals were mostly present in the cells ( Figure 1). An increase in positive E6/E7 RNA signals was associated with a severe grade of cervical lesions (Table 3). Table 4 shows the sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) of RNA E6/E7 for detecting LSIL+ * and HSIL+ ** compared with normal cervical tissues. An association between p16 INK4A and RNA E6/E7 was found in this study, which was consistent with Zappacosta R. et al. (2013) [32].    Note: * Pearson Chi-Square, ** one-way ANOVA, P/N = positive or negative status, AS = Allred score, SD = standard deviation. Note: PPV = Positive predictive value. NPV = Negative predictive value. LSIL+ * indicates cervical lesion grades of LSIL and more severe (HSIL and SCC). HSIL+ ** was cervical lesion grades as HSIL and SCC.

p16 INK4A and Ki-67 Immunostaining
The expression patterns of p16 INK4a and Ki-67 are shown in Figures 2 and 3, respectively. A positive expression and the AS of p16 INK4a and Ki-67 were significantly associated with increasing severity grades of the cervical lesions, as shown in Table 3. Table 4 shows the sensitivity and specificity of p16 INK4A and Ki-67 for detecting LSIL+ and normal cases and for detecting HSIL+ and normal cases, respectively.

Cortactin Immunostaining
Five expression patterns of cortactin were observed according to their localization, staining, and intensity (

Cortactin Immunostaining
Five expression patterns of cortactin were observed according to their localization, staining, and intensity (  (Table 3). Positive cortactin staining was generally detected with significant differences across different grades of cervical lesions.

Mathematical Models (MM)
The MMs developed were simple linear regression models that included independent variables that consisted of age range (five categories) and the following biomarkers: cortactin, p16 INK4A , Ki-67, and HPV E6/E7 RNA. The coefficients of the independent variables in the best five MMs are shown in Table 5. Table 6 shows the five best MMs used for

Mathematical Models (MM)
The MMs developed were simple linear regression models that included independent variables that consisted of age range (five categories) and the following biomarkers: cortactin, p16 INK4A , Ki-67, and HPV E6/E7 RNA. The coefficients of the independent variables in the best five MMs are shown in Table 5. Table 6 shows the five best MMs used for calculating the expected value (mean ± SD) for each cervical lesion grade shown in Table 7, Supplementary Figure S1 and Supplementary Table S2. The risk score for the prediction of abnormal cervical progression and precancerous lesion was calculated on the basis of the mean ± SD of each cervical lesion grade. Based on the expected values (mean ± SD), none of the models could differentiate between the normal and LSIL cases (p > 0.05); therefore, the normal cases were included with the LSIL cases in the confirmed sample set. In model 3, the risk score for the progression from LSIL was 2.60 (mean ± SD: 1.4843 ± 1.10780). The risk scores for the progression from HSIL (mean ± SD: 3.5374 ± 1.01427) and SCC (mean ± SD: 3.9516 ± 0.41838) ranged from 3.54 to 4.56 and from 3.95 to 4.37, respectively (  Table S3 shows the sensitivity and specificity of the five best models. Models 1-5 yielded a greater association with the variables with the disease outcome (OR) than did the other models. Models 3-5 revealed a great association with the variables with the disease outcome (OR). Models 2 and 3 could not differentiate between HSIL and SCC (p > 0.05). Interestingly, the predictive value of Models 1, 4, and 5 could significantly differentiate (1)      Models 3, 4, and 5 were selected to assess the risk of abnormal cervical lesion progression and precancerous lesions in the confirmed sample set. The expected value (Y) was calculated in each case and compared with the risk score. When Y was equal to or lower than the risk score, the cervical lesions were suggested to have biomolecules characteristic of the baseline (i.e., normal tissues). When Y was higher than the risk score, the individuals were expected to be at risk of progression or to have "risk biomolecules." Such individuals should be monitored. For example, when a histopathological LSIL case was evaluated by Model 3 and showed a predictive value of 1.59 (risk score: <2.60), the presence of LSIL with "baseline characteristic biomolecules" was suggested. However, when a histopathological LSIL case showed a predictive value of 3.02 (risk score: >2.60), it was suggested to be an LSIL case with present "risk biomolecules". For the prediction of precancerous lesions in the normal and LSIL cases, the cases were predicted to have precancerous lesions when Y was higher than the risk score for HSIL (risk score: >3.54).
Supplementary Figures S3 and S4 demonstrate the prediction of cervical lesions using Models 3, 4, and 5. Model 4 showed the highest detection rate of cases with risk biomolecules in the LSIL (23/86 normal + LSIL cases, 26.7%) and HSIL groups (29/34 cases, 85.3%). The traditional histologic grading of the biopsies did not identify the 23 normal and LSIL cases. Without this knowledge, 23 patients would not have undergone close monitoring.
The next best models were Models 5 and 3. Model 4 best predicted the cases with precancerous lesions in the LSIL (5/86 cases, 5.8%) and HSIL groups (24/34 cases, 70.6%), while Models 3 and 5 predicted the cases with precancerous lesions in 3/86 (3.5%) cases in the LSIL group and 20/34 (58.8%) cases in the HSIL group. As shown in Table 8, the risk scores obtained by Models 3, 4, and 5 were suitable for detecting abnormal cervical lesions among patients in the LSIL group and for determining the risk of LSIL and HSIL. The ROC curve and AUC of Model 4 were significantly higher than those of Models 3 and 5 (p = 0.000) in terms of predicting the histopathological normal and LSIL cases with risk biomolecules and precancerous lesions ( Supplementary Figures S3 and S4). In the comparison between the sensitivity and specificity of Models 3 to 5 to distinguish between normal tissue and LSIL+HSIL in the confirmed sample set (Supplementary Figure S3), the AUC values for predicting risk biomolecules were 0.757, 0.793, and 0.751, respectively. In the comparison between the sensitivity and specificity of Models 3 to 5 to distinguish between LSIL and HSIL in the confirmed sample set (Supplementary Figure S4), the AUC values for predicting precancerous lesions were 0.777, 0.824, and 0.762, respectively. Table 8. Sensitivity, specificity, PPV, and NPV of the pathological grade in Models 3 to 5 in the confirmed sample set.

Discussion
As previously reported, atypical cervical cells slowly grow and progress to precancerous lesions over a period of 10-20 years. Patients with these abnormal cells need to be monitored closely to prevent cervical cancer. In low-income countries, including Thailand, it is difficult to monitor these patients since they are usually lost in the follow-up. In this study, we were able to collect data from the initial presentations of our patients; however, we were unable to obtain follow-up results. Some of the patients might be at risk of developing cervical cancer.
Many studies have reported the clinical significance of p16 INK4A and Ki-67 expression as risk factors for cervical cancer. However, to date, no biomarkers have accurately predicted the progression of abnormal cervical cells and the development of precancerous lesions. The present study aimed to develop MMs and risk scores using a new biomarker, cortactin, combined with p16 INK4A , Ki-67, and HPV mRNA. We intended to identify the best MM and risk score to predict the progression from normal cervical tissues to LSILs and HSILs and the risk of developing cervical precancerous lesions. We found that the sensitivity, specificity, PPV, and NPV of p16 INK4A /Ki-67 were 68%, 69%, 61%, and 74% for detecting LSIL+ and 92%, 68%, 91%, and 96% for detecting HSIL+, respectively. These results are comparable to those of other studies. Li and colleagues found that the sensitivity, specificity, PPV, and NPV of p16 INK4A /Ki-67 FFPE were 94%, 88%, 69%, and 98% for CIN 2+ detection, respectively, and 84%, 96%, 88%, and 96%, respectively, for CIN 3+ detec-tion [51,52]. Among women with CIN 2, positive IHC staining for p16 INK4A and Ki-67 was strongly associated with disease progression [53].
Cortactin can promote cell migration, cell mortality, and tumor invasiveness in melanoma, colorectal cancer, and glioblastoma [33,34], and its expression was demonstrated to be significantly associated with poorer survival rates in patients with OSCC [54][55][56]. Meta-analyses have concluded that an overexpression of p16 INK4A [57,58] in cervical cancer relates to increased overall and disease-free survival rates, which differs from the function of cortactin. We found that cortactin staining (Table 3) might be a useful molecular diagnostic aid for cervical cancer screening, based on its sensitivity and specificity. However, the cellular functions of cortactin in cervical cancer require further investigation. The abnormal expression of cortactin was manifested both in intensity and localized distribution ( Table 2). Correspondingly, a study of invasive and metastatic melanomas showed cortactin expression with a high density of (very strong) expression in SCC of 83% [59]. However, different distribution patterns of cortactin were also seen, such as in cases of nevi. This study reported that weak staining with low intensity was evenly distributed in the cytoplasm in normal nevi tissue and that strong staining was found in the cytoplasm of high-grade lesions. In contrast, strong staining was accentuated in the cell's periphery in most melanomas. This was also seen in cultured melanoma cells, in which cortactin was distributed in the membrane ruffles and lamellipodia [59]. Therefore, the level of protein expression and the distribution of cortactin may reflect the abnormal upregulation of protein expression. The expression of cortactin in cervical cancer, which is reported for the first time by our group, may act as a biomarker for cervical cancer progression.
An increased expression of HR-HPV E6 and E7 correlates with the progression to high-grade lesions [60] and eventually to carcinoma in situ. These oncoproteins have been shown to induce abnormal chromosome copy numbers and miRNA expression in infectious processes [12][13][14][15][16]. The detection of HPV E6/E7 RNA was combined with assays of biomarkers of human DNA, RNA, or protein for the diagnosis and prediction of abnormal cervical lesions. The sensitivity and specificity of HPV E6/E7 RNA for detecting high-grade cytology (CIN 2) were 71.4% and 75.8% [12][13][14][15][16] [59]. Herein, the sensitivity and specificity of HPV E6/E7 RNA were 88% and 54% for predicting LSIL+, and they were 93% and 44% for predicting HSIL+, respectively ( Table 4). The presence of HPV E6/E7 RNA was associated with the future development of CIN 2+ among women with LSIL [60]. Moreover, the higher specificity (54% for LSIL+ and 44% for HSIL+) and NPV (81% for LSIL+ and 93% for HSIL+) of HPV E6/E7 mRNA testing are valuable in predicting clinically insignificant HPV DNA infections and helping to avoid aggressive procedures (biopsies and over-referral for transient HPV infections), as well as for reducing patients' anxieties and frequencies of follow up [18,61].
Several prediction models are currently widely used in clinical practice, including the model for breast cancer incidence, the Adjuvant Online Decision Aid [41,44,62] and that from http://www.predict.nhs.uk/predict.html (accessed on 8 March 2023), which uses MMs to determine the likelihood of relapse and to predict responses to chemotherapy for breast cancer [41]. Three of our five best MMs were evaluated using the confirmed sample set; Model 4. with risk scores of >2.60 and >3.62. showed the highest sensitivity for predicting risk biomolecules in the normal and LSIL cases and precancerous lesions, respectively.
Our model also suggested that 13-14/34 (38-41%) cases of HSIL with "risk biomolecules" (3.95-4.23) might progress to cervical cancer. This is in broad agreement with the findings by Austin (2020), wherein they determined that only around 30% of CIN 3 lesions would progress to cervical cancer in 30 years [65]. However, this study found that slides suffer from issues such as the positions of the biopsies.
Wu et al. (2021) validated a prediction model in two cohorts in China with a followup duration of 3 years. In the first cohort, 42 cases were diagnosed as CIN 2+, with thirty-seven cases predicted to progress and five cases to not progress. In the second cohort, 28 cases were diagnosed as CIN 2+, with 11 cases predicted to progress and 17 cases to not progress [66]. Although this is a starting point for research using machine learning, our study demonstrates that machine-learning-based algorithms using input data from the expression levels of multiple biomarkers have potential for diagnosing and predicting disease progression [67,68] and consequently for solving health problems currently considered unsolvable, such as cancer.

Conclusions
MM-based analysis of the expression levels of multiple biomarkers, including p16 INK4A , Ki-67, cortactin, and HPV E6/E7 RNA, can provide a risk score for predicting the progression of abnormal cervical cells and the development of precancerous lesions in patients with normal histology and LSILs. For example, the relevant equation (Model 4) was Y = 0.535 + 0.387 (Ki-67 AS ) + 0.142(p16 INK4A AS ) + 0.530(cortactin P/N ) + 0.506(RNA E6/E7 P/N ) − 0.786 (Ki-67 P/N ). These results suggest that monitoring patients with MMbased analyses of multiple biomarkers could help physicians design optimal therapeutic strategies and help predict cancer progression in the future.