Next Article in Journal
P21 Overexpression Promotes Cell Death and Induces Senescence in Human Glioblastoma
Next Article in Special Issue
Utilization of a Machine Learning Algorithm for the Application of Ancillary Features to LI-RADS Categories LR3 and LR4 on Gadoxetate Disodium-Enhanced MRI
Previous Article in Journal
Beta Blockers with Statins May Decrease All-Cause Mortality in Patients with Cardiovascular Diseases and Locally Advanced Unresectable Non-Small-Cell Lung Cancer after Chemoradiotherapy
Previous Article in Special Issue
Chimp Optimization Algorithm Influenced Type-2 Intuitionistic Fuzzy C-Means Clustering-Based Breast Cancer Detection System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Improving the Post-Operative Prediction of BCR-Free Survival Time with mRNA Variables and Machine Learning

by
Autumn O’Donnell
1,*,
Eric Wolsztynski
1,2,†,
Michael Cronin
1 and
Shirin Moghaddam
3,†
1
School of Mathematical Sciences, Western Gateway Building, Western Road, University College Cork, T12 XF62 Cork, Ireland
2
Insight SFI Centre for Data Analytics, Western Gateway Building, Western Road, University College Cork, T12 XF62 Cork, Ireland
3
Department of Mathematics and Statistics (MACSI), University of Limerick, V94 T9PX Limerick, Ireland
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work as senior authors.
Cancers 2023, 15(4), 1276; https://doi.org/10.3390/cancers15041276
Submission received: 31 January 2023 / Revised: 11 February 2023 / Accepted: 13 February 2023 / Published: 17 February 2023
(This article belongs to the Collection Artificial Intelligence and Machine Learning in Cancer Research)

Abstract

:

Simple Summary

Prostate cancer is among the most prevalent cancers for men globally, accounting for 13% of cancer diagnoses in the male population each year. Surgical intervention is the primary treatment option but fails in up to 40% of patients, who experience biochemical recurrence (BCR). Determining the likelihood of recurrence and the length of time between surgery and BCR is critical for patient treatment decision-making. Traditional predictive models exploit routine clinical variables such as cancer stage, and may be improved upon by leveraging other accessible information about the patient. This study considers including patient-specific genomic data to identify relevant additional predictors of BCR-free survival, which requires the use of modern machine learning techniques. The results of this study indicate that including such genomic data leads to a gain in BCR prediction performance over models using clinical variables only.

Abstract

Predicting the risk of, and time to biochemical recurrence (BCR) in prostate cancer patients post-operatively is critical in patient treatment decision pathways following surgical intervention. This study aimed to investigate the predictive potential of mRNA information to improve upon reference nomograms and clinical-only models, using a dataset of 187 patients that includes over 20,000 features. Several machine learning methodologies were implemented for the analysis of censored patient follow-up information with such high-dimensional genomic data. Our findings demonstrated the potential of inclusion of mRNA information for BCR-free survival prediction. A random survival forest pipeline was found to achieve high predictive performance with respect to discrimination, calibration, and net benefit. Two mRNA variables, namely ESM1 and DHAH8, were identified as consistently strong predictors with this dataset.

Graphical Abstract

1. Introduction

Radiation prostatectomy (RP) is the primary treatment for prostate cancer [1]. Approximately 30% of patients undergo RP [2], of which up to 70% remain disease-free after 10 years [3], while 20–40% fail and experience biochemical recurrence (BCR) [4]. In the latter case, the time from surgical intervention until BCR varies across patients, with over 60% of BCR cases occurring in the first two years post-operatively [5]. The probability of BCR-free survival is a widely investigated topic as it plays a role in patient treatment decision making [6,7]. The probability of recurrence can be used to determine if and when adjunctive therapies, such as radiotherapy, hormone therapy, and chemotherapy, should be initiated [8]. Current state-of-the-art nomograms such as that produced by the Memorial Sloan Kettering Cancer Centre (MSKCC) use routine clinical variables to make the prediction [9].
Originally developed by Kattan et al. in 1999, a post-operative nomogram using prostate specific antigen (PSA) at diagnosis, Gleason sum score, prostatic capsular invasion, surgical marginal status, seminal vesicle invasion, and lymph node status are used to make prediction of BCR at seven years [10] with updates on this in 2005 and 2009 [11,12]. Other models such as the Walz et al., 2009, are designed for prediction at 2-years post-operatively [5]. Two potential limitations of these reference nomograms are the modelling structure they are based on and the nature of the variables used. Nomograms can predict only for the probability of an event occurring at a specific point in time; the original Kattan nomogram [10], for example, considers a 7-year post-RP horizon. This limits the ability to make treatment decisions at different time-points. Both nomograms also consider only clinical variables, although further information may become routinely available for prediction. This study considered models that allowed for less stringent assumptions on the variable effects, and the inclusion of a number of genetic variables, to overcome these limitations.
The Cox proportional hazard model [13] is a statistical approach of reference for the analysis of censored survival data. Two key assumptions of this model that may be unrealistic are that the independent risk predictors interact linearly in quantifying the relative risk of patients, and that all hazards are proportional within the cohort. Any continuous predictor variable with a non-linear relationship to the risk would need to be transformed to fit the model. However, this would require prior knowledge of the distribution of all variables to be considered for inclusion [14]. Furthermore, such traditional regression models are prone to overfitting, in that the addition of (possibly redundant) variables to the model can lead to an improved fit whilst reducing the generalisability of the model to unseen patients [14,15]. Other important limitations of Cox models in the context of modern, high-dimensional clinical datasets, are their inability to adjust for collinearity (i.e., redundancy) in the feature set, and their non-applicability to datasets containing more candidate predictors than observations [15,16].
Machine learning methodologies offer solutions to overcome these limitations. Model regularisation can be applied to traditional models including Cox models [17] to handle collinearity and reduce the potential of overfitting, yielding the likes of the Cox lasso and Cox ridge regression models [15]. Gradient boosting may also be applied to Cox modelling in order to mitigate the risk of overfitting, and so, is naturally applicable in high-dimensional environments [18,19]. Other alternative models include the random survival forest, a non-parametric, tree-based strategy that does not impose linearity constraints and thus allows the inclusion of covariates with associations that are not linear in nature [20].
Improvements may be achieved also in terms of the nature of the information used in current clinical nomograms, since these use routine clinical variables only. Over the past 15 years, the utility of gene expression biomarkers has been widely researched and demonstrated [21,22,23,24,25]. The assessment of mRNA variables has become more cost-reasonable and feasible, offering new opportunities to improve on current models. Currently, mRNA data typically becomes available post-operatively when clinicians have large biological samples of diseased tissue to extract the genetic component from. The primary goal of this study is, firstly, to determine whether the inclusion of mRNA variables can lead to an improvement in the predictive performance of models determining time-to-BCR, and secondly, to identify relevant statistical or machine learning methodologies to perform this prediction.

2. Materials and Methods

2.1. Data

The dataset used for this analysis was developed in a study by Taylor et al. and contains a total of 232 patients [25] and is comprised of 2 parts, i.e., a set of routine clinical variables and another of messenger ribonucleic acid (mRNA) variables. Following pre-processing for duplicates and missing information, a final 187 patients with complete information including BCR-free duration were retained. BCR in this dataset, is defined as PSA ≥ 0.2 ng/mL on two occasions post-operatively [25]. Any records of a Gleason score of 5 were upgraded to a score of 6 per the current grading system [26]. Of these 187 remaining patients, 133 had mRNA information, and so any mRNA analysis was carried out on this sub-cohort. Although the primary and secondary Gleason scores were available for all remaining individuals, they were not considered for use in our model pipelines as their combined information is already captured in the combined Gleason sum score. Pre-operative variables such as clinical stage and biopsy Gleason score were also excluded as these are seen as approximations of the pathological variables available for analysis in the post-operative setting. As extracapsular extension (ECE), seminal vesicle invasion (SVI), and lymph node involvement (LNI) were included in the modelling, the pathological stage was also removed as the combined effect of these variables should account for all of its effect [27]. All of the aforementioned variable exclusions are in line with the method of the original Kattan nomogram and the current, state-of-the-art MSKCC nomogram [9,10].
The final set of routine clinical variables consisted of both continuous and categorical features. The continuous variables were PSA (mean = 10.4, standard deviation (SD) = 16.13) and age (mean = 58.1, SD = 6.67) at diagnosis and the categorical variables considered for inclusion in a model can be seen in Table 1 along with their descriptive statistics. The set of mRNA variables was not pre-processed, and was comprised of over 20,000 features.

2.2. Statistical Analysis

Four distinct predictive modelling strategies were considered: the traditional Cox proportional hazard (CPH) model [13]; an L 1 -penalised regularisation of this model named LASSO Cox [28]; a gradient boosting approach consisting of a sequential ensemble of CPH models, named Boosted Cox [29]; and a random survival forest (RSF) [20], as an alternative to Cox-based modelling of the risk of BCR. Each modelling approach was implemented with either routine clinical variables only, or with a combination of clinical and mRNA variables.
Before inclusion, the set of mRNA variables was reduced using pre-filtering and feature selection techniques outlined hereafter, applied either separately or combined, prior to model building. A correlation-based pre-filter was considered in order to remove variables with a high Pearson correlation with other variables from the dataset in order to prevent multicollinearity, using an absolute cut-off value of 0.6. This cut-off was selected as it had previously been shown to be optimal by Goh et al. on similar data [30]. The feature selection technique consisted of performing univariate Cox PH analysis on each variable separately and ordering the output with respect to their adjusted log-rank score, whilst also adjusting all p-values for false discovery rate (FDR) [31]. The 50 variables with an FDR-adjusted p-value less than 5% and yielding highest log-rank scores were retained for inclusion for multivariate modelling. This method was implemented in line with that outlined by Beer et al. on their assessment of gene expression and the prediction of lung cancer [32]. Such multiple testing methods can be prone to high false positive rates [33], and FDR p-value correction is a popular method used to mitigate this risk. All traditional CPH models underwent further feature selection by means of a forward stepwise selection with a maximum of 10 variables chosen for the models with mRNA variables. All RSF models underwent recursive feature elimination [34] with the selected number of variables being evaluated based on the model yielding the highest concordance index (C-index), which is defined as the proportion of all usable pairs of patients whose predicted and observed outcomes are concordant [35,36].
R (version 2022.07) [37] was used for all statistical and machine learning analyses. The regularisation parameter ( λ ) was selected using cross-validation for the LASSO Cox models [38], using a final value for λ corresponding to one standard deviation of the cross-validated error, as a standard approach to achieve parsimony [34]. Hyperparameters for the boosted Cox and RSF models were kept to default settings [29,39]. Thus, boosted Cox models contained 100 boosting iterations and a fixed shrinkage parameter of 0.1. The RSF models were built using 500 trees, considering P variables at each split (where P denotes the number of features considered for modelling) and with a minimum size of 15 for the terminal nodes.

2.3. Evaluation Framework

The models were assessed and bench-marked via bootstrapping, using 100 bootstrap resamples for model building and the corresponding out-of-bag (OOB, i.e., test) points for independent testing. Thus, training samples of size N = 133, sampled from the original dataset with replacement, were used. Those observations not used in the training sample were used in the independent test samples, each of size n = 49. The predictive performance of the models was assessed in terms of discrimination and calibration. In this study, discrimination refers to the model’s ability to correctly rank patients relative to each other, using a bootstrap-corrected c-index and corresponding confidence intervals (CI), in order to obtain unbiased estimates of external predictive discrimination [40]. Calibration analysis provides a way to assess model bias for various risk profiles, by comparing the observed and predicted probabilities [41]. Here, two approaches to calibration were implemented within the bootstrapping framework. The method of Harrell et al. compares quintiles of the predicted survival of patients at a specific time point to their Kaplan–Meier estimates [40]. Two- and five-year BCR-free survival endpoints were considered in keeping with calibration of the reference MSKCC nomogram. The second method involves the comparison of quintiles of the predicted overall BCR-free survival probability with their observed Kaplan–Meier estimates [42]. This second method allows for inspection of the model’s ability to predict across all time points, as opposed to the 2-year or 5-year survival horizons.
Receiver operating characteristic (ROC) curve and decision curve analysis (DCA) were also assessed at both 2-year and 5-year horizons to further assess the predictive performance. ROC analysis was summarised in terms of the corresponding bootstrapped area under the ROC curve (AUC) values calculated on the OOB points [36,43]. DCA [44] provided complementary assessment in terms of the clinical value and potential net benefit to patients of the models’ predictions.
In complement to predictive performance, the feature selection stability of the best-performing modelling pipelines across the 100 bootstraps was assessed in terms of the overall feature selection rates, and measured using the Jaccard, Davis, and Dice indices [45,46]. These indices range between 0 and 1, a higher index value indicating a more stable model [47]. Further, individual feature selection rates were analysed to extract information about potentially relevant predictors for BCR.

3. Results

The results from the evaluation of the predictive pipelines described in Section 2.2 using the frameworks outlined in Section 2.3 are reported in this section.

3.1. Discrimination

The models’ discriminatory performance based on the bootstrapped C-indices are shown in Table 2 along with their corresponding confidence intervals. The boxplots of the performing pipelines are shown in Figure 1.
Pairwise two-sided Wilcoxon signed-rank tests of the null hypothesis of no difference in the C-index distributions reported in Table 2 were carried out on all pairs of the models, in order to compare the discriminative performance of the pipelines relative to each other, with FDR correction for p-values [31]. The traditional CPH model and the RSF using only clinical variables outperformed the MSKCC nomogram (p-value < 0.001). No significant difference was found between the MSKCC and either the LASSO or Boosted models (p-value = 0.210 and 0.915, respectively). There was also no significant difference between the traditional CPH model and the RSF (p-value = 0.280). These tests also indicated that at least one of the models that included mRNA variables outperformed each of the clinical-only models. Furthermore, the inclusion of mRNA variables yielded improved performance for the CPH, Boosted Cox, and RSF models (all p-values < 0.001), but not for the LASSO Cox model (all p-values > 0.300). For all modelling strategies, the pipeline yielding the highest bootstrapped C-index was identified as the best-performing pipeline (shown in bold in Table 2) and retained for further analyses.

3.2. Calibration

The MSKCC nomogram, the current state-of-the-art post-operative model, is designed with a 2-year or 5-year horizon as the predictive endpoints, and all models were tested for both 2- and 5-year survival calibration, as shown in Figure 2. For this data, the MSKCC nomogram showed reasonable calibration for both these endpoints. Some under-prediction occurred for those with low observed survival and some over-prediction occurred for those with higher observed survival. For overall survival, however, the nomogram appeared mis-calibrated for this data, yielding under-prediction at all time points of reference.
In the clinical-only models, overall survival was slightly over-predicted but with much closer adherence to the 45-degree line overall (thus, closer to the observed values) than the MSKCC model. All clinical-only models also showed reasonable calibration at the 2- and 5-year endpoints. It can be noted that in both 2- and 5-year endpoint curves the LASSO and Boosted Cox models yielded very little separation between the top four quintiles.
As for calibration of the best-performing mRNA models; RSF, Boosted, and LASSO models all yielded similar calibration curves with close adherence to the 45-degree line for overall survival as well as 2- and 5-year survival. The CPH model, however, showed poorer calibration for all three endpoints. Thus, optimisation of the CPH model with respect to the C-index, with the inclusion of mRNA variable, led to a decrease in model calibration.

3.3. Prediction Performance

ROC and decision curve analysis (DCA) were also both carried out for the 2- and 5-year survival endpoints, with the output shown in Figure 3. Optimal out-of-bag prediction [48] was achieved by the RSF at the 2-year horizon (AUC = 0.824, CI of 2000 bootstrap resamples = (0.716, 0.933)) and 5-year horizon (AUC = 0.772, CI = (0.651, 0.893)). DCA indicated that only the RSF model showed a continued net benefit across all threshold probabilities for both the 2- and 5-year endpoints. All models improved upon the “Treat All” method, but none of the other methodologies showed benefit at 2 years. At 5 years both the LASSO and Boosted Cox models retained some net benefit across all thresholds but were below that of the RSF model.

3.4. Model Stability and Feature Selection

Overall model stability was evaluated on all clinical-only pipelines and on the best performing pipelines for each modelling strategy when including the entire set of over 20,000 mRNA variables, identified in Table 2 in terms of their C-index performance. A first assessment consisted of statistical summaries of the sizes of the final predictive subsets selected by each pipeline, reported in Table 3. All clinical-only pipelines showed little deviation around a typical 5-variable model structure. In terms of the make-up of these models, however, the LASSO and Boosted Cox pipelines selected 14 clinical variables across all bootstraps, compared to 8 for traditional CPH and the RSF (Table 3), suggesting higher stability of the latter 2 for the clinical-only models. When including mRNA data, all models typically tended to contain around 10 to 16 variables for prediction. Remarkably, the typical final feature set size of Cox-based mRNA-inclusive pipelines had a coefficient of variation (CoV) either lower than, or comparable to that of clinical-only pipelines, indicating reasonably low variability in final model sizes for mRNA-inclusive models. An increase in CoV, however acceptable, was noted for RSF after introduction of mRNA information. This first analysis thus suggested that mRNA-inclusive pipelines tended to include a reasonable number of variables in their models.
Model stability was further measured by considering the overall individual feature selection rates across bootstraps, as well as on the basis of three overall stability indices [47], namely the Jaccard, Davis, and Dice indices, as shown in Figure 4. The Davis index in Figure 4A confirmed increased stability of the traditional Cox and RSF pipelines over the LASSO and Boosted Cox alternatives for clinical-only modelling. As for mRNA-inclusive pipelines, the comparison of metrics shown in Figure 4A with those of Figure 4B indicated lower stability, as would be expected given the sheer size of the genomic dataset. Finally, RSF outperformed the CPH pipeline when using mRNA, as shown in Figure 4B. In complement, the traces of overall feature selection rates shown in Figure 4C,D illustrate the tendency of RSF pipelines to yield higher model stability and feature selection rates overall compared to the LASSO and Boosted Cox models, whether using clinical-only or combined clinical and mRNA information.
In summary, this set of results clearly indicates that the RSF is a more stable pipeline than the LASSO and Boosted Cox alternatives. Whether considering clinical-only or clinical and mRNA features, the RSF yielded the most stable pipeline overall in the latter case. Moreover, the RSF and Boosted Cox methodologies, which were the most competitive overall, do not rely on a proportional hazard assumption. This may suggest that the Cox PH assumption is a possible limitation for the inclusion of mRNA information. Similarly, RSF does not constrain to linear interactions between the predictors, which may be a reason for its overall higher predictive performance. The overall lower performance seen with LASSO Cox, on the other end, may be due to the inability of this model to cope well with multi-collinearity or confounding, both of which are possibly present among the mRNA feature set. The differences in performance observed between the models must be confirmed in independent studies before any reliable insights can be derived on the specific aspects behind the performance of each model.
Overall, the pathological Gleason Score was the single-most prevalent predictor across all pipelines (Figure 4C,D; Table 4). It was the only variable that was selected more than 50% of the time in conventional Cox modelling. The other three modelling approaches had more stable individual feature selection rates. Figure 4C illustrates the consistent tendency by all clinical-only models to use pathological Gleason score and SVI in particular. This trend remained present in LASSO and Boosted Cox models following the introduction of mRNA data (Figure 4D and Table 4), whereas RSF tended to use LNI and PSA mainly alongside the Gleason score. Finally, two mRNA variables, namely DNAH8 and ESM1, appeared consistently in a number of pipelines. Both of these have been noted in other reports for their association with prostate cancer. DNAH8 has been shown to be a marker for poor prognosis in prostate cancer [49]. Elevated ESM1 has been shown to be associated with prostate cancer progression and metastasis [50].

4. Discussion

4.1. Key Findings

In this study, the machine learning pipelines developed for prediction of BCR-free survival using Cox proportional hazards and random survival forest models using clinical variables only outperformed the reference MSKCC nomogram with respect to model discrimination and calibration. This is in line with previous findings by Jeffers et al. where their RSF model with clinical variables has only been shown to improve on the Kattan nomogram [51]. Additionally, predictive performance of the models was enhanced by the addition of mRNA variables. The three best performing pipelines included both mRNA information and clinical variables, indicating the potential for genetic information to improve the predictive performance when combined with routine clinical variables. The mRNA features most frequently included in the models were DNAH8 (Dynein axonemal heavy chain 8) and ESM1 (Endothelial Cell Specific Molecule 1). The main function of DNAH8 is currently understood to be related to sperm motility [52] and has been shown to be a marker for poor prognosis in prostate cancer [49] and to be associated with metastasis [53]. Elevated ESM1 readings have been shown to be associated with prostate cancer progression and metastasis [50], ESM1 over-expression correlates to a higher Gleason Score [54]. Both of these variables were also found to be useful predictors in the pre-operative setting by this group in other analyses.

4.2. Other Markers of Interest

Four clinical variables were repeatedly retained in the models that included mRNA information; pathological Gleason sum, seminal vesicle invasion, preoperative PSA, and lymph node involvement. PSA was not consistently selected in the Cox-based models, but was in the RSF pipelines, which suggests that this antigen level may interact with predictors in a nonlinear way with respect to the endpoint of interest. None of the mRNA variables identified by Taylor et al. as tumour suppressors [25] were selected in our models. This may indicate a lack of association between tumour suppression and biochemical recurrence. Similarly, none of the variables used in the Decipher classifier for the prediction of metastasis following RP [22] were selected. These mRNA variables and their pathways, which are associated with metastasis, may be different to those involved in biochemical recurrence itself. Other non-genetic markers such as measures of systemic inflammation such as neutrophil-to-lymphocyte ratio (NLR) have also been investigated recently for their potential as predictive markers for BCR. Results on the prognostic potential of NLR for BCR prediction presented in the literature are mixed [55,56,57,58,59,60,61]. As the dataset used in this study did not contain this information, it was not possible to investigate the potential of NLR in conjunction with mRNA variables at this time, and it is therefore an avenue for future work.

4.3. Model Performance and Key Feature Identification

While the traditional CPH model with prefiltered mRNA variables demonstrated the best discrimination performance, its poorer calibration at specific endpoints as well as overall calibration suggests overfitting. On the other end, the best performing RSF-based pipeline outperformed other modelling strategies in terms of discrimination and calibration for 2-year, 5-year, and overall BCR-free survival prediction. The RSF model was also shown to have the highest AUC at both 2-year and 5-year horizons and gave rise to the most consistent net benefit at these time-points in the decision curve analysis. Among the most common methodologies available for survival analysis in continuous time, it may thus be a more opportune strategy overall compared to Cox-based approaches. Although the machine learning models were tested internally using a bootstrapping framework, these findings require external validation, which was not possible given the relatively small size of this dataset. Further improvement may also be possible using novel deep learning techniques, such as the Cox-based deep neural network recently described by Hao et al., 2021 [62], for application to high-dimensional data with small sample sizes. Experimentation of this recent innovation to BCR prediction for prostate cancer will be done in future work by this group.
Not all mRNA-inclusive pipelines yielded an increase in the predictive performance over their clinical-only counterpart. Whilst including mRNA variables in Boosted Cox models always led to a statistically significant increase in performance, regardless of feature pre-filtering and selection technique used, none of the LASSO Cox alternatives did. As for RSF pipelines, improvement required the combined use of correlation prefiltering and univariate feature selection. These findings are in line with those found by Gade et al. who investigated the inclusion of microRNA and mRNA data by using a bipartite graph as a single predictor [63]. The differences across the pipelines may be due to a large extent to the lack of statistical power associated with the relatively small sample size available, but it also indicates that the model building process is non trivial. Less than 2% of the features available were used at least once by a pipeline within the bootstrapping framework, and a gain in both model stability and predictive performance may be achieved with feature pre-processing strategies that are better suited to the nature of these data.
Moreover, unknown potential predictors of BCR may have been inadvertently removed from the pool of candidates prior to model building, e.g., due to the inability of feature selection methods based on multiple univariate selection to identify variables that may provide additional predictive potential only when used in conjunction with other variables. Other methods of pre-filtering and feature elimination will be investigated in future work including the potential benefit of clustering and other relevant methods such as principal component analysis.

5. Conclusions

The ability to predict BCR-free survival is critical for therapeutic decision-making. After comparing several machine learning methodologies with the reference Memorial Sloan Kettering nomogram, the main finding of this study was that the inclusion of mRNA variables increased the predictive performance. The best-performing pipeline combined correlation based pre-filtering and univariate feature selection of mRNA variables with a random survival forest model. It yielded high discriminative performance while retaining satisfactory calibration at 2-year, 5-year and overall survival, as well as the highest AUC and net benefit across all pipelines benchmarked. mRNA variables ESM1 and DNAH8 were consistently found to be strong predictors for 2-year, 5-year, and overall BCR-free survival.

Author Contributions

Conceptualization, A.O., E.W., and S.M.; methodology, A.O., E.W., and S.M.; software, A.O., E.W., and S.M.; formal analysis, A.O., E.W., and S.M.; investigation, A.O., E.W., and S.M.; resources, E.W., M.C., and S.M.; data curation, A.O. and S.M.; writing—original draft preparation, A.O., E.W., and S.M.; writing—review and editing, A.O., E.W., M.C., and S.M.; supervision, E.W., M.C., and S.M.; project administration, A.O., E.W., M.C., and S.M.; funding acquisition, M.C. and E.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the HEA Human Capital Initiative Pillar 1, 2020 and by Science Foundation Ireland under grant number 12/RC/2289-P2, co-funded under the European Regional Development Fund.

Data Availability Statement

Publicly available datasets were analysed in this study (accessed 14 June 2021). This data can be found here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE21032.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
BCRBiochemical recurrence
mRNAMessenger ribonucleic acid
RPRadical prostatectomy
PSAProstate specific antigen
CPHCox proportional hazard
ECEExtracapsular Extension
SVISeminal vesicle invasion
LNILymph node involvement
MSKCCMemorial Sloan Kettering Cancer Centre

References

  1. Sanda, M.G.; Cadeddu, J.A.; Kirkby, E.; Chen, R.C.; Crispino, T.; Fontanarosa, J.; Freedland, S.J.; Greene, K.; Klotz, L.H.; Makarov, D.V.; et al. Clinically localized prostate cancer: AUA/ASTRO/SUO guideline. Part II: Recommended approaches and details of specific care options. J. Urol. 2018, 199, 990–997. [Google Scholar] [CrossRef] [PubMed]
  2. Wang, J.; Xia, H.H.X.; Zhang, Y.; Zhang, L. Trends in treatments for prostate cancer in the United States, 2010–2015. Am. J. Cancer Res. 2021, 11, 2351. [Google Scholar] [PubMed]
  3. John Hopkins Medicine. Prostate Cancer Prognosis. 2022. Available online: https://www.hopkinsmedicine.org/health/conditions-and-diseases/prostate-cancer/prostate-cancer-prognosis (accessed on 30 November 2022).
  4. Tourinho-Barbosa, R.; Srougi, V.; Nunes-Silva, I.; Baghdadi, M.; Rembeyo, G.; Eiffel, S.S.; Barret, E.; Rozet, F.; Galiano, M.; Cathelineau, X.; et al. Biochemical recurrence after radical prostatectomy: What does it mean? Int. Braz. Urol. 2018, 44, 14–21. [Google Scholar] [CrossRef] [Green Version]
  5. Walz, J.; Chun, F.K.H.; Klein, E.A.; Reuther, A.; Saad, F.; Graefen, M.; Huland, H.; Karakiewicz, P.I. Nomogram predicting the probability of early recurrence after radical prostatectomy for prostate cancer. J. Urol. 2009, 181, 601–608. [Google Scholar] [CrossRef] [PubMed]
  6. Shariat, S.F.; Karakiewicz, P.I.; Roehrborn, C.G.; Kattan, M.W. An updated catalog of prostate cancer predictive tools. Cancer 2008, 113, 3075–3099. [Google Scholar] [CrossRef] [PubMed]
  7. Moghaddam, S.; Jalali, A.; O’Neill, A.; Murphy, L.; Gorman, L.; Reilly, A.M.; Heffernan, Á.; Lynch, T.; Power, R.; O’malley, K.J.; et al. Integrating serum biomarkers into prediction models for biochemical recurrence following radical prostatectomy. Cancers 2021, 13, 4162. [Google Scholar] [CrossRef]
  8. Diblasio, C.J.; Kattan, M.W. Use of nomograms to predict the risk of disease recurrence after definitive local therapy for prostate cancer. Urology 2003, 62, 9–18. [Google Scholar] [CrossRef]
  9. Memorial Sloan Kettering Cancer Center. Prostate Cancer Nomograms. 2022. Available online: https://www.mskcc.org/nomograms/prostate (accessed on 30 November 2022).
  10. Kattan, M.W.; Wheeler, T.M.; Scardino, P.T. Postoperative nomogram for disease recurrence after radical prostatectomy for prostate cancer. J. Clin. Oncol. 1999, 17, 1499. [Google Scholar] [CrossRef] [Green Version]
  11. Stephenson, A.J.; Scardino, P.T.; Eastham, J.A.; Bianco Jr, F.J.; Dotan, Z.A.; DiBlasio, C.J.; Reuther, A.; Klein, E.A.; Kattan, M.W. Postoperative nomogram predicting the 10-year probability of prostate cancer recurrence after radical prostatectomy. J. Clin. Oncol. 2005, 23, 7005. [Google Scholar] [CrossRef] [Green Version]
  12. Kattan, M.W.; Vickers, A.J.; Yu, C.; Bianco, F.J.; Cronin, A.M.; Eastham, J.A.; Klein, E.A.; Reuther, A.M.; Edson Pontes, J.; Scardino, P.T. Preoperative and postoperative nomograms incorporating surgeon experience for clinically localized prostate cancer. Cancer 2009, 115, 1005–1010. [Google Scholar] [CrossRef] [Green Version]
  13. Cox, D.R. Regression models and life-tables. J. R. Stat. Soc. Ser. (Methodol.) 1972, 34, 187–202. [Google Scholar] [CrossRef]
  14. Kattan, M.W.; Eastham, J.A.; Stapleton, A.M.; Wheeler, T.M.; Scardino, P.T. A preoperative nomogram for disease recurrence following radical prostatectomy for prostate cancer. JNCI J. Natl. Cancer Inst. 1998, 90, 766–771. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Tibshirani, R. The lasso method for variable selection in the Cox model. Stat. Med. 1997, 16, 385–395. [Google Scholar] [CrossRef]
  16. Sohn, I.; Kim, J.; Jung, S.H.; Park, C. Gradient lasso for Cox proportional hazards model. Bioinformatics 2009, 25, 1775–1781. [Google Scholar] [CrossRef] [Green Version]
  17. Joffe, E.; Coombes, K.R.; Qiu, Y.H.; Yoo, S.Y.; Zhang, N.; Bernstam, E.V.; Kornblau, S.M. Survival prediction in high dimensional datasets–Comparative evaluation of lasso regularization and random survival forests. Blood 2013, 122, 1728. [Google Scholar] [CrossRef]
  18. De Bin, R. Boosting in Cox regression: A comparison between the likelihood-based and the model-based approaches with focus on the R-packages CoxBoost and mboost. Comput. Stat. 2016, 31, 513–531. [Google Scholar] [CrossRef] [Green Version]
  19. Binder, H.; Schumacher, M. Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinform. 2008, 9, 14. [Google Scholar] [CrossRef] [Green Version]
  20. Ishwaran, H.; Kogalur, U.B.; Blackstone, E.H.; Lauer, M.S. Random survival forests. Ann. Appl. Stat. 2008, 2, 841–860. [Google Scholar] [CrossRef]
  21. Sager, M.; Yeat, N.C.; Pajaro-Van der Stadt, S.; Lin, C.; Ren, Q.; Lin, J. Transcriptomics in cancer diagnostics: Developments in technology, clinical research and commercialization. Expert Rev. Mol. Diagn. 2015, 15, 1589–1603. [Google Scholar] [CrossRef]
  22. Erho, N.; Crisan, A.; Vergara, I.A.; Mitra, A.P.; Ghadessi, M.; Buerki, C.; Bergstralh, E.J.; Kollmeyer, T.; Fink, S.; Haddad, Z.; et al. Discovery and validation of a prostate cancer genomic classifier that predicts early metastasis following radical prostatectomy. PLoS ONE 2013, 8, e66855. [Google Scholar] [CrossRef] [Green Version]
  23. Cuzick, J.; Berney, D.; Fisher, G.; Mesher, D.; Møller, H.; Reid, J.; Perry, M.; Park, J.; Younus, A.; Gutin, A.; et al. Prognostic value of a cell cycle progression signature for prostate cancer death in a conservatively managed needle biopsy cohort. Br. J. Cancer 2012, 106, 1095–1099. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  24. Klein, E.A.; Cooperberg, M.R.; Magi-Galluzzi, C.; Simko, J.P.; Falzarano, S.M.; Maddala, T.; Chan, J.M.; Li, J.; Cowan, J.E.; Tsiatis, A.C.; et al. A 17-gene assay to predict prostate cancer aggressiveness in the context of Gleason grade heterogeneity, tumor multifocality, and biopsy undersampling. Eur. Urol. 2014, 66, 550–560. [Google Scholar] [CrossRef] [Green Version]
  25. Taylor, B.S.; Schultz, N.; Hieronymus, H.; Gopalan, A.; Xiao, Y.; Carver, B.S.; Arora, V.K.; Kaushik, P.; Cerami, E.; Reva, B.; et al. Integrative genomic profiling of human prostate cancer. Cancer Cell 2010, 18, 11–22. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  26. Kryvenko, O.N.; Epstein, J.I. Prostate cancer grading: A decade after the 2005 modified Gleason grading system. Arch. Pathol. Lab. Med. 2016, 140, 1140–1152. [Google Scholar] [CrossRef] [Green Version]
  27. Wittekind, C.; Asamura, H.; Sobin, L.H. TNM Atlas: Illustrated Guide to the TNM Classification of Malignant Tumours; Wiley-Blackwell: Hoboken, NJ, USA, 2014. [Google Scholar]
  28. Simon, N.; Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J. Stat. Softw. 2011, 39, 1. [Google Scholar] [CrossRef]
  29. Hothorn, T.; Buehlmann, P.; Kneib, T.; Schmid, M.; Hofner, B. mboost: Model-Based Boosting; R Package Version 2.9-7; 2022. Available online: https://cran.r-project.org/package=mboost (accessed on 30 January 2023).
  30. Goh, L.; Song, Q.; Kasabov, N. A novel feature selection method to improve classification of gene expression data. In Proceedings of the Second Conference on Asia-Pacific Bioinformatics, Dunedin, New Zealand, 18–22 January 2004; Volume 29, pp. 161–166. [Google Scholar]
  31. Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. (Methodol.) 1995, 57, 289–300. [Google Scholar] [CrossRef]
  32. Beer, D.G.; Kardia, S.L.; Huang, C.C.; Giordano, T.J.; Levin, A.M.; Misek, D.E.; Lin, L.; Chen, G.; Gharib, T.G.; Thomas, D.G.; et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med. 2002, 8, 816–824. [Google Scholar] [CrossRef]
  33. Witten, D.M.; Tibshirani, R. Survival analysis with high-dimensional covariates. Stat. Methods Med. Res. 2010, 19, 29–51. [Google Scholar] [CrossRef]
  34. Hastie, T.; R, T.; J, F. The Elements of Statistical Learning; Springer: Berlin, Germany, 2009. [Google Scholar]
  35. Harrell, F.E.; Califf, R.M.; Pryor, D.B.; Lee, K.L.; Rosati, R.A. Evaluating the yield of medical tests. JAMA 1982, 247, 2543–2546. [Google Scholar] [CrossRef]
  36. Hanley, J.A.; McNeil, B.J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef] [Green Version]
  37. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2021. [Google Scholar]
  38. Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  39. Ishwaran, H.; Kogalur, U. Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC); R Package Version 2.13.0; 2021. Available online: https://github.com/kogalur/randomForestSRC/issues/ (accessed on 30 January 2023).
  40. Harrell, F.E., Jr.; Lee, K.L.; Mark, D.B. Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat. Med. 1996, 15, 361–387. [Google Scholar] [CrossRef]
  41. Remmers, S.; Verbeek, J.F.; Nieboer, D.; van der Kwast, T.; Roobol, M.J. Predicting biochemical recurrence and prostate cancer-specific mortality after radical prostatectomy: Comparison of six prediction models in a cohort of patients with screening-and clinically detected prostate cancer. BJU Int. 2019, 124, 635–642. [Google Scholar] [CrossRef] [Green Version]
  42. Royston, P. Tools for checking calibration of a Cox model in external validation: Prediction of population-averaged survival curves based on risk groups. Stata J. 2015, 15, 275–291. [Google Scholar] [CrossRef] [Green Version]
  43. Bewick, V.; Cheek, L.; Ball, J. Statistics review 13: Receiver operating characteristic curves. Crit. Care 2004, 8, 1–5. [Google Scholar]
  44. Vickers, A.J.; Elkin, E.B. Decision curve analysis: A novel method for evaluating prediction models. Med. Decis. Mak. 2006, 26, 565–574. [Google Scholar] [CrossRef] [Green Version]
  45. Khaire, U.M.; Dhanalakshmi, R. Stability of feature selection algorithm: A review. J. King Saud-Univ.-Comput. Inf. Sci. 2022, 34, 1060–1073. [Google Scholar] [CrossRef]
  46. Davis, C.A.; Gerick, F.; Hintermair, V.; Friedel, C.C.; Fundel, K.; Küffner, R.; Zimmer, R. Reliable gene signatures for microarray classification: Assessment of stability and performance. Bioinformatics 2006, 22, 2356–2363. [Google Scholar] [CrossRef] [Green Version]
  47. Bommert, A.; Lang, M. stabm: Stability measures for feature selection. J. Open Source Softw. 2021, 6, 3010. [Google Scholar] [CrossRef]
  48. Breiman, L. Out-of-bag estimation. Mach. Learn. 1996, 26, 123–140. [Google Scholar] [CrossRef] [Green Version]
  49. Wang, Y.; Ledet, R.J.; Imberg-Kazdan, K.; Logan, S.K.; Garabedian, M.J. Dynein axonemal heavy chain 8 promotes androgen receptor activity and associates with prostate cancer progression. Oncotarget 2016, 7, 49268. [Google Scholar] [CrossRef] [Green Version]
  50. Pan, K.F.; Lee, W.J.; Chou, C.C.; Yang, Y.C.; Chang, Y.C.; Chien, M.H.; Hsiao, M.; Hua, K.T. Direct interaction of β-catenin with nuclear ESM1 supports stemness of metastatic prostate cancer. EMBO J. 2021, 40, e105450. [Google Scholar] [CrossRef]
  51. Jeffers, A.; Sochat, V.; Kattan, M.W.; Yu, C.; Melcon, E.; Yamoah, K.; Rebbeck, T.R.; Whittemore, A.S. Predicting prostate cancer recurrence after radical prostatectomy. Prostate 2017, 77, 291–298. [Google Scholar] [CrossRef] [Green Version]
  52. Liu, C.; Miyata, H.; Gao, Y.; Sha, Y.; Tang, S.; Xu, Z.; Whitfield, M.; Patrat, C.; Wu, H.; Dulioust, E.; et al. Bi-allelic DNAH8 variants lead to multiple morphological abnormalities of the sperm flagella and primary male infertility. Am. J. Hum. Genet. 2020, 107, 330–341. [Google Scholar] [CrossRef]
  53. Song, F.; Zhang, Y.; Pan, Z.; Hu, X.; Yi, Y.; Zheng, X.; Wei, H.; Huang, P. Identification of novel key genes associated with the metastasis of prostate cancer based on bioinformatics prediction and validation. Cancer Cell Int. 2021, 21, 1–16. [Google Scholar] [CrossRef]
  54. Lai, C.Y.; Chen, C.M.; Hsu, W.H.; Hsieh, Y.H.; Liu, C.J. Overexpression of endothelial cell-specific molecule 1 correlates with Gleason score and expression of androgen receptor in prostate carcinoma. Int. J. Med. Sci. 2017, 14, 1263. [Google Scholar] [CrossRef] [Green Version]
  55. Bahig, H.; Taussky, D.; Delouya, G.; Nadiri, A.; Gagnon-Jacques, A.; Bodson-Clermont, P.; Soulieres, D. Neutrophil count is associated with survival in localized prostate cancer. BMC Cancer 2015, 15, 594. [Google Scholar] [CrossRef] [Green Version]
  56. Jang, W.; Cho, K.; Kim, K.; Yoon, C.; Kang, Y.; Lee, J.; Ham, W.; Rha, K.; Hong, S.; Choi, Y. Prognostic impact of preoperative neutrophil-to-lymphocyte ratio after radical prostatectomy in localized prostate cancer. Prostate Cancer Prostatic Dis. 2016, 19, 298–304. [Google Scholar] [CrossRef]
  57. Kwon, Y.S.; Han, C.S.; Yu, J.W.; Kim, S.; Modi, P.; Davis, R.; Park, J.H.; Lee, P.; Ha, Y.S.; Kim, W.J.; et al. Neutrophil and lymphocyte counts as clinical markers for stratifying low-risk prostate cancer. Clin. Genitourin. Cancer 2016, 14, e1–e8. [Google Scholar] [CrossRef] [Green Version]
  58. Lee, H.; Jeong, S.J.; Hong, S.K.; Byun, S.S.; Lee, S.E.; Oh, J.J. High preoperative neutrophil–lymphocyte ratio predicts biochemical recurrence in patients with localized prostate cancer after radical prostatectomy. World J. Urol. 2016, 34, 821–827. [Google Scholar] [CrossRef]
  59. Poyet, C.; Adank, J.P.; Keller, E.; Mortezavi, A.; Rabgang, T.; Pfister, B.; Sulser, T.; Hermanns, T. Mp66-05 pretreatment systemic inflamatory response parameters do not predict the outcome in men with prostate cancer undergoing radical prostatectomy. J. Urol. 2015, 193, e817. [Google Scholar] [CrossRef]
  60. Sharma, V.; Cockerill, P.A.; Viers, B.R.; Rangel, L.J.; Carlson, R.E.; Karnes, R.J.; Tollefson, M.K. Mp6-05 the association of preoperative neutrophil to lymphocyte ratio with oncologic outcomes following radical prostatectomy for prostate cancer. J. Urol. 2015, 193, e55–e56. [Google Scholar] [CrossRef]
  61. Zhang, G.M.; Zhu, Y.; Ma, X.C.; Qin, X.J.; Wan, F.N.; Dai, B.; Sun, L.J.; Ye, D.W. Pretreatment neutrophil-to-lymphocyte ratio: A predictor of advanced prostate cancer and biochemical recurrence in patients receiving radical prostatectomy. Medicine 2015, 94. [Google Scholar] [CrossRef] [PubMed]
  62. Hao, L.; Kim, J.; Kwon, S.; Ha, I.D. Deep learning-based survival analysis for high-dimensional survival data. Mathematics 2021, 9, 1244. [Google Scholar] [CrossRef]
  63. Gade, S.; Porzelius, C.; Fälth, M.; Brase, J.C.; Wuttig, D.; Kuner, R.; Binder, H.; Sültmann, H.; Beißbarth, T. Graph based fusion of miRNA and mRNA expression data improves clinical outcome prediction in prostate cancer. BMC Bioinform. 2011, 12, 488. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Bootstrap C-Index distributions of the MSKCC nomogram and the most competitive model pipelines using mRNA data. From left to right; MSKCC, Cox model with clinical variables and correlation prefiltered mRNA variables, RSF model with clinical variables and correlation prefiltered and univariate Cox feature selected mRNA, Boosted Cox model with clinical variables and correlation prefiltered mRNA variables, LASSO Cox model with clinical variables and univariate Cox feature selected mRNA.
Figure 1. Bootstrap C-Index distributions of the MSKCC nomogram and the most competitive model pipelines using mRNA data. From left to right; MSKCC, Cox model with clinical variables and correlation prefiltered mRNA variables, RSF model with clinical variables and correlation prefiltered and univariate Cox feature selected mRNA, Boosted Cox model with clinical variables and correlation prefiltered mRNA variables, LASSO Cox model with clinical variables and univariate Cox feature selected mRNA.
Cancers 15 01276 g001
Figure 2. Calibration curves for MSKCC (left), clinical-only models (centre) and best-performing mRNA-based models (right) for (from top to bottom) overall, 2-year and 5-year survival.
Figure 2. Calibration curves for MSKCC (left), clinical-only models (centre) and best-performing mRNA-based models (right) for (from top to bottom) overall, 2-year and 5-year survival.
Cancers 15 01276 g002
Figure 3. (A,B): ROC for 2-year (A) and 5-year (B) horizons. (C,D): DCA curves at the same horizons.
Figure 3. (A,B): ROC for 2-year (A) and 5-year (B) horizons. (C,D): DCA curves at the same horizons.
Cancers 15 01276 g003
Figure 4. (A) Stability indices of the clinical-only models. (B) Stability indices of the top mRNA-inclusive pipelines. (C) Feature selection rates in clinical-only models (a darker shade indicates a higher rate). (D) Feature selection rates in the best-performing mRNA-inclusive pipelines, for features with a selection rate ≥0.2 in at least one of these top pipelines. For comparative purposes, the shade of grey used for ESM1 in RSF corresponds to a 50% selection rate.
Figure 4. (A) Stability indices of the clinical-only models. (B) Stability indices of the top mRNA-inclusive pipelines. (C) Feature selection rates in clinical-only models (a darker shade indicates a higher rate). (D) Feature selection rates in the best-performing mRNA-inclusive pipelines, for features with a selection rate ≥0.2 in at least one of these top pipelines. For comparative purposes, the shade of grey used for ESM1 in RSF corresponds to a 50% selection rate.
Cancers 15 01276 g004
Table 1. Descriptive statistics of the categorical clinical variables.
Table 1. Descriptive statistics of the categorical clinical variables.
VariableN%
BCR-free Survival
Censored13572
BCR5228
Ethnicity
White14779
Black3016
Asian42
Unknown63
Pathological Gleason Score
65228
710355
8179
9158
Extracapsular Extension (ECE)
Absent5228
Present13572
Seminal Vesicle Invasion (SVI)
Negative15683
Positive3117
Lymph Node Involvement (LNI)
Normal14075
Abnormal1910
Not Done2815
Surgical Marginal Status (SMS)
Negative13773
Positive5027
Table 2. Bootstrapped C-indices. The top performing pipelines for each statistical model (namely CPH, LASSO Cox, Boosted Cox, and RSF) are highlighted in bold.
Table 2. Bootstrapped C-indices. The top performing pipelines for each statistical model (namely CPH, LASSO Cox, Boosted Cox, and RSF) are highlighted in bold.
ModelC-Index95% Confidence Interval
Memorial Sloan Kettering (MSKCC)0.770(0.675, 0.844)
Clinical Variables Only
Cox0.798(0.724, 0.857)
LASSO0.761(0.710, 0.824)
Boosted0.770(0.709, 0.829)
RSF0.795(0.727, 0.852)
Clinical Variable and Correlation Prefiltered mRNA variables
Cox 0.828 (0.720, 0.909)
LASSO0.762(0.696, 0.826)
Boosted0.816(0.757, 0.866)
RSF0.749(0.695, 0.811)
Clinical Variables and Univariate Cox Feature Selected mRNA variables
Cox0.765(0.688, 0.827)
LASSO0.768(0.703, 0.835)
Boosted0.798(0.737, 0.845)
RSF0.782(0.730, 0.846)
Clinical Variables and Correlation Prefiltered and Univariate Cox Feature Selected mRNA variables
Cox0.792(0.671, 0.865)
LASSO0.763(0.464, 0.918)
Boosted0.814(0.753, 0.863)
RSF0.821(0.747, 0.874)
Table 3. Summaries of final predictive model sizes (i.e., number of variables used) selected within the clinical-only models and best-performing pipelines, in terms of their mean, median, standard deviation, and coefficient of variation (CoV), as well as the overall number of features selected at least once across 100 bootstraps.
Table 3. Summaries of final predictive model sizes (i.e., number of variables used) selected within the clinical-only models and best-performing pipelines, in terms of their mean, median, standard deviation, and coefficient of variation (CoV), as well as the overall number of features selected at least once across 100 bootstraps.
MeanMedianSDCoVOverall
Clinical-only
Cox4.5551.25027.5%8
LASSO3.5131.54143.9%14
Boosted5.1050.87017.1%14
RSF4.9651.52430.7%8
Top performing pipelines
Cox9.46100.9269.8%623
LASSO11.88123.89132.8%343
Boosted14.98153.05220.4%323
RSF17.85169.05750.7 %335
Table 4. Selection rates of all variables with a selection rate ≥ 50% in at least one of the best-performing mRNA-inclusive pipelines.
Table 4. Selection rates of all variables with a selection rate ≥ 50% in at least one of the best-performing mRNA-inclusive pipelines.
Pathological Gleason Sum ScoreLNISVIPSADNAH8ESM1
Cox56%16%5%4%9%10%
LASSO60%9%64%34%55%29%
Boosted67%2%65%34%64%62%
RSF86%88%31%62%59%50%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

O’Donnell, A.; Wolsztynski, E.; Cronin, M.; Moghaddam, S. Improving the Post-Operative Prediction of BCR-Free Survival Time with mRNA Variables and Machine Learning. Cancers 2023, 15, 1276. https://doi.org/10.3390/cancers15041276

AMA Style

O’Donnell A, Wolsztynski E, Cronin M, Moghaddam S. Improving the Post-Operative Prediction of BCR-Free Survival Time with mRNA Variables and Machine Learning. Cancers. 2023; 15(4):1276. https://doi.org/10.3390/cancers15041276

Chicago/Turabian Style

O’Donnell, Autumn, Eric Wolsztynski, Michael Cronin, and Shirin Moghaddam. 2023. "Improving the Post-Operative Prediction of BCR-Free Survival Time with mRNA Variables and Machine Learning" Cancers 15, no. 4: 1276. https://doi.org/10.3390/cancers15041276

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop