Prediction System for Prostate Cancer Recurrence Using Machine Learning

Prostate cancer is the fourth most common cancer affecting South Korean males, and the biochemical recurrence (BCR) of prostate cancer occurs in approximately 25% of patients five years after radical prostatectomy. The ability to predict BCR would help clinicians and patients to make better treatment decisions. Therefore, in this study, we have proposed a web-based clinical decision support system that predicts the BCR of prostate cancer in Korean patients. The data were obtained from the Korean Prostate Cancer Registry (KPCR) database, which contained information about 7394 patients with prostate cancer who were treated at one of the six major medical institutions in South Korea between May 2001 and December 2014. We tested 13 prediction models and selected the gradient boosting classifier because it demonstrated excellent prediction performance. Using this model, we were able to create a web application and once clinical data from patients were entered, the three- and five-year post-surgery BCR predictions could be extracted. We developed a clinical decision support system to provide a prostate cancer BCR predictive function to facilitate postoperative follow-up and clinical management. This system will help clinicians develop a strategic approach for prostate cancer treatment by predicting the likelihood of prostate cancer recurrence.


Introduction
According to a 2017 Korean National Cancer Center survey, prostate cancer is the fourth most common cancer affecting South Korean males, with an incidence rate of 29 per 100,000 people in the country's population [1].
Biochemical recurrence (BCR) of prostate cancer occurs in approximately 25% of patients within five years after radical prostatectomy [2]. BCR is accepted as evidence of cancer recurrence if measurable serum prostate-specific antigen (PSA) levels are confirmed after radical prostatectomy, or if PSA levels increase after radiation treatments [3][4][5]. BCR is also accepted as an indicator of postoperative progress and outcomes [6]. If we could predict prostate cancer BCR by analyzing data on the progress and recurrence patterns of prostate cancers, it would help clinicians to make better decisions with regard to future treatment options. In addition, depending on the predicted BCR probability, the incidence of recurrence could be reduced by providing patients with physical examinations or related medication.
Most medical institutions use computer-based systems to accurately and efficiently manage medical data [7]. In particular, the clinical decision support system (CDSS) is a valuable tool that helps healthcare providers to make decisions and solve complex problems [8,9]. There are few prostate cancer recurrence prediction systems currently in clinical use, so providing this function, through a facility's CDSS, will facilitate postoperative follow up and aid in the clinical management of patients with prostate cancer [10].
The Memorial Sloan Kettering Cancer Center (MSKCC) in the United States provides a function that can be used to predict the probability of 2-, 5-, 7-, 10-, and 15-year BCR-free survival after prostate cancer surgery. It applies statistical techniques such as linear regression, logistical regression, and survival progress models to provide a web-based cancer recurrence prediction system [11]. However, the incidence of prostate cancer varies with country and race [12][13][14][15], which means that the predictive system that was designed for MSKCC patients may not be as accurate at predicting prostate cancer recurrence in South Korea.
Therefore, in this study, we have proposed a web-based CDSS that can predict the BCR of Korean patients with prostate cancer to support radical prostatectomy postoperative care. Our study uses data from Korean patients with prostate cancer to predict the recurrence of prostate cancer in this population. We tested several statistical and machine learning techniques that could be used to predict prostate cancer recurrence and then adopted the model that exhibited the best prediction accuracy for use in our prostate cancer recurrence prediction system. This study aimed to develop a machine learning model that could use the Korean Prostate Cancer Registry (KPCR) database to predict the three-and five-year BCR in Korean patients who previously had prostate cancer and had undergone radical prostatectomies.

BCR Prediction Data
The data used in this study was obtained from the KPCR database which included 7394 patients with prostate cancer. This database included Electronic Medical Records (EMRs) data collected from the six major medical institutions in South Korea between May 2001 and December 2014 ( Figure 1). The study protocol was approved by the institutional review board of the Catholic University of Korea (IRB No. MC16RIMI0107). Data standardization and quality control were implemented to ensure data integrity, and exclusion criteria were created and applied to the database to refine the data used in the BCR prediction.
Our data subset was created from the main KPCR database using the following processes ( Figure  2): • There were 7394 potentially relevant records in the KPCR database • 2280 records, including those of non-Koreans (51), those who did not have a full follow-up period of at least one year (633), those who did not have a neoclassical prostatectomy (80), those who performed pre-supplementary therapy (230), and those for whom critical data were missing (1286) were excluded • We identified 5114 individual records as being statistically usable and relevant • We classified the 5114 patient data items into BCR (1207) and Non-BCR (3907) groups to identify the characteristics of each group.  Data standardization and quality control were implemented to ensure data integrity, and exclusion criteria were created and applied to the database to refine the data used in the BCR prediction.
Our data subset was created from the main KPCR database using the following processes ( Figure 2): • There were 7394 potentially relevant records in the KPCR database • 2280 records, including those of non-Koreans (51), those who did not have a full follow-up period of at least one year (633), those who did not have a neoclassical prostatectomy (80), those who performed pre-supplementary therapy (230), and those for whom critical data were missing (1286) were excluded • We identified 5114 individual records as being statistically usable and relevant • We classified the 5114 patient data items into BCR (1207) and Non-BCR (3907) groups to identify the characteristics of each group.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 3 of 9 Data standardization and quality control were implemented to ensure data integrity, and exclusion criteria were created and applied to the database to refine the data used in the BCR prediction.
Our data subset was created from the main KPCR database using the following processes ( Figure  2): • There were 7394 potentially relevant records in the KPCR database • 2280 records, including those of non-Koreans (51), those who did not have a full follow-up period of at least one year (633), those who did not have a neoclassical prostatectomy (80), those who performed pre-supplementary therapy (230), and those for whom critical data were missing (1286) were excluded • We identified 5114 individual records as being statistically usable and relevant • We classified the 5114 patient data items into BCR (1207) and Non-BCR (3907) groups to identify the characteristics of each group.  We found that follow-up periods ranged between 0 and 157 months, and out of the total of 7394 patients, 70% or more had a follow-up period of less than five years. Therefore, we decided to predict BCR likelihood at three and five years after radical prostate cancer surgery.
Factors affecting BCR have been reported to include age, initial PSA, clinical T stage, pathology Gleason score sum, pathology T stage, surgical margin, perineural invasion, seminal vesicles, extracapsular extension, and lymphovascular invasion [16][17][18][19]. We extracted these factors from the 5114 analytical data items and generated a training data set for predicting three-and five-year BCR after radical prostate cancer surgery.
To select the model with the best BCR prediction performance, we evaluated the area under the receiver operating characteristics curve (AUC), accuracy, sensitivity, specificity, and Matthews correlation coefficient (MCC) for each model. We then selected the analytical model that performed the best and applied it to the BCR prediction system.

BCR Prediction System
There was a need for a system that can analyze the database of patients with prostate cancer and review the results to make predictions about prostate cancer recurrence. This system can then be used to analyze data from a new patient and predict the probability of recurrence. Based on the results of the model, clinicians can select treatment options to reduce the incidence of BCR. This system was developed as a web-based application so clinicians could access it easily.
In this study, we proposed a system that used a prediction algorithm to analyze the KPCR database in order to develop a BCR prediction model using data collected from patients with prostate cancer.

Patient Characteristics and Distribution
We analyzed the characteristics and distribution of patients by dividing them into groups with prostate cancer BCR and non-recurrence of prostate cancer (Non-BCR) ( Table 1). The average age of the 5114 patients with prostate cancer was 66 years, and the average ages for BCR and Non-BCR were both 65 years, respectively. The oldest and youngest BCR patients were 84 and 38, respectively, while the oldest and youngest Non-BCR patients were 90 and 37, respectively. The average initial PSA value for all patients was 11.59 ng/mL, and the mean PSA values for BCR and Non-BCR patients were 17.58 ng/mL and 9.74 ng/mL, respectively. For BCR patients, the initial PSA minimum value was 0.66, and the maximum was 261.77. For Non-BCR patients, the initial PSA minimum was 0.09, while the maximum was 305. Pathological Gleason scores were classified as grade 2-4, 5, 6, 7, and 8-10, and within each grade, 5 (0.09%), 22 (0.43%), 1105 (21%), 3295 (64%), and 687 (13.43%) patients were included, respectively. We classified clinical T stages into four of the existing eight grades [29]: 2152 for T1, 1883 for T2, 953 for T3, and 126 for T4. The BCR patients were widely distributed within T2, and the non-BCR patients were widely distributed within T1.

BCR Prediction Model Analysis
In order to analyze the statistical performance of the BCR predictions, we evaluated the AUC, accuracy, sensitivity, specificity, and MCC of each analysis method. The analytical data was divided randomly, with 80% of the data allocated to the training dataset and 20% allocated to the test dataset. We trained the model using the training data, and then we evaluated the performance of the model using the test data.
According to the result of analysis (Table 2), the highest AUC value for the three year predictions was 0.8419, which was associated with the GBC model. For the five year predictions, the highest AUC value (0.8071) was associated with the Ridged Regression model. The highest MCC values for the three year and five year predictions were associated with the Random Forest (ntrees = 80) model (0.4621) and the GBC model (0.4836), respectively. The Cox proportional hazard model showed good results in terms of accuracy and specificity but performed poorly in terms of sensitivity, and so was not used as a predictive model. In terms of overall performance across the five analysis criteria (AUC, accuracy, sensitivity, specificity, MCC), the GBC model had the highest average value and showed excellent performance. Therefore, we applied this model to develop the BCR prediction system. To develop prediction models for three and five years after prostate cancer surgery using the GBC model, we developed the BCR prediction system using Python.

Development of BCR Prediction System
The BCR prediction system used data from Korean patients with prostate cancer to implement a web-based application. The structure of the prediction system was classified into two parts: predictive modeling and web application development (Figure 3). The web application used the predictive model to make predictions.

Development of BCR Prediction System
The BCR prediction system used data from Korean patients with prostate cancer to implement a web-based application. The structure of the prediction system was classified into two parts: predictive modeling and web application development (Figure 3). The web application used the predictive model to make predictions. The BCR prediction system was set up to perform several functions on one screen, including reading the data list of patients with prostate cancer, inputting the input values, and extracting the prediction results (Figure 4). After the input fields, including age at diagnosis, initial PSA, clinical T stage, pathological Gleason score, and pathological T stage values have been entered and status information such as surgical margin, perineural invasion, seminal vesicle invasion, extracapsular extension, and lymphovascular invasion has been selected, the three and five year post-surgery BCR predictions could be extracted. The prediction results appear as an output item at the bottom of the screen. The BCR prediction system was set up to perform several functions on one screen, including reading the data list of patients with prostate cancer, inputting the input values, and extracting the prediction results (Figure 4). After the input fields, including age at diagnosis, initial PSA, clinical T stage, pathological Gleason score, and pathological T stage values have been entered and status information such as surgical margin, perineural invasion, seminal vesicle invasion, extracapsular extension, and lymphovascular invasion has been selected, the three and five year post-surgery BCR predictions could be extracted. The prediction results appear as an output item at the bottom of the screen.

Follow-Up Period for Predicting BCR
Long-term follow-up is needed to identify prostate cancer recurrence. The current data includes follow-up periods of up to 13 years, although the average is just four years, which is short. As a result, our work was limited to predicting prostate cancer recurrence three and five years after radical prostate cancer surgery. If Korean prostate cancer data could be accumulated over a longer time period, the BCR prediction system developed here could be applied to predicting the likelihood of cancer recurrence more than 10 years after surgery.

Prostate Cancer Patients Data and Data Provider Characteristics
The data on patients with prostate cancer used in this study were collected from six Korean agencies. Although this is not a large number of institutions, they are large hospitals that would have been visited by most cancer patients in South Korea. These institutions are therefore considered to have provided representative datasets, and it was assumed that institutional dataset characteristics were unlikely to exhibit significant differences.
The analytical classifier used in this study is specific to the Korean prostate cancer dataset. There are racial differences in the recurrence of prostate cancer, and since we currently only have a dataset that represents Korean patients, we can only develop a specific predictive model that applies to that population. We would, therefore, have difficulty predicting recurrence in other races. If we can collect data from other racial populations in the future, we will be able to develop more general models that can predict recurrence in various groups of people.

The Potential of Misdiagnosis
The CDSS developed in this study can help clinicians determine treatment options by predicting the probability of prostate cancer recurrence. However, there was an error rate of about 20 percent in this study, which means that there is a potential for misdiagnosis. The CDSS should be used as an

Follow-Up Period for Predicting BCR
Long-term follow-up is needed to identify prostate cancer recurrence. The current data includes follow-up periods of up to 13 years, although the average is just four years, which is short. As a result, our work was limited to predicting prostate cancer recurrence three and five years after radical prostate cancer surgery. If Korean prostate cancer data could be accumulated over a longer time period, the BCR prediction system developed here could be applied to predicting the likelihood of cancer recurrence more than 10 years after surgery.

Prostate Cancer Patients Data and Data Provider Characteristics
The data on patients with prostate cancer used in this study were collected from six Korean agencies. Although this is not a large number of institutions, they are large hospitals that would have been visited by most cancer patients in South Korea. These institutions are therefore considered to have provided representative datasets, and it was assumed that institutional dataset characteristics were unlikely to exhibit significant differences.
The analytical classifier used in this study is specific to the Korean prostate cancer dataset. There are racial differences in the recurrence of prostate cancer, and since we currently only have a dataset that represents Korean patients, we can only develop a specific predictive model that applies to that population. We would, therefore, have difficulty predicting recurrence in other races. If we can collect data from other racial populations in the future, we will be able to develop more general models that can predict recurrence in various groups of people.

The Potential of Misdiagnosis
The CDSS developed in this study can help clinicians determine treatment options by predicting the probability of prostate cancer recurrence. However, there was an error rate of about 20 percent in this study, which means that there is a potential for misdiagnosis. The CDSS should be used as an ancillary tool, and, in the future, measures to improve data quality and the accuracy of predictive models should be investigated to.

Conclusions
In this study, we designed and developed a CDSS that could predict the likelihood of BCR in Korean patients with prostate cancer. Performance-stable GBC predictive models were used, and the KPCR, a database of Korean patients with prostate cancer, was used for training to develop BCR prediction models that were specialized for use with Korean data. This prediction model was incorporated into a web-based system to allow easy access to data from patients with prostate cancer and facilitate its use for BCR predictions. The BCR prediction could facilitate postoperative follow-up and clinical management of patients with prostate cancer. This system will be able to help clinicians develop a strategic approach by predicting the likelihood that prostate cancer will recur in a particular patient.