Prediction Model for Pancreatic Cancer—A Population-Based Study from NHIRD

Simple Summary Pancreatic cancer has been ranked seventh in the top ten cancer mortality rates for the past three year in Taiwan. It is one of the more difficult cancers to detect early due to the lack of early diagnostic tools. This is a population-based study from NHIRD. A higher performance pancreatic cancer prediction model has been established. This predictive model can improve the awareness of the risk of pancreatic cancer and give patients with pancreatic cancer a simpler tool for early screening in the golden period when the disease can still be eradicated. Abstract (1) Background: Cancer has been the leading cause of death in Taiwan for 39 years, and among them, pancreatic cancer has been ranked seventh in the top ten cancer mortality rates for the past three years. While the incidence rate of pancreatic cancer is ranked at the bottom of the top 10 cancers, the survival rate is very low. Pancreatic cancer is one of the more difficult cancers to detect early due to the lack of early diagnostic tools. Early screening is important for the treatment of pancreatic cancer. Only a few studies have designed predictive models for pancreatic cancer. (2) Methods: The Taiwan Health Insurance Database was used in this study, covering over 99% of the population in Taiwan. The subset sample was not significantly different from the original NHIRD sample. A machine learning approach was used to develop a predictive model for pancreatic cancer disease. Four models, including logistic regression, deep neural networks, ensemble learning, and voting ensemble were used in this study. The ROC curve and a confusion matrix were used to evaluate the accuracy of the pancreatic cancer prediction models. (3) Results: The AUC of the LR model was higher than the other three models in the external testing set for all three of the factor combinations. Sensitivity was best measured by the stacking model for the first factor combinations, and specificity was best measured by the DNN model for the second factor combination. The result of the model that used only nine factors (third factor combinations) was equal to the other two factor combinations. The AUC of the previous models for the early assessment of pancreatic cancer ranged from approximately 0.57 to 0.71. The AUC of this study was higher than that of previous studies and ranged from 0.71 to 0.76, which provides higher accuracy. (4) Conclusions: This study compared the performances of LR, DNN, stacking, and voting models for pancreatic cancer prediction and constructed a pancreatic cancer prediction model with accuracy higher than that of previous studies. This predictive model will improve awareness of the risk of pancreatic cancer and give patients with pancreatic cancer a simpler tool for early screening in the golden period when the disease can still be eradicated.


Introduction
While targeted drugs are one of the treatments for cancer, there is still a lack of widely available and effective targeted drugs for pancreatic cancer. According to recent studies, pancreatic cancer does not have a specific molecular variant, as lung and breast cancers do, and researchers have even suggested that there may be more than one molecular variant in pancreatic cancer. Pancreatic cancer tends to have more non-specific symptoms than other diseases, which often leads to the initial diagnosis of other abdominal diseases, making initial treatment plans ineffective and delaying treatment. The lack of early diagnostic tools for pancreatic cancer results in poor overall outcomes [1].
Cancer has been the leading cause of death in Taiwan for 39 years, and among them, pancreatic cancer has been ranked seventh in the top ten cancer mortality rates for the past three years [2]. While the incidence rate of pancreatic cancer is ranked at the bottom of the top 10 cancers, the survival rate is very low. In Taiwan, the survival rate of pancreatic cancer patients is about 25.52% at one year, 9.22% at three years, 6.6% at five years, and 4.71% at ten years. Based on the number of new diagnoses and deaths each year, it is estimated that the incidence rate of the disease is generally close to the mortality rate [3,4].
Due to the absence of obvious disease features in the initial stage, and the fact that the tumor is located in the posterior abdominal cavity, pancreatic cancer is often diagnosed at an advanced stage [4]. According to the American Cancer Society's Facts and Figures annual report, the average five-year survival rate for pancreatic cancer is 9%, with 37% of patients having an early localized disease and only 3% of patients having an advanced disease [5]. In 2019, a joint study by the Lancet and the Global Burden of Diseases (GBD) found that, between 1990 and 2017, the number of deaths, incident cases, and disabilityadjusted life years (DALYs) of pancreatic cancer doubled worldwide. Therefore, there is an urgent need for a screening method and effective treatment strategy for the early detection of pancreatic cancer [6]. Surgery remains the ideal treatment for pancreatic cancer and is currently the only method considered to have a chance of curing pancreatic cancer, which significantly improves patient survival compared to other treatment options [7,8]. However, the success rate of surgery depends on the stage of the disease. If the tumor has invaded a major artery or has metastasized distantly, the chances of surgery for pancreatic cancer are low. The data show that only 10-15% of patients have a chance of undergoing radical surgery, with most patients having a tumor that is too large by the time it is discovered and the surrounding lymph nodes, blood vessels, and nerves have been eroded. In 2014, the International Study Group of Pancreatic Surgery (ISGPS) published a study stating that formal resection is highly discouraged if the cancer shows signs of erosion of the superior mesenteric artery [9].
The pre-cancerous nature of pancreatic cancer is not well understood in current studies, and evidence of its incidence, biomarkers, and natural history progression remain insufficient [10,11]; therefore, there is a great need for successful early detection markers when screening for pancreatic cancer in populations at higher risk. Epidemiological studies of clinical factors can be used to effectively reduce and screen high-risk groups [12,13].
A number of symptoms and diseases have been found to be associated with pancreatic cancer; for example, pancreatitis (idiopathic pancreatitis, alcoholic pancreatitis, hereditary pancreatitis, and febrile pancreatitis) is associated with a significantly increased risk of pancreatic cancer [14,15]. Compared to other inflammatory conditions, chronic pancreatitis has a relative risk (RR) or odds ratio (OR), which is one of the higher risk factors for pancreatic cancer [16,17].
T2DM is considered to be an early manifestation of asymptomatic pancreatic cancer and has been suggested as a potential early detection marker [18,19]. Annika Bergquist showed that patients with primary sclerosing cholangitis had a 37% risk of developing hepatobiliary malignancies within one year of diagnosis, a 161-fold increase in the risk of hepatobiliary malignancies, and a 14-fold increase in the risk of pancreatic cancer, as compared to those without prior cholangitis [20]. Thus, cholangitis is a significant risk factor for pancreatic cancer [21]. Clinical observations support the potential carcinogenic role of the hepatitis B virus (HBV) in pancreatic tumors [22,23]. According to the National Health and Nutrition Examination Survey (NHANES) epidemiological follow-up survey, people with periodontitis were found to be at higher risk of developing pancreatic cancer [24,25]. Another study showed a strong positive association between periodontal disease and pancreatic cancer [26]. Pancreatic cancer occurs predominantly in older patients, and only about 10% of patients that develop a tumor are under 50 years old, thus, the incidence rate increases rapidly with age [27].
Only a few studies have designed predictive models for pancreatic cancer. Limor Appelbaum et al. used a feedforward neural network and logistics regression, with an AUC of 0.71 for the training set and 0.68 for the validation set [28], in order to construct a model for the early assessment of pancreatic cancer. A case-control study by Aileen Baecker et al. developed a multivariate logistic regression model using 16 risk factors and patients' symptoms in the 15 months prior to the diagnosis of pancreatic cancer. After matching the age and the gender of the two groups, the model showed that both the case and the control groups' predictions achieved an AUC of 0.68 [29]. Alison P. Klein et al. developed a predictive model based on odds ratio (OR), including smoking, alcohol consumption, diabetes, obesity, family history of pancreatic cancer, and non-O ABO genotype. The subsequent results showed that the AUCs of the risk models were 58%, 57%, and 61% for non-genetic, genetic-only, and non-genetic and genetic factors, respectively [30].

Data Source
This study used data from the Taiwan Health Insurance Database (NHIRD) from 2000 to 2009. The NHIRD covers 99.98% of the population in Taiwan [31] and contains basic data files, outpatient and inpatient surgical, diagnostic, and medication information, treatment records, and other clinical detailed information. The dataset used in this study was published by NHI Taiwan and contains all of the data for 2,000,000 randomly sampled individuals from NHIRD. The subset sample was not significantly different from the original NHIRD sample in terms of gender and age distribution.
As the diagnostic data contained in NHIRD may only be a one-off diagnosis made for cancer examination, the experimental group excluded cases with less than two diagnoses of pancreatic cancer.
As metastasis was not a predictive target, only patients without any prior malignancy (cancer) were included in the experimental group. Ultimately, the time of the first diagnosis of pancreatic cancer in the experimental group was used as the study data, which incorporated the short-term historical medical information within one year before diagnosis date. The targeted control group had never had any malignancy in the entire database, and in order to avoid bias and inequity due to age and gender, the sample was matched according to the gender and age of the subjects in the experimental group. A one-to-three control ratio was used for the control group sample [32], thus, a total of 738 subjects were included in the experimental group and 2952 subjects in the control group.

Data Processing
The dataset was divided into a training set and a testing set in a ratio of 8:2. The training set was divided into an 80% training set for model training and a 20% validation set for model validation. The testing set was the external test data, which was used for the final performance testing of the model. As data imbalance may lead to model accuracy being compromised [33], the plain sampling approach was used during the training and validation phases, where a few types of data were randomly re-sampled to achieve a 1:1 balance. The Random Over Sampler suite, as provided by Python, used imbalanced learning to implement the plain sampling method by randomly re-sampling a small number of categories until a numerical balance was achieved, relative to the majority of categories. In order to ensure fairness of the external test data, the testing set was not involved in any significant factor checking or sampling process. Chi-squared testing was used to examine the relationship between the factors of the categorical variables and pancreatic cancer in this study to initially confirm the statistical significance of the factors. The Akaike information criterion (AIC) was used to identify the best factor combination for inclusion in the model training.

Logistic Regression, LR
Logistic regression [34] is a probabilistic non-linear regression model for multivariate analysis of the relationship between a binary outcome (Y) and multiple variables (X 1, X 2, . . . , X 3 ). In contrast, linear regression aims to fit all data points into a straight line to predict a continuous value. In general, during linear regression, the equation is constructed directly on the target Y using the feature X. In logistic regression, a linear equation is constructed by taking the logarithm of the odds ratio and converting it to a sigmoid function with a probability range between 0 and 1, representing 0% and 100%, respectively, for the purpose of binary classification.

Deep Neural Networks (DNN)
DNNs [35] are neural-like networks with multiple hidden layers. While their architecture is similar to that of early perceptrons, the main differences are the deeper hidden layers, the greater variety of activation functions, and the generally better fitting effect. A single-layer perceptron can be regarded as a simple feedforward neural network, where X i is the input factor, W i is the weight value, and function is the activation function that simulates the structure of nerve cells in a living organism. However, as single-layer perceptrons are unable to learn more complex non-linear models, or provide multivariate outputs, they have been extended to deep neural networks.

Ensemble Learning
Ensemble learning [36,37] refers to the systematic combination of multiple classifiers, and the aim of combining multiple classifiers is to produce a more powerful model. It is primarily used to improve the classification, prediction, and function approximation performance of a model or to reduce the likelihood of selecting a poorly performing algorithm.
Stacking is an ensemble learning approach that focuses on reducing bias by combining multiple prediction results. It consists of a two-tier structure, with the first tier being used to build multiple base classifiers, and the base classifiers of the first tier being combined by a meta-learner (logistic regression) in the second tier. Seven base classifiers were used in this study, including the following: Multi-Level Perceptron (MLP), Random Forest (RF), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Gradient Boosted Decision Tree (XGBoost), Classification and Regression Tree (CART), and Bayes. The Stack ensemble learning model architecture is shown in Figure 1. The stacking steps are as follows: (1) Split the data into a training set and a testing set (2) Split the training set by k-fold (3) Train and predict until a prediction is available for each fold (4) Combine a base model on the complete training set (5) Use the model to make predictions on the testing set (6) Repeat the above steps for the other base models

Voting Ensemble
The voting ensemble is an ensemble learning method that combines the predictions of several different models, also known as majority voting ensembles, which can be used for classification or regression tasks. The regression task calculates the average of the predictions of the models. The categorization task determines the final outcome by a majority vote of the predicted category results of each model. This study used seven models, including the following: MLP, LR, RF, SVM, KNN, XGBoost, and Classification Soft Voting, to sum and average the predicted class odds for each model. The voting ensemble model architecture is shown in Figure 2.

Voting Ensemble
The voting ensemble is an ensemble learning method that combines the predictions of several different models, also known as majority voting ensembles, which can be used for classification or regression tasks. The regression task calculates the average of the predictions of the models. The categorization task determines the final outcome by a majority vote of the predicted category results of each model. This study used seven models, including the following: MLP, LR, RF, SVM, KNN, XGBoost, and Classification Soft Voting, to sum and average the predicted class odds for each model. The voting ensemble model architecture is shown in Figure 2.

Model Development Environment
This study used SPSS 22 for data pre-processing, chi-squared independence checking, and sample matching; R language version 3.5.2 for AIC criterion checking, such as backward elimination in stepwise regression; Python version 3.7.3 for logistic regression (LR) [34], deep neural networks (DNN) [35], ensemble learning [36,37], voting ensemble model building, and visual mapping; Microsoft SQL Server 2014 was used for database retrieval and filtering. The ROC curve and a confusion matrix were used to evaluate the accuracy of the pancreatic cancer prediction models.
of several different models, also known as majority voting ensembles, which can be used for classification or regression tasks. The regression task calculates the average of the predictions of the models. The categorization task determines the final outcome by a majority vote of the predicted category results of each model. This study used seven models, including the following: MLP, LR, RF, SVM, KNN, XGBoost, and Classification Soft Voting, to sum and average the predicted class odds for each model. The voting ensemble model architecture is shown in Figure 2.

Results
A total of 3690 subjects were included in this study. The dataset was divided into a training set and a testing set in a ratio of 8:2. The training set (N = 2952) was divided into an 80% training subset (N = 2362) for model training and a 20% validation set (N = 590) for model validation. There were more male subjects present (53.9 vs. 46.1). Age was concentrated in the middle to old age group, mainly over 65 years of age (more than 55%). A total of 2952 subjects in the training set were identified as having risk factors associated with pancreatic cancer in the first stage. The risk factors identified in previous studies, including pancreatitis, diabetes, peptic ulcer, cholangitis, hepatitis, periodontal disease, sleep disorders, and fasciitis, were included in this study. Other factors were obtained from the short-term medical history of the pancreatic cancer patients in the study group over the period of one year. Finally, a total of 74 candidate factors were included in the follow-up factor validation.

Group Experimental Group Control Group Chi-Square (With Pancreatic Cancer) (No Cancer) p-Value
Chronic bronchitis (ICD-9 = 491) 0. In order to streamline the number of factors to improve the convenience and reduce the complexity of the model, the next two combinations attempted to use more stringent statistical validity as a factor for the selection criterion. The purpose was to identify the key factors with truly high predictive power among the significant factors (first factor combinations).
The second factor combination used backward elimination in stepwise regression to select 19 factors with p-values of <0.05 from the first factor combinations, including the following: abdominal pain, peptic ulcer site unspecified, symptoms involving the digestive system, gastritis and duodenitis, disorders of function of the stomach, chronic liver disease and cirrhosis, general symptoms, cholangitis, pancreatitis, symptoms involving the head and neck, symptoms involving the respiratory system and other chest symptoms, urticaria, other cellulitis and abscess, acute bronchitis and bronchiolitis, cardiac dysrhythmias, acute and subacute necrosis of liver, diabetes mellitus, gout and functional digestive disorders not elsewhere classified. The third factor combination used backward elimination in stepwise regression to select nine factors with p-values of <0.001 from the first factor combinations, including the following: abdominal pain, peptic ulcer site unspecified, symptoms involving the digestive system, gastritis and duodenitis, disorders of function of the stomach, chronic liver disease and cirrhosis, general symptoms, cholangitis, and pancreatitis. The stepwise regression results is shown in Table 2: This study constructed four models using three different sets of factors. The ROC curve has both validation and testing sets. While the validation set was used in the aforementioned chi-square and stepwise regression, it was not included in the model training. While the testing set was not used for factor selection or model training, it was used to simulate real data for testing.

Model Performance Comparison
This study compared the performance of each model in the validation and testing sets by using three different sets of ROC curve factors, accuracy, sensitivity, and specificity.

First Factor Combinations (32 Factors)
The ROC of validation set and testing by first factor combinations are shown in Figures 3 and 4.

Second Factor Combinations (19 Factors)
The ROC of validation set and testing by second factor combinations are shown in Figures 5 and 6.             According to the results, the validation set showed that the voting, stacking, and DNN models tended to be over-optimistic when the factor combinations were more complex (more factors). While a slight improvement was seen in the training phase using SMOTE data augmentation (synthetic training data), a significant improvement was seen when the complexity of the model was simplified (by reducing the number of factors).
The AUC of the LR model was higher than the other three models in the external testing set for all three of the factor combinations. Sensitivity was best measured by the stacking model for the first factor combinations, and specificity was best measured by the DNN model for the second factor combination. The result of the model that used only nine factors (third factor combinations) was no worse than the other two factor combinations that used more factors for the external testing results. The detailed results are shown in Table 3.

Discussion
This study developed a pancreatic cancer risk identification prediction model using disease diagnosis records from the NHIRD, and the results were validated through an independent testing set. Pancreatic cancer progresses rapidly, and the average estimated time for progression from stage T1 to stage T4 is 14 months [38], thus, the immediate detection of pancreatic cancer at the resectable stage is the critical goal of early assessment.
This study constructed a predictive model based on the diagnostic data of 12 months, in order to provide an early warning to patients at the early stage of pancreatic cancer. The results of the before-stated model show that a short history of diseases has the potential for screening and prediction. The AUC of previous models for the early assessment of pancreatic cancer ranged from approximately 0.57 to 0.71. The AUC of this study was higher than that of previous studies and ranged from 0.71 to 0.76, which provides higher accuracy.
A recent study on the development of a prediction model for pancreatic cancer risk screening among the general population [32] proposed a total of 15 factors (abdominal pain, angina pectoris, asthma, atherosclerotic heart disease, gallbladder stones, chest pain, chronic pancreatitis, coronary heart disease, diabetes, emphysema, primary hypertension, family history of pancreatic cancer, jaundice, stroke, and ulcers). The AUC of the training and testing sets were 0.71 and 0.68, respectively. Another model for the assessment of pancreatic cancer applied to health care delivery [29] proposed 16 factors, including the following: acute pancreatitis, chronic pancreatitis, diabetes, dyspepsia, gastritis/peptic ulcer/gallbladder disease, acute cholecystitis, depression, abdominal pain, chest pain, gastrointestinal symptoms, esophageal reflux, jaundice, weight loss/anorexia, nausea/vomiting, fatigue, and tickling, in order to establish a prediction model. The performance analysis of a model with an AUC of 0.61 found that even though the data sources were from different ethnic or national populations, the results of the factors were similar to our study, such as those related to gastrointestinal, gallbladder, pancreatic, diabetes, and chest pain.
Although not all of the factors were clinically confirmed, the data showed consistency across different regions of the population. In addition, sleep disturbance and hepatitis, which have been less frequently adopted as training factors than those used in previous studies, were found to be among the important key factors in this study. This study presented nine key independent predictors as third combination factors and used a smaller number of predictors than previous studies, and the results show that the model's AUC performance was higher than that of the previous models for identifying pancreatic cancer in the general population. The detailed results are shown in Table 4.

Conclusions
This study compared the performances of LR, DNN, stacking, and voting models for pancreatic cancer prediction and constructed a pancreatic cancer prediction model with an accuracy that was higher than that of previous studies. As a reference tool, this diagnostic-based model will help physicians and the public to identify the risk of pancreatic cancer. As this model uses only nine key disease factors, it offers the advantages of low cost and rapid screening. This predictive model will improve awareness of the risk of pancreatic cancer and will give patients with pancreatic cancer a simpler tool for early screening in the golden period when the disease can still be eradicated.  Institutional Review Board Statement: The study was exempt from full review by the Institutional Review Board, since the dataset used in this study are deidentified secondary data released to the public for research purposes. All of the NHIRD data related to personal identification were encrypted by the National Health Insurance Administration (NHIA) before being published. The confidentiality of subjects in the dataset was protected by the NHIA regulations.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.