Establishment of a Potential Serum Biomarker Panel for the Diagnosis and Prognosis of Cholangiocarcinoma Using Decision Tree Algorithms

Potential biomarkers which include S100 calcium binding protein A9 (S100A9), mucin 5AC (MUC5AC), transforming growth factor β1 (TGF-β1), and angiopoietin-2 have previously been shown to be effective for cholangiocarcinoma (CCA) diagnosis. This study attempted to measure the sera levels of these biomarkers compared with carbohydrate antigen 19-9 (CA19-9). A total of 40 serum cases of CCA, gastrointestinal cancers (non-CCA), and healthy subjects were examined by using an enzyme-linked immunosorbent assay. The panel of biomarkers was evaluated for their accuracy in diagnosing CCA and subsequently used as inputs to construct the decision tree (DT) model as a basis for binary classification. The findings showed that serum levels of S100A9, MUC5AC, and TGF-β1 were dramatically enhanced in CCA patients. In addition, 95% sensitivity and 90% specificity for CCA differentiation from healthy cases, and 70% sensitivity and 83% specificity for CCA versus non-CCA cases was obtained by a panel incorporating all five candidate biomarkers. In CCA patients with low CA19-9 levels, S100A9 might well be a complementary marker for improved diagnostic accuracy. The high levels of TGF-β1 and angiopoietin-2 were both associated with severe tumor stages and metastasis, indicating that they could be used as a reliable prognostic biomarkers panel for CCA patients. Furthermore, the outcome of the CCA burden from the Classification and Regression Tree (CART) algorithm using serial CA19-9 and S100A9 showed high diagnostic efficiency. In conclusion, results have shown the efficacy of CCA diagnosis and prognosis of the novel CCA-biomarkers panel examined herein, which may prove be useful in clinical settings.


Introduction
Cholangiocarcinoma (CCA) is a complex group of malignancies that have arisen in the biliary tree. It is the most common liver cancer and the major public health issue of the northeastern region in Thailand [1,2]. Infection with the liver fluke Opisthorchis viverrini, which causes chronic inflammation and advanced periductal fibrosis, is a significant oncogenic risk factor for CCA development in this region. Additionally, the critical challenge related to this cancer is effective diagnosis and prognosis because CCA is typically asymptomatic in early stages, most often diagnosed in late stages, and difficult to differentiate from other gastrointestinal cancers (GI) including hepatocellular carcinoma (HCC) [3,4].
Currently, the blood-based tumor biomarker, carbohydrate antigen 19-9 (CA19-9), is generally used to diagnose CCA. However, the CA19-9 marker provides unsatisfactory sensitivity and specificity values because 7% of individuals in the population are Lewis antigen-negative with no or low production of CA19-9, and this marker is also often elevated in benign and other GI tract malignancies [5,6].
The numerous studies of biomarkers in bile duct cancer have focused on individual biological protein measurements (i.e., single biomarkers). There are limited studies to date concerning multiple biomarkers (biomarkers panel) that can be used as a strategy to improve the diagnostic accuracy of CCA [7][8][9]. Tshering et al. and Wongkham et al., have shown that no specific individual biomarker provides acceptable sensitivity and specificity for CCA and suggested that a combination of biomarkers may provide more accurate diagnosis [10,11]. To date, proteomics studies have been used to analyze the pattern of all proteins in a patient's sera of many types of cancer [12][13][14]. Recent results of Duangkumpha et al. that studied the candidate proteins in sera of CCA patients compared to normal groups by mass spectrophotometry found statistically increasing levels of S100 calcium-binding protein A9 (S100A9) that had the potential to be used as the diagnostic biomarker of CCA patients [15]. A meta-analysis revealed that the detection of mucin 5AC (MUC5AC) in sera could be used as a powerful biomarker of CCA by providing a specificity of up to 97% and sensitivity of 63% [16][17][18]. Furthermore, Kimawaha et al. significantly found that transforming growth factor β1 (TGF-β1) in sera could be a potential biomarker to predict the risk of developing CCA [19]. In addition, studies of angiopoietin-2, which provides sensitivity of 74% and specificity of 94% for CCA diagnosis, have also been reported [10,11,20].
Currently, a computer-based diagnostic algorithm, such as decision tree classification, to construct supervised models with integrated diagnostics from candidate biomarkers, have been used for many cancers [21][22][23]. The main task of these diagnostic models is to classify the unknown objects into pre-defined groups consisting of a hierarchical structure that directs interpretation to provide a final decision [24]. Consequently, combined biomarker studies are required to establish the potentially effective CCA biomarker panels based on biomedical and bioinformatics fields, thereby enhancing the efficiency in CCA diagnosis.
In this study, we aimed to validate the group of reported candidate-CCA biomarkers, namely S100A9, MUC5AC, TGF-β1, angiopoietin-2, and the commonly used tumor marker, CA19-9, in the sera of CCA patients compared with healthy people and other GI cancer groups. The pattern of serum biomarkers together with the clinicopathological data of CCA patients was subsequently analyzed. Furthermore, to improve the diagnostic and prognostic efficiency in the management of CCA patients, the DT classification model was applied as a hierarchical structure of multi-biomarkers to establish the CCA biomarkers panel.

Patients and Serum Samples
Blood samples were collected from CCA (n = 40), and non-CCA patients (n = 40) including, hepatocellular carcinoma (HCC) (n = 23), CA gallbladder (n = 7), CA pancreas (n = 5) and liver metastasis (n = 5), from Srinagarind hospital and specimens were kept in the biobank of Cholangiocarcinoma Research Institute (CARI), Khon Kaen University, Khon Kaen, Thailand. In addition, blood samples of people who had normal ultrasonography results (normal group; n = 40) were collected from Ban Wa sub-distinct, Khon Kaen, Thailand. All human specimens and the protocols in this study were approved by the Human Ethics Committee of Khon Kaen University, based on the ethics of human specimen experimentation of the National Research Council of Thailand (HE611196), and informed consent was obtained from each subject before surgery.

The Detection of Candidate Proteins in Sera by Sandwich ELISA
A sandwich ELISA was performed to determine the candidate protein levels of S100A9, MUC5AC, TGF-β1, and angiopoietin-2. Quantitation of these proteins in sera of CCA patients and non-CCA patients were compared with the normal ultrasonography group by using a Quantikine ELISA Kit (S100A9, CSB-E11834h, Cusabio, Houston, TX, USA; MUC5AC, CSB-E10109h, Cusabio, Houston, TX, USA; TGF-β1, DB100B, R&D systems, Minneapolis, MN, USA; angiopoietin-2, DANG20, R&D systems, Minneapolis, MN, USA). According to the manufacturer's instructions, the plate that was coated with primary antibody specific to each protein, was added to assay diluent to each well. Standard, control, and diluted samples were added to each well in duplicate and they were incubated for 2-2.5 h at room temperature with gentle shaking. After washing, the biotinylated antibody specific for each candidate protein was added for 1-2 h' incubation time. Subsequently, streptavidin-HRP solution was added to each well for 45-60 min at room temperature. The TMB substrate solution was added and was protected from light. The reaction was stopped with hydrochloric acid and the plates were read on an ELISA reader using Magellan at the optical density (OD) of 450 nm. The results were calculated by reference to the standard curve that related to the concentration in each protein.

The Detection of CA19-9 in Sera by Automated ELECSYS COBAS
The immunoassay for quantitative determination of CA19-9 was performed by the electrochemiluminescence immunoassay (ECLIA) from Cobas e analyzer at Srinagarind hospital, Khon Kaen University, Khon Kaen, Thailand. In brief, the sandwich principle was generated by 10 µL of sample, a biotinylated monoclonal CA19-9 specific antibody, and a monoclonal CA19-9 specific antibody labeled with a ruthenium complex which formed the sandwich complex. Subsequently, streptavidin-coated microparticles were added and the complex was bound to the solid phase via an interaction of biotin and streptavidin. The reaction mixture was aspirated into the measuring cell where the microparticles were magnetically captured onto the surface of the electrode. The unbound substances were then removed and the application of a voltage to the electrode induced the chemiluminescent emission which was measured by a photomultiplier. The CA19-9 concentration was determined via a calibration curve in the analyzer. The measuring range was 0.60-1000 U/mL.

Decision Tree Construction for CCA Biomarkers Panel
Decision Tree (DT) is a tree-based classifier/hypothesis and has been widely considered as a practical decision-making technique with a simple knowledge representation [25]. DT has been applied in many medical applications [22,23,25]. The optimum combination of the candidate variables is derived through an exhaustive search of all possibilities by recursively partitioning a dataset to achieve the minimum impurity measure.
We applied the Classification and Regression Tree (CART) algorithm [26] to construct a binary classification model based on five candidate biomarkers with corresponding optimal cut-off values as obtained in the previous section. The CART algorithm is an extension of C4.5 for supporting numerical target variable. A pseudo procedure of CART algorithm [26] can be summarized as follows:

2.
Search for a split s* among the set if all possible candidate 's' the give the purest decrease in impurity.

4.
Repeat the split search process, by following the steps 1-3, for the obtained nodes (t = 2 and t = 3) until the tree grows fully or the stopping rules are met.
Three distinctive models for the binary classifier based on the CART algorithm, namely Model I for classifying normal and CCA, Model II for classifying normal and non-CCA, and Model III for classifying non-CCA and CCA, were built. For model construction, the relevant samples for binary classification were divided into two subsets defined as training (70%) and testing (30%) datasets. The distribution of clinicopathological data used for training and testing datasets are shown in Table 1. The CART algorithm was executed by Python with a Scikit-learn library [27]. Five biomarkers were considered as input variables consisting of S100A9, MUC5AC, TGF-β1, angiopoietin-2, and CA19-9. The input variables were transformed into 0 or 1 based on their obtained cut-off values. If any value satisfied the cut-off condition, then it was set to 1. Otherwise, the values were set to 0. To achieve the best tree's parameters, a 5-fold cross-validation method with GridSearchCV criterion was employed to evaluate the performance of DT on the training dataset. Five parameters were investigated including max_depth, max_feature, min_samples_leaf, min_samples_split and criterion. The list of tree's parameters with their candidate values is given in Table S1 (view Supplementary Materials).

Statistical Analysis
The statistical analyses were performed using SPSS V.23.0 statistical software (IBM Corporation, Armonk, NY, USA). Data were represented as mean ± SD. The correlation of candidate biomarkers levels and clinicopathological parameters of CCA patients were analyzed by 2 independent samples t-tests. For non-parametric statistics, Mann-Whitney tests were performed. The log-rank test was used to compare survival distributions between the low and high level in sera of each protein, and the Kaplan-Meier method was plotted for survival curves for overall survival between groups. The diagnostic performance of selected proteins was evaluated using ROC curve analysis, AUC with 95% CI, and Youden index (YI) were calculated and then the optimal cut-off OD values for proteins levels were designated to balance suitable sensitivity and specificity. Odd ratios (OR) were analyzed to predict risk score by logistic regression. Values of p < 0.05 were considered statistically significant.

The Validation of Candidate Biomarkers in Sera of CCA Patients
Serum levels of S100A9, MUC5AC, TGF-β1, angiopoietin-2, and CA19-9 were examined in 40 patients diagnosed with CCA, 40 non-CCA subjects, and 40 healthy individuals. The CCA sera in this validated study were obtained from 13 females (33%) and 27 males (67%). The median age of patients was 62 years (range, 39-82 years) ( Table 1 and Supplementary data Table S2).
As shown in Figure 1, the average values for each marker in the CCA patients were significantly higher than the normal control group, except for angiopoietin-2. Furthermore, the serum levels of S100A9 and TGF-β1 were significantly greater in the non-CCA group compared with normal subjects. Only CA19-9 levels were considerably higher in CCA patients compared with other GI cancers patients (p = 0.0007).
The CCA sera in this validated study were obtained from 13 females (33%) and 27 ma (67%). The median age of patients was 62 years (range, 39-82 years) ( Table 1 and Supp  mentary data Table S2).
As shown in Figure 1, the average values for each marker in the CCA patients w significantly higher than the normal control group, except for angiopoietin-2. Furth more, the serum levels of S100A9 and TGF-β1 were significantly greater in the non-CC group compared with normal subjects. Only CA19-9 levels were considerably higher CCA patients compared with other GI cancers patients (p = 0.0007). According to subgroups of non-CCA subjects (view Supplementary Materials; Figu S1), the levels of S100A9 and TGF-β1 were significantly higher in CA gallbladder and liv metastasis when compared with the normal group. Only CA19-9 was significantly high in CCA subjects when compared with HCC and CA gallbladder patients.

The Correlation between Serum Candidate Biomarkers Level with Clinicopathological Dat of CCA Patients
The correlations between each biomarker level and clinicopathological data of CC patients were evaluated. As shown in Table 2, high angiopoietin-2 and TGF-β1 levels w significantly correlated with late TNM stages (stage III-IV) (OR = 4.846, 5.333; p < 0.05) a only a high TGF-β1 level was associated with metastasis (OR = 3.467; p = 0.024) and lym node metastasis (OR = 3.111). Alternatively, there was no significant correlation of S100A MUC5AC, and CA19-9 levels with any clinicopathological variables. According to subgroups of non-CCA subjects (view Supplementary Materials; Figure S1), the levels of S100A9 and TGF-β1 were significantly higher in CA gallbladder and liver metastasis when compared with the normal group. Only CA19-9 was significantly higher in CCA subjects when compared with HCC and CA gallbladder patients.

The Correlation between Serum Candidate Biomarkers Level with Clinicopathological Data of CCA Patients
The correlations between each biomarker level and clinicopathological data of CCA patients were evaluated. As shown in Table 2, high angiopoietin-2 and TGF-β1 levels were significantly correlated with late TNM stages (stage III-IV) (OR = 4.846, 5.333; p < 0.05) and only a high TGF-β1 level was associated with metastasis (OR = 3.467; p = 0.024) and lymph node metastasis (OR = 3.111). Alternatively, there was no significant correlation of S100A9, MUC5AC, and CA19-9 levels with any clinicopathological variables. The symbol (*) indicates a statistically significant p value < 0.05, as determined by 2 independent sample t-tests. a Odd ratios (OR) were analyzed to demonstrate the association of serum levels of candidate biomarkers with clinicopathological variables. The symbol (-) indicates a reference variable and symbol (**) denotes a statistically significant p value < 0.05, as analyzed by logistic regression.

The Overall Survival Analysis of Candidate Biomarkers in Sera of CCA Patients
The overall survival (OS) analysis by the Kaplan-Meier method with a log rank test revealed that only CCA patients with a high level of angiopoietin-2 showed a trend of shorter survival times than those with a low angiopoietin-2 level (p = 0.083). The mean overall survival times between low and high levels of angiopoietin-2 in CCA patients were 330 and 219 days, respectively ( Figure 2). Nevertheless, the combination of biomarkers panel could predict poor prognosis in CCA patients. The OS analysis indicated that CCA patients with a high level of four to five combined biomarkers were found to have significantly shorter survival times than those with other levels of markers (green and yellow lines, p = 0.018). The mean overall survival time in each group was 445 days for all low markers, 257 days for one high marker, 317 days for two high markers, 478 days for three high markers, 280 days for four high markers, and 49 days for five high markers (view Supplementary Materials; Figure S2).  Figure S2).

The Analysis of Candidate Biomarkers as Potential Prognostic Biomarkers in CCA Patients
In this study, the prognostic biomarkers included all candidate biomarkers to monitor the prognosis in CCA patients. The results showed that only serum TGF-β1 levels were significantly different between non-metastatic and metastatic CCA patients (p = 0.011) ( Figure 5A). Moreover, ROC analysis showed that TGF-β1 could be used to differentiate metastasis from non-metastasis with a cut-off of 48.8 ng/mL, which resulted in 44% for sensitivity and 91% for specificity (p = 0.012, AUC = 0.700, YI = 0.35) ( Figure 5C, view Supplementary Materials; Table S3). tor the prognosis in CCA patients. The results showed that only serum TGF-β1 levels were significantly different between non-metastatic and metastatic CCA patients (p = 0.011) ( Figure 5A). Moreover, ROC analysis showed that TGF-β1 could be used to differentiate metastasis from non-metastasis with a cut-off of 48.8 ng/mL, which resulted in 44% for sensitivity and 91% for specificity (p = 0.012, AUC = 0.700, YI = 0.35) ( Figure 5C, view Supplementary Materials; Table S3).

Figure 5. Serum levels of candidate prognostic-biomarkers in CCA patients according to metastasis (A) and TNM stages (B)
. Column bar graph represents mean ± standard deviation (SD). Receiver operating characteristic curve (ROC) analysis of candidate biomarkers for prognosis in CCA, TGF-β1 for prediction of metastasis (C), TGF-β1 and angiopoietin-2 for prediction of TNM stages (D), combined TGF-β1 and angiopoietin-2 for prediction of TNM stages (E). The p value < 0.05 was considered statistically significant.
According to OR values, at cut-off of 48.8 ng/mL, the TGF-β1 level was a significant predictor to determine metastatic status (OR crude = 7.43, OR adjusted = 11.29; p = 0.017, 0.012, respectively). Furthermore, either TGF-β1, angiopoietin-2, or combined markers could serve as effective predictors for prognosis of severe cancer stage(s) in CCA patients (view Supplementary Materials; Table S4). For TNM stages, the results showed that serum TGF-β1 and angiopoietin-2 levels were significantly different between these groups (p < 0.05) ( Figure 5B). ROC analysis showed that TGF-β1 could be used to predict late TNM stages with a cut-off of 43.6 ng/mL at 51% for sensitivity and 91% for specificity (p = 0.012, AUC = 0.748, YI = 0.42). Moreover, angiopoietin-2 level at cut-off 1457 pg/mL could predict severe stages of CCA in patients at 81% sensitivity and 78% specificity (p = 0.020, AUC = 0.758, YI = 0.59) ( Figure 5D, view Supplementary Materials; Table S3). The combination of TGF-β1 and angiopoietin-2 level could improve the prognostic power to determine severe CCA stage in patients as shown in Figure 5E, view Supplementary Materials; Table S3 (32% sensitivity and 100% specificity, p = 0.002, AUC = 0.842, YI = 0.002).
According to OR values, at cut-off of 48.8 ng/mL, the TGF-β1 level was a significant predictor to determine metastatic status (OR crude = 7.43, OR adjusted = 11.29; p = 0.017, 0.012, respectively). Furthermore, either TGF-β1, angiopoietin-2, or combined markers could serve as effective predictors for prognosis of severe cancer stage(s) in CCA patients (view Supplementary Materials; Table S4).

Decision Tree Construction and Their Diagnostic Performance for CCA Biomarkers Panel
Three models for binary classification problems called DT I (normal vs. CCA), DT II (normal vs. non-CCA), and DT III (non-CCA vs. CCA), were built in our study. After performing a five-fold cross-validation method with GridSearchCV criterion, the best tree's parameters for each model were obtained and shown in Table 5. The DT diagrams for DT I (normal vs. CCA), DT II (normal vs. non-CCA), and DT III (non-CCA vs. CCA) were shown in Figure 6. The circles depict the biomarkers selected with their given cut-off conditions. The rectangles show the class predictive label with the percentage of correctly classified subjects in the training dataset. A comparative study was conducted with performance measures of each five single biomarkers and the DT models of three kinds of diagnosis computed from the confusion matrices based on the test dataset (Table 6). To differentiate CCA patients from the normal population, a DT model (DT I) was constructed, contacting a hierarchical structure of CA19-9 and S100A9 ( Figure 6A). Three classification rules were obtained from DT I. The number of classification rules was characterized by the number of rectangles. From Figure 6A, subjects were labeled as CCA patients with 100% correct classification if CA19-9 was > 37 U/mL. Otherwise, S100A9 was applied to differentiate CCA and the normal cases. Using S100A9 < 197.9 ng/mL as a cut-off, subjects were labelled as normal cases with 96% correct classification. The diagnostic performance in the testing dataset of five single biomarkers and DT I was shown in Table 6 (normal vs. CCA). Interestingly, DT I gave the highest values (highlighted in bold typeface), in four performance measures, SN, YI, NPV, and ACC, and gave the second-highest values (identified by an underline) in PPV, compared to other single markers. Table 5. The best tree's parameters for each decision tree (DT) model after performing five-fold cross validation method with GridSearchCV criterion.   The bold indicates the highest value, while the underline indicates the second-highest value in each category of biomarker diagnostic performance. Abbreviations; SN = sensitivity; SP = specificity; YI = Youden index; PPV = positive predictive value; NPV = negative predictive value; ACC = accuracy. To distinguish the non-CCA group from the normal group, the resulting DT model, DT II with a hierarchical structure of angiopoietin-2, TGF-β1, and S100A9, was shown in Figure 6B. All non-CCA cases were discriminated against by serum angiopoietin-2 values To distinguish the non-CCA group from the normal group, the resulting DT model, DT II with a hierarchical structure of angiopoietin-2, TGF-β1, and S100A9, was shown in Figure 6B. All non-CCA cases were discriminated against by serum angiopoietin-2 values < 1312 pg/mL in the control group (correctly classified 83.33% in the training dataset). In comparison, TGF-β1, and S100A9 serial decision based on their relevant cut-off conditions could be accurately classified at an 87.5% level. In Table 6 (normal vs. non-CCA), DT II gave the highest values (bold typeface), in two performance measures, including YI, and ACC, and gave the second-highest values (identified by an underline) in SN, and NPV, compared to other single markers.
Finally, DT III model suggested a serial decision involving TGF-β1 and CA19-9 to distinguish CCA patients from non-CCA subjects ( Figure 6C). Using a serum TGF-β1 of value > 39.9 ng/mL and CA19-9 cut-off greater than 37 U/mL, CCA cases could be discriminated from the non-CCA, and correctly classified at 82%. Using TGF-β1 less than or equal to 39.9 ng/mL and angiopoietin-2 > 1008 pg/mL as a cutoff, non-CCA cases were identified and 100% correctly classified. Hence, MUC5AC could be used as a biomarker to discriminate CCA from non-CCA at a cut-off value higher than 90.5 ng/mL correctly providing a classification of 83%. Moreover, in Table 6 (non-CCA vs. CCA), DT III gave the highest values (bold typeface), in three performance measures, including YI, PPV, and ACC, and gave second-highest values noted by the underline in SN, and SP, compared to other single markers.

Discussion
We aimed to validate and evaluate already existing potential biomarkers for their applications in a diagnostic approach for CCA detection from healthy and related-GI cancers by using supervised learning algorithms. Based on previous findings, serum TGF-β1 alone could diagnose CCA patients at a cut-off of 38.54 ng/mL with adequate sensitivity and specificity. Interestingly, TGF-β1 combined with alkaline phosphatase (ALP), the routine liver biomarker, might provide a more efficient diagnosis of the disease given improved sensitivity and specificity [19]. Hence, only one biomarker test might not be appropriate to correctly diagnose the disease. Nevertheless, a comparatively limited number of studies have tested many blood-based biomarkers for CCA diagnosis. Thus, we must attempt to find the combination of biomarkers that might boost the capacity to diagnose CCA patients effectively.
The S100 protein family comprises a group of small acidic calcium proteins which have two major members, S100A8 and S100A9. S100A9 has emerged as an effective proinflammatory mediator in acute and chronic inflammation, and can play a critical role in cancer associated with inflammation [30]. Many studies have found that the serum level of S100A9 is significantly increased in many types of cancer and benign biliary diseases (BBD) [31][32][33]. These findings established that S100A9 was a promising diagnostic biomarker with 78% sensitivity, 88% specificity, and a 0.888 AUC value, which was equivalent values for the differential diagnosis of CCA and normal control [15]. When S100A9 level was combining with CA19-9 to enhance the diagnostic efficiency; the sensitivity value increased from 78% for S100A9 alone to 95% for these two markers. Impressively, S100A9 provides a diagnostic yield of 95% in CCA patients with low CA19-9 levels. These results suggest the potential diagnostic usefulness of S100A9 in combination with CA 19-9 or in cases in which the CA19-9 level is normal or low.
MUC5AC is a high molecular weight O-glycosylated glycoprotein member of the membrane-bound and secreted epithelial mucin family. This is the most studied mucin with high potential as a biomarker for CCA [34]. We have shown that the serum levels of MUC5AC are greater in CCA than healthy subjects, and when two markers were combined, only MUC5AC and CA19-9 obtained the highest AUC value for differentiating CCA from GI tract cancers patients. Serum MUC5AC is a highly particular tumor-associated mucin that could be helpful in the diagnosis and development of therapeutic strategies for biliary tract cancer (BTC), as supported by a previous study [35]. In addition, the BTC tumor biopsies of most patients have demonstrated a high MUC5AC reactivity, suggesting the tumor-associated MUC5AC tumor antigen is shed into the blood where it can be detected [36]. Currently, serum extracellular vesicles (EVs) carry a lot of promising source of clinically beneficial biomarkers to increase cancer detection sensitivity and specificity. Arbelaiz et al. [37] discovered a new potential biomarker in serum EVs of CCA, primary sclerosing cholangitis (PSC), and HCC patients. According to their report, CCA-derived EV include oncogenic proteins including mucin and the S100 protein family, which have a high differential diagnostic capacity for CCA diagnosis [37]. Thailand is the endemic area of liver fluke infection, which is the major cause of CCA burden in our region. As a result, the serum EVs from liver fluke related CCA patients should be examined further in a prospective study, as they could provide a possible biomarker derived from EVs. This may have contributed to S100A9 and MUC5AC being the one reliable diagnostic marker of the CCA biomarker panel.
Most of the mortality of CCA patients comes from poor prognosis, therefore, prognostic markers are needed to follow up the treatment outcomes after resection and to predict those who will benefit from treatment. Additionally, TGF-β1, the multifunctional polypeptides with potent effects, had the diagnostic and prognostic potential serum levels which was confirmed our previous in CCA studies [19]. Even though, in this study, sera TGF-β1 level appears to be less of a diagnostic power to differentiate CCA from the control group based on an unsatisfactory sensitivity and specificity values. However, our study revealed that serum TGF-β1 could significantly serve as the prognostic biomarker for monitoring metastasis and severe tumor stages of CCA.
Many studies have shown that elevated TGF-β1 levels are significantly associated with metastasis and poor prognosis in many cancers [38][39][40], as TGF-β1 can modulate the metastatic potential of tumor cells by regulating their ability to break down and infiltrate barriers of the basement membrane [41]. In CCA cell lines, the metastatic role of TGF-β1 was shown to effectively induce CCA cell migration by activation of the expression of Twist, N-cadherin and vimentin [42]. These results suggest that a possible prognostic biomarker for monitoring pathological conditions in patients with CCA may be TGF-β1.
In general, angiopoietin-2, an endothelial cell-specific angiogenic growth factor, has been used as an angiogenesis-related biomarker of various types of tumors, but has not been thoroughly examined for expression and function in CCA. According to our current study, angiopoietin-2 alone was not one of the best potential diagnostic biomarkers, in contrast to a previous study which revealed that the serum angiopoietin-2 level can be useful for differentiating CCA versus primary sclerosing cholangitis (PSC) with an acceptable AUC value [20]. The different controversial aspect is that we did not select the population of BBD as PSC patients. We only conducted angiopoietin-2 determination in CCA compared to those with normal and non-CCA groups, which were different from previous research examining only CCA, PSC, and bile duct stones in patients [20]. The different populations studied may provide explanations for the discrepancies between studies in the diagnostic outcomes of angiopoietin-2 in CCA diagnosis. However, a high level of angiopoietin-2 could be associated with the trend for shorter survival time and predict the severe cancer stages with an adjusted OR equivalent to 23.22. Preliminary studies indicated a potential role for angiopoietin-2 as a prognostic factor in cancers, for instance breast cancer [43], lung cancer [44], and HCC [45], not only by inducing angiogenesis but also by encouraging metastasis via the α5β1 integrin/integrin-linked kinase (ILK)/Akt, GSK-3β/Snail/E-cadherin signaling pathway [46]. Additionally, the combination of TGF-β1 and angiopoietin-2 could strongly predict the relative risk of poor prognosis in severe cancer stages in our study. The coincidence of this phenomenon could be explained because tumor angiogenesis is regulated by a network of growth factors, including members of the TGF-β family [47] and angiogenic inducers [48]. However, in-depth studies on the roles of these biomarkers in CCA genesis are required.
The CART algorithm is based on Classification and Regression Trees by Breiman et al. [26]. A CART tree is a binary decision tree that is built by repeatedly splitting a node into two child nodes, starting with the root node holding the entire sample of learning. In this analysis, this algorithm was used as a classifier because it provides a set of rules that theoretically describe the relationship between inputs, including candidate biomarkers, and output as a diagnostic outcome; normal, non-CCA, or CCA. The Python-built DT diagram with the Scikit-learn library provides physicians with an easy and practical guideline to diagnose CCA patients without depending on any additional computers and other devices.
In our study, because its performance reached the defined goal, DT was still preferred to the artificial neural network and CART can provide a logical rule set that is convenient for medical approach. For the training set, a five-fold cross-validation method with GridSearchCV criterion was employed to evaluate the performance of DT to achieve the best tree's parameters. This provided three models consisting of various candidate biomarkers with varying accuracy, which was greater than the precision achieved by any single biomarker. Of these, for normal versus CCA, the two markers CA19-9 and S100A9 provided diagnostic power better than those of other multiple markers and better than any single marker. The diagnostic power of these two markers was further validated in the testing set and revealed the best diagnostic power to discriminate CCA from the healthy group.
The challenging aspect of CCA diagnosis is to reliably distinguish CCA from other gastrointestinal cancers that demonstrate the disease's similar pathophysiology. According to results from CCA versus non-CCA groups, we found that the integrated and combined analyses of novel candidate-biomarkers (TGF-β1, CA19-9, angiopoietin-2, and MUC5AC) tends to be a successful method for increasing the CCA diagnostic with adequate 82% sensitivity and 92% specificity. This is similar to Pattanapairoj and co-workers, who showed that the potential classification model consisted of CCA-CA and ALP for differentiating CCA from non-CCA [49]. Moreover, Negrini et al. studied the efficiency of machine learning models according to the plasma bile acid profiles and reported that the Naïve Bayes model demonstrated the improved diagnostic efficiency for the differentiation of patients with CCA and BBD [50]. In terms of diagnostic capacity, however, the classification output for this DT model (CCA versus non-CCA) was still not satisfactory. There is an important need for further research to explore the potential biomarker panel to differentiate CCA from other cancers, especially HCC. Recently, Jamnongkan et al. firstly identified glycoform patterns of serotransferrin in CCA serum by the glycoproteomic method. The results revealed that serotransferrin glycoform 6503, which is the highly-sialylated glycoform, could be used to differentiate CCA from HCC patients [51]. This study could provide a novel insight for the discovery of novel glyco-biomarkers for CCA diagnosis.
Our study has revealed some important features for the diagnosis of CCA. Firstly, we recruited GI-related cancers including hepatocellular carcinoma, CA gall bladder, CA pancreas, and liver metastatic patients that have the pathological conditions similar to CCA. Thus, the real diagnostic performance of the test is reflected more accurately than using cancer subjects versus normal controls alone. Secondly, our study focused on the finding that candidate biomarkers are correlated with clinicopathological data of CCA patients. Thirdly, our study was based on serum ELISA analysis that measures the concentration of biomarkers by real quantitative units. No previous studies have investigated a panel of these biomarkers by quantitative analyses. Lastly, this study provided a new DT algorithm for physicians with an easy and practical guideline as a potential workflow to diagnose CCA patients effectively. However, the study still has some limitations. In contrast with the number in the CCA group, the number of individuals in the non-CCA subgroups is small. Our findings can be further confirmed by other potential prospective trials with greater sample sizes involving patients with malignant GI diseases, and the biomarkers panel could be validated in an external bank of liquid biopsies to support the conclusion of this report.

Conclusions
The present study suggested the efficacy of utilizing combined biomarker analysis for CCA diagnosis. The DT algorithm was used to establish the CCA biomarker panel that could distinguish CCA from healthy people, with a panel consisting of CA19-9 followed by S100A9 having the highest diagnostic power. In CCA patients with low CA19-9 levels, S100A9 was especially useful and could be used as a complementary marker to provide greater diagnostic yield. For GI cancers versus CCA diagnosis, the results showed that the potent serial biomarkers model was obtained by CA19-9, followed by MUC5AC, and TGF-β1. Moreover, the set of two markers, TGF-β1 and angiopoietin-2, provided effective prognoses in CCA patients with metastasis and severe cancer stage conditions. Our results strengthen the value of a blood-based biomarkers panel for the diagnosis and prognosis of CCA and discloses the classification of a DT model that can be used as an effective tool for CCA diagnosis.
Supplementary Materials: The following are available online at https://www.mdpi.com/2075-4 418/11/4/589/s1, Figure S1: Serum levels of S100A9 (A), MUC5AC (B), TGF-β1 (C), angiopoietin-2 (D), and CA19-9 (E) in normal control group, non-CCA group, and CCA patients. Non-CCA group including hepatocellular carcinoma (HCC), CA gallbladder, CA pancreas, and liver metastasis patients. Scatter plots represent mean ± standard deviation (SD). The p value < 0.05 was considered statistically significant when compared in each group, Figure S2: Overall survival analysis according to Kaplan-Meier method with a log rank test calculated for combined biomarkers with survival rate in CCA patients. The p value < 0.05 was considered statistically significant, Table S1: List of tree's parameters with the sets of their candidate values, Table S2: The characteristics of CCA patients; Table S3: Predictive values of serum TGF-β1 and angiopoietin-2 levels for prognosis metastasis and TNM stages in CCA patients, based on the optimal cut-off derived from ROC analysis and YI calculation; Table S4: Predictive risk of metastasis and TNM stages relative to using serum levels of TGF-β1 and angiopoietin-2.