An Integrative Data Mining and Omics-Based Translational Model for the Identification and Validation of Oncogenic Biomarkers of Pancreatic Cancer

Substantial alterations at the multi-omics level of pancreatic cancer (PC) impede the possibility to diagnose and treat patients in early stages. Herein, we conducted an integrative omics-based translational analysis, utilizing next-generation sequencing, transcriptome meta-analysis, and immunohistochemistry, combined with statistical learning, to validate multiplex biomarker candidates for the diagnosis, prognosis, and management of PC. Experiment-based validation was conducted and supportive evidence for the essentiality of the candidates in PC were found at gene expression or protein level by practical biochemical methods. Remarkably, the random forests (RF) model exhibited an excellent diagnostic performance and LAMC2, ANXA2, ADAM9, and APLP2 greatly influenced its decisions. An explanation approach for the RF model was successfully constructed. Moreover, protein expression of LAMC2, ANXA2, ADAM9, and APLP2 was found correlated and significantly higher in PC patients in independent cohorts. Survival analysis revealed that patients with high expression of ADAM9 (Hazard ratio (HR)OS = 2.2, p-value < 0.001), ANXA2 (HROS = 2.1, p-value < 0.001), and LAMC2 (HRDFS = 1.8, p-value = 0.012) exhibited poorer survival rates. In conclusion, we successfully explore hidden biological insights from large-scale omics data and suggest that LAMC2, ANXA2, ADAM9, and APLP2 are robust biomarkers for early diagnosis, prognosis, and management for PC.


Introduction
Pancreatic cancer (PC) is the seventh leading cause of cancer-related death worldwide, with less than seven percent of patients surviving after five years from the initial diagnosis [1].Early diagnosis at a curable stage of PC is exceptionally challenging.Resection surgery may increase the 5-year survival rate to 20%, but only 10-20% of the patients are eligible for this procedure [2].The symptoms of PC are usually not apparent due to its location deep inside the abdomen, and there are no current biomarkers with sufficient sensitivity and specificity for accurate diagnosis [3].The cut-off level of carbohydrate antigen (CA) 19-9 for differentiating malignant and benign tumors is still controversial due to unconvincing diagnostic values.Additionally, the sensitivity of CA 19-9 drops dramatically in early and small tumors [4].The lack of effective screening strategies and the paucity of powerful systematic therapies are also significant causes of PC fatality [5].Collectively, novel methods for early detection and effective management of PC are urgently needed.
Only a small fraction of an enormous number of cancer biomarkers has been successfully implemented in the clinical practice.Cancer is characterized by tumor heterogeneity and a variety of molecular biosignatures; this calls for a more effective strategy of simultaneously combining a set of biomarkers [6].A study has been reported on the improvement in diagnostic sensitivity of cancer using panels of combined biomarkers rather than individual molecules [7].In PC, novel approaches applying methylated DNAs, circulating tumor DNAs, miRNAs, proteins, and metabolites have been developed and used with CA 19-9 in an attempt to enhance diagnostic and prognostic performance [8][9][10].For instance, the sensitivity and specificity of the combined CA 19-9 and serum cytokines panel for differentiating PC from the normal group were 85.7% and 92.3%, respectively, and those of the combined CA19-9, ICAM-1, and OPG panel were 88% and 90%, respectively [11,12].However, it is worth mentioning that several confounding factors may affect the establishment of a cancer biomarker panel.The samples for research on PC usually stem from patients with the advanced disease, as most patients are diagnosed at late stages, whereas biomarkers are expected to be for early, curable stages [13].Additional clinical parameters, notably jaundice, are also worth considering when recruiting PC samples for biomarker studies [14].
Recent large-scale gene expression data studies have been conducted to shed light on molecular characterization and to provide a system-level view of human cancers, with a minor focus on PC [15,16].Remarkably, a compendium including 2516 genes that showed changes in expression levels associated with PC was investigated, which accounted for nearly 13% of the known human coding genes [17].However, the huge data load makes it impossible to translate the results into clinical applications.In this context, a promising and practical set derived from this database is needed.In the current study, we applied high-throughput next-generation sequencing (NGS)-based gene expression analysis, transcriptome meta-analysis, various data mining techniques with the previously published compendium and databases, and immunohistochemistry (IHC) on various cohorts for candidate validation to introduce multiplex signatures capable of differentiating PC tissue from the normal tissue.A powerful statistical learning technique and easy-to-comprehend explanation approach, using local interpretable model-agnostic explanations algorithm, were also utilized to enhance the diagnostic practice.The impact of the proposed biomarkers on patients' survival was examined using The Cancer Genome Atlas (TCGA) cohort.Ultimately, an exceptional panel of biomarkers was introduced and initially validated.We also aimed to systematically assess the most promising therapeutic targets for PC.Finally, the associated biological and functionally altered signaling cascades of the introduced biosignatures in PC carcinogenesis have been discussed.

Integrative Screening Approach Suggests 23 Novel Biomarker Candidates for the Diagnosis and Management of Pancreatic Cancer
We conducted an effect size meta-analysis from two datasets to detect differentially expressed genes.Dataset GSE16515 included 36 cancerous samples and 16 adjacent normal tissues.Dataset GSE28735 contained 45 matching pairs of pancreatic cancer and adjacent non-tumor tissues.Table S1 lists the characteristics of all utilized microarray data.We identified 3915 significantly upregulated and 4007 significantly downregulated genes in the PC tissue compared to normal tissue.It is worthy to note that we targeted upregulated genes in an effort to identify oncogenic biomarkers for the diagnosis, prognosis, and treatment of PC.It turned out that less than 7% (266 out of 3915) of overexpressed genes demonstrated a combined effect size (cES) greater than or equal to 1.5, the selection criterion for potential candidates (Figure 1a).In parallel, we utilized an exhaustive compendium of PC biomarkers collected by Harsha et al. as the filtering resource [17].The genes coding for secretory proteins and membrane proteins overexpressed in PC, precursor lesions, the stroma associated with PC, and other subtypes of PC, but not in pancreatitis, were selected.Of note, we tried to limit our study to genes that are only overexpressed in PC but not in chronic pancreatitis, which are also those having potential to distinguish PC from chronic pancreatitis, since it may be difficult to distinguish chronic pancreatitis from PC in some cases, as there have been no excellent biomarkers to distinguish these two conditions [17,18].Given these criteria, 95 secretory and 114 membrane protein-coding genes from the compendium fulfilled the criteria.Based on the Venn diagram analysis of the candidates from the meta-analysis and the compendium, five secretory (AGR2, GAPDH, LAMC2, MMP11, and TAGLN2), 12 membrane (CD82, CLDN18, EPHA2, EZR, FXYD3, GPRC5A, ITGA2, ITGB6, MET, MST1R, NQO1, and SLC2A1), and six "dual" (ADAM9, ANXA2, APLP2, CDH3, MSLN, and SERPINB5) protein-coding genes were selected for further analysis (Figure 1b,c).
Cancers 2019, 11, x 3 of 22 lists the characteristics of all utilized microarray data.We identified 3915 significantly upregulated and 4007 significantly downregulated genes in the PC tissue compared to normal tissue.It is worthy to note that we targeted upregulated genes in an effort to identify oncogenic biomarkers for the diagnosis, prognosis, and treatment of PC.It turned out that less than 7% (266 out of 3915) of overexpressed genes demonstrated a combined effect size (cES) greater than or equal to 1.5, the selection criterion for potential candidates (Figure 1a).In parallel, we utilized an exhaustive compendium of PC biomarkers collected by Harsha et al. as the filtering resource [17].The genes coding for secretory proteins and membrane proteins overexpressed in PC, precursor lesions, the stroma associated with PC, and other subtypes of PC, but not in pancreatitis, were selected.Of note, we tried to limit our study to genes that are only overexpressed in PC but not in chronic pancreatitis, which are also those having potential to distinguish PC from chronic pancreatitis, since it may be difficult to distinguish chronic pancreatitis from PC in some cases, as there have been no excellent biomarkers to distinguish these two conditions [17,18].Given these criteria, 95 secretory and 114 membrane protein-coding genes from the compendium fulfilled the criteria.Based on the Venn diagram analysis of the candidates from the meta-analysis and the compendium, five secretory (AGR2, GAPDH, LAMC2, MMP11, and TAGLN2), 12 membrane (CD82, CLDN18, EPHA2, EZR, FXYD3, GPRC5A, ITGA2, ITGB6, MET, MST1R, NQO1, and SLC2A1), and six "dual" (ADAM9, ANXA2, APLP2, CDH3, MSLN, and SERPINB5) protein-coding genes were selected for further analysis (Figure 1b,c).

Literature-and Experiment-Based Validation of the 23 Selected Candidates
Published papers and data submitted to public repositories were extensively mined for literaturebased validation.The overexpression of the 23 candidates in pancreatic ductal adenocarcinoma (PDAC) was experimentally identified and validated by multiple techniques, such as microarray, quantitative proteomics, and antibody-based methods.According to the results, each gene was overexpressed with a change greater than two-fold at least once in PC or PC cell lines at the mRNA or protein level.For instance, ANXA2 was preferably reported to be overexpressed in PDAC tissue in the mRNA analysis using DNA Microarray and in the protein analyses using IHC, isotope-coded affinity tag, western blot, mass spectrometry, two-dimensional electrophoresis, and label-free quantitation mass spectrometry.
We further conducted an NGS gene expression analysis on a cohort containing six PDAC samples and six controls for experiment-based validation.Despite the heterogeneity and other confounding factors, the NGS result was promising, showing 17 out of 23 consistently overexpressed genes with a cut-off of two-fold change (log FC = 1.00), and 14 out of 23 genes using log FC = 1.50.Remarkably, SERPINB5 was more than 250 times overexpressed in the PC tissue compared to normal tissue (log FC = 8.35).LAMC2, CDH3, and ANXA2 exhibited the altered expression difference with log FC of 3.91, 2.72, and 1.16, respectively.Collectively, the experimental and in silico analyses largely supported our 23 selected candidates as reliable biomarkers for differentiating PC tissue from normal tissue.
The list of techniques for gene expression, identification, and validation, cES, and gene expression in premalignant pancreatic lesions are summarized in Table 1.

ADAM9, ANXA2, ITGA2, MET, and LAMC2 Exhibit Substantial Impact on the Survival of PC Patients
The Kaplan-Meier (KM) survival analysis and the multivariate Cox regression analysis were conducted based on the available data from TCGA pancreatic cancer cohort.The KM survival analysis revealed that PC patients with high expression levels of 17 of the candidate genes exhibited a poorer overall survival (OS) or disease-free survival (DFS).Particularly, patients overexpressing ADAM9, ANXA2, ITGA2, or MET, had both poorer OS and DFS rates than those with low gene expression.Figure 2 shows the KM plots of the OS for ADAM9, ANXA2, ITGA2, and MET expression.Of note, the HR was greater than 2 for both analyses, except for ITGA2 that had an HR of 1.9 for the DFS analysis.LAMC2 exhibited a significantly poor DFS (HR = 1.8, p-value = 0.012).A multivariate Cox regression analysis of the TCGA cohort was adopted to evaluate the impact of the 23 genes in patient survival.The analysis revealed that LAMC2, ADAM9, ANXA2, CDH3, SERPINB5, EPHA2, GPRC5A, ITGA2, ITGB6, and MET were likely associated with the worse outcome of PC with FDR < 0.05.The detailed information regarding Cox regression survival analysis of these genes is described in Table 2 and Figure S1.

ADAM9, ANXA2, ITGA2, MET, and LAMC2 Exhibit Substantial Impact on the Survival of PC Patients
The Kaplan-Meier (KM) survival analysis and the multivariate Cox regression analysis were conducted based on the available data from TCGA pancreatic cancer cohort.The KM survival analysis revealed that PC patients with high expression levels of 17 of the candidate genes exhibited a poorer overall survival (OS) or disease-free survival (DFS).Particularly, patients overexpressing ADAM9, ANXA2, ITGA2, or MET, had both poorer OS and DFS rates than those with low gene expression.Figure 2 shows the KM plots of the OS for ADAM9, ANXA2, ITGA2, and MET expression.
Of note, the HR was greater than 2 for both analyses, except for ITGA2 that had an HR of 1.9 for the DFS analysis.LAMC2 exhibited a significantly poor DFS (HR = 1.8, p-value = 0.012).A multivariate Cox regression analysis of the TCGA cohort was adopted to evaluate the impact of the 23 genes in patient survival.The analysis revealed that LAMC2, ADAM9, ANXA2, CDH3, SERPINB5, EPHA2, GPRC5A, ITGA2, ITGB6, and MET were likely associated with the worse outcome of PC with FDR < 0.05.The detailed information regarding Cox regression survival analysis of these genes is described in Table 2 and Figure S1.The role of 11 secretory protein-coding genes for differentiating PC tissue from normal tissue was further tested.We visualized the expression profiling of the 11 genes by means of boxplots.For instance, the expression levels of two promising candidates, CDH3 and LAMC2, were significantly higher in PDAC compared to normal tissue (Figure 3a,b).Boxplots of the remaining potential genes can be found in Figure S2.As illustrated in Figure 3c, two classes of PC and normal tissues showed a relatively separated tendency in the principal component analysis.However, there was still an overlapped region between them, making grouping of those cases more challenging.A heatmap analysis further provided a general overview of relative differences in gene expression between PC and normal tissue (Figure 3d).As expected, in general, the 11 candidates exhibited a higher level of expression in PC tissues than in normal tissue.The dataset was then randomly divided into the training set (70%) and test set (30%) for supervised RF model calibration and validation.RF classification was conducted to build the prediction model based on 11 potential genes.The number of randomly chosen variables at each split ranged from one to 11, and the data were cross-validated tenfold to find the best tuning parameter, which was repeated five times.The optimal RF model (mtry of two) in our trial revealed an excellent performance with an area under the ROC curve of 0.96 in the training set.The model validated in the test set also exhibited promising results with only one misclassified case, reporting accuracy of 0.96, sensitivity of 1.00, and specificity of 0.93 (Figure 3e).Of note, LAMC2, ANXA2, APLP2, and ADAM9 were the most important variables of the model, although other genes facilitated the decision (Figure 3f).

Immunohistochemistry Analysis Reveals High Protein Expression Levels of LAMC2, ADAM9, ANXA2, and APLP2 in Pancreatic Cancer
We assessed the expression of LAMC2, ADAM9, and ANXA2 at the protein level using tissue microarray samples of 86 patients and normal controls.They were chosen due to their roles in the diagnostic model, prognostic impact, and profound validated evidence.Typical immunostaining is shown in Figure 4a, and the cohort characteristics are given in Table S2.Individual proteins showed moderate to strong immunostaining in most PC tissues, and weak or negative immunostaining in normal tissues; the differences are statistically significant (Figure 4b, p-value < 0.001).LAMC2, ADAM9, and ANXA2 exhibited moderate correlations with their protein levels according to

Immunohistochemistry Analysis Reveals High Protein Expression Levels of LAMC2, ADAM9, ANXA2, and APLP2 in Pancreatic Cancer
We assessed the expression of LAMC2, ADAM9, and ANXA2 at the protein level using tissue microarray samples of 86 patients and normal controls.They were chosen due to their roles in the diagnostic model, prognostic impact, and profound validated evidence.Typical immunostaining is shown in Figure 4a, and the cohort characteristics are given in Table S2.Individual proteins showed moderate to strong immunostaining in most PC tissues, and weak or negative immunostaining in normal tissues; the differences are statistically significant (Figure 4b, p-value < 0.001).LAMC2, ADAM9, and ANXA2 exhibited moderate correlations with their protein levels according to Spearman's pairwise correlation (Figure 4c, Figures S4 and S5).The protein expression of LAMC2, ADAM9, and ANXA2 between PC and adjacent tissues (eight paired samples) presented similar results, although only ADAM9 showed statistical significance (Figure S5, p-value = 0.031).The protein expression of APLP2 was also investigated in a cohort comprising 64 PC and 64 adjacent normal tissues (Table S3).As shown in Figure S6, the expression of APLP2 was significantly higher in PC tissues (p-value = 0.0139), making them valuable for further investigation.The Pearson's correlation analysis of gene expression between LAMC2-ADAM9 (R = 0.71, p-value < 0.001), ANXA2-LAMC2 (R = 0.75, p-value < 0.001), and ANXA2-ADAM9 (R = 0.78, p-value < 0.001) revealed strong correlations among them (Figure 4d).Cancer Hallmarks Analytics Tool (CHAT) was employed to clarify the involvement of our candidates in the biological processes leading to PC according to the current ten hallmarks of cancer.Surprisingly, 18 of the 23 genes were implicated in at least five cancer hallmarks, suggesting that they may be highly associated with PC carcinogenesis and progression.Finally, a pathway enrichment analysis was performed based on the 397 upregulated genes with a cES greater than 1.5, derived from the meta-analysis of five datasets, to provide solid evidence on the PC related biological processes.As a result, various KEGG-annotated pathways were enriched in the PC tissue compared to normal tissue, including prominent cancer-related pathways, such as cell cycle and intrinsic molecular processes of PC and others, such as prostate cancer and chronic myeloid leukemia.The results indicate both unique and overlapping features of PC and other cancers, which would be helpful for further studies.In the protein-protein interaction network of 397 upregulated genes, strong connections among the nodes were observed (score > 0.900).There were two major clusters formed by pathways in cancer and cell cycles in the network (Figure S7).Pathway enrichment, protein-protein networks with known and predicted interactions between the nodes, and CHAT analysis can be found in Tables S4 and S5, and Figure S7.S3).As shown in Figure S6, the expression of APLP2 was significantly higher in PC tissues (p-value = 0.0139), making them valuable for further investigation.The Pearson's correlation analysis of gene expression between LAMC2-ADAM9 (R = 0.71, p-value < 0.001), ANXA2-LAMC2 (R = 0.75, p-value < 0.001), and ANXA2-ADAM9 (R = 0.78, p-value < 0.001) revealed strong correlations among them (Figure 4d).

Functional Analysis and Public Data Mining to Create a Comprehensive Picture of the Biological Processes in Pancreatic Cancer
Cancer Hallmarks Analytics Tool (CHAT) was employed to clarify the involvement of our candidates in the biological processes leading to PC according to the current ten hallmarks of cancer.
Surprisingly, 18 of the 23 genes were implicated in at least five cancer hallmarks, suggesting that they may be highly associated with PC carcinogenesis and progression.Finally, a pathway enrichment analysis was performed based on the 397 upregulated genes with a cES greater than 1.5, derived from the meta-analysis of five datasets, to provide solid evidence on the PC related biological processes.
As a result, various KEGG-annotated pathways were enriched in the PC tissue compared to normal tissue, including prominent cancer-related pathways, such as cell cycle and intrinsic molecular processes of PC and others, such as prostate cancer and chronic myeloid leukemia.The results indicate both unique and overlapping features of PC and other cancers, which would be helpful for further studies.In the protein-protein interaction network of 397 upregulated genes, strong connections among the nodes were observed (score > 0.900).There were two major clusters formed by pathways in cancer and cell cycles in the network (Figure . S7). Pathway enrichment, proteinprotein networks with known and predicted interactions between the nodes, and CHAT analysis can be found in Table S4, Table S5, and Figure S7.

Drug-Gene Interaction and miRNA-Gene for Further Development of Novel Therapeutics
Increasing evidence suggests that aberrantly expressed miRNAs are attractive targets for preventing PC initiation, progression, metastasis, and chemoresistance.Given their enormous potential as therapeutic tools for cancer management, we checked whether the transcripts of our 18 membrane protein-coding genes were the targets of miRNAs based on experimentally validated evidence.It turned out that nine candidates exhibited a strong connection with miRNAs (ADAM9, ANXA2, MSLN, SERPINB5, EZR, GPRC5A, ITGA2, MET, and SLC2A1).The Pearson's correlation coefficient between each candidate and the miRNA was also recorded.Finally, we input our genes in curated resources combined by the drug-gene interaction database (DGIdb) to examine if they coded for a drug-metabolizing enzyme or a drug transporter (so-called druggable genes), and then determined possible drug-gene interactions.As an example of the results, ADAM9 was positively correlated with hsa-miR-126-3p (the correlation coefficient, 0.315; p-value < 0.05) with strong evidence from the reporter assay, western blot, and qPCR.According to DrugBank, the drug Ilomastat was detected to inhibit the cell-surface protein ADAM9.The detailed information regarding miRNA-gene interaction and drug-gene interaction is presented in Table 3.

Discussion
PC is a complicated and highly heterogeneous disease, yet it takes approximately two decades for a primary pancreatic tumor to develop from initiation to cancer death, opening a large window for an effective screening and early detection program [20].In that context, a plethora of biomarkers have been studied, but most of them have failed to be translated into clinical practice.Robust biomarkers, even identified by advanced, reliable technologies, have not exhibited a better diagnostic and prognostic ability than conventional biomarkers in large-scale clinical trials [21].In recent years, advances in computational techniques have facilitated and enhanced cancer omics-based research [22].The current study utilizes an unbiased data-driven approach together with the integration of omics data, machine learning, and conventional immunohistochemistry approaches to suggest and validate potential biomarkers for PC diagnostics and management.Our approach covers a broad spectrum of cancer biology, of which LAMC2, ANXA2, ADAM9, and APLP2 are the most important features of the tested diagnostic model, and their implementation in clinical settings could be further examined.The roles of LAMC2, ANXA2, ADAM9, and APLP2 in cancer biology have been documented [23][24][25][26][27].However, our study has been a pioneer in suggesting the combination of LAMC2, ANXA2, ADAM9, and APLP2 and statistical learning algorithms to properly detect PDAC in early stages.Of note, the expression at the gene and protein level of LAMC2, ANXA2, and ADAM9 are positively correlated to PC. PC patients overexpressing LAMC2 potentially have a higher chance of getting distant metastasis and poorer prognosis [28].The elevation of LAMC2 in serum was also considered an indicator for differentiating healthy people from an early stage of PDAC with outstanding performance [29].Similarly, the high expression of stromal ANXA2 was significantly correlated with short disease-free survival and overall survival, while the high expression of ADAM9 was associated with poor tumor differentiation and short overall survival in PDAC patients [26].We also screened for the presence of somatic mutations of our candidates using the most comprehensive global database for the annotation of somatic variants in human cancer, the Catalogue of Somatic Mutations in Cancer (COSMIC) [30].COSMIC has systematically curated published cancer mutation data and provides in-depth knowledge on cancer genes, including their involvement in cancer genesis and mutation mechanisms.It turned out that our candidates were not frequently mutated genes.Further studies integrating the suggested biosignature and conventional biomarkers, such as CA 19-9, are needed to validate the performance of the panel.Nevertheless, our investigation demonstrates that with the power of machine learning and omics-based big data, scientists could gather patient tumor data and clinical parameters together with biomarker screening using the combined panel to explore the pattern of alterations in multiple PDAC patient tumors, thereby increasing the diagnostic and prognostic accuracy.
Out of the 23 genes in our panel, 18 were involved in at least five cancer hallmarks, indicating the relation of these genes and associated products with cancer initiation and progression.However, with respect to panel, the roles of immune destruction, cellular energetics, and replicative immortality in PC have not been studied frequently compared with the other hallmarks.More studies regarding these aspects are needed to fully explore the potential of our multiplex panels in PC diagnosis and treatment.In recent years, extensive efforts have been made to shed light on the association between immune destructions, cancer, and associated regulators [31,32].The immune hallmark can be summarized as the ability to survive in the chronically infiltrated microenvironment, to escape the immune recognition, and to suppress the immune response [33].Despite their great impact on cancer treatment and prevention, these processes may also give some clues for developing new screening and diagnostic approaches.Of note, the deregulating cellular energetics or cancer metabolism has been recently recognized as a cancer hallmark [34].Investigations in the past 20 years and the successful implementation of metabolic tumor imaging into clinical practice have once again emphasized that altered metabolic processes are critical for tumor survival and proliferation [35].Among them, the alterations of glycolysis, amino acids, and lipid metabolism are the most important changes of cancer cellular energetics [36].Multiple metabolic enzymes and pathways could be investigated as attractive targets not only for anticancer therapeutics but also for developing novel biomarkers [37].
It was expected that miRNA would diminish drug resistance of tumors through various regulatory mechanisms [38,39].In our investigation, ADAM9 and ANXA2 were attractive candidates for miRNAbased anti-PC therapies.They exhibited strong interactions with miR-126, miR-33a, and miR-125a (ADAM9); and miR-155 and miR-206 (ANXA2).In PC, miR-126 regulated ADAM9 by targeting the epithelial-mesenchymal transition, the initial step of the metastatic cascade [40].ANXA2 and KRAS were also direct downstream targets of miR-206 that modulated cancer cell growth, metastasis, and lymphangiogenesis in PDAC [41].In addition to miRNAs, various drug molecules are worthy of more attention.Lately, in an effort to implement evidence-based personalized treatment recommendations into clinical settings, some software and databases have been developed to assist the decision of PC treatment.A recent evidence mining software analyzing next-generation sequencing data suggests pharmacogenomic biomarkers and has potential to predict PC treatment efficacy and toxicity [42].Further studies are necessary to fully explore the potential of ADAM9 and ANXA2, with high level of evidence of clinical validity, among other candidates, as drug candidates in PC management with regard to their associated miRNAs.
The current study set out to introduce an integrative data mining model for translational research to facilitate the discovery and validation of new biomarkers.In particular, a comprehensive literature and experimental data mining approach to examine potential roles of overexpressed genes as biomarkers for diagnosis and prognosis of pancreatic cancer were conducted.Thus, as a limitation, down-regulated genes were neglected in all of our analyses, which might result in the missing of valuable candidates for further validation.In addition, a follow-up study integrating four proposed candidates and statistical learning is needed to examine their actual clinical validity.

Ethical Approval and Consent of Patients
The Institutional Review Board (IRB) of Seoul National University reviewed and exempted our protocol for a full ethical approval application when using human gene expression data from the public database without personal information, survival data, and commercial tissue micro-array (SNU 16-11-014 and SNU 17-05-043).The IRB of Inha University approved the protocol of the NGS gene expression analysis (INHAUH 2016-08-008-001).Every patient signed an informed consent before any procedures were conducted.The experiments followed the Ethical Principles of the World Medical Association Declaration of Helsinki.

NGS Sample Collection
Tissue samples were obtained from six PC patients, and six paired controls (stage IIa/IIb, median age of 66.5, range from 55 to 75, male:female of 2:1) using standard protocols of the Inha University Hospital.All samples were formalin-fixed paraffin-embedded and stored at 4 • C.

Microarray Sample Collection
The Affymetrix-based datasets GSE16515, GSE28735, GSE15471, GSE18670, and GSE41368, were included in the gene expression meta-analysis.After that, we divided the datasets into two groups with similar sample sizes: set A including, GSE16515 and GSE28735 for variable selection purposes; and set B, including GSE15471, GSE18670, and GSE41368 for statistical learning-based validation of the proposed set of novel biomarkers.Dataset GSE19650 was also included for the examination of the multistep PC development.

RNA Extraction from Formalin-Fixed Paraffin-Embedded Tissue
Formalin-fixed paraffin-embedded tissue was cut using a razor blade (20 µm), transferred to a 1.5 mL tube, and filled with mineral oil.The tube was incubated at 70 • C for 3 min to melt the paraffin.The sample was then centrifuged for 1 min at 16,000× g.The pellet was washed with 1 mL of ethanol and air dried.Next, 150 µL protease K digestion buffer containing 500 µg/mL protease K was added, and the mixture was incubated for 3 h. 1 mL of Trizol was added to the sample and incubated at room temperature for at least 5 min to dissociate nucleoprotein complexes.Subsequently, 0.2 mL chloroform was added, and the mixture was vigorously vortexed for 15 s and centrifuged at 12,000× g for 10 min at 4 • C. The aqueous phase was transferred to a fresh tube.Total RNA was precipitated by mixing with an equal volume of isopropyl alcohol at −20 • C for at least 1 h, followed by 10 min of centrifugation at 12,000× g at 4 • C. Finally, RNA pellet was washed with 70% ethanol, briefly air-dried, and resuspended in RNase-free water.

RNA Data Processing and Analysis
To construct cDNA libraries, 1 µg of total RNA was used with TruSeq RNA library kit (Illumina, San Diego, CA, USA).The protocol included polyA-selected RNA extraction, RNA fragmentation, random hexamer-primed reverse transcription, and 100-nt paired-end sequencing by Illumina HiSeq2500 (Illumina).The libraries were quantified using qPCR and qualified using an Agilent Technologies 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA).To estimate the expression level, the RNA-seq reads were mapped to the genome of Homo sapiens.The reference genome sequence of Homo sapiens and the annotation data were downloaded from the UCSC website (http://genome.uscs.edu).The transcript counts at the gene level were calculated [43].Data normalization and statistical analysis were then conducted using edgeR.We determined the significant result by adjusting the absolute value of fold change ≥ 2.

Microarray Data Processing and Gene Expression Meta-Analysis
We normalized raw gene expression data using a robust multi-array analysis implemented in the Bioconductor affy package [44].Database for Annotation, Visualization, and Integrated Discovery (DAVID) version 6.8 was applied to get the Entrez IDs.The Empirical Bayes (ComBat) cross-study normalization method was applied to remove batch effects.The random effects model was chosen as the statistical meta-analysis method based on the result of Cochran's Q test.The analysis was implemented in the comprehensive web-based tool NetworkAnalyst [45].

Immunohistochemistry (IHC) Experiments
The procedures of IHC were carried out as described previously [46].Two experts independently assessed staining scores.The immunostaining intensity was graded as 0, 1, 2, or 3 for the tissue with stainless cells, <10% stained cells, 10-50% stained cells, and > 50% stained cells, respectively.We excluded the tissues that were not suitable for analysis (e.g., not adenocarcinoma or tissue loss cores).IHC scoring was visualized by Aperio ScanScope digital slide scanners, and analyzed by the vendor's software (Leica, Wetzlar, Germany).The Mann-Whitney and Wilcoxon matched-pairs signed rank tests were conducted to test two-group differences among PC tissues, adjacent normal tissues, and normal tissues using GraphPad Prism 6 (San Diego, CA, USA) where applicable.

Variable Selection
We treated the overexpressed genes coding for secretory and membrane proteins separately because of their distinct clinical roles.Only those coding for secretory proteins and not upregulated in pancreatitis from the compendium were selected for a further classification.

Data Exploration and Visualization
The principal component analysis (PCA) was incorporated to facilitate data visualization and to detect potential outliers prior to the class assignment analysis using FactoMineR version 1.35 and factoextra version 1.0.4[47].A heatmap was also applied to highlight the differences in gene expression between the two comparison groups by means of MetaboAnalyst 3.0 [48].

Random Forests Classification Model and Explanation
Classification and Regression Training (caret) package version 6.0.77 was used for data splitting and supervised classification analysis [49].Set B was divided into the training set and the test set with a ratio of 7:3.A random forests (RF) model using 500 trees was optimized on the training set.The number of variables for splitting at each tree node (mtry) was tuned.Five-time repeated 10-fold cross-validation sampling iteration was used during the training process to select the optimal model.A seed number was given prior to data splitting and model training to achieve reproducible results.We used the explainer local interpretable model-agnostic explanations (LIME) to interpret the rationale behind the black-box RF model's predictions [50].

Correlation Analysis
Pearson's correlation analysis and Spearman's rank correlation analysis were conducted for gene expression level and protein expression score, respectively.

Kaplan-Meier Plots and Cox Regression Analysis
The overall survival and disease-free survival were investigated using the Kaplan-Meier method with the log-rank test.We set the high and low gene expression level groups by the median value.The overall survival plot and disease-free survival plot were obtained with the hazard ratios (HR) and the 95% confidence interval information.The whole process was implemented using the web-based tool GEPIA [51].Additionally, multivariate Cox regression analysis of TCGA PC patients was also extracted using Oncolnc to determine the potential prognostic biomarkers [52].

Database Mining, Pathway Enrichment, STRING, and CHAT Analyses
Kyoto Encyclopedia of Genes, Genomes (KEGG) pathway enrichment, and OmicsNet and STRING protein-protein interaction analyses and visualizations were conducted using the significantly overexpressed genes.The Database of Human Pancreatic Cancer (HPCDb) and the Pancreatic Expression Database (PED) were used as resources for literature data mining [53,54].A public literature review of a reported gene list in PC was also conducted under the guidance of Cancer Hallmarks Analytics Tool (CHAT) [55].

Drug-Gene Interaction and miRNA-Gene Interaction for Further Developing Novel Therapeutics
To examine the connection between our membrane protein-coding genes with miRNA, we employed the miRTarBase [56].The miRTarBase offers data regarding experimentally validated miRNA-target interactions.The Pearson's correlation coefficient between each candidate and miRNA was also recorded.Subsequently, the drug-gene interaction database was investigated to examine if our genes coded for a drug-metabolizing enzyme or a drug transporter [57].

Statistical Significance Level
R statistic 3.4.2was used to implement the analysis except otherwise stated.A p-value of 0.05 was used as the cut-off for significance in statistical tests.False discovery rate (FDR) (Benjamini-Hochberg method), when applicable (e.g., in NGS differential analysis), with a threshold of 0.05, was utilized for all multiple hypothesis tests.

Availability of Data and Materials
NGS data generated and analyzed during the current study are available in the Gene Expression Omnibus (GEO) with the assession code of GSE119224.Other datasets analyzed during the current study are available in the GEO (GSE16515, GSE28735, GSE15471, GSE18670, GSE41368, and GSE19650).All data generated or analyzed during this study are included in this published article and its supplementary information files.

Conclusions
Recent advances in high-throughput multi-omic technologies along with advanced bioinformatics and machine learning techniques have provided a new paradigm for the discovery of human disease biomarkers [10,46,58,59].This perspective should remark the continued employment of multi-omic approaches and machine learning techniques under the guidance of big data and biological mechanisms in cancer biomarker research.In view of this, we successfully demonstrated an integrative method to introduce and validate a panel of LAMC2, ANXA2, ADAM9, and ALPL2 for innovating PC diagnosis, prognosis, and management.Future studies are expected to take a step forward and concentrate on validating biomarkers by integrating multi-omics data in an epidemiological and clinical context-dependent manner.

Figure 1 .Figure 1 .
Figure 1.Screening process of innovative biomarker candidates for pancreatic cancer diagnosis and treatment.(a) Combined effect size distributions in transcriptome meta-analysis.(b) Selection of membrane protein-coding genes.(c) Selection of secretory protein-coding genes.The left circle of the Figure 1.Screening process of innovative biomarker candidates for pancreatic cancer diagnosis and treatment.(a) Combined effect size distributions in transcriptome meta-analysis.(b) Selection of membrane protein-coding genes.(c) Selection of secretory protein-coding genes.The left circle of the Venn diagram (orange) represents the candidates from meta-analysis and the right circle of the Venn diagram (blue) represents the candidates curated from [17].

Figure 2 .
Figure 2. Kaplan-Meier (KM) plots of the overall survival of four promising prognostic candidates.(a) KM plot of ADAM9.(b) KM plot of ANXA2.(c) KM plot of ITGA2.(d) KM plot of MET.

Figure 2 .
Figure 2. Kaplan-Meier (KM) plots of the overall survival of four promising prognostic candidates.(a) KM plot of ADAM9.(b) KM plot of ANXA2.(c) KM plot of ITGA2.(d) KM plot of MET.

Cancers 2019, 11 , x 9 of 22 Figure 3 .
Figure 3. Data exploration and diagnostic performance of 11 biomarker candidates in the Random Forests model.(a) The gene expression of CDH3 is higher in PDAC than in normal controls (n = 96), ****: p < 0.0001.(b) The gene expression of LAMC2 is higher in PDAC than in normal controls (n = 96).(c) Principal component analysis of PDAC versus normal controls.(d) Heatmap analysis of 11 biomarker candidates.(e) ROC curve of the random forests model in the test set (n = 28).(f) The importance scores of 11 biomarker candidates in the random forests model.

Figure 3 .
Figure 3. Data exploration and diagnostic performance of 11 biomarker candidates in the Random Forests model.(a) The gene expression of CDH3 is higher in PDAC than in normal controls (n = 96), ****: p < 0.0001.(b) The gene expression of LAMC2 is higher in PDAC than in normal controls (n = 96).(c) Principal component analysis of PDAC versus normal controls.(d) Heatmap analysis of 11 biomarker candidates.(e) ROC curve of the random forests model in the test set (n = 28).(f) The importance scores of 11 biomarker candidates in the random forests model.

2. 6 .
Functional Analysis and Public Data Mining to Create a Comprehensive Picture of the Biological Processes in Pancreatic Cancer

Cancers 2019, 11 , x 10 of 22 and
ANXA2 between PC and adjacent tissues (eight paired samples) presented similar results, although only ADAM9 showed statistical significance (FigureS5, p-value = 0.031).The protein expression of APLP2 was also investigated in a cohort comprising 64 PC and 64 adjacent normal tissues (Table

Table 1 .
Literature-and experiment-based validation of the 23 candidates.

Table 2 .
Combined effect sizes, Cox regression analysis, and alterations at the mRNA and protein level of the 23 candidates.

Table 3 .
Drug-gene and miRNA-gene interactions of 18 membrane proteins.