Identification of Circulating Serum miRNAs as Novel Biomarkers in Pancreatic Cancer Using a Penalized Algorithm

Pancreatic cancer (PC) is difficult to detect in the early stages; thus, identifying specific and sensitive biomarkers for PC diagnosis is crucial, especially in the case of early-stage tumors. Circulating microRNAs are promising non-invasive biomarkers. Therefore, we aimed to identify non-invasive miRNA biomarkers and build a model for PC diagnosis. For the training model, blood serum samples from 63 PC patients and 63 control subjects were used. We selected 39 miRNA markers using a smoothly clipped absolute deviation-based penalized support vector machine and built a PC diagnosis model. From the double cross-validation, the average test AUC was 0.98. We validated the diagnosis model using independent samples from 25 PC patients and 81 patients with intrahepatic cholangiocarcinoma (ICC) and compared the results with those obtained from the diagnosis using carbohydrate antigen 19-9. For the markers miR-155-5p, miR-4284, miR-346, miR-7145-5p, miR-5100, miR-661, miR-22-3p, miR-4486, let-7b-5p, and miR-4703-5p, we conducted quantitative reverse transcription PCR using samples from 17 independent PC patients, 8 ICC patients, and 8 healthy individuals. Differential expression was observed in samples from PC patients. The diagnosis model based on the identified markers showed high sensitivity and specificity for PC detection and is potentially useful for early PC diagnosis.


Introduction
Pancreatic cancer (PC) is one of the leading causes of cancer-related mortality, as the symptoms of PC seldom appear in the early stages of the disease, and the cancer is mostly detected after it has metastasized to other organs. According to cancer statistics in 2020, the five-year survival rate of patients with PC is 9%, although that of patients with localized PC is higher than 37%, based on patients diagnosed with pancreatic cancer between 2009 and 2015 [1].
The most effective strategy for reducing PC-related mortality is early diagnosis and treatment. However, the lack of reliable markers for PC detection reduces the efficacy of screening strategies in at-risk populations, such as those with chronic pancreatitis [2]. Carbohydrate antigen 19-9 (CA19-9) and carcinoembryonic antigen (CEA) are the most commonly used serological biomarkers; however, they lack sufficient sensitivity and specificity for the detection of PC [3]. To improve the prognosis of patients with this form of cancer, it is important to identify diagnostic biomarkers for PC.
Recently, microRNAs (miRNAs), which are small non-coding RNA molecules, have been reported to play important roles in post-transcriptional regulation in cancer [4]. Increasing evidence has shown that miRNAs are essential for the development, diagnosis, and prognosis of cancer, suggesting that these RNAs have potential for use as diagnostic markers in cancer [5]. To date, nearly 100 miRNAs have been identified to be associated with PC using tissue samples [2]. However, it is difficult to perform tissue biopsies in every patient suspected of having PC. Therefore, the optimal biomarkers would be non-invasive and derived from blood, such as circulating miRNAs, which may be readily collected from the patient. Another reason for using circulating miRNAs as biomarkers is their remarkable stability in plasma and serum. They are protected from RNAse degradation as they can be packaged in microparticles (e.g., exosomes) or bound to Argonaut proteins or high-density lipoproteins [6][7][8][9][10].
Currently, highly sensitive and specific invasive biomarkers are not available for the detection of PC. Therefore, the primary objective of this study was to identify noninvasive miRNA biomarkers and to build a prediction model for the diagnosis of PC. In this study, 63 PC patients and 63 control subjects were used for the identification of miRNA biomarkers, and an additional 25 PC samples and 81 intrahepatic cholangiocarcinoma (ICC) samples were used for the validation of our proposed prediction model. For comparison, we also obtained diagnosis results based on serum levels of CA19-9 in the same blood samples. For additional validation, quantitative reverse transcription PCR (qRT-PCR) was conducted using additional RNA samples from 17 patients with PC, 8 patients with ICC, and 8 healthy individuals.

Study Design
The present study included 105 patients with PC, 109 patients with ICC, 7 patients with stomach cancer (SC), 5 patients with colorectal cancer (CRC), 2 patients with gastrointestinal stromal tumor (GIST), 10 patients with cholelithiasis (Ch), and 27 healthy subjects who had been clinically classified at the time of participation. A case-control study was designed to identify differentially expressed miRNAs (DEmiRNAs) between the case-control groups and to build a diagnostic model for PC. For cases, 63 PC patients were used, and for controls, two types of control groups were used.
The first type of control group consisted of 19 healthy subjects and 10 Ch patients. The second type of control group, the non-PC group, included samples from patients with other cancers as well as those from non-cancer subjects. In particular, we included 20 ICC patients, 7 SC patients, 5 CRC patients, and 2 GIST patients. We set aside 25 PC and 81 ICC samples for the validation study. The clinical characteristics of the samples in the microarray experiments and the grouping details are presented in Table 1. qRT-PCR was conducted using samples from 17 PC patients, 8 ICC patients, and 8 healthy individuals. The purpose of our study was not only to identify biomarkers for PC but also to build a prediction model. Therefore, although the age of the subjects was significantly different between the case and control groups, we decided to use the model without the covariate, as the model with the covariate had a similar prediction performance to the model without the covariate. The study protocol conformed to the ethical guidelines of the 1975 Declaration of Helsinki, and the Ethical Committee and Institutional Review Board of Yonsei University College of Medicine approved the protocol of serum acquisition from the patients' specimens. Written informed consent was obtained from all participating patients and healthy controls (IRB approval code 4-2012-0528, 20 September 2012).

Sample Preparation
Patient samples were prospectively obtained from consenting individuals who underwent a detailed clinical examination and were diagnosed at the Severance Hospital, Yonsei University College of Medicine. Serum samples from 63 patients with PC, 63 non-PC control subjects, and another 25 patients with PC and 81 patients with ICC were collected in 10-mL BD serum tubes. Samples were centrifuged at 4 • C for 20 min at 3000× g. The supernatant serum was then aliquoted and stored at −80 • C until further use.

MicroRNA Extraction
Total RNA containing miRNA was extracted from the serum samples using a serum miRNA purification kit (Genolution, Seoul, Korea) according to the manufacturer's instructions, and the RNA was resuspended in 12 µL of RNase-free water and stored at −80 • C until microarray or qRT-PCR analysis.

MicroRNA Microarray Experiments
For quality control, the purity and integrity of the RNA were evaluated based on the OD260/280 ratio and analyzed using the Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA). Analysis using the Affymetrix GeneChip miRNA 4.0 array (Affymetrix, Santa Clara, CA, USA) was performed according to the manufacturer's protocol. RNA samples (130 ng) were labeled using the FlashTag Biotin RNA Labeling Kit (Genisphere, Hatfield, PA, USA). The labeled RNA was quantified, fractionated, and hybridized to the miRNA microarray according to the standard procedures provided by the manufacturer.
Next, the labeled RNA was heated to 99 • C for 5 min and then to 45 • C for 5 min. RNA-array hybridization was performed with agitation at 60 rotations per minute for 16 h at 48 • C on an Affymetrix 450 Fluidics Station. The chips were washed and stained using a GeneChip Fluidics Station 450 (Affymetrix). The chips were then scanned using an Affymetrix GCS 3000 scanner; 232 CEL files were analyzed and normalized using the Expression Console software. The Affymetrix GeneChip Micro 4.0 Array provides 100% miRBase v20 coverage (www.mirbase.org) using a one-color approach. This chip contains 6658 human probe sets, which includes pre-mature miRNAs (n = 2025) and other small RNAs (n = 1996), including internal and negative controls. For further analysis, we extracted 2578 mature human miRNAs, from all probe sets.

Principal Component Analysis Based on Differentially Expressed Genes
Log2-transformed and normalized intensities for the 2578 human mature miRNAs were analyzed for the difference in expression levels between the cases and controls. To identify DEmiRNAs, we used a logistic regression analysis. Statistical significance was determined using the false discovery rate (FDR) method; FDR < 0.05 was considered significant in this analysis.
To examine the difference in miRNA profiles between the cases and controls, we conducted a principal component analysis (PCA). The principal components of the two groups were computed based on different sets of miRNAs: (i) all miRNAs and (ii) FDR < 0.05. Based on this PCA model, we also predicted the principal components of the validation samples (25 PC samples and 81 ICC samples). To visualize the pattern of each group, we added 95% confidence ellipses of principal components in a PCA plot based on the multivariate t distribution.

Biomarker Selection for Diagnosis
For diagnosis of PC, miRNA biomarkers were selected from the 2578 human mature miRNAs, using the following procedure:

•
Step 1 (training/test data assigning): Whole data were randomly divided into 5 approximately equal-sized subsets (folds). Each of the five folds were considered test data, and the remaining folds were designated as training data (5-fold cross-validation).

•
Step 2 (candidate variable selection): Using the individualized assigned training data, logistic regression analysis was conducted, and p-values and adjusted p-values (FDR) were computed for each miRNA. First candidate miRNAs were selected (FDR < 0.05). By applying a smoothly clipped absolute deviation (SCAD) penalty to the first candidate miRNAs, second candidate miRNAs with non-zero coefficients were selected.

•
Step 4 (final variable selection by voting): From the 1000 sets of candidate miRNAs, the frequency of each candidate miRNA was computed. The candidate miRNAs were sorted by frequency.
Optimal K was determined based on the performance of each model. As a final model, RBF-kernel SVM with K top-ranked miRNAs was applied using the whole training dataset.

Double Cross-Validation
For the parametrization and validation of our diagnostic model, we used double cross-validation [11,12], which consists of inner and outer cross validation. We conducted the outer 5-fold cross validation to determine the optimal K and the inner 5-fold cross validation for the hyperparameter assignment of SVM. In the inner 5-fold cross validation, for the grid search of kernel hyperparameters, we assigned gamma values in the range of −2 −10 to 2 10 (−2 −10 , 2 −9 , . . . , 2 9 , 2 10 ) and cost values in the range of −2 −7 to 2 7 (2 −7 , 2 −6 , . . . , 2 6 , 2 7 ). In the outer 5-fold cross validation, the diagnostic models with K top-ranked miRNAs were applied to the test data, and the area under the curve (AUC), sensitivity, and specificity were calculated for each fold. We calculated these performances for several K values (K = 2, . . . , 50). This double cross-validation was repeated 20 times in random seeds. The performance metrics were then averaged for 5 folds and 20 repetitions. Based on this performance, we determined the final number of biomarkers (=K).

Smoothly Clipped Absolute Deviation (SCAD) Penalty
SCAD is a non-concave penalty function introduced by Fan and Li [13], and Zhang et al. [14] considered the sparse SVM with SCAD for feature selection. The SCAD-penalized term for each coefficient t j has the following form Equation (1) [14]: In our analysis, Fan and Li's suggested value for a = 3.7 was used. The parameter λ was assigned by minimizing the approximate generalized cross-validation statistics. Among the various penalized methods for feature selection, we chose SCAD because it has several desirable properties. For example, SCAD produces nearly unbiased estimates for large coefficients, and the set of features selected using SCAD are asymptotically equivalent to the set of true signal features; that is, SCAD satisfies the oracle property. We conducted a penalized SVM with the SCAD penalty for multiple miRNA selection in the double cross-validation in our study.

Quantitative RT-PCR
Reverse transcription and qRT-PCR were performed using a TaqMan Advanced miRNA cDNA Synthesis Kit (Applied Biosystems, Foster City, CA, USA), TaqMan Advanced miRNA Assays (Applied Biosystems), and TaqMan Fast Advanced Master Mix (Applied Biosystems), according to the manufacturer's protocols. qRT-PCR was performed using an ABI Prism 7300 Sequence Detection System (Applied Biosystems), and primers for the mature miRNAs were purchased from Applied Biosystems. PCR amplification consisted of an initiation step at 95 • C for 10 min, followed by 55 cycles at 95 • C for 30 s, 56 • C for 30 s, and 72 • C for 15 s. All qRT-PCR assays were performed in triplicate using total RNA samples from 17 patients with PC, 8 patients with ICC, and 8 healthy individuals. Statistical analyses were analyzed using GraphPad 5 (GraphPad Software). The miRNA expression between groups were calculated by a one-way ANOVA and Bonferroni post-tests.

Comparison between Case and Control by PCA
Upon comparing the 63 PC samples with the 29 non-cancer samples in the DEmiRNA analysis, we identified 103 miRNAs that showed significant differences in expression between the two groups (FDR < 0.05) (Table S1, Figure S1). When 103 miRNAs were used in the PCA, the 63 PC samples (green dots) and 29 non-cancer samples (red dots) were welldistinguished compared to when all miRNAs were used. Furthermore, 25 validation-PC samples (purple dots) had similar patterns to the 63 PC samples, as shown in Figure 1a,b. However, some of the 81 validation-ICC samples (blue points) had overlapping patterns with the PC case samples. Thus, if we used only non-cancer samples as controls, the biomarkers led to many false positives (for example, the biomarker could diagnose some ICC patients as PC patients) and were not appropriate for the PC-specific diagnostic model. distinguished compared to when all miRNAs were used. Furthermore, 25 validation-PC samples (purple dots) had similar patterns to the 63 PC samples, as shown in Figure 1a,b. However, some of the 81 validation-ICC samples (blue points) had overlapping patterns with the PC case samples. Thus, if we used only non-cancer samples as controls, the biomarkers led to many false positives (for example, the biomarker could diagnose some ICC patients as PC patients) and were not appropriate for the PC-specific diagnostic model. Upon comparing the 63 PC samples with the 63 non-PC samples, we identified 149 miRNAs that showed significant differential expression between the two groups (FDR < 0.05) (Table S2, Figure S2). When we used all the miRNA data in the PCA, the PC and non-PC samples exhibited overlapping patterns of principal components (Figure 1c). When the 149 differentially expressed miRNAs were used in the PCA, the clustering patterns of the 63 PC samples and 63 non-PC samples were nearly distinguished, and the validation Upon comparing the 63 PC samples with the 63 non-PC samples, we identified 149 miRNAs that showed significant differential expression between the two groups (FDR < 0.05) (Table S2, Figure S2). When we used all the miRNA data in the PCA, the PC and non-PC samples exhibited overlapping patterns of principal components (Figure 1c). When the 149 differentially expressed miRNAs were used in the PCA, the clustering patterns of the 63 PC samples and 63 non-PC samples were nearly distinguished, and the validation samples (25 PC and 81 ICC) had similar patterns to those of the training case-control samples, as shown in Figure 1d.

Building a Diagnostic Model Based on the Selected miRNA Markers
To build the diagnostic model, we decided to use the results of the comparison between the PC samples and non-PC samples, including samples from patients with other cancers as controls to obtain PC-specific diagnostic markers. For the selection of diagnostic markers, we used 5-fold cross validation with 200 repetitions. In each fold of the cross-validation, we conducted a logistic regression analysis without a covariate and selected a set of candidate miRNA markers whose FDR was less than 0.05. Then, through the use of the SVM with the SCAD penalty function, the candidate markers were narrowed down to the markers with non-zero coefficients.
We ranked the markers according to the selection frequency. Based on these frequencies, K top-ranked miRNAs were used to build the RBF kernel SVM model. To determine the value of K, through double cross-validation, we estimated the diagnostic performance of the model with the K top-ranked miRNAs by varying K (K = 1, . . . , 50). As shown in Figure 2, the performance measures increased as K increased and began to saturate at an AUC of 0.98 and an accuracy of 0.93 when K was 39. Therefore, we decided to select the top 39 miRNAs as diagnostic biomarkers for PC among the candidate miRNAs ( Table 2).

Building a Diagnostic Model Based on the Selected miRNA Markers
To build the diagnostic model, we decided to use the results of the comparison between the PC samples and non-PC samples, including samples from patients with other cancers as controls to obtain PC-specific diagnostic markers. For the selection of diagnostic markers, we used 5-fold cross validation with 200 repetitions. In each fold of the crossvalidation, we conducted a logistic regression analysis without a covariate and selected a set of candidate miRNA markers whose FDR was less than 0.05. Then, through the use of the SVM with the SCAD penalty function, the candidate markers were narrowed down to the markers with non-zero coefficients.
We ranked the markers according to the selection frequency. Based on these frequencies, K top-ranked miRNAs were used to build the RBF kernel SVM model. To determine the value of K, through double cross-validation, we estimated the diagnostic performance of the model with the K top-ranked miRNAs by varying K (K = 1, …, 50). As shown in Figure 2, the performance measures increased as K increased and began to saturate at an AUC of 0.98 and an accuracy of 0.93 when K was 39. Therefore, we decided to select the top 39 miRNAs as diagnostic biomarkers for PC among the candidate miRNAs (Table 2).   At K = 39, the mean sensitivity and mean specificity of the diagnostic model were 0.93 and 0.93, respectively, given an optimal decision threshold. The optimal threshold of diagnosis probability was determined to be 0.55 by comparing the performance results based on thresholds (0.5, 0.55, 0.6, 0.65, and 0.7). Among the 39 miRNAs, 28 miRNAs were also differentially expressed between the PC samples and non-cancer samples (FDR < 0.05); 11 miRNAs were differentially expressed between the PC and non-PC samples (FDR < 0.05) (Figure 3).
For validation, we next applied our PC-specific diagnostic model to a different set of 25 PC and 81 ICC samples. When the PC-diagnosis probability from the diagnostic model was >0.55, we diagnosed the patient as having PC. We also applied CA19-9 diagnosis to the same samples for comparison. When the CA19-9 value was >37, we diagnosed the patient as having PC. As shown in Figure 4, the AUC of the proposed diagnostic model was 1.5 times higher, the sensitivity was 1.3 times higher, and the specificity was 2 times higher than that of the CA19-9 diagnosis model (the AUC, sensitivity, and specificity are presented in Figure 4).  [15][16][17][18] hsa-miR-4284 836 0.708 [19] hsa-miR-939-5p * 810 0.734 [20] hsa-miR-642b-3p * 805 0.759 [21,22] hsa-miR-346 * 736 0.749 [23]   For validation, we next applied our PC-specific diagnostic model to a different set of 25 PC and 81 ICC samples. When the PC-diagnosis probability from the diagnostic model was >0.55, we diagnosed the patient as having PC. We also applied CA19-9 diagnosis to the same samples for comparison. When the CA19-9 value was >37, we diagnosed the patient as having PC. As shown in Figure 4, the AUC of the proposed diagnostic model was 1.5 times higher, the sensitivity was 1.3 times higher, and the specificity was 2 times higher than that of the CA19-9 diagnosis model (the AUC, sensitivity, and specificity are presented in Figure 4). We also validated 10 miRNAs of the 39 diagnostic markers using qRT-PCR. For qRT-PCR, blood samples from another 17 patients with PC, 8 patients with ICC, and 8 healthy individuals were used (Table S3). The expression levels of miR-155-5p, miR-4284, let-346,  For validation, we next applied our PC-specific diagnostic model to a different set of 25 PC and 81 ICC samples. When the PC-diagnosis probability from the diagnostic model was >0.55, we diagnosed the patient as having PC. We also applied CA19-9 diagnosis to the same samples for comparison. When the CA19-9 value was >37, we diagnosed the patient as having PC. As shown in Figure 4, the AUC of the proposed diagnostic model was 1.5 times higher, the sensitivity was 1.3 times higher, and the specificity was 2 times higher than that of the CA19-9 diagnosis model (the AUC, sensitivity, and specificity are presented in Figure 4). We also validated 10 miRNAs of the 39 diagnostic markers using qRT-PCR. For qRT-PCR, blood samples from another 17 patients with PC, 8 patients with ICC, and 8 healthy individuals were used (Table S3). The expression levels of miR-155-5p, miR-4284, let-346, We also validated 10 miRNAs of the 39 diagnostic markers using qRT-PCR. For qRT-PCR, blood samples from another 17 patients with PC, 8 patients with ICC, and 8 healthy individuals were used (Table S3). The expression levels of miR-155-5p, miR-4284, let-346, miR-7154-5p, miR-5100, miR-661, miR-22-3p, miR-4486, let-7b-5p, and miR-4703-5p were analyzed using primers for mature miRNAs. The findings indicated differential expression in PC versus ICC and healthy individuals. Decreased expression of miR-155-5p, miR-7154-5p, miR-661, and miR-4703-5p and elevated expression of miR-5100, miR-22-3p, miR-4486, and let-7b-5p were observed in PC patients. miR-4284 was only detected in cancer groups and miR-346 was absent in patients with PC ( Figure 5). miR-7154-5p, miR-5100, miR-661, miR-22-3p, miR-4486, let-7b-5p, and miR-4703-5p were analyzed using primers for mature miRNAs. The findings indicated differential expression in PC versus ICC and healthy individuals. Decreased expression of miR-155-5p, miR-7154-5p, miR-661, and miR-4703-5p and elevated expression of miR-5100, miR-22-3p, miR-4486, and let-7b-5p were observed in PC patients. miR-4284 was only detected in cancer groups and miR-346 was absent in patients with PC ( Figure 5).

Discussion
Despite multiple clinical trials and continued efforts, PC remains the most difficult cancer to cure as it is difficult to diagnose at the early stages. In this study, we aimed to identify circulating miRNA biomarkers for the detection of PC and to develop a diagnostic model based on these markers. For the identification of diagnostic markers, we used two types of control group. The first control group consisted of 29 non-cancer samples and the second consisted of 63 non-PC samples, including those from patients with other cancers. DEmiRNAs selected from the PC vs. non-cancer study successfully enabled discrimination between the training-case samples and the training-control samples but could not distinguish the validation-case samples from the validation-control samples, possibly because the validation-control samples consisted of samples from patients with ICC. For validation control, patients with ICC were used instead of healthy patients to verify the specificity of PC diagnosis. PC and ICC are known to have overlapping immunohistochemical profiles [40]. DEmiRNAs selected from the PC vs. non-PC study differentiated

Discussion
Despite multiple clinical trials and continued efforts, PC remains the most difficult cancer to cure as it is difficult to diagnose at the early stages. In this study, we aimed to identify circulating miRNA biomarkers for the detection of PC and to develop a diagnostic model based on these markers. For the identification of diagnostic markers, we used two types of control group. The first control group consisted of 29 non-cancer samples and the second consisted of 63 non-PC samples, including those from patients with other cancers. DEmiRNAs selected from the PC vs. non-cancer study successfully enabled discrimination between the training-case samples and the training-control samples but could not distinguish the validation-case samples from the validation-control samples, possibly because the validation-control samples consisted of samples from patients with ICC. For validation control, patients with ICC were used instead of healthy patients to verify the specificity of PC diagnosis. PC and ICC are known to have overlapping immunohistochemical profiles [40]. DEmiRNAs selected from the PC vs. non-PC study differentiated the cases from the controls well, both in the training samples and in the validation samples. As a result, we found that the PC vs. non-PC grouping was more acceptable for identification of PC-specific diagnostic markers than the PC vs. non-cancer grouping. Based on this grouping, we tried to identify PC-specific diagnosis markers from 2578 miRNAs. In order to consider joint effects from multiple core miRNAs and filter the negative effects caused by irrelevant miRNAs, we conducted a penalized SVM with SCAD penalty. As a result, we identified 39 PC-specific diagnostic markers using SCAD-based penalized SVM with a double cross-validation technique.
The markers identified in the present study have potential for use in the early diagnosis of PC and are expected to serve as a major platform for developing commercial models for the timely diagnosis of PC.

Conclusions
In this study, we identified 39 circulating miRNAs as PC-specific diagnostic markers using penalized methods. They include several novel biomarkers that have not yet been reported for PC diagnosis. For inner validation, we estimated the sensitivity and specificity of our diagnostic model through double cross-validation and obtained a mean sensitivity of 0.93 and mean specificity of 0.93. We also validated the specificity using 25 independent PC and 81 ICC samples with a PCA analysis and conducted qRT-PCR validation on several diagnostic markers using independent samples from 17 PC, 8 ICC, and 8 healthy control patients. qRT-PCR analysis indicated that miR-155-5p, miR-4284, miR-346, miR-7145-5p, miR-5100, miR-661, miR-22-3p, miR-4486, let-7b-5p, and miR-4703-5p were differentially expressed in samples from patients with PC. Overall, while we are convinced that our identified miRNA biomarkers based on the PC-specific diagnosis model improve the detection rate for PC, further validation studies will be needed in the future.
Supplementary Materials: The following are available online at https://www.mdpi.com/1422-0 067/22/3/1007/s1, Figure S1: The heatmap of 103 miRNAs for distinguishing PC and non-cancer samples, Figure S2: The heatmap of 149 miRNAs for distinguishing PC and non-PC samples, Table  S1: The differentially expressed 103 miRNAs for distinguishing PC and non-cancer samples, Table  S2: The differentially expressed 149 miRNAs for distinguishing PC and non-PC samples, Table S3: Clinical characteristic of the samples used for qRT-PCR.