Non-Coding RNAs as Prognostic Biomarkers: A miRNA Signature Specific for Aggressive Early-Stage Lung Adenocarcinomas

Lung cancer burden can be reduced by adopting primary and secondary prevention strategies such as anti-smoking campaigns and low-dose CT screening for high risk subjects (aged >50 and smokers >30 packs/year). Recent CT screening trials demonstrated a stage-shift towards earlier stage lung cancer and reduction of mortality (~20%). However, a sizable fraction of patients (30–50%) with early stage disease still experience relapse and an adverse prognosis. Thus, the identification of effective prognostic biomarkers in stage I lung cancer is nowadays paramount. Here, we applied a multi-tiered approach relying on coupled RNA-seq and miRNA-seq data analysis of a large cohort of lung cancer patients (TCGA-LUAD, n = 510), which enabled us to identify prognostic miRNA signatures in stage I lung adenocarcinoma. Such signatures showed high accuracy (AUC ranging between 0.79 and 0.85) in scoring aggressive disease. Importantly, using a network-based approach we rewired miRNA-mRNA regulatory networks, identifying a minimal signature of 7 miRNAs, which was validated in a cohort of FFPE lung adenocarcinoma samples (CSS, n = 44) and controls a variety of genes overlapping with cancer relevant pathways. Our results further demonstrate the reliability of miRNA-based biomarkers for lung cancer prognostication and make a step forward to the application of miRNA biomarkers in the clinical routine.


Introduction
The latest global lung cancer data indicate a burden of 2.09 million new cases and 1.76 million deaths in 2018 [1]. The main type of lung cancer is represented by Non-Small-Cell Lung Cancer (NSCLC) (80-85%) including several heterogeneous tumor subtypes: lung adenocarcinoma (ADC,~40% of lung cancers), squamous cell carcinoma (SqCC,~25% of lung cancers) and large cell carcinoma (LCC, 10% of lung cancers) [2]. In the last decades, there have been significant improvements in lung cancer treatment, such as stereotactic ablative radiotherapy (SABR), targeted therapy and immunotherapy [3][4][5]. Nevertheless, despite the successful introduction of these new treatments in clinical practice, global lung cancer mortality rates remained rather unchanged in the last 40 years, with some variability worldwide due to different lifestyle, environmental and occupational exposures [4,6]. However, primary and secondary prevention strategies such as anti-smoking campaigns and the implementation of large CT screening programs resulted in a reduction of lung cancer mortality of~20% in enrolled patients and progressive lung cancer stage-shift [7,8]. In addition, the high level of molecular heterogeneity of lung cancer enhances the metastatic dissemination of a large fraction of early stage tumors (~30-50%) [9]. In-depth molecular and functional characterization of ADC could help to contextualize tumor heterogeneity in specific molecular subtypes which may suggest alternative therapeutic options. We recently described a 10-gene prognostic signature for stage I ADC which identified a subset of tumors, namely C1-ADC [10,11], with peculiar gene/protein expression and genetic alterations resembling more advanced cancer. This prognostic gene signature can be measured by qRT-PCR, Affymetrix or RNA-seq, in fresh-frozen or in formalin-fixed, paraffin-embedded (FFPE) specimens [11].
To foster clinical translation of this 10-gene signature, here we present a miRNA signature as a surrogate of the 10 genes, for prognostic risk stratification of ADC. A miRNA-based prognostic signature would overcome the problem of using low-quality mRNA when extracted from FFPE samples, which are routinely used for diagnostic purposes. Indeed, shorter non-coding RNA molecules such as miRNA are more resistant to harsh conditions [12,13] and compatible with most of the expression profiling methods including qRT-PCR.

MiRNA-Signature Identification
We developed a multi-tiered approach summarized in Figure 1, which allowed us to identify a surrogate miRNA-based signature for prognostication of ADC patients.
Firstly, we performed gene expression profile analysis of a total of 515 ADC patients belonging to the TCGA-LUAD cohort (see Section 4), with available mRNA data. Patients and tumors characteristics are reported in Table 1. Stage I tumors represented 54% of the cohort and smoking habit was present in 71%. Median length of follow-up in survivors was 2.1 years.
Hierarchical clustering analysis using the 10-gene signature of the TCGA-LUAD cohort (n = 515) patients revealed 4 main branches, namely C1 (n = 201), C2 (n = 98), C3 (n = 39), and C4 (n = 177) clusters ( Figure 2a) that are consistent with previous findings [11]. Analysis of the 3-years overall survival showed non-significant differences between C2, C3 and C4 clusters (log-rank test p-value = 0.90 and p-value = 0.48 in stage I and advanced stages, respectively), that were therefore collapsed into non-C1 clusters. C1 cluster displayed the worse prognosis both in stage I (p-value = 0.0010) and in more advanced stages (p-value = 0.0061) ( Figure 2b). Furthermore, C1 cluster displayed a significant higher fraction of male subjects and patients with more advanced lung cancer, and a nearly significant higher proportion of smokers (Table S1), which is in line with the reported worse prognosis [9].
We then performed miRNA expression profile of 510 out of the 515 ADC of the TCGA-LUAD cohort, with miRNAs expression data available. We used both DESeq2 R package and BRB-ArrayTools (see Section 4) as alternative statistical approaches in order to identify differentially expressed miRNAs in C1 and non-C1 clusters of ADC. We analyzed a total of 382 miRNAs, of which 200 were found differentially expressed by DESeq2 and 90 by BRB-ArrayTools (Table S2A,B, respectively). A total of 87 miRNAs were overlapping in the two sets. Lasso regularization was then applied to identify optimized miRNA-based signatures capable of stratifying C1 from non-C1 tumors. In total, two signatures of 14-(from the 90 miRNA-set) and 19-miRNA (from the 200 miRNA-set) were derived (5 miRNA overlapping; Table 2), which displayed a high accuracy in C1/non-C1 cancer patients stratification (cross-validated AUC = 0.81 and AUC = 0.85, respectively; Figure 2c).  To further reduce complexity of these miRNA-based biomarkers, we looked for a minimal set of miRNAs capable of the same accuracy of the 14-and 19-miRNA signatures to identify C1 aggressive disease. The following assumptions were made: (i) the molecular function of a miRNA is dependent to the network of targeted mRNAs which, in this case, are those differentially expressed in C1/non-C1 tumors; (ii) a prognostic biomarker should be functionally linked to mechanisms involved in tumor progression. Accordingly, we explored the miRNA-mRNA interactome characterizing C1 tumors by performing ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) (see Section 4) using the set of 200 miRNA, and a set of 2900 mRNA genes found significantly regulated in C1-ADC (p-value < 0.05) by DESeq2 (see Section 4). Our analysis was restricted to genes identified by DESeq2 in order to reduce technical variability. The following rules were applied to rewire C1 miRNA-mRNA interactome: (1) we selected miRNA-mRNA pairs generated in only C1 tumors and specific, but not exclusive, for stage I (n = 2858); (2) we selected miRNA predicted to target C1-genes (n = 1787, miRWalk3.0, see Section 4), and (3) with an opposite trend of expression than C1-genes (n = 598); (4) we selected miRNA interacting with a least three C1-genes (n = 528). Percentages could not add up to 100 due to rounding; 1 19 patients with missing information on age; 2 1 patient with adenocarcinoma in situ; 3 9 patients with missing follow-up in the TCGA-LUAD cohort; 4 3 deaths were excluded: 1 without date of death, and 2 within 30 days from surgery.
Among the miRNA-mRNA networks identified, we found a set of interacting networks with 7 miRNA as "HUBs" which derived from both the 19-miRNA and 14-miRNA signatures ( Table 2 and Figure 2d). Hierarchical clustering analysis of this 7-miRNA signature (Table S3) showed an overall increased expression in the more aggressive C1 tumors ( Figure 2e). Importantly, the 7-miRNA signature had a cross-validated AUC of 0.79 in C1/non-C1 patients' stratification, which is comparable to the other two signatures (Figure 3a), as well as when we considered differences in C1 predicted probability ( Figure 3b). The predicted C1 class from all the three signatures (7-, 14-and 19-miRNA) presented significantly increased hazard of death at 3 years in patients of all stages, with an increased risk comparable to C1 patients identified by using the 10-gene signature (Table 3). However, when we focused the analysis to stage I ADC patients, we scored that the best risk-stratification was held by the 7-miRNA signature with approximately two-fold increased risk of death for C1 patients (HR = 2.11; 95% Confidence Interval: 1.11-4.00; p-value = 0.0223) ( Table 3). Interestingly enough, the networks of genes targeted by these 7 miRNAs were found significantly (q-value < 0.0001) enriched in gene sets representing molecular mechanisms related to cancer progression, which fulfilled our initial hypotheses ( Figure 3c).
Despite most of 90 miRNAs identified by BRB-ArrayTools (87/90, 97%) were comprised in the 200-miRNA set found by DESeq2, including 12 out of 14 miRNAs of the BRB-derived model, we performed ARACNe as well by using this 90-miRNAs set. Among the three not overlapping miRNAs, only hsa-miR-210-3p passed all the selection filters we described previously. However, when we added this additional miRNA to the 7-miRNA signature and performed cross-validation in C1/non-C1 patients' stratification, the prediction performance remained the same (AUC = 0.79).

Seven-miRNA-Signature Validation
Finally, we validated the 7 miRNA-signature in an external cohort of 44 lung adenocarcinoma patients, which was collected at the IRCCS Casa Sollievo della Sofferenza Hospital (CSS). Clinical pathological characteristics of the CSS cohort are reported in Table 1, with an overrepresentation of stage I tumors in CSS (70%) with respect to the TCGA-LUAD cohort (54%). We performed qRT-PCR analysis of FFPE samples using the 10-gene signature and calculated relative risk-score to stratify the cohort into C1 (n = 16) and non-C1 (n = 28) groups (Figure 3d) (see Section 4). Next, we performed Low-Density Taqman miRNA Arrays to profile the 7-miRNA signature in the same cohort of 44 ADC and, using logistic regression, we rederived a model based on the expression profile of the 7-miRNA signature ( Table S4). The 7-miRNA model stratified C1 from non-C1 tumors with an AUC of 0.76 ( Figure 3e) and with significant difference (p-value = 0.0028) in C1 predicted probability (Figure 3f). Remarkably, when we limited the analysis to stage I tumors, we scored an AUC of 0.81 ( Figure 3e) and a significant difference (p-value = 0.0108) in C1 predicted probability (Figure 3f).

Discussion
Improvements in lung cancer early diagnosis by large scale low-dose CT screening trials is resulting in a stage-shift towards earlier stage lung cancer, with subsequent reduction in mortality as observed in NELSON and NLST trials [7,8]. For this reason, there is an urgent need of prognostic biomarkers for patients with stage I lung cancer who could eventually benefit from systemic adjuvant chemotherapy (platinum-based) rather than molecular targeted/immuno therapeutics in case of aggressive disease. Nowadays, the advent of more sophisticated and precise bioinformatic tools and computational approaches has sped up the identification of novel diagnostic and prognostic cancer biomarkers, either focused on proteins, coding transcripts, epigenome modifications or non-coding transcripts [14,15]. In lung cancer, in particular, the differential expression of miRNAs allowed the development of innovative and promising cancer biomarkers [16].
Here we present surrogate miRNA signatures which recapitulate a previously described 10-gene prognostic signature in stage I ADC [11]. The 7-, 14-and 19-miRNA signatures were all effective in identifying aggressive C1-ADC disease (AUC = 0.79-0.85). Notably, 6 out of 7 miRNAs of the 7-miRNA signature were well-detected in FFPE samples (median Ct < 30; Table S3) which confirmed the proven higher stability of miRNAs in low-quality RNA [17]. Importantly, in our approach, we adopted a network-rewiring strategy by specifically selecting miRNA-mRNA pairs which characterize aggressive stage I tumors (C1). Such an approach allowed us to select a core of 7 miRNAs capable to stratify C1 from non-C1 samples with an accuracy comparable to the 14-and 19-miRNA models (Figures 2c and 3a), and, importantly, interacting with C1 tumor transcriptome. This is relevant for capturing molecular mechanisms controlled by miRNAs which are associated to tumor progression. As a matter of fact, we observed a large overlap between the '7-miRNA network' with several gene sets representing cancer relevant pathways (Figure 3c). Further experiments are therefore warranted in order to investigate whether this 7-miRNA network is functionally linked to lung cancer progression.
On the other hand, very little is known about hsa-miR-550a-5p and hsa-miR-584-5p biological functions. However, hsa-miR-550a-5p was found to be a prognostic factor in ADC in association with other 3 miRNAs indicating its possible oncogenic role [31]. Lastly, in gastric cancer, hsa-miR-584-5p was found to induce apoptosis and inhibits proliferation [32], while in hepatocellular carcinoma hsa-miR-584-5p was shown to have an oncogenic role [33].
In conclusion, we developed a 7-miRNA signature which is capable in identifying aggressive early stage lung adenocarcinoma. The expression profile of such miRNAs can be measured by standard qRT-PCR and using FFPE samples. A limit of the present study is the relatively small size of the external validation cohort (CSS) and the short follow-up (1.6 median years in survivors, and three deaths recorded within 3 years), which did not allow us to quantify the excess of mortality risk for patients with predicted aggressive tumors.

TCGA-LUAD Cohort
We selected the cohort of 515 patients with lung adenocarcinomas from the TCGA data portal (https://portal.gdc.cancer.gov/) at 2018. A total of 510 tumors were profiled for both gene and miRNA expression. Log2 read counts were used for expression analysis. Patients follow-up information was used for survival analysis: overall survival was defined as the time from the date of tumor resection until death from any cause. Follow-up was truncated at 3 years to reduce the potential overestimation of overall mortality with respect to lung cancer-specific mortality.

The CSS Cohort
We selected a cohort of 44 patients with lung adenocarcinoma underwent surgery between February 2017 and February 2020 at the CSS. Written informed consent was obtained from all study patients. None of these patients received preoperative chemotherapy. Clinical information was obtained through review of medical records. Vital status was assessed through the Vital Records Offices of the patients' towns of residence or by contacting directly the patients or their families.

Gene Expression Analysis of the TCGA-LUAD Cohort
Hierarchical clustering analysis was performed on the 10-gene signature for the entire cohort of 510 patients. Clustering was done by using Cluster 3.0 for Mac OS X (C Clustering Library 1.56, Tokyo, Japan) with uncentered correlation and centroid linkage, and Java TreeView software environment (version 1.1.6r4; http://jtreeview.sourceforge.net). A total of four main branches were selected to build clusters. Kaplan-Meier survival curves were stratified by clusters and log-rank test p-values were calculated. C1 cluster was associated to the worse prognosis, and all other clusters were pooled together (non-C1 clusters).
To retain most informative expression data (i.e., transcripts detected in most of tumor samples), we considered miRNAs with raw counts >0 in at least the 50% of patients either in C1 or non-C1, identifying a total of 382 miRNAs. This allowed us to reduce also the complexity of the TCGA-LUAD dataset (2237 miRNAs). We applied the complexity reduction also to genes and we selected the most varying across all samples (standard deviation in the top 25%), identifying a total of 4899 genes. Using DESeq2 R package (R Core Team, R Foundation for Statistical Computing, Vienna, Austria) [34], we identified a total of 2900 differentially expressed genes between C1 and non-C1 tumors.
BRB-ArrayTools [35] and DESeq2 (R package) [34] tools were used for class prediction (C1 cluster vs. non-C1 clusters) according to miRNA expression. BRB-ArrayTools uses statistics based on two-sample T-test with multivariate permutations test (1000 random permutations); confidence level of false discovery rate assessment, 80%; maximum allowed proportion of false-positive genes, 0.05. DESeq2 is based on Wald test statistics to identify differentially expressed transcripts. Lists of miRNAs differentially expressed were obtained from BRB-ArrayTools and DESeq2 tools were subsequently reduced via Lasso regularization. In details, a penalized unconditional logistic regression was applied considering cluster as discrete outcome (C1 cluster vs. non-C1 clusters) and miRNA expressions as explanatory variables. Cross-validated (10-fold) log-likelihood with optimization (50 simulations) of the tuning penalty parameter was used to control for potential overfitting.
Starting from differentially expressed genes (identified with DESeq2) and miRNAs (identified with both DESeq2 and BRB-ArrayTools), we used ARACNe [36] with 1000 bootstraps to infer direct regulatory relationships between transcriptional regulators (i.e., miRNAs) and target genes. ARACNe was performed using all patients, stage I patients and stage II-IV patients. miRNA target genes were retrieved using miRWalk 3.0 [37].
Probability of being in the C1 cluster was estimated using the unconditional logistic regression for the three signatures of 19, 14 and 7 miRNAs. Model performance was assessed using the cross-validated area under the receiver operating curve, and assessing the difference in C1 predicted probability between C1 and non-C1 patients (Wilcoxon-Mann-Whitney test). Cox regression model was used to evaluate the prognostic role of these miRNA signatures and their ability to recapitulate the risk-stratification of the original 10-genes signature.
To receive insights into the biology of the 7-miRNA model, we verified the enrichment of cancer-relevant pathways associated to their target genes. We investigated the Molecular Signature Database (MSigDB; v7.2) (https://www.gsea-msigdb.org/gsea/msigdb/annotate.jsp) using the list of 87 targeted genes by interrogating the CGP (chemical and genetic perturbations, 3358 gene sets). Bubble plot analysis was performed using JMP 15.2.1 (SAS Institute, Inc., Cary, NC, USA) software.
Hierarchical clustering analysis was performed on the 7-miRNA signature for 510 patients, those with available miRNA expression data. Clustering was completed by using Cluster 3.0 for Mac OS X (C Clustering Library 1.56) with uncentered correlation and centroid linkage, and Java TreeView software environment (version 1.1.6r4; http://jtreeview.sourceforge.net).

RNA Extraction and qRT-PCR Analysis and Data Interpretation
One tissue core (1.5 mm in diameter) from FFPE blocks, in representative tumor areas with adequate tumor cellularity (>60%) selected by a pathologist, was processed for total RNA extraction. The AllPrep DNA/RNA FFPE kit (QIAGEN, Inc., Hilden, Germany) was used for total RNA extraction. Quantitative real-time PCR (qRT-PCR) was performed to analyze the 10-genes signature as described in Dama et al. [11]. Briefly, RNA was quantified using Nanodrop ND-10000 Spectrophotometer and a total of 200 ng was retro-transcribed using SuperScript VILO cDNA Synthesis Kit (Thermo Fisher Scientific, Inc., Waltham, MA, USA) (Thermo Fisher Scientific) and pre-amplified for 10 cycles with PreAmp Master Mix Kit (Thermo Fisher Scientific), following manufacturer's instructions. qRT-PCR analysis was performed starting from 1:10 diluted pre-amplified cDNA, using the TaqMan Fast Advance Master Mix and hydrolysis probes (Thermo Fisher Scientific; for primers see Dama et al. [11]), in a QuantStudio 12k Flex (Thermo Fisher Scientific). Thermal cycling amplification was performed with an initial incubation at 95 • C for 30 s, followed by 45 cycles of 95 • C for 5 s and 60 • C for 30 s. For miRNA expression analysis, a total of 10 ng RNA was reverse-transcribed using the TaqMan Advanced miRNA cDNA Synthesis Kit (Thermo Fisher Scientific). Poly(A) tailing, adapter ligation, RT reaction and miR-Amp were performed following manufacturer's instructions. qRT-PCR was performed following manufacturer's instructions (i.e., 95 • C for 30 s, 45 cycles of 95 • C for 5 s, and 60 • C for 30 s) using a Card Custom Advance (Thermo Fisher Scientific; Table S5) in a QuantStudio 12k Flex (Thermo Fisher Scientific). The hsa-miR-16-5p was used as standard reference for CT normalization using a previously described methodology [38]. Briefly, the normalized CT of each miRNA (i) of each sample (j) was alculated as the difference between the raw CT ij and a scaling factor (SF) specific for each sample (j); the SF j represented the difference between the raw CT of the miRNA "hsa-miR-16-5p" used as a reference in the sample (j) and a constant equal to 21.87. Notably, hsa-miR-16-5p expression profile analysis in both TCGA-LUAD and CSS cohorts revealed a comparable expression in C1 and non-C1 tumors subsets (Table S6).
Risk-scores were assigned to each patient based to the 10-gene risk model described in Dama et al. [11]. Before applying the risk-model, data were rescaled (q1-q3 normalization). Patients with risk-scores higher than the 66th percentiles [11] were classified as C1 tumors. Next, unconditional logistic regression (C1 vs. non-C1 tumors) with 7 miRNAs as explanatory variables was applied, and the area under the receiver operating curve was calculated. Difference in C1 predicted probability between C1 and non-C1 patients was evaluated through Wilcoxon-Mann-Whitney test.
All statistical analyses were performed using SAS software, version 9.4 (SAS Institute, Inc., Cary, NC, USA) and R 3.3.1 (R Core Team, 2016) and JMP 15 (SAS). p-values less than 0.05 were considered statistically significant.