Non-Coding RNAs in Lung Cancer: Contribution of Bioinformatics Analysis to the Development of Non-Invasive Diagnostic Tools

Lung cancer is currently the leading cause of cancer related mortality due to late diagnosis and limited treatment intervention. Non-coding RNAs are not translated into proteins and have emerged as fundamental regulators of gene expression. Recent studies reported that microRNAs and long non-coding RNAs are involved in lung cancer development and progression. Moreover, they appear as new promising non-invasive biomarkers for early lung cancer diagnosis. Here, we highlight their potential as biomarker in lung cancer and present how bioinformatics can contribute to the development of non-invasive diagnostic tools. For this, we discuss several bioinformatics algorithms and software tools for a comprehensive understanding and functional characterization of microRNAs and long non-coding RNAs.

As altered miRNA expression levels do not only correlate with patient relapse and survival but can also clearly distinguish between lung cancers, thus they might function as novel diagnostic biomarkers.
Several studies highlight miRNA expression signatures as non-invasive diagnostic tools for early detection and classification of lung cancer including first promising clinical results. For example, Hennessey et al. (2012) [49] reported from a phase I/II biomarker study the potential of a serum miRNA-15b/miRNA-27b pair as sensitive and effective biomarkers for the early NSCLC detection. Moreover, Sozzi et al. (2014) [13] described the clinical utility of blood based miRNA signatures which have a high potential to reduce false-positive rates of low dose CT (LDCT) screenings (3.7% compared to 19.7% for LDCT alone) as an additional diagnostic tool (MILD-trial). Similarly, Montani et al. (2015) developed the blood miRNA-test for identification of optimal patients for LDCT screening (COSMOS-trial; accuracy, sensitivity and specificity approximately 75%-78%) [14].
In the context of having been mainly developed for smokers as a first-line tool for LDCT screenings with per se high false-positive rates, these miRNA blood tests reduce unnecessary LDCT screenings in disease-free individuals of such high-risk cohorts. In case of a general population such tests appear probably not yet very beneficial and need further improvements and validation [13,14,49].
On the other hand, studies investigating the potential of miRNAs to differentiate between NSCLC patients and smokers have been started. For example, Zhu et al. (2016) [40] validated the diagnostic value of a four serum miRNA signature (miRNA-182, miRNA-183, miRNA-210 and miRNA-126) for early diagnosis which can also distinguish NSCLC patients from current smokers. This points to the usage of a combined biomarker signature for a better diagnostic value and an effective molecular characterization of lung cancer patients [37,45].
Nevertheless, miRNA based signatures have drawbacks and limitations: They depend on technical accuracy and proper validation, standardization and normalization schemes. miRNAs were detected either in tumor tissue or blood samples. However, miRNA levels are different in plasma and serum, and the overlap of miRNAs present both in blood and tissue is small [14,50]. This points out the challenge of miRNA signatures as a clinically reliable blood test. However, given the low overlap of miRNAs between different samples, there exists a common consensus of known miRNAs expressed in lung cancer, e.g., miRNA-21 and/or miRNA-30d, in which one of them is at least found across different studies and/or known to be expressed in tissue, plasma and serum [13,44,[49][50][51]. Such statistically significantly expressed miRNAs were also independently validated by bioinformatics meta-analyses with miRNA-21 and miRNA-210 as most consistently reported miRNAs in lung cancer [52,53]. Thus, circulating blood signatures based on those miRNAs represent a high potential as an effective diagnostic tool for early lung cancer detection.

lncRNAs as Diagnostic Biomarkers
lncRNAs represent a large and diverse class of ncRNAs that play an important biological role. They are multi-exonic transcripts greater than 200 nucleotides and often less conserved among species [7,28,54,55]. lncRNAs show altered expression and regulate important biological processes and pathways associated with lung cancer pathogenesis and progression [15]. They have a complex regulatory effect ranging from transcriptional regulation (enhancer, chromatin modifier and transcription factor (TF) binding) and regulation of epigenetic processes (nuclear lncRNAs) to translational regulation through mRNA and protein binding as well as miRNA sponging (cytoplasm lncRNAs) [54][55][56][57][58] (see Figure 1). Thus, functional characterization is challenging and most lncRNAs are not well understood: currently, several thousand lncRNA transcripts have been annotated, however the majority is functionally uncharacterized [59][60][61][62].
As lncRNAs show a tissue-and disease-specific expression they emerged as potential diagnostic biomarkers [2,7]. lncRNAs are already known as promising biomarkers for different diseases, e.g., LIPCAR as biomarker for cardiac remodeling [63] or HULC for liver cancer [15,64]. Moreover, lncRNA based diagnostic tests such as the lncRNA prostate cancer antigen 3 (PCA3) as urinary biomarker for prostate cancer are already available for clinical use [65]. Therefore, lncRNAs are currently also under investigation for lung cancer, showing high potential for developing effective screening tests for diagnosis [66,67]. Table 2 summarizes different clinical lncRNA lung cancer studies including background information (number of patient, tumor stage, histological subtype; if available).   The most promising lncRNA candidate is the metastasis-associated lung AC transcript 1 (MALAT1) which is highly expressed e.g., in lung and pancreas. MALAT1 is functionally well-characterized and known to be a prognostic marker for early-stage NSCLC as well as cancer metastasis [29,68]. Moreover, MALAT1 has been shown to serve as blood-based biomarker for the early detection of NSCLC. For example, Yao et al. (2012) identified a four serum biomarker signature containing MALAT1 and three proteins that show a high diagnostic accuracy for detecting early stage NSCLC [12]. Similarly, Weber et al. (2013) reported MALAT1 as non-invasive and effective diagnostic biomarker for NSCLC diagnosis [22]. As the authors observed a low sensitivity, they suggest that MALAT1 might not be a single biomarker but applicable as a complementary biomarker [22]. The lung AC-specific lncRNA colon cancer-associated transcript 2 (CCAT2) displays altered expression, promotes invasion of NSCLC and can serve as a biomarker for lymph node metastasis [69]. The lncRNA HOX transcript antisense intergenic RNA (HOTAIR) functions as gene expression repressor through recruitment of chromatin modifiers and correlates with metastasis and poor prognosis in NSCLC [71,77].
Further lncRNAs known to be associated with different cancer types are also under investigation as diagnostic biomarker in lung cancer. For instance, Tantai et al. (2015) showed that the lncRNAs XIST and HIF1A-AS1 have a significantly increased level in tumor tissues or serum from NSCLC patients, highlighting a clinical significance as effective diagnostic screening for NSCLC when combining both lncRNAs [74].
lncRNAs were also investigated regarding their potential as clinical biomarkers to predict lung cancer risk and treatment response. In this context, studies of Gong et al. (2016) found that genetic polymorphisms of well-characterized lncRNAs such as CCAT2, HOTAIR and MALAT1 were significantly associated with lung cancer susceptibility and platinum-based chemotherapy response, indicating that they might function as clinical biomarkers [75]. Furthermore, Yuan et al. (2016) found in a large-scale meta-analysis of 690,564 single-nucleotide polymorphism (SNPs) in 15,531 autosomal lncRNAs a genetic SNP risk locus (1p31.1) in the lncRNA NEXN-AS1 which influence the secondary structure and is statistically associated with lung cancer risk [76]. It could thus serve as potential risk biomarker for lung cancer diagnosis. The potential of lncRNAs as diagnostic biomarkers was also confirmed by several meta-analyses with MALAT1 and the human urothelial carcinoma associated 1 (UCA1) as most promising candidates in lung cancer patient [29,[78][79][80].
In addition, altered lncRNA expression levels might also accurately distinguish between AC and SQ and predict the clinical outcome for both NSCLC subtypes [1,2,6,7]. White et al. (2014) characterized 567 RNA-Seq datasets from AC and SQ tumors and found 463 and 315 up-and down-regulated lncRNA, respectively, in AC tumors relative to SQ. Moreover, they reported that 27 lncRNAs were differentially expressed between AC and SQ that can potentially serve as important biomarkers for lung cancer diagnosis [7]. Furthermore, studies of  identified 1646 differentially expressed lncRNA transcripts, in which the lncRNA LINC01133 showed the largest up-regulation in SQ but not in the AC samples and correlates with shorter survival time [2]. Recently,  reported four six-lncRNA signature patterns that are significantly associated with AC and SQ patient survival [1]. More interestingly, the authors also demonstrated that knockdown of the up-regulated lncRNA AFAP1-AS1 and LINC00511 impaired AC cell proliferation, while knockdown of PVT1 inhibited SQ cell growth [1].
However, to the best of our knowledge, currently no study has been directly carried out to identify lncRNA expression signatures that can differentiate NSCLC and SCLC lung cancer subtypes.
An explanation might be that investigating lncRNAs in lung cancer is a relatively new research field and therefore only very few lncRNAs are well-characterized. Moreover, lncRNA studies focused primarily on NSCLC as the more commonly diagnosed subtype, but also mainly on lncRNAs that were already reported from other cancer types, e.g., the colon cancer lncRNA CCAT2. However, there exist reports on lncRNA characterization in different lung cancer subtypes. For example, Qiu et al. (2014) identified that the lncRNA CCAT2 shows a specific expression in AC and might function as biomarker for lymph node metastasis [69], whereas Chen et al. (2016) recently found that CCAT2 serves also as an independent unfavorable prognostic factor in SCLC patients [70]. In this regard, it would be of high importance for the future to investigate differentially expressed lncRNAs between NSCLC and SCLC in order to develop a diagnostic biomarker signature that can accurately distinguish between both lung cancer subtypes.

Bioinformatics Databases
Popular databases such as Rfam [84] represent information about several RNA families including sequence and consensus secondary structure information, whereas LNCipedia [59], LncRBase [60] and miRBase [109] provide information about specific families including further information, e.g., about experimental data, tissue expression and targets. On the other hand, databases such as MiR2Disease [110] and LncRNADisease [114] focus on the disease and interaction specific context based on the literature and/or experimentally disease data. All these databases provide information about miRNAs and lncRNAs and allow a fast overview regarding their sequence, structure and functional role. However, most miRNAs and lncRNAs are newly detected without concrete knowledge about their functional role which requires integrated bioinformatics analysis for comprehensive understanding. In this context, they should combine phylogenetic sequence-structure conservation analysis with functional interaction partner, biological process and pathway as well as promoter analysis (see Figure 2).  [128] Interaction prediction http://www.imtech.res.in/raghava/pprint/ KYG [129] Interaction prediction http://cib.cf.ocha.ac.jp/KYG/ Struct-NB [130] Interaction prediction http://www.public.iastate.edu/~ftowfic PRINTR [131] Interaction prediction http://210.42.106.80/printr/ lncRNAtor [132] Functions/Interactions/networks http://lncrnator.ewha.ac.kr/index.htm

Bioinformatics Databases
Popular databases such as Rfam [84] represent information about several RNA families including sequence and consensus secondary structure information, whereas LNCipedia [59], LncRBase [60] and miRBase [109] provide information about specific families including further information, e.g., about experimental data, tissue expression and targets. On the other hand, databases such as MiR2Disease [110] and LncRNADisease [114] focus on the disease and interaction specific context based on the literature and/or experimentally disease data. All these databases provide information about miRNAs and lncRNAs and allow a fast overview regarding their sequence, structure and functional role. However, most miRNAs and lncRNAs are newly detected without concrete knowledge about their functional role which requires integrated bioinformatics analysis for comprehensive understanding. In this context, they should combine phylogenetic sequence-structure conservation analysis with functional interaction partner, biological process and pathway as well as promoter analysis (see Figure 2). Illustration of integrated bioinformatics analysis of ncRNAs (miRNAs, lncRNAs; red circle) which should focus on the sequence, structure, promoter and interaction partner prediction combined with functional analysis (rectangles). Dashed arrows represent the three main analysis steps (e.g., promoter analysis), whereas continuous arrows show the subsequently functional analysis step using the obtained results from the previous steps (e.g., transcription factors) to get a comprehensive functional understanding of ncRNAs (green hexagon).

Phylogenetic Sequence-Structure Conservation
Sequence data are available from the Ensembl (http://www.ensembl.org/) and UCSC genome browser. The sequence can first be analyzed using the BLAST algorithm [85] to find homologous sequences among mammalian species, e.g., human, chimpanzee and mouse. Resulting sequences can be further analyzed for sequence and structure conservation using bioinformatics secondary prediction algorithms. Dynamic programming algorithms such as the Zuker algorithm which are implemented in RNAfold and Mfold calculate for a sequence the thermodynamic optimal secondary Illustration of integrated bioinformatics analysis of ncRNAs (miRNAs, lncRNAs; red circle) which should focus on the sequence, structure, promoter and interaction partner prediction combined with functional analysis (rectangles). Dashed arrows represent the three main analysis steps (e.g., promoter analysis), whereas continuous arrows show the subsequently functional analysis step using the obtained results from the previous steps (e.g., transcription factors) to get a comprehensive functional understanding of ncRNAs (green hexagon).

Phylogenetic Sequence-Structure Conservation
Sequence data are available from the Ensembl (http://www.ensembl.org/) and UCSC genome browser. The sequence can first be analyzed using the BLAST algorithm [85] to find homologous sequences among mammalian species, e.g., human, chimpanzee and mouse. Resulting sequences can be further analyzed for sequence and structure conservation using bioinformatics secondary prediction algorithms. Dynamic programming algorithms such as the Zuker algorithm which are implemented in RNAfold and Mfold calculate for a sequence the thermodynamic optimal secondary structure based on a minimum free energy. These algorithms calculate accurately the optimal secondary structure, they are however not useful for a large-scale application with several sequences due to the high calculation time [86,87,133,134]. A more effective approach represents the Sankoff algorithm which can simultaneously align and fold multiple sequences [88,133,[135][136][137]. Programs using the Sankoff algorithm are for instance RNAalifold [88], FOLDALIGN [89] and LocARNA [90], in which a pairwise sequence alignment is generated and subsequently aligned and folded to calculate the optimal conserved secondary structure [135]. However, further extensions of the Sankoff algorithm are more efficient, allowing a faster calculation, e.g., the RNAshapes program has a linear calculation time based on a non-heuristic approach instead of being exponential (Sankoff algorithm) [91,138]. Another useful program for large sequence and structure data sets is the 4SALE algorithm which allows a fast sequence and synchronous secondary analysis including further analyses and manual editing [92].

Functional Interaction Partner Analysis
miRNAs and lncRNAs show enormous clinical potential but currently have some limitations, as exact knowledge about the functional interaction context is necessary. Experimental methods for RNA-target detection range from quantitative proteomic analysis and high throughput experiments such as tissue-specific microarray and RNA-Seq analysis to UV cross-linking immunoprecipitation-high-throughput sequencing (CLIP-Seq) [33,34,83,139]. Such methods are essential for correct identification of interaction partners, but technically challenging, time and cost intensive, e.g., interaction analysis considering the whole interactome is laborious and methods such as CLIP-Seq require a specific target RNA or protein [82,83,115,140]. Over the past decades, several bioinformatics interaction prediction algorithms have been developed which are helpful e.g., for large-scale application and for filtering and pre-selection of candidates for further experimental testing. However, RNA-RNA interaction prediction is challenging due to high combinatorial number of RNA pairs, time complexity of calculating the joint secondary structures of both RNA molecules as well as knowledge of the intra-molecular and inter-molecular base-pairs interactions between both RNA molecules [115]. In the following we will discuss computational databases and prediction algorithm for RNA interactions analysis (extensively reviewed in [33,115,139,[141][142][143]).

miRNA Target Prediction Algorithms
miRNAs bind mRNA by seed region matching which can be predicted by bioinformatics algorithms. Numerous prediction algorithms were developed which are mainly based on seed region similarity, but new approaches also include sequence matching combined with structure and/or thermodynamic parameters (folding energy) or target site accessibility [33,34,107,144]. The popular TargetScan algorithm [111] predicts target interactions based on conserved seed region matching, whereas miRanda [105] and PicTar [112] algorithms allow seed region mismatches but include free folding energy, and the PITA algorithm [113] includes target site accessibility for seed interaction prediction [139,[145][146][147]. In addition, several other interaction prediction algorithm such as RNAup (target site accessibility [106]) and IntaRNA (seed region and target site accessibility [107]) exist which focus mainly on specific RNA-RNA interactions such as miRNA-mRNA or bacterial small RNAs, but can also be used for other RNA types.
It is worth mentioning that such mRNA target prediction algorithms have some drawbacks and limitations which requires a carefully compliance by users. For instance, most algorithms are not validated by experimental data, they do not include tissue-specific miRNA expression, and are based on different parameters, resulting in a less target overlap between them [33,34,139,144]. Moreover, target predictions often shows large number of false positives, e.g., miRanda has a approximated false-positive rate between 24%-39%, whereas TargetScan shows 22%-31%, PicTar 30% and PITA 20% [113,148,149]. The false positive rate reflects in this case all erroneously predicted target interactions that have not been experimentally validated. Thus, this parameter is of high interest as it gives an approximation of how specific and usable the algorithm is as this is important for computational target interaction prediction especially without available experimental supported results [139].
Beside these facts, several studies highlighted that bioinformatics prediction algorithms are beneficial as an additional tool for experiments [150][151][152]. In this context, studies demonstrated that algorithms using high seed region similarity represent the highest accuracy and overlap of predicted and experimentally validated targets [33,34,150], e.g., TargetScanS (requires perfect complementarity), miRanda and PicTar seem to be the effective methods with a sensitivity nearly 65% with experimentally validated interactions, but miRanda algorithm predicts much more mRNA targets [139,148].
Similarly, Busch et al. (2008) compared several algorithm and demonstrated that IntaRNA shows the highest accuracy, whereas RNAup shows an overall accuracy closest to IntaRNA (average sensitivity of IntaRNA is 0.783, RNAup 0.752) [107]. Therefore, users should use different algorithms and their overlapping targets in combination with experiments to find the best mRNA interaction partners in a functional and tissue-specific manner [33,34,139].

lncRNA-RNA Interaction Prediction Algorithms
Due to the high computational costs and the fact that lncRNAs are a new research field, methods for predicting lncRNA-RNA interactions are limited [115]. To the best of our knowledge, there are currently only two tools available which allow direct lncRNA-RNA interaction prediction. The Rtools pipeline calculates interactions between lncRNA-lncRNA and lncRNA-mRNA considering seed matching and target site accessibility by combining different existing algorithms, e.g., IntaRNA, to reduce the computation complexity. The interactions between lncRNAs and the human transcriptome were calculated, but were also validated using different experimentally RNA-RNA interaction datasets [115]. According to the authors, the pipeline is comparable with existing algorithms but significantly faster, however it has room for further improvements to reach higher accuracy especially for lncRNA interaction predictions [115]. On the other hand, the LncTar software allows calculation of lncRNA-RNA interactions in large-scale datasets [116]. It assumes that the primer-dimer detection is not only important for real-time polymerase chain reaction (PCR) design, but also an important process of base pairing in nature, thus enabling prediction of lncRNA-RNA interactions. Therefore, the software uses a modified exact melting-temperature and primer-dimer prediction algorithms of the PerlPrimer [153] code, a developed platform for real-time PCR primer design [116]. It was demonstrated by the authors that LncTar efficiently predicts lncRNA-RNA interaction partners with highly accuracy, which was further validated by experimentally lncRNA-mRNA interactions curated from the LncRNADisease and NPInter databases [116]. However, it has currently some limitations which need further improvements, e.g., it did not take the stacking base pairs and loop energy for searching the stable joint secondary structures and the RNA tertiary structure into account which are known to play a role in RNA-RNA interactions [116]. miRNAs potentially sponged by lncRNAs can be predicted e.g., using the above described miRanda algorithm [105]. As the RNAup [106] and IntaRNA [107] algorithms can be used for several RNA types, they can also be applied for lncRNA-mRNA interaction prediction. However, they are not efficient for large-scale prediction of lncRNA targets and/or for larger lncRNAs and mRNA molecules, e.g., due to sequence length limitation (IntaRNA ≤ 2 kb, RNAup ≤ 5 kb) [107,116] which requires carefully consideration of their specific features and further steps such as locally software use by users.
For more information and a detailed overview about several in silico approaches for functional prediction and mechanistic characterization of lncRNAs please see to the recent review by Signal et al. (2016) [62].

RNA-Protein Interaction Prediction Algorithms
lncRNAs regulate not only RNAs, but also interact with proteins, pointing out the importance of an integrated bioinformatics analysis of potential interaction partners for a tissue and disease-specific functional characterization of lncRNAs. Popular databases such as NPInter [117], the Protein Data Bank (PDB) [118] and Nucleic Acid Database (NDB) [119] include information about the experimental determined structure of proteins and nucleic acids as well as RNA-protein complex assemblies, whereas BioGRID [120] and IntAct [121] contain protein-protein and RNA-protein interactions from several organisms. Furthermore, several approaches combine different databases, thus allowing a comprehensive overview of interactions and further individual analysis. For example, the RPIntDB [82], PRD [122], iMEX [123] and UniProt [124] databases provide functional information and experimentally interactions curated e.g., from BioGRID [120] and IntAct [121], whereas the Protein-RNA Interface Database (PRIDB) [125] collects interactions based on PDB [141]. In addition, the DrumPID database [126] which was developed by our group, focusses especially on the drug-target interaction context combining interaction and pathway data, but also allows organismor tissue-specific analyses. Additional features include structural and sequence domain analyses of proteins and RNAs which help in detecting functional interaction and recognition binding-sites such as the RNA recognition motif and RNA-binding domain in proteins [126,141,154].
As these databases mainly contain experimentally validated RNA-protein interactions, they are not applicable especially for newly annotated lncRNAs. To support this, several bioinformatics prediction algorithms were developed that focus on the sequence and/or structure by using different machine learning algorithm. For example, the software "fast predictions of RNA and protein interactions and domains at the Center for Genomic Regulation, Barcelona, Catalonia" (catRAPID) [127] performs predictions based on a sequence HMMscan (probabilistic statistical profile Hidden Markov Model (HMM)) using propensities of individual residues from PDB [127,141,142]. The RPISeq software calculates interactions using different machine learning classifiers [82,141,142], whereas Pprint predicts RNA binding sites using evolutionary position-specific scoring matrix (PSSM) information combined with a support vector machine (SVM) [128,155]. The KYG algorithm focuses on the structure and calculates binding regions by applying a position-specific multiple sequence profile on protein-RNA structures from PDB, also without biochemical or functional data [129,155]. Other algorithm like Struct-NB combine sequence and structural-based information of protein-RNA complex interfaces from PDB using a Gaussian Naive Bayes classifiers machine learning algorithm [130], whereas PRINTR calculates interactions using a SVM and PSSM [131,142].
All these algorithms perform high-confidence lncRNA-protein interaction predictions and are helpful to find potential RNA-protein interaction partners [82,141]. Moreover, they reveal a high accuracy, which was validated by independent training datasets including experimentally validated physical interactions [127,129,142,155], e.g., from the NPInter server [117]. Most prominently, catRAPID correctly calculates the interaction of the human lncRNAs XIST and HOTAIR with the Polycomb Repressive Complex 2 (PRC2) but also the interaction between HOTAIR and Suz12 predicted by RPISeq and catRAPID were in agreement with experimental data [82,127,141,142,156].
Nevertheless, these prediction algorithms also have some limitations. For instance, most of them do not consider the tissue and functional specific context of lncRNA-protein interactions and often show large number of interaction predictions [127,141,157]. Moreover, most of them were not systematically validated on general benchmark datasets, depend on different approaches and the prediction outcome is affected by the distance threshold for the interface residue definition [141,155,157]. In addition, several groups evaluated the influence of different machine learning classifiers and found out that the accuracy of sequence-based methods can be increased by using PSSM profiles [128,131,155,158].
For example, Walia et al. (2012) analyzed several sequence and simple structure-based with complex structure-based algorithms and reported that results are comparable between these approaches [155]. However, sequence-based methods using PSSM classifiers achieve comparable results to state-of-the-art structure-based methods, but the latter ones reach higher specificity compared to exclusively sequence-based approaches [155]. Thus, sequence-based approaches can effectively predict RNA-protein interactions, but a higher accuracy can be reached when using PSSM profiles and/or structure-based methods [128,155]. Furthermore, structure-based features also allow identifying substrate-binding clefts and how the RNAs and proteins specifically recognize each other but often show a higher degree of irregularity at the surface compared to non-interface residues. However, they require information on protein-RNA complexes as training templates, which are often limited [131,141,142,155,159].
As parameters and outputs differ between sequence and structure-based approaches, for a large-scale benchmark application and prediction of unknown RNA-protein interactions it is therefore of importance to compare different methods but also how to use them and to interpret the results. In this context, it is essential to have a detailed knowledge about the used features and datasets of each interaction prediction algorithm, the evaluation process and the definition of interface residues, e.g., background data and validation and usage of PSSM profiles [155].

Functional GO and Pathway Enrichment Analysis
Functional enrichment analysis of interaction partners and related regulatory networks is essential for understanding its tissue-specific and functional role and can boost the selection of best candidates for further experimental validation. There exist several databases and software tools for data mining and analysis of large gene list. The most prominent annotation is the Gene Ontology (GO) Consortium project which functionally specifies genes/proteins and their relationships in categories, so-called GO terms, regarding molecular function, cellular component and biological processes (including pathways) in a species-independent manner [93,94]. All annotation data can be downloaded or accessed online from the GO database or through the web-based application database AmiGO which also support a term enrichment analysis for user input lists using Panther [93,94]. The Protein ANalysis THrough Evolutionary Relationships (Panther) classification is a large database collection of protein families that are divided in functional categories using statistical HMMs, allowing functional analysis and classification of large gene lists and/or experimental datasets in significantly enriched ontology and pathway terms [95]. Other popular databases are Reactome, Kyoto Encyclopedia of Genes and Genomes (KEGG) and WikiPathways, enabling also analysis of signaling pathways including tissue analysis, e.g., also from gene lists and/or large-scale expression data sets [96][97][98]160]. Recently, Herwig et al. (2016) developed ConsensusPathDB which allows a functional and network-based characterization of biomolecules from a user input list and/or experimental high-throughput datasets such as RNA-seq [161]. For this, molecular interaction data from 32 different online available repositories from human, mouse and yeast were integrated, in which calculations of statistical significant over-represented and enriched interaction network modules and biochemical pathways are based on different computational analyses.
Limitations and drawbacks are, for instance, large output lists and over-prediction especially from large-scale gene lists. Moreover, results are not filtered for a specific biological process and pathway, and the programs do not consider interactions between genes/proteins and the interaction network in a cell-line and tissue-specific context [33,162,163]. Therefore, an individual collection by users of genes, proteins, processes and signaling pathways associated with lung cancer as well as tissue-specific information will specify the results from the functional enrichment analysis which allows a better functional interpretation in a biological context.

Bioinformatics Tools for Integrated Functional Analysis
There are several bioinformatics tools for integrated functional analysis and interpretation, e.g., co-expression, disease and tissue-specific analysis, enabling to comprehensively understand the functional role of miRNAs and lncRNAs from large input data lists. For example, the starBase web tool deciphers lncRNAs and miRNAs from experimental large-scale CLIP-Seq datasets and tumor samples and provides RNA-protein, miRNA-lncRNA and miRNA-mRNA interactions including further analysis, e.g., biological processes and signaling pathways [100]. Similarly, the lncRNAtor tool is especially designed to investigate and functional understand lncRNAs combining for instance lncRNA expression profiles, RNA-protein interactions and functional enrichment analysis by using RNA-Seq and CLIP-Seq data sets from publicly available databases such as The Cancer Genome Atlas (TCGA), Gene Expression Omnibus (GEO) and ENCODE [132]. Moreover, the Cytoscape software tool allows to visualize and analyze regulatory networks, e.g., regarding functional GO terms and the network topology, co-expression or identification of functional clusters of highly connected interaction partners [99,164,165]. As an example, the Cytoscape plugin ClueGO calculates statistically enriched processes and pathways from a user gene list using GO terms and information from KEGG, WikiPathways and Reactome [166]. Moreover, the ncINetView plugin integrates data from the ncRNA-DB, a database collection of ncRNA interactions from several sources, allowing to search for associated interaction partners and regulatory networks including related biological functions and diseases as well as filtering for a specific diseases [167].
However, current bioinformatics tools only provide information about already known ncRNAs as they mainly analyzed public available large-scale datasets or focus on a specific disease and/or ncRNA class. Nevertheless, for specific analyses, e.g., interaction partners and functional enrichment analysis, there are powerful tools available such as ConsensusPathDB [161] on which further analyses (e.g., promotor, structure as discussed here) can build.

Promoter Analysis
Promoter analysis is an essential step in understanding the complex regulatory effects of ncRNAs, for instance transcriptional regulation of miRNA and lncRNA expression or in case of lncRNAs, also cooperatively working with a TF to guide them to the promoter e.g., to regulate its own transcription in a feedback loop, both helpful in terms of posttranscriptional therapeutically usage. TFs bind to specific transcription factor binding sites (TFBSs) in the promoter that can be bioinformatically represented by a IUPAC (International Union of Pure and Applied Chemistry) consensus nucleotide sequence or a position weight matrices (PWMs), whereas the latter one reflects a better representation by displaying the whole nucleotide distribution for each binding site position (extensively reviewed in [168]). Databases using PWMs are TRANSFAC, JASPAR, Allgen PROMO and MatInspector (implemented in the Genomatix software (Genomatix GmbH, Munich, Germany)). These tools allow not only prediction of putative TFBSs for a given sequence but also additional analysis, e.g., genome-wide and comparative regulatory region analysis [101][102][103][104].
Limitations and drawbacks are for instance, most methods are using different parameters for TFBS detection, they are mostly not based on experimental TFBS profiles and do not consider the tissue and functional context of the TF. Moreover, they are using different output parameters, e.g., dissimilarity threshold which controls how similar a sequence must be to the matrix to be reported as a hit, and do not include multiple statistical testing, indicating a high number of over-predictions. Thus, knowledge about the parameters is essential but also the combination of different prediction software to find overlapping hits. Moreover, analysis regarding the tissue and functional context can improve the detection of potentially functional TFBSs and minimize potential testing candidates which at least need to be further proven by experiments [104].

Automation
As discussed in the previous sections, analyzing ncRNAs is a complicated process involving a lot of different resources, e.g., various databases and prediction tools with different specializations. Especially considering the goal of using circulating miRNA and lncRNA biomarkers in the clinic, solutions for a more automated sample analysis are important. Partial automation has been achieved [115], where multiple prediction algorithms are combined. This already reduces the complexity of the analysis and improves its quality. A similar approach integrates databases, e.g., the online database RNAcentral [108] which provides a unified way to access various previously discussed databases not only using a unified API, but including a comprehensive versioning system which makes the data analysis reproducible through the use of stable identifiers. Nevertheless, complete solutions do not yet exist, but have to be developed using custom scripts or using flexible pipelines like Galaxy [169], Ruffus [170] or Snakemake [171]. Considering the rapidly evolving miRNAs and lncRNA annotations and analysis tools, reproducibility and versioning of the analysis pipeline is important. Pipeline specific [172] as well as generic approaches using Docker [173] have been proposed.

Conclusions and Future Directions
miRNAs and lncRNAs have a high potential working as non-invasive biomarker in lung cancer diagnosis (see Figure 3 for a summary of key points presented in this review). First large-scale blood biomarker signature studies based on miRNAs and lncRNAs show promising clinical results for NSCLC early diagnosis. However, studies to understand the implication and potential of such circulating biomarker signatures in lung cancer have just begun. There are major challenges for the transfer to the clinic, e.g., accuracy, reliability, and well established validity. Thus, further investigations and validation studies are required, identifying highly sensitive tests for efficient RNA-based lung cancer diagnostics. Regarding their functional characterization, experimental methods are technical challenging and laborious, but can be supported by integrated bioinformatics analysis for filtering and pre-selection of experimental candidates considering the tissue and functional interaction context. Especially lncRNA characterization is challenging and functional understanding of their role is limited which requires further studies regarding the diagnostic potential. Blood-based tests must be highly accurate and face substantial hurdles [15,22,174], however, such biomarker signatures have enormous potential as a first-line diagnostic tool. Despite its proven potential to reduce false-positive rates of LDCT screenings as an additional diagnostic tool, further refinement and evaluation of blood signatures are necessary. In this context, the use of tissue (e.g., from surgical NSCLC biopsies and/or GEO) and blood samples, examination of several cohorts (e.g., low-and high-risk lung cancer patients), expression normalization and analysis using different normalization strategies as well as bioinformatics meta-analysis will be essential to improve lung cancer diagnostics and application in the clinic in future. This will not only refine the blood biomarker signature but can also reduce the complexity of such blood tests, e.g., due to the identification of a combination of two or three differentially expressed miRNAs and lncRNAs. All this will contribute to better accuracy and less costs, resulting in higher number of correct and early diagnosed individuals as well as the reduction of unnecessary invasive diagnostics. The vision for the future is that circulating miRNAs and lncRNAs based blood tests can be transferred to the clinic for a better clinical management of lung cancer resulting in a better patient survival. signature but can also reduce the complexity of such blood tests, e.g., due to the identification of a combination of two or three differentially expressed miRNAs and lncRNAs. All this will contribute to better accuracy and less costs, resulting in higher number of correct and early diagnosed individuals as well as the reduction of unnecessary invasive diagnostics. The vision for the future is that circulating miRNAs and lncRNAs based blood tests can be transferred to the clinic for a better clinical management of lung cancer resulting in a better patient survival.