PredLnc-GFStack: A Global Sequence Feature Based on a Stacked Ensemble Learning Method for Predicting lncRNAs from Transcripts
Abstract
:1. Introduction
2. Datasets and Method
2.1. Datasets
2.2. Features Extraction
2.3. Feature Selection by Genetic Algorithm and Random Forest (GA-RF)
2.4. Stacked Ensemble Learning in PredLnc-GFStack
3. Results and Discussion
3.1. Performance Evaluation
(1) | |
(2) | |
(3) | |
(4) | |
(5) |
3.2. Evaluation of the Optimal Feature Subsets
3.3. Evaluation of PredLnc-GFStack on Different Datasets
3.4. Comparison with Other Methods
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Conflicts of Interest
References
- Trapnell, C.; Williams, B.A.; Pertea, G.; Mortazavi, A.; Kwan, G.; van Baren, M.J.; Salzberg, S.L.; Wold, B.J.; Pachter, L. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 2010, 28, 511. [Google Scholar] [CrossRef] [PubMed]
- Guttman, M.; Rinn, J.L. Modular regulatory principles of large non-coding RNAs. Nature 2012, 482, 339. [Google Scholar] [CrossRef] [PubMed]
- Cabili, M.N.; Trapnell, C.; Goff, L.; Koziol, M.; Tazon-Vega, B.; Regev, A.; Rinn, J.L. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 2011, 25, 1915–1927. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Goodrich, J.A.; Kugel, J.F. Non-coding-RNA regulators of RNA polymerase II transcription. Nat. Rev. Mol. Cell Biol. 2006, 7, 612–616. [Google Scholar] [CrossRef] [PubMed]
- Sanchez-Elsner, T.; Gou, D.; Kremmer, E.; Sauer, F. Noncoding RNAs of trithorax response elements recruit Drosophila Ash1 to Ultrabithorax. Science 2006, 311, 1118–1123. [Google Scholar] [CrossRef] [PubMed]
- Lukiw, W.; Handley, P.; Wong, L.; McLachlan, D.C. BC200 RNA in normal human neocortex, non-Alzheimer dementia (NAD), and senile dementia of the Alzheimer type (AD). Neurochem. Res. 1992, 17, 591–597. [Google Scholar] [CrossRef] [PubMed]
- Fu, X.; Ravindranath, L.; Tran, N.; Petrovics, G.; Srivastava, S. Regulation of apoptosis by a prostate-specific and prostate cancer-associated noncoding gene, PCGEM1. Dna Cell Biol. 2006, 25, 135–141. [Google Scholar] [CrossRef]
- Prensner, J.R.; Chinnaiyan, A.M. The emergence of lncRNAs in cancer biology. Cancer Discov. 2011, 1, 391–407. [Google Scholar] [CrossRef]
- Li, D.; Chen, G.; Yang, J.; Fan, X.; Gong, Y.; Xu, G.; Cui, Q.; Geng, B. Transcriptome analysis reveals distinct patterns of long noncoding RNAs in heart and plasma of mice with heart failure. PLoS ONE 2013, 8, e77938. [Google Scholar] [CrossRef]
- Batista, P.J.; Chang, H.Y. Long noncoding RNAs: Cellular address codes in development and disease. Cell 2013, 152, 1298–1307. [Google Scholar] [CrossRef]
- Zhang, Q.; Chen, C.-Y.; Yedavalli, V.S.R.K.; Jeang, K.-T. NEAT1 long noncoding RNA and paraspeckle bodies modulate HIV-1 posttranscriptional expression. MBio 2013, 4, e00596-12. [Google Scholar] [CrossRef] [PubMed]
- Jathar, S.; Kumar, V.; Srivastava, J.; Tripathi, V. Technological developments in lncRNA biology. In Long Non Coding RNA Biology; Rao, M.R.S., Ed.; Springer Singapore: Singapore, 2017; pp. 283–323. [Google Scholar] [CrossRef]
- Schmitt, A.M.; Garcia, J.T.; Hung, T.; Flynn, R.A.; Shen, Y.; Qu, K.; Payumo, A.Y.; Peres-da-Silva, A.; Broz, D.K.; Baum, R.; et al. An inducible long noncoding RNA amplifies DNA damage signaling. Nat. Genet. 2016, 48, 1370. [Google Scholar] [CrossRef] [PubMed]
- Li, J.; Liu, C. Coding or noncoding, the converging concepts of RNAs. Front. Genet. 2019, 10. [Google Scholar] [CrossRef] [PubMed]
- Lan, W.; Li, M.; Zhao, K.; Liu, J.; Wu, F.-X.; Pan, Y.; Wang, J. LDAP: A web server for lncRNA-disease association prediction. Bioinformatics 2016, 33, 458–460. [Google Scholar] [CrossRef] [PubMed]
- Zhang, W.; Qu, Q.; Zhang, Y.; Wang, W. The linear neighborhood propagation method for predicting long non-coding RNA–protein interactions. Neurocomputing 2018, 273, 526–534. [Google Scholar] [CrossRef]
- Zhang, W.; Yue, X.; Tang, G.; Wu, W.; Huang, F.; Zhang, X. SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions. PLoS Comput. Biol. 2018, 14, e1006616. [Google Scholar] [CrossRef] [PubMed]
- Bassett, A.R.; Akhtar, A.; Barlow, D.P.; Bird, A.P.; Brockdorff, N.; Duboule, D.; Ephrussi, A.; Ferguson-Smith, A.C.; Gingeras, T.R.; Haerty, W.; et al. Considerations when investigating lncRNA function in vivo. eLife 2014, 3. [Google Scholar] [CrossRef]
- Kong, L.; Zhang, Y.; Ye, Z.-Q.; Liu, X.-Q.; Zhao, S.-Q.; Wei, L.; Gao, G. CPC: Assess the protein-coding potential of transcripts using sequence features and support vector machine. Nucleic Acids Res. 2007, 35, W345–W349. [Google Scholar] [CrossRef]
- Sun, L.; Luo, H.; Bu, D.; Zhao, G.; Yu, K.; Zhang, C.; Liu, Y.; Chen, R.; Zhao, Y. Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts. Nucleic Acids Res. 2013, 41, e166. [Google Scholar] [CrossRef]
- Li, A.; Zhang, J.; Zhou, Z. PLEK: A tool for predicting long non-coding RNAs and messenger RNAs based on an improved k-mer scheme. BMC Bioinform. 2014, 15, 311. [Google Scholar] [CrossRef]
- Sun, L.; Liu, H.; Zhang, L.; Meng, J. lncRScan-SVM: A tool for predicting long non-coding RNAs using support vector machine. PLoS ONE 2015, 10, e0139654. [Google Scholar] [CrossRef] [PubMed]
- Kang, Y.-J.; Yang, D.-C.; Kong, L.; Hou, M.; Meng, Y.-Q.; Wei, L.; Gao, G. CPC2: A fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res. 2017, 45, W12–W16. [Google Scholar] [CrossRef] [PubMed]
- Schneider, H.W.; Raiol, T.; Brigido, M.M.; Walter, M.; Stadler, P.F. A support vector machine based method to distinguish long non-coding RNAs from protein coding transcripts. BMC Genom. 2017, 18, 804. [Google Scholar] [CrossRef] [PubMed]
- Tong, X.; Liu, S. CPPred: Coding potential prediction based on the global description of RNA sequence. Nucleic Acids Res. 2019. [Google Scholar] [CrossRef] [PubMed]
- Breiman, L. Random forest. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Achawanantakun, R.; Chen, J.; Sun, Y.; Zhang, Y. LncRNA-ID: Long non-coding RNA IDentification using balanced random forests. Bioinformatics 2015, 31, 3897–3905. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Hu, L.; Xu, Z.; Hu, B.; Lu, Z.J. COME: A robust coding potential calculation tool for lncRNA identification and characterization based on multiple features. Nucleic Acids Res. 2017, 45, e2. [Google Scholar] [CrossRef] [PubMed]
- Wucher, V.; Legeai, F.; Hedan, B.; Rizk, G.; Lagoutte, L.; Leeb, T.; Jagannathan, V.; Cadieu, E.; David, A.; Lohi, H.; et al. FEELnc: A tool for long non-coding RNA annotation and its application to the dog transcriptome. Nucleic Acids Res. 2017, 45, e57. [Google Scholar] [CrossRef] [PubMed]
- Cristiano, F.; Veltri, P.; Prosperi, M.; Tradigo, G. On the identification of long non-coding rnas from RNA-Seq. In Proceedings of the 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Shenzhen, China, 15–18 December 2016; pp. 1103–1106. [Google Scholar]
- Wang, L.; Park, H.J.; Dasari, S.; Wang, S.; Kocher, J.P.; Li, W. CPAT: Coding-potential assessment tool using an alignment-free logistic regression model. Nucleic Acids Res. 2013, 41, e74. [Google Scholar] [CrossRef] [PubMed]
- Fan, X.-N.; Zhang, S.-W. LncRNA-MFDL: Identification of human long non-coding RNAs by fusing multiple features and using deep learning. Mol. Biosyst. 2015, 11, 892–897. [Google Scholar] [CrossRef] [PubMed]
- Baek, J.; Lee, B.; Kwon, S.; Yoon, S. LncRNAnet: Long non-coding RNA identification using deep learning. Bioinformatics 2018, 34, 3889–3897. [Google Scholar] [CrossRef] [PubMed]
- Yang, C.; Yang, L.; Zhou, M.; Xie, H.; Zhang, C.; Wang, M.D.; Zhu, H. LncADeep: An ab initio lncRNA identification and functional annotation tool based on deep learning. Bioinformatics 2018, 34, 3825–3834. [Google Scholar] [CrossRef] [PubMed]
- Hu, J.; Andrews, B. Distinguishing long non-coding RNAs from mRNAs using a two-layer structured classifier. In Proceedings of the 2017 IEEE 7th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), Orlando, FL, USA, 19–21 October 2017; pp. 1–5. [Google Scholar]
- Simopoulos, C.M.A.; Weretilnyk, E.A.; Golding, G.B. Prediction of plant lncRNA by ensemble machine learning classifiers. BMC Genom. 2018, 19, 316. [Google Scholar] [CrossRef] [PubMed]
- Pian, C.; Zhang, G.; Chen, Z.; Chen, Y.; Zhang, J.; Yang, T.; Zhang, L. LncRNApred: Classification of long non-coding RNAs and protein-coding transcripts by the ensemble algorithm with a new hybrid feature. PLoS ONE 2016, 11, e0154567. [Google Scholar] [CrossRef] [PubMed]
- Ventola, G.M.; Noviello, T.M.; D’Aniello, S.; Spagnuolo, A.; Ceccarelli, M.; Cerulo, L. Identification of long non-coding transcripts with feature selection: A comparative study. BMC Bioinform. 2017, 18, 187. [Google Scholar] [CrossRef]
- Harrow, J.; Frankish, A.; Gonzalez, J.M.; Tapanari, E.; Diekhans, M.; Kokocinski, F.; Aken, B.L.; Barrell, D.; Zadissa, A.; Searle, S. GENCODE: The reference human genome annotation for the ENCODE project. Genome Res. 2012, 22, 1760–1774. [Google Scholar] [CrossRef]
- Curwen, V.; Eyras, E.; Andrews, T.D.; Clarke, L.; Mongin, E.; Searle, S.M.J.; Clamp, M. The ensembl automatic gene annotation system. Genome Res. 2004, 14, 942–950. [Google Scholar] [CrossRef]
- Fu, L.; Niu, B.; Zhu, Z.; Wu, S.; Li, W. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 2012, 28, 3150–3152. [Google Scholar] [CrossRef]
- Li, W.; Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22, 1658–1659. [Google Scholar] [CrossRef]
- Vilela, C.; McCarthy, J.E. Regulation of fungal gene expression via short open reading frames in the mRNA 5′ untranslated region. Mol. Microbiol. 2003, 49, 859–867. [Google Scholar] [CrossRef]
- Dubchak, I.; Muchnik, I.; Holbrook, S.R.; Kim, S.-H. Prediction of protein folding class using global description of amino acid sequence. Proc. Natl. Acad. Sci. USA 1995, 92, 8700–8704. [Google Scholar] [CrossRef] [PubMed]
- Davis, L. Handbook of Genetic Algorithms; Van Nostrand Reinhold: New York, NY, USA, 1991. [Google Scholar]
- Blickle, T.; Thiele, L. A Mathematical analysis of tournament selection. In Proceedings of the ICGA, San Francisco, CA, USA, 1995; pp. 9–16. [Google Scholar]
- Dietterich, T.G. Ensemble learning. In The Handbook of Brain Theory and Neural Networks; MIT Press: Cambridge, MA, USA, 2002; Volume 2, pp. 110–125. [Google Scholar]
- Perez-Ortiz, M.; Gutierrez, P.A.; Hervas-Martinez, C. Projection-based ensemble learning for ordinal regression. IEEE Trans. Cybern. 2014, 44, 681–694. [Google Scholar] [CrossRef] [PubMed]
- Zhang, W.; Jing, K.; Huang, F.; Chen, Y.; Li, B.; Li, J.; Gong, J. SFLLN: A sparse feature learning ensemble method with linear neighborhood regularization for predicting drug–drug interactions. Inf. Sci. 2019, 497, 189–201. [Google Scholar] [CrossRef]
- Zhang, W.; Zhu, X.; Fu, Y.; Tsuji, J.; Weng, Z. Predicting human splicing branchpoints by combining sequence-derived features and multi-label learning methods. BMC Bioinform. 2017, 18, 464. [Google Scholar] [CrossRef] [PubMed]
- Luo, L.; Li, D.; Zhang, W.; Tu, S.; Zhu, X.; Tian, G. Accurate prediction of transposon-derived piRNAs by integrating various sequential and physicochemical features. PLoS ONE 2016, 11, e0153268. [Google Scholar] [CrossRef] [PubMed]
- Li, D.; Luo, L.; Zhang, W.; Liu, F.; Luo, F. A genetic algorithm-based weighted ensemble method for predicting transposon-derived piRNAs. BMC Bioinform. 2016, 17, 329. [Google Scholar] [CrossRef] [PubMed]
- Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef] [Green Version]
- Kearns, M. Thoughts on hypothesis boosting. Unpubl. Manuscr. 1988, 45, 105. [Google Scholar]
- Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
- Zhang, W.; Liu, J.; Zhao, M.; Li, Q. Predicting linear B-cell epitopes by using sequence-derived structural and physicochemical features. Int. J. Data Min. Bioinform. 2012, 6, 557–569. [Google Scholar] [CrossRef]
- Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
- Zhang, W.; Niu, Y.; Zou, H.; Luo, L.; Liu, Q.; Wu, W. Accurate prediction of immunogenic T-cell epitopes from epitope sequences using the genetic algorithm-based ensemble learning. PLoS ONE 2015, 10, e0128194. [Google Scholar] [CrossRef] [PubMed]
- Bühlmann, P.; Yu, B. Analyzing bagging. Ann. Stat. 2002, 30, 927–961. [Google Scholar] [CrossRef]
Data Sources | Name | Coding RNAs | NcRNAs |
---|---|---|---|
GENCODE | Human-Main | 35760 | 20299 |
Human-Independent | 1500 | 1500 | |
Mouse-Main | 23987 | 11746 | |
Mouse-Independent | 1500 | 1500 | |
preprocess CPPred | Human-Testing | 8557 | 8241 |
Mouse-Testing | 31102 | 19930 | |
Zebrafish-Testing | 15594 | 10662 | |
Fruit-fly-Testing | 17400 | 4098 | |
S.cerevisiae-Testing | 6713 | 413 | |
Integrate-Testing | 13903 | 13903 |
Types | Features (Dimension) |
---|---|
codon-related features | stop codon count (1), stop codon frequency (1), stop codon frame score (1), stop codon frequency frame score (1), nucleotide position frequencies (4), Fickett TESTCODE score (1) |
Open reading frame (ORF)-related features | the first ORF length (1), the longest ORF length (1), the ORF coverage (2), the ORF integrity (1), ORF frame score (1), the entropy density profiles (EDP) of ORF (16) |
GC-related features | GC (1), GC1 (1), GC2 (1), GC3 (1), GC frame score (1), UTR GC content (2) |
coding sequence-related features | Coding sequence (CDS) length (1), CDS percentage (1), coding potential of the transcripts (CDS score) (1) |
transcript-related features | transcript length (1), k-mer (168), CTD (20), Hexamer score (1), Signal to noise ratio (SNR) (1), untranslated region (UTR) coverage (2), EDP (20) |
structure-related features | Molecular weight (Mw) (1), isoelectric point (pI) (1), pI/Mw (1), pI/Mw frame score (1), Gravy (1), Instability index (1) |
Optimal Feature Subset No. | Human | Mouse | ||
---|---|---|---|---|
AUC | Number of Features | AUC | Number of Features | |
1 | 0.94979 | 134 | 0.96382 | 118 |
2 | 0.94946 | 137 | 0.96350 | 125 |
3 | 0.94940 | 131 | 0.96343 | 127 |
4 | 0.94934 | 136 | 0.96334 | 123 |
5 | 0.94929 | 138 | 0.96327 | 114 |
6 | 0.94929 | 134 | 0.96324 | 123 |
7 | 0.94923 | 129 | 0.96323 | 115 |
8 | 0.94916 | 127 | 0.96323 | 121 |
9 | 0.94913 | 137 | 0.96322 | 122 |
10 | 0.94910 | 128 | 0.96322 | 119 |
Dataset | AUC | ACC | SN | SP | PRE | F1 |
---|---|---|---|---|---|---|
Human | 0.956 | 0.895 | 0.884 | 0.901 | 0.835 | 0.859 |
Mouse | 0.969 | 0.914 | 0.875 | 0.933 | 0.865 | 0.870 |
Training Dataset | Testing Dataset | AUC | ACC | SN | SP | PRE | F1 |
---|---|---|---|---|---|---|---|
Human-Main | Human-Testing | 0.995 | 0.968 | 0.962 | 0.974 | 0.973 | 0.967 |
Mouse-Testing | 0.987 | 0.941 | 0.879 | 0.981 | 0.968 | 0.921 | |
Integrated-Testing | 0.985 | 0.907 | 0.831 | 0.982 | 0.979 | 0.899 | |
Zebrafish-Testing | 0.971 | 0.901 | 0.772 | 0.989 | 0.980 | 0.863 | |
Fruit-fly-Testing | 0.992 | 0.940 | 0.714 | 0.993 | 0.962 | 0.819 | |
S.cerevisiae-Testing | 0.983 | 0.960 | 0.828 | 0.969 | 0.621 | 0.710 | |
Mouse-Main | Human-Testing | 0.977 | 0.887 | 0.807 | 0.964 | 0.955 | 0.875 |
Mouse-Testing | 0.995 | 0.944 | 0.869 | 0.992 | 0.985 | 0.924 | |
Integrated-Testing | 0.984 | 0.871 | 0.757 | 0.985 | 0.981 | 0.855 | |
Zebrafish-Testing | 0.971 | 0.843 | 0.626 | 0.991 | 0.979 | 0.764 | |
Fruit-fly-Testing | 0.990 | 0.917 | 0.593 | 0.994 | 0.957 | 0.733 | |
S.cerevisiae-Testing | 0.964 | 0.942 | 0.382 | 0.976 | 0.500 | 0.433 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, S.; Zhao, X.; Zhang, G.; Li, W.; Liu, F.; Liu, S.; Zhang, W. PredLnc-GFStack: A Global Sequence Feature Based on a Stacked Ensemble Learning Method for Predicting lncRNAs from Transcripts. Genes 2019, 10, 672. https://doi.org/10.3390/genes10090672
Liu S, Zhao X, Zhang G, Li W, Liu F, Liu S, Zhang W. PredLnc-GFStack: A Global Sequence Feature Based on a Stacked Ensemble Learning Method for Predicting lncRNAs from Transcripts. Genes. 2019; 10(9):672. https://doi.org/10.3390/genes10090672
Chicago/Turabian StyleLiu, Shuai, Xiaohan Zhao, Guangyan Zhang, Weiyang Li, Feng Liu, Shichao Liu, and Wen Zhang. 2019. "PredLnc-GFStack: A Global Sequence Feature Based on a Stacked Ensemble Learning Method for Predicting lncRNAs from Transcripts" Genes 10, no. 9: 672. https://doi.org/10.3390/genes10090672
APA StyleLiu, S., Zhao, X., Zhang, G., Li, W., Liu, F., Liu, S., & Zhang, W. (2019). PredLnc-GFStack: A Global Sequence Feature Based on a Stacked Ensemble Learning Method for Predicting lncRNAs from Transcripts. Genes, 10(9), 672. https://doi.org/10.3390/genes10090672