TrAnnoScope: A Modular Snakemake Pipeline for Full-Length Transcriptome Analysis and Functional Annotation
Abstract
:1. Introduction
2. Materials and Methods
2.1. Components of the TrAnnoScope Pipeline
2.2. Pipeline Implementation
2.2.1. Quality Control and Preprocessing of Illumina Reads
2.2.2. Preprocessing of PacBio Reads
2.2.3. Contamination Removal
2.2.4. Error Correction
2.2.5. Clustering and Classification
2.2.6. Quality Assessment
2.2.7. Annotation
2.2.8. Data Selection
2.2.9. Mapping to the Zebra Finch Reference Genome
3. Results and Discussion
4. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Chen, J.W.; Shrestha, L.; Green, G.; Leier, A.; Marquez-Lago, T.T. The hitchhikers’ guide to RNA sequencing and functional analysis. Brief. Bioinform. 2023, 24, bbac529. [Google Scholar] [CrossRef] [PubMed]
- Deshpande, D.; Chhugani, K.; Chang, Y.; Karlsberg, A.; Loeffler, C.; Zhang, J.; Muszynska, A.; Munteanu, V.; Yang, H.; Rotman, J.; et al. RNA-seq data science: From raw data to effective interpretation. Front. Genet. 2023, 14, 997383. [Google Scholar] [CrossRef]
- Raghavan, V.; Kraft, L.; Mesny, F.; Rigerte, L. A simple guide to de novo transcriptome assembly and annotation. Brief. Bioinform. 2022, 23, bbab563. [Google Scholar] [CrossRef] [PubMed]
- Esteve-Codina, A. RNA-Seq Data Analysis, Applications and Challenges. In Comprehensive Analytical Chemistry; Jaumot, J., Bedia, C., Tauler, R., Eds.; Elsevier: Amsterdam, The Netherlands, 2018; Volume 82, pp. 71–106. [Google Scholar]
- Garg, R.; Jain, M. RNA-Seq for transcriptome analysis in non-model plants. Methods Mol. Biol. 2013, 1069, 43–58. [Google Scholar] [CrossRef] [PubMed]
- Dohm, J.C.; Peters, P.; Stralis-Pavese, N.; Himmelbauer, H. Benchmarking of long-read correction methods. NAR Genom. Bioinform. 2020, 2, lqaa037. [Google Scholar] [CrossRef] [PubMed]
- Amarasinghe, S.L.; Su, S.; Dong, X.; Zappia, L.; Ritchie, M.E.; Gouil, Q. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 2020, 21, 30. [Google Scholar] [CrossRef] [PubMed]
- Fu, S.; Wang, A.; Au, K.F. A comparative evaluation of hybrid error correction methods for error-prone long reads. Genome Biol. 2019, 20, 26. [Google Scholar] [CrossRef]
- Lataretu, M.; Holzer, M. RNAflow: An Effective and Simple RNA-Seq Differential Gene Expression Pipeline Using Nextflow. Genes 2020, 11, 1487. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Jonassen, I. RASflow: An RNA-Seq analysis workflow with Snakemake. BMC Bioinform. 2020, 21, 110. [Google Scholar] [CrossRef] [PubMed]
- Fallon, T.R.; Calounova, T.; Mokrejs, M.; Weng, J.K.; Pluskal, T. transXpress: A Snakemake pipeline for streamlined de novo transcriptome assembly and annotation. BMC Bioinform. 2023, 24, 133. [Google Scholar] [CrossRef] [PubMed]
- Rivera-Vicens, R.E.; Garcia-Escudero, C.A.; Conci, N.; Eitel, M.; Worheide, G. TransPi-a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly. Mol. Ecol. Resour. 2022, 22, 2070–2086. [Google Scholar] [CrossRef] [PubMed]
- Ortiz, R.; Gera, P.; Rivera, C.; Santos, J.C. Pincho: A Modular Approach to High Quality De Novo Transcriptomics. Genes 2021, 12, 953. [Google Scholar] [CrossRef] [PubMed]
- FIT: Functional IsoTranscriptomics Analyses. Available online: https://tappas.org/ (accessed on 11 September 2024).
- Lienhard, M.; van den Beucken, T.; Timmermann, B.; Hochradel, M.; Borno, S.; Caiment, F.; Vingron, M.; Herwig, R. IsoTools: A flexible workflow for long-read transcriptome sequencing analysis. Bioinformatics 2023, 39, btad364. [Google Scholar] [CrossRef] [PubMed]
- Xia, Y.; Jin, Z.; Zhang, C.; Ouyang, L.; Dong, Y.; Li, J.; Guo, L.; Jing, B.; Shi, Y.; Miao, S.; et al. TAGET: A toolkit for analyzing full-length transcripts from long-read sequencing. Nat. Commun. 2023, 14, 5935. [Google Scholar] [CrossRef]
- Guizard, S.; Miedzinska, K.; Smith, J.; Smith, J.; Kuo, R.I.; Davey, M.; Archibald, A.; Watson, M. nf-core/isoseq: Simple gene and isoform annotation with PacBio Iso-Seq long-read sequencing. Bioinformatics 2023, 39, btad150. [Google Scholar] [CrossRef] [PubMed]
- Kasianova, A.M.; Penin, A.A.; Schelkunov, M.I.; Kasianov, A.S.; Logacheva, M.D.; Klepikova, A.V. Trans2express—De novo transcriptome assembly pipeline optimized for gene expression analysis. Plant Methods 2024, 20, 128. [Google Scholar] [CrossRef] [PubMed]
- Zhang, W.; Petegrosso, R.; Chang, J.W.; Sun, J.; Yong, J.; Chien, J.; Kuang, R. A large-scale comparative study of isoform expressions measured on four platforms. BMC Genom. 2020, 21, 272. [Google Scholar] [CrossRef]
- Molder, F.; Jablonski, K.P.; Letcher, B.; Hall, M.B.; Tomkins-Tinch, C.H.; Sochat, V.; Forster, J.; Lee, S.; Twardziok, S.O.; Kanitz, A.; et al. Sustainable data analysis with Snakemake. F1000Research 2021, 10, 33. [Google Scholar] [CrossRef]
- Babraham Bioinformatics. FastQC. Available online: https://www.bioinformatics.babraham.ac.uk/projects/fastqc (accessed on 8 August 2024).
- Ewels, P.; Magnusson, M.; Lundin, S.; Kaller, M. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 2016, 32, 3047–3048. [Google Scholar] [CrossRef]
- Wingett, S.W.; Andrews, S. FastQ Screen: A tool for multi-genome mapping and quality control. F1000Research 2018, 7, 1338. [Google Scholar] [CrossRef]
- Chen, S.; Zhou, Y.; Chen, Y.; Gu, J. fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018, 34, i884–i890. [Google Scholar] [CrossRef] [PubMed]
- Pacific Biosciences. Iso-Seq—Scalable De Novo Isoform Discovery from Pacbio HiFi Reads. Available online: https://isoseq.how/ (accessed on 8 August 2024).
- Mortezaei, Z. Computational methods for analyzing RNA-sequencing contaminated samples and its impact on cancer genome studies. Inform. Med. Unlocked 2022, 32, 101054. [Google Scholar] [CrossRef]
- Gondane, A.; Itkonen, H.M. Revealing the History and Mystery of RNA-Seq. Curr. Issues Mol. Biol. 2023, 45, 1860–1874. [Google Scholar] [CrossRef] [PubMed]
- Laetsch, D.R.; Blaxter, M.L. BlobTools: Interrogation of genome assemblies. F1000Research 2017, 6, 1287. [Google Scholar] [CrossRef]
- Langmead, B.; Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 2012, 9, 357–359. [Google Scholar] [CrossRef] [PubMed]
- Manni, M.; Berkeley, M.R.; Seppey, M.; Zdobnov, E.M. BUSCO: Assessing Genomic Data Quality and Beyond. Curr. Protoc. 2021, 1, e323. [Google Scholar] [CrossRef] [PubMed]
- Camacho, C.; Coulouris, G.; Avagyan, V.; Ma, N.; Papadopoulos, J.; Bealer, K.; Madden, T.L. BLAST+: Architecture and applications. BMC Bioinform. 2009, 10, 421. [Google Scholar] [CrossRef] [PubMed]
- Wang, J.R.; Holt, J.; McMillan, L.; Jones, C.D. FMLRC: Hybrid long read error correction using an FM-index. BMC Bioinform. 2018, 19, 50. [Google Scholar] [CrossRef] [PubMed]
- Li, W.; Godzik, A. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006, 22, 1658–1659. [Google Scholar] [CrossRef] [PubMed]
- Gilbert, D. EvidentialGene: tr2aacds, mRNA Transcript Assembly Software. Available online: http://arthropods.eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html (accessed on 29 October 2024).
- De Coster, W.; D’Hert, S.; Schultz, D.T.; Cruts, M.; Van Broeckhoven, C. NanoPack: Visualizing and processing long-read sequencing data. Bioinformatics 2018, 34, 2666–2669. [Google Scholar] [CrossRef] [PubMed]
- Trinity. Counting Full Length Trinity Transcripts. Available online: https://github.com/trinityrnaseq/trinityrnaseq/wiki (accessed on 8 August 2024).
- Trinotate: Transcriptome Functional Annotation and Analysis. Available online: https://github.com/Trinotate/Trinotate/wiki (accessed on 21 October 2024).
- Mistry, J.; Chuguransky, S.; Williams, L.; Qureshi, M.; Salazar, G.A.; Sonnhammer, E.L.L.; Tosatto, S.C.E.; Paladin, L.; Raj, S.; Richardson, L.J.; et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 2021, 49, D412–D419. [Google Scholar] [CrossRef]
- UniProt, C. UniProt: The universal protein knowledgebase in 2021. Nucleic Acids Res. 2021, 49, D480–D489. [Google Scholar] [CrossRef]
- Teufel, F.; Almagro Armenteros, J.J.; Johansen, A.R.; Gislason, M.H.; Pihl, S.I.; Tsirigos, K.D.; Winther, O.; Brunak, S.; von Heijne, G.; Nielsen, H. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat. Biotechnol. 2022, 40, 1023–1025. [Google Scholar] [CrossRef] [PubMed]
- Krogh, A.; Larsson, B.; von Heijne, G.; Sonnhammer, E.L. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 2001, 305, 567–580. [Google Scholar] [CrossRef] [PubMed]
- Huerta-Cepas, J.; Szklarczyk, D.; Heller, D.; Hernandez-Plaza, A.; Forslund, S.K.; Cook, H.; Mende, D.R.; Letunic, I.; Rattei, T.; Jensen, L.J.; et al. eggNOG 5.0: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 2019, 47, D309–D314. [Google Scholar] [CrossRef]
- Nawrocki, E.P.; Eddy, S.R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 2013, 29, 2933–2935. [Google Scholar] [CrossRef]
- Gene Ontology, C.; Aleksander, S.A.; Balhoff, J.; Carbon, S.; Cherry, J.M.; Drabkin, H.J.; Ebert, D.; Feuermann, M.; Gaudet, P.; Harris, N.L.; et al. The Gene Ontology knowledgebase in 2023. Genetics 2023, 224, iyad031. [Google Scholar] [CrossRef] [PubMed]
- Kanehisa, M.; Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28, 27–30. [Google Scholar] [CrossRef] [PubMed]
- Tatusov, R.L.; Galperin, M.Y.; Natale, D.A.; Koonin, E.V. The COG database: A tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000, 28, 33–36. [Google Scholar] [CrossRef] [PubMed]
- Mello, C.V. The zebra finch, Taeniopygia guttata: An avian model for investigating the neurobiological basis of vocal learning. Cold Spring Harb. Protoc. 2014, 2014, 1237–1242. [Google Scholar] [CrossRef] [PubMed]
- Hauber, M.E.; Louder, M.I.; Griffith, S.C. The Natural History of Model Organisms: Neurogenomic insights into the behavioral and vocal development of the zebra finch. eLife 2021, 10, e61849. [Google Scholar] [CrossRef] [PubMed]
- Leinonen, R.; Sugawara, H.; Shumway, M. International Nucleotide Sequence Database Collaboration. Seq. Read Archive. Nucleic Acids Res. 2011, 39, D19–D21. [Google Scholar] [CrossRef] [PubMed]
- Quast, C.; Pruesse, E.; Yilmaz, P.; Gerken, J.; Schweer, T.; Yarza, P.; Peplies, J.; Glockner, F.O. The SILVA ribosomal RNA gene database project: Improved data processing and web-based tools. Nucleic Acids Res. 2013, 41, D590–D596. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y.; Zhao, Y.; Bollas, A.; Wang, Y.; Au, K.F. Nanopore sequencing technology, bioinformatics and applications. Nat. Biotechnol. 2021, 39, 1348–1365. [Google Scholar] [CrossRef]
- Ferrarini, M.; Moretto, M.; Ward, J.A.; Surbanovski, N.; Stevanovic, V.; Giongo, L.; Viola, R.; Cavalieri, D.; Velasco, R.; Cestaro, A.; et al. An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome. BMC Genom. 2013, 14, 670. [Google Scholar] [CrossRef] [PubMed]
- Tvedte, E.S.; Gasser, M.; Sparklin, B.C.; Michalski, J.; Hjelmen, C.E.; Johnston, J.S.; Zhao, X.; Bromley, R.; Tallon, L.J.; Sadzewicz, L.; et al. Comparison of long-read sequencing technologies in interrogating bacteria and fly genomes. G3 2021, 11, jkab083. [Google Scholar] [CrossRef] [PubMed]
- Sacristan-Horcajada, E.; Gonzalez-de la Fuente, S.; Peiro-Pastor, R.; Carrasco-Ramiro, F.; Amils, R.; Requena, J.M.; Berenguer, J.; Aguado, B. ARAMIS: From systematic errors of NGS long reads to accurate assemblies. Brief. Bioinform. 2021, 22, bbab170. [Google Scholar] [CrossRef]
- Pourmohammadi, R.; Abouei, J.; Anpalagan, A. Error analysis of the PacBio sequencing CCS reads. Int. J. Biostat. 2023, 19, 439–453. [Google Scholar] [CrossRef] [PubMed]
- Waterhouse, R.M.; Seppey, M.; Simao, F.A.; Manni, M.; Ioannidis, P.; Klioutchnikov, G.; Kriventseva, E.V.; Zdobnov, E.M. BUSCO Applications from Quality Assessments to Gene Prediction and Phylogenomics. Mol. Biol. Evol. 2018, 35, 543–548. [Google Scholar] [CrossRef] [PubMed]
Step/Sample | Brain_2 | Brain_5 | Ovary_2 | Testis_5 |
---|---|---|---|---|
Raw | 32,217,548 | 29,323,820 | 33,201,568 | 28,474,620 |
FastQScreen | 30,396,892 | 28,010,526 | 31,965,452 | 27,981,280 |
fastp | 29,777,552 | 27,446,223 | 31,089,662 | 27,519,646 |
Step/Sample | Brain_2 | Brain_5 | Ovary_2 | Testis_5 |
---|---|---|---|---|
Raw | 444,968 | 717,758 | 483,419 | 729,821 |
CCS | 124,615 | 56,773 | 198,608 | 42,832 |
FL | 93,967 | 15,245 | 172,050 | 25,332 |
FLNC | 89,017 | 15,103 | 168,336 | 25,095 |
Clustered | 47,129 | 10,229 | 80,405 | 18,101 |
Contamination | 46,769 | 9990 | 80,166 | 17,993 |
Error Correction | 46,769 | 9990 | 80,166 | 17,993 |
CD-Hit-Est | 23,636 | 6175 | 37,838 | 11,284 |
Step/Sample | Transcriptome |
---|---|
Total isoforms | 39,984 |
Mean, median, N50 transcripts | 3097.7|2794|4108 |
Total proteins | 39,984 |
Mean, median, N50 proteins | 398.2|271|597 |
Full-length proteins (EvidentialGene) | 86.7% |
Transcriptome completeness | C: 79.1% [S: 38.2%, D: 40.9%], F: 4.7%, M: 16.2%, n: 3354 |
Database | Hits (%) |
---|---|
UniProt/SwissProt Blastx | 28,274 (70.7%) |
UniProt/SwissProt Blastp | 25,051 (62.7%) |
Pfam Domains | 23,689 (59.2%) |
GO | 28,399 (71.0%) |
KOG | 26,409 (66.0%) |
KEGG | 26,097 (65.3%) |
Transmembrane Domains (TmHMM) | 7155 (17.9%) |
Signal Peptides (SignalP) | 2305 (5.8%) |
Non-coding RNAs (Infernal) | 169 (0.4%) |
Non-redundant protein DB (NR Blastx) | 33,049 (82.7%) |
Non-redundant protein DB (NR Blastp) | 28,303 (70.8%) |
Nucleotide DB (NT Blastn) | 39,827 (99.6%) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pektas, A.; Panitz, F.; Thomsen, B. TrAnnoScope: A Modular Snakemake Pipeline for Full-Length Transcriptome Analysis and Functional Annotation. Genes 2024, 15, 1547. https://doi.org/10.3390/genes15121547
Pektas A, Panitz F, Thomsen B. TrAnnoScope: A Modular Snakemake Pipeline for Full-Length Transcriptome Analysis and Functional Annotation. Genes. 2024; 15(12):1547. https://doi.org/10.3390/genes15121547
Chicago/Turabian StylePektas, Aysevil, Frank Panitz, and Bo Thomsen. 2024. "TrAnnoScope: A Modular Snakemake Pipeline for Full-Length Transcriptome Analysis and Functional Annotation" Genes 15, no. 12: 1547. https://doi.org/10.3390/genes15121547
APA StylePektas, A., Panitz, F., & Thomsen, B. (2024). TrAnnoScope: A Modular Snakemake Pipeline for Full-Length Transcriptome Analysis and Functional Annotation. Genes, 15(12), 1547. https://doi.org/10.3390/genes15121547