MTBseq-nf: Enabling Scalable Tuberculosis Genomics “Big Data” Analysis Through a User-Friendly Nextflow Wrapper for MTBseq Pipeline
Abstract
1. Introduction
2. Materials and Methods
2.1. The Design of MTBseq (Standard) Pipeline
2.2. Implementation of MTBseq-nf Wrapper
2.3. Validation Infrastructure and Dataset
2.4. Experimental Set Up for Evaluation of Scalability and Reproducibility
3. Results
3.1. Thematic Improvements in MTBseq-nf
3.2. Reproducibility Analysis of Intra-Modal Comparison
3.3. Reproducibility Analysis of Inter-Modal Comparison
3.4. Scalability Analysis of MTBseq-nf (Default) and MTBseq-nf (Parallel)
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AWS | Amazon Web Services |
| BWA | Burrows-Wheeler Aligner |
| CNPq | Conselho Nacional de Desenvolvimento Científico e Tecnológico (Brazilian Nationa Council for Scientific and Technological Development) |
| CPU | Central Processing Unit |
| CRediT | Contributor Role Taxonomy |
| DNA | Deoxyribonucleic Acid |
| Docker | Containerization platform |
| ENA | European Nucleotide Archive |
| ERS | ENA Experiment Accession Prefix |
| FASTQ | File format for sequencing reads; originally stands for “FASTA+Quality” |
| GATK | Genome Analysis Toolkit |
| GB | Gigabyte |
| GUI | Graphic User Interface |
| HPC | High-Performance Computing |
| IBM | International Business Machines |
| IDs | Identifications |
| IQ-TREE | Program for phylogenetic analysis |
| LMICs | Low- and Middle-Income Countries |
| MTBC | Mycobacterium tuberculosis Complex |
| MTBseq | Mycobacterium tuberculosis Sequencing Pipeline |
| MTBseq-nf | MTBseq Nextflow Wrapper Pipeline |
| MultiQC | Multi-Tool Quality Control Summary Software |
| M. tuberculosis | Mycobacterium tuberculosis |
| N/NE | Norte/Nordeste (North/Northeast) |
| NGS | Next-Generation Sequencing |
| nf-core | Community-curated Nextflow pipelines |
| NRF | National Research Foundation (South Africa) |
| Perl5 | Programming Language version 5 |
| PICARD | Picard Toolkit |
| PRJEB7727 | ENA Project Accession Number |
| RAM | Random Access Memory |
| R | R Statistical Language |
| SAMTOOLS | Sequence Alignment/Map Tools |
| SNP | Single Nucleotide Polymorphism |
| TB | Tuberculosis |
| TBamend | MTBseq pipeline step |
| TBbwa | MTBseq pipeline step |
| TBfull | MTBseq pipeline step |
| TBgroups | MTBseq pipeline step |
| TBjoin | MTBseq pipeline step |
| TBstrains | MTBseq pipeline step |
| TBstats | MTBseq pipeline step |
| TBvariants | MTBseq pipeline step |
| TSV | Tab-Separated Values |
| WGS | Whole-Genome Sequencing |
References
- Berger, B.; Yu, Y.W. Navigating Bottlenecks and Trade-Offs in Genomic Data Analysis. Nat. Rev. Genet. 2023, 24, 235–250. [Google Scholar] [CrossRef]
- Kohl, T.A.; Utpatel, C.; Schleusener, V.; Filippo, M.R.D.; Beckert, P.; Cirillo, D.M.; Niemann, S. MTBseq: A Comprehensive Pipeline for Whole Genome Sequence Analysis of Mycobacterium tuberculosis Complex Isolates. PeerJ 2018, 6, e5895. [Google Scholar] [CrossRef]
- Stephens, Z.D.; Lee, S.Y.; Faghri, F.; Campbell, R.H.; Zhai, C.; Efron, M.J.; Iyer, R.; Schatz, M.C.; Sinha, S.; Robinson, G.E. Big Data: Astronomical or Genomical? PLoS Biol. 2015, 13, e1002195. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Yu, J.; Xie, X.; Jiang, F.; Wu, C. Application of Genomic Data in Translational Medicine During the Big Data Era. Front. Biosci.-Landmark 2024, 29, 7. [Google Scholar] [CrossRef] [PubMed]
- Roberts, M.C.; Holt, K.E.; Del Fiol, G.; Baccarelli, A.A.; Allen, C.G. Precision Public Health in the Era of Genomics and Big Data. Nat. Med. 2024, 30, 1865–1873. [Google Scholar] [CrossRef]
- Saparov, A.; Zech, M. Big Data and Transformative Bioinformatics in Genomic Diagnostics and Beyond. Park. Relat. Disord. 2025, 134, 107311. [Google Scholar] [CrossRef] [PubMed]
- Di Tommaso, P.; Chatzou, M.; Floden, E.W.; Barja, P.P.; Palumbo, E.; Notredame, C. Nextflow Enables Reproducible Computational Workflows. Nat. Biotechnol. 2017, 35, 316–319. [Google Scholar] [CrossRef]
- Wratten, L.; Wilm, A.; Göke, J. Reproducible, Scalable, and Shareable Analysis Pipelines with Bioinformatics Workflow Managers. Nat. Methods 2021, 18, 1161–1168. [Google Scholar] [CrossRef]
- McKenna, A.; Hanna, M.; Banks, E.; Sivachenko, A.; Cibulskis, K.; Kernytsky, A.; Garimella, K.; Altshuler, D.; Gabriel, S.; Daly, M.; et al. The Genome Analysis Toolkit: A MapReduce Framework for Analyzing next-Generation DNA Sequencing Data. Genome Res. 2010, 20, 1297–1303. [Google Scholar] [CrossRef]
- Broad Institute Picard Tools Picard Tools. Available online: https://broadinstitute.github.io/picard/ (accessed on 13 November 2025).
- Li, H.; Durbin, R. Fast and Accurate Short Read Alignment with Burrows–Wheeler Transform. Bioinformatics 2009, 25, 1754–1760. [Google Scholar] [CrossRef]
- Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R. The Sequence Alignment/Map Format and SAMtools. Bioinformatics 2009, 25, 2078–2079. [Google Scholar] [CrossRef] [PubMed]
- Wall, L.; Christiansen, T.; Schwartz, R.L. The Perl Programming Language; Prentice Hall Software Series; Pearson Education USA: Upper Saddle River, NJ, USA, 1994. [Google Scholar]
- Ewels, P.A.; Peltzer, A.; Fillinger, S.; Patel, H.; Alneberg, J.; Wilm, A.; Garcia, M.U.; Di Tommaso, P.; Nahnsen, S. The Nf-Core Framework for Community-Curated Bioinformatics Pipelines. Nat. Biotechnol. 2020, 38, 276–278. [Google Scholar] [CrossRef]
- Langer, B.E.; Amaral, A.; Baudement, M.-O.; Bonath, F.; Charles, M.; Chitneedi, P.K.; Clark, E.L.; Di Tommaso, P.; Djebali, S.; Ewels, P.A.; et al. Empowering Bioinformatics Communities with Nextflow and Nf-Core. Genome Biol. 2025, 26, 228. [Google Scholar] [CrossRef] [PubMed]
- Arnold, K.; Gosling, J.; Holmes, D. The Java Programming Language; Addison Wesley Professional: Boston, MA, USA, 2005. [Google Scholar]
- Merkel, D. Docker: Lightweight Linux Containers for Consistent Development and Deployment. Linux J. 2014, 2014, 2. [Google Scholar]
- da Veiga Leprevost, F.; Grüning, B.A.; Alves Aflitos, S.; Röst, H.L.; Uszkoreit, J.; Barsnes, H.; Vaudel, M.; Moreno, P.; Gatto, L.; Weber, J.; et al. BioContainers: An Open-Source and Community-Driven Framework for Software Standardization. Bioinformatics 2017, 33, 2580–2582. [Google Scholar] [CrossRef]
- Grüning, B.; Dale, R.; Sjödin, A.; Chapman, B.A.; Rowe, J.; Tomkins-Tinch, C.H.; Valieris, R.; Köster, J. Bioconda: Sustainable and Comprehensive Software Distribution for the Life Sciences. Nat. Methods 2018, 15, 475–476. [Google Scholar] [CrossRef]
- Schleusener, V.; Köser, C.U.; Beckert, P.; Niemann, S.; Feuerriegel, S. Mycobacterium tuberculosis Resistance Prediction and Lineage Classification from Genome Sequencing: Comparison of Automated Analysis Tools. Sci. Rep. 2017, 7, 46327. [Google Scholar] [CrossRef]
- Di Tommaso, P.; Floden, E.W. Seqera, Carrer de Marià Aguiló. Seqera|Bioinformatics Platform by the Developers of Nextflow. Available online: https://seqera.io/ (accessed on 13 November 2025).
- Krampis, K. Democratizing Bioinformatics through Easily Accessible Software Platforms for Non-Experts in the Field. Biotechniques 2022, 72, 36–38. [Google Scholar] [CrossRef]
- Araxis Ltd. Araxis Merge–Advanced 2 & 3-Way File Comparison (Diff), Merging and Folder Synchronization. Available online: https://www.araxis.com/merge/index.en (accessed on 13 November 2025).
- R Core Team. R: The R Project for Statistical Computing. Available online: https://www.r-project.org/ (accessed on 13 November 2025).
- Trifinopoulos, J.; Nguyen, L.-T.; von Haeseler, A.; Minh, B.Q. W-IQ-TREE: A Fast Online Phylogenetic Tool for Maximum Likelihood Analysis. Nucleic Acids Res. 2016, 44, W232–W235. [Google Scholar] [CrossRef]
- Di Tommaso, P.; Palumbo, E.; Chatzou, M.; Prieto, P.; Heuer, M.L.; Notredame, C. The Impact of Docker Containers on the Performance of Genomic Pipelines. PeerJ 2015, 3, e1273. [Google Scholar] [CrossRef] [PubMed]
- Kadri, S.; Sboner, A.; Sigaras, A.; Roy, S. Containers in Bioinformatics. J. Mol. Diagn. 2022, 24, 442–454. [Google Scholar] [CrossRef] [PubMed]
- Ewels, P.; Magnusson, M.; Lundin, S.; Käller, M. MultiQC: Summarize Analysis Results for Multiple Tools and Samples in a Single Report. Bioinformatics 2016, 32, 3047–3048. [Google Scholar] [CrossRef]
- Grealey, J.; Lannelongue, L.; Saw, W.-Y.; Marten, J.; Méric, G.; Ruiz-Carmona, S.; Inouye, M. The Carbon Footprint of Bioinformatics. Mol. Biol. Evol. 2022, 39, msac034. [Google Scholar] [CrossRef] [PubMed]
- Lannelongue, L.; Aronson, H.-E.G.; Bateman, A.; Birney, E.; Caplan, T.; Juckes, M.; McEntyre, J.; Morris, A.D.; Reilly, G.; Inouye, M. GREENER Principles for Environmentally Sustainable Computational Science. Nat. Comput. Sci. 2023, 3, 514–521. [Google Scholar] [CrossRef] [PubMed]







| Theme | Feature |
|---|---|
| User-friendliness | Ease of download |
| User-friendliness | Explicit samplesheet |
| User-friendliness | Graphical user interface |
| User-friendliness | MultiQC Summary report |
| User-friendliness | CSV and TSV format cleanup |
| User-friendliness | Remote monitoring |
| User-friendliness | Manual steps |
| User-friendliness | Flexible output location |
| Maintainability | Extensibility |
| Maintainability | Module testing |
| Maintainability | Test dataset |
| Scalability | Parallel execution |
| Scalability | HPC compatibility |
| Scalability | Resource allocation |
| Scalability | Dynamic retries |
| Scalability | Execution cache |
| Scalability | Reduced data footprint |
| Scalability | Reduced cloud computing costs |
| Reproducibility | Declarative parameters file |
| Reproducibility | Portability |
| Reproducibility | Save intermediate files |
| Principal Output | MTBseq (Standard) | MTBseq-nf (Default Mode) | MTBseq-nf (Parallel Mode) |
|---|---|---|---|
| Classification | No differences | No differences | No differences |
| SNP distance matrix | No differences | No differences | No differences |
| Phylogenetic tree | No differences | No differences | No differences |
| Cluster groups | Consistent agglomeration | Consistent agglomeration | Consistent agglomeration |
| Statistics | Minor differences | Minor differences | No differences |
| Classification | No differences | No differences | No differences |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sharma, A.; Marcon, D.J.; Loubser, J.; Lima, K.V.B.; van der Spuy, G.; Conceição, E.C. MTBseq-nf: Enabling Scalable Tuberculosis Genomics “Big Data” Analysis Through a User-Friendly Nextflow Wrapper for MTBseq Pipeline. Microorganisms 2025, 13, 2685. https://doi.org/10.3390/microorganisms13122685
Sharma A, Marcon DJ, Loubser J, Lima KVB, van der Spuy G, Conceição EC. MTBseq-nf: Enabling Scalable Tuberculosis Genomics “Big Data” Analysis Through a User-Friendly Nextflow Wrapper for MTBseq Pipeline. Microorganisms. 2025; 13(12):2685. https://doi.org/10.3390/microorganisms13122685
Chicago/Turabian StyleSharma, Abhinav, Davi Josué Marcon, Johannes Loubser, Karla Valéria Batista Lima, Gian van der Spuy, and Emilyn Costa Conceição. 2025. "MTBseq-nf: Enabling Scalable Tuberculosis Genomics “Big Data” Analysis Through a User-Friendly Nextflow Wrapper for MTBseq Pipeline" Microorganisms 13, no. 12: 2685. https://doi.org/10.3390/microorganisms13122685
APA StyleSharma, A., Marcon, D. J., Loubser, J., Lima, K. V. B., van der Spuy, G., & Conceição, E. C. (2025). MTBseq-nf: Enabling Scalable Tuberculosis Genomics “Big Data” Analysis Through a User-Friendly Nextflow Wrapper for MTBseq Pipeline. Microorganisms, 13(12), 2685. https://doi.org/10.3390/microorganisms13122685

