Machine-Learning-Derived, Mechanistically Informed Transcriptomic Signature to Diagnose Active Tuberculosis and Guide Host-Directed Therapy
Abstract
1. Introduction
1.1. Background
1.2. Research Motivation
- Multi-group analysis: ANOVA (p < 0.001) and Tukey’s HSD post hoc analysis will be used concurrently to identify biomarkers that are specific to active TB, but not to the latent and healthy conditions, eliminating the false discovery risk of iterative pairwise analyses.
- Machine-learning (ML)-based feature selection: Integrating Boruta-XGBoost and LASSO regularization to rank strong biomarkers in high-dimensional data to overcome the weaknesses of univariate statistical methods.
- Cross-cohort validation: External validation of GSE194444 provides an assurance of generalizability, which deals with the issue of reproducibility in previous single-cohort research.
1.3. Study Objectives and Methodology
- Multi-group differential expression model: Comparative gene expressions in three clinical settings by ANOVA (p < 0.001) and Tukey’s HSD post hoc testing, which overcomes binary comparison problems.
- Biomarker selection optimized ML pipeline: Use of Boruta-XGBoost to select the optimal features (biomarkers), and LASSO regularization to rank biomarkers using GSE19439 and validating with GSE19444.
- Functional annotation: Screened biomarkers are cross-linked with immunopathology and therapeutic targets of TB using functional annotation.
1.4. Main Contributions
- A curated, mechanistically linked panel: Discovery of a proposed minimal four-gene signature (TAP2, SORT1, WARS, and ANKRD22) in which each biomarker is localized to a core, therapeutically relevant host pathway—antigen presentation, immunometabolism, interferon response, and inflammasome activation—generating a diagnostic that is also a mechanistic map.
- Strong cross-cohort validation: The diagnostic performance of these biomarkers was tested in GSE19439 data, and the AUC was 0.9911 (95% CI: 0.983–0.997). The expression dynamics were also validated by the GSE19444 cohort, which showed a significant difference in the expression between the clinical states (ANOVA, p < 0.001).
- Fundamental functionality: We have developed a deployable ML pipeline for TB staging to be used in resource-constrained environments to promote access to high-quality diagnostics.
- Mechanistic insights: Functional annotation of the biomarkers gives the biomarkers connection with possible therapeutic targets, which display new immune–metabolic interactions in TB progression.
- Improved diagnostic performance: Our four-gene signature meets the WHO Target Product Profile criteria of non-sputum triage tests and has a 90% (95% CI: 85.5–93.8%) sensitivity and 89.47% (95% CI: 84.2–93.5%) specificity. The effectiveness of this signature reflects its possible appropriateness to be developed to a quick non-sputum triage instrument. A validated version of such a tool may play an important role in filling critical diagnostic gaps in high-burden, resource-limited clinical environments.
2. Literature Review: Development of Transcriptomic Biomarkers to Diagnose Tuberculosis
| Study | Statistical Model | Indication | Number of Genes | Sensitivity | Specificity | AUC |
|---|---|---|---|---|---|---|
| Berry et al., 2010 [19] | K-nearest neighbors | ATB vs. LTBI and HCs | 393 | 61.67 | 93.75 | N/A |
| ATB vs. ODs | 86 | 92 | 83 | N/A | ||
| Kaforou et al., 2013 [20] | Difference of means | ATB vs. LTBI | 27 | 95 | 90 | 0.98 |
| ATB vs. ODs | 44 | 93 | 88 | 0.95 | ||
| Anderson et al., 2014 [21] | Difference of sums | ATB vs. LTBI | 42 | 96 | 91 | 0.984 |
| ATB vs. ODs | 51 | 74 | 78 | 0.862 | ||
| Laux da Costa et al., 2015 [4] | Random Forest | ATB vs. ODs | 3 | 93 | 95 | 0.955 |
| Lee et al., 2016 [23] | Naive Bayes | ATB vs. LTBI | 3 | 97.9 | 98 | 0.979 |
| Maertzdorf et al., 2016 [24] | Random Forest | ATB vs. LTBI and HCs | 4 | 88 | 75 | 0.98 |
| Sweeney et al., 2016 [22] | Difference of geometric means | ATB vs. LTBI and ODs and HCs | 3 | 0.82 | 0.79 | 0·88 |
| Sambarey et al., 2017 [8] | Linear discriminant analysis | ATB vs. LTBI and HCs and ODs | 10 | 89.67 | 81.0 | N/A |
| Leong et al., 2018 [26] | Rigid logistic regression | ATB vs. LTBI | 24 | 93.07 | 94.5 | 0.9840 |
| Bayaa et al., 2018 [27] | LASSO | ATB vs. HCs | 6 | 90.9 | 87.8 | 0.94 |
| ATB vs. LTBI | 6 | 90.9 | 88.5 | 0.93 | ||
| Wang et al., 2019 [18] | Decision Tree | ATB vs. LTBI and HCs | 3 | 82.4 | 92.4 | 0.806 |
| Gliddon et al., 2021 [10] | Disease Risk Score Method | TB/LTBI | 3 | 95 | 85 | 0.973 |
| TB/OD | 3 | 95 | 85 | 0.938 | ||
| Perumal et al., 2021 [32] | Simple arithmetic algorithms | HCs vs. ATB | 2 | 90.48 | 66.67 | 0.9048 |
| HCs/LTBI vs. ATB | 2 | 90.91 | 71.43 | 0.8615 | ||
| HCs vs. LTBI | 2 | 91.67 | 23.81 | 0.5357 | ||
| LTBI vs. ATB | 2 | 90.48 | 71.43 | 0.8367 | ||
| Natarajan et al., 2022 [25] | N/A | ATB vs. LTBI | 7 | 80–100 | 80–95 | 0.84–1.00 |
| Sutherland et al., 2022 [28] | Mann–Whitney U tests | TB vs. ORD | 3 | 0.87 | 0.94 | 0.88 |
| Luo et al., 2022 [29] | Cforest | ATB vs. LTBI | 8 | 93.39 | 91.18 | 0.978 |
| Xie et al., 2024 [30] | LASSO/Random Forest | ATB vs. LTBI | 2 | -- | -- | 0.994 |
| ATB vs. HCs | 2 | -- | -- | 0.782 | ||
| LTBI vs. HCs | 2 | -- | -- | 0.914 | ||
| Ren et al., 2025 [31] | Support Vector Machine | ATB vs. LTBI | 4 | -- | -- | 0.86 |
| ATB vs. HCs | 4 | -- | -- | 0.99 | ||
| This study (2025) | Voting Classifier | ATB vs. LTBI and HCs | 4 | 90 | 89.47 | 0.9911 |
3. Materials and Methods
3.1. Dataset for Gene Biomarker Discovery and Validation
3.2. Data Preprocessing
3.2.1. Normalization and Transformation
3.2.2. Scaling Method Comparison and Justification
3.2.3. Data Splitting
3.3. Differential Expression Analysis
3.4. Multi-Group Comparisons and Post Hoc Testing
- Statistical robustness: Control over false-positive discoveries (Type-I error), when dealing with high-dimensional data.
- Biological specificity: The omnibus ANOVA shows there is a global difference in place, but the post hoc Tukey’s HSD test would show exactly which of the clinical states are different from each other, so the resulting pattern of observed gene expression is specific to the stage.
- Sensitivity analysis of ANOVA threshold: To assess the robustness of our very stringent cutoff, we conducted sensitivity analysis with less stringent thresholds (p < 0.01 and p < 0.05), which is in line with what is recommended by rigorous statistical testing [44]. The findings (Supplementary Table S1) showed that (1) the 4-gene signature was always among the 20 most significant genes at all thresholds, (2) the machine learning pipeline always picked the same 4 genes, and (3) model performance (AUC) was very good (>0.985) in all thresholds. This justifies the robustness of our biomarker selection strategy. The threshold of p < 0.001 gives an optimal balance of statistical rigor and clinical applicability for initial biomarker discovery.
3.5. Multiple Testing Correction Strategy
3.6. Group-Specific DEG Categorization of Validated Genes
- Active-specific DEGs: Genes upregulated in active TB vs. both control and latent TB (, log2FC > 1).
- Latent-specific DEGs: Genes upregulated in latent TB vs. control (, log2FC > 1) and downregulated in active vs. latent TB (, log2FC < –1).
- Control-specific DEGs: Genes downregulated in both active vs. control and latent vs. control (, |log2FC| > 1).
3.7. Machine Learning Pipeline
3.7.1. Feature Selection
- Correlation filtering: Features with an absolute Pearson correlation coefficient |r| of less than 0.1 to the target and features with a pairwise correlation of |r| > 0.9 were pruned to avoid multicollinearity [46].
- Boruta-XGBoost: This wrapper method iteratively identified stable features using XGBoost’s gain-based importance [47]. Features were deemed significant if their importance exceeded the maximum importance of shadow features (permuted copies) across 100 iterations.
- LASSO regularization: The Least Absolute Shrinkage and Selection Operator (LASSO) was used to optimize the signature panel even further. It employs an L1-penalized objective function (see Supplementary Methods Equation (S1)) to induce sparsity, to select a minimal set of robust predictive gene biomarkers. The optimization of the regularization parameter (α = 0.01) was done using grid search [48,49].
3.7.2. Model Training and Evaluation
- Kernel-based method: Support Vector Machine (SVM) [54].
3.8. Biomarker Validation and Visualization
3.9. Functional Enrichment Analysis
4. Results
- Active TB-specific transcriptional changes revealed by differential expression: Multi-group analysis identified several dysregulated genes in active TB (log2FC > 1, FDR < 0.05); however, minimal dysregulated genes were observed between latent TB and controls.
- Feature-selection-based refinement of significant dysregulated genes: The results of a hybrid Boruta-XGBoost + LASSO pipeline selected the most robust biomarker signature, comprising of a minimal set of four genes.
- Diagnostic performance of the signature: A Voting Classifier fitted on this panel gave an AUC of 0.9911 (95% CI: 0.983–0.997; sensitivity 90.00% (95% CI: 85.5–93.8%) and specificity 89.47% (95% CI: 84.2–93.5%)) and correctly stratified the three clinical states.
- External validation confirms robust expression: The expression of all four biomarkers was repeatedly validated in another cohort (GSE19444), and all were significantly upregulated in active TB (ANOVA, Tukey’s HSD; p < 0.001).
- Functional pathway mapping: Enrichment analysis associated each biomarker to a core dysregulated pathway involving antigen presentation (TAP2), lipid metabolism (SORT1), interferon-gamma response (WARS), and inflammasome activation (ANKRD22).
- Comparative performance: The signature fulfills the WHO triage test standards and competes on a positive note with available transcriptomic panels, indicating the potential for diagnostic application and host-directed therapeutic understanding.
4.1. Identification of Differentially Expressed Genes (DEGs) Across Clinical States
4.2. Multi-Group Validation of Transcriptional Signatures
4.3. Transcriptome Analysis Discovers State-Specific Molecular Signatures
4.4. Machine Learning Prioritizes Minimal Biomarker Panels with Clinical Utility
4.5. Biomarker Validation Highlights Expression Dynamics
- TAP2: Its higher expression is in line with higher requirement of antigen processing and presentation through MHC-I in active infection by mycobacteria.
- SORT1: Overexpression indicates possible dysregulation of the intracellular sorting of proteins and lipids, which are important processes involved in the work of immune cells and in the process of inflammation.
- WARS: As an immunomodulatory tRNA synthetase, its increase is indicative of an enhanced status of interferon-mediated antimicrobial response.
- ANKRD22: The amplified levels of this protein indicate that it has a role in controlling the activation of immune cells; thus, this could be associated with the inflammasome or other innate signaling pathways that are active in the disease.
4.6. Sensitivity Analysis Confirms Signature Robustness
4.7. Robustness of the Signature to Scaling Methodology
4.8. Comparative Performance of TB Diagnostic ML Models
4.9. Gene Ontology and Pathway Enrichment Analysis of the Six Key DEGs
- Antigen presentation pathways:
- The analysis of functional enrichment closely related TAP2 to the peptide loading of the MHC class I pathway (GO:0042590, FDR = 2.1 × 10−5; Reactome R-HSA-983170, p = 7.8 × 10−6). This result is in line with its known mechanism in the processing and presentation of mycobacterial antigens during active infection.
- Interferon-mediated immunity:
- In line with its immune activities, WARS was highly enriched in interferon-gamma signaling (GO:0060333, FDR = 3.4 × 10−4; Reactome R-HSA-877300, p = 1.2 × 10−5), which advocates its role in antimicrobial defense.
- Cellular protein trafficking:
- Functional analysis associated SORT1 with lysosomal sorting and vesicular transport (GO:0007041, FDR = 0.003; KEGG hsa04142, FDR = 0.008), an indicator of a dysregulated protein traffic situation that happens in active TB.
- Innate immune activation:
- The pathway analysis showed that ANKRD22 is functionally related to neutrophil degranulation (Reactome R-HSA-6798695, p = 4.5 × 10−4). Since neutrophils play a primary role in the early response of immunity against M. tuberculosis, this connection means that ANKRD22 may be involved in the inflammatory mechanism in the development of granuloma.
5. Discussion and Future Work(s)
- SORT1 regulates PPARγ-dependent lipid trafficking, facilitating foam cell formation in granulomas [66].
- Four-gene signature experimental validation, through specific targeted techniques like quantitative PCR (qPCR) or NanoString on prospectively collected whole blood of specified clinical groups.
- Validation on major underrepresented groups like pediatric and HIV-coinfected patients to promote an unbiased diagnostic utility.
- Assessment of the signature performance in subclinical TB, an essential gap in the current diagnostic spectrum.
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- World Health Organization. 2025 Tuberculosis Global Report; World Health Organization: Geneva, Switzerland, 2025. [Google Scholar]
- Kohli, M.; Korobitsyn, A.; Ismail, N.; Zignol, M.; Kasaeva, T.; Dewan, P.; Ruhwald, M.; Anyaike, C.; Ayles, H.; Basilio, R.; et al. WHO Target Product Profile for TB Detection at Peripheral Settings: 2024 Update. PLoS Glob. Public Health 2025, 5, e0004612. [Google Scholar] [CrossRef] [PubMed]
- World Health Organization. WHO Consolidated Guidelines on Tuberculosis: Module 3: Diagnosis—Rapid Diagnostics for Tuberculosis Detection; World Health Organization: Geneva, Switzerland, 2021. [Google Scholar]
- Laux Da Costa, L.; Delcroix, M.; Dalla Costa, E.R.; Prestes, I.V.; Milano, M.; Francis, S.S.; Unis, G.; Silva, D.R.; Riley, L.W.; Rossetti, M.L.R. A Real-Time PCR Signature to Discriminate between Tuberculosis and Other Pulmonary Diseases. Tuberculosis 2015, 95, 421–425. [Google Scholar] [CrossRef] [PubMed]
- Afzal, Z.; Javed, M.T.; Mohsin, M.; Ahmad, H.M.W.; Saeed, Z.; Taimoor, M.; Aleem, R.A.; Raza, A.; Ayub, A.; Israr, F.; et al. The usefulness of glutaraldehyde coagulation test as a conjuncture test in the diagnosis of tuberculosis in humans and animals. Agrobiol. Rec. 2024, 15, 34–40. [Google Scholar] [CrossRef] [PubMed]
- Wahyuda, A.; Suharto, R.H.; Muflih, M.; Rasdiyanah; Danawir, M. Exploration of Tuberculosis Transmission between Humans and Cows through Milk Testing in South Sulawesi, Indonesia. Int. J. Vet. Sci. 2025, 14, 1190–1195. [Google Scholar] [CrossRef]
- Hassine, A.B.H.; Marzouk, M.; Saad, J.; Boukadida, J.; Drancourt, M. Molecular epidemiology of mycobacterium tuberculosis complex in the center of Tunisia (2008–2010 and 2014–2017). Agrobiol. Rec. 2024, 17, 69–74. [Google Scholar] [CrossRef]
- Sambarey, A.; Devaprasad, A.; Mohan, A.; Ahmed, A.; Nayak, S.; Swaminathan, S.; D’Souza, G.; Jesuraj, A.; Dhar, C.; Babu, S.; et al. Unbiased Identification of Blood-Based Biomarkers for Pulmonary Tuberculosis by Modeling and Mining Molecular Interaction Networks. EBioMedicine 2017, 15, 112–126. [Google Scholar] [CrossRef]
- Sambarey, A.; Devaprasad, A.; Baloni, P.; Mishra, M.; Mohan, A.; Tyagi, P.; Singh, A.; Akshata, J.; Sultana, R.; Buggi, S.; et al. Meta-Analysis of Host Response Networks Identifies a Common Core in Tuberculosis. NPJ Syst. Biol. Appl. 2017, 3, 4. [Google Scholar] [CrossRef]
- Gliddon, H.D.; Kaforou, M.; Alikian, M.; Habgood-Coote, D.; Zhou, C.; Oni, T.; Anderson, S.T.; Brent, A.J.; Crampin, A.C.; Eley, B.; et al. Identification of Reduced Host Transcriptomic Signatures for Tuberculosis Disease and Digital PCR-Based Validation and Quantification. Front. Immunol. 2021, 12, 637164. [Google Scholar] [CrossRef]
- Zhang, H.; Shi, M.; Yu, L.; Ran, F.; Zheng, N.; Wang, X.; Liu, Y.; Li, C.; Li, D.; Li, J. Identification of a Diagnostic Multiomics-Based Biomarker Cluster in a Mouse Model of Pulmonary Tuberculosis. Discov. Med. 2024, 36, 1268–1279. [Google Scholar] [CrossRef]
- Chen, E.; Chen, C.; Chen, F.; Yu, P.; Lin, L. Positive Association between MIC Gene Polymorphism and Tuberculosis in Chinese Population. Immunol. Lett. 2019, 213, 62–69. [Google Scholar] [CrossRef]
- Herrera, M.; Keynan, Y.; McLaren, P.J.; Isaza, J.P.; Abrenica, B.; López, L.; Marin, D.; Rueda, Z.V. Gene Expression Profiling Identifies Candidate Biomarkers for New Latent Tuberculosis Infections. A Cohort Study. PLoS ONE 2022, 17, e0274257. [Google Scholar] [CrossRef] [PubMed]
- Vargas, R.; Abbott, L.; Bower, D.; Frahm, N.; Shaffer, M.; Yu, W.H. Gene Signature Discovery and Systematic Validation across Diverse Clinical Cohorts for TB Prognosis and Response to Treatment. PLoS Comput. Biol. 2023, 19, e1010770. [Google Scholar] [CrossRef] [PubMed]
- Zheng, S.; Qu, W.; Zhang, D.; Zhou, J.; Xu, Y.; Wu, W.; Liu, C.; Huang, M.; Shen, E.; Chen, X.; et al. International Multicenter Development of Ensemble Machine Learning Driven Host Response Based Diagnosis for Tuberculosis. iScience 2025, 28, 113444. [Google Scholar] [CrossRef] [PubMed]
- Łukaszuk, T.; Krawczuk, J.; Żyła, K.; Kęsik, J. Stability of Feature Selection in Multi-Omics Data Analysis. Appl. Sci. 2024, 14, 11103. [Google Scholar] [CrossRef]
- Boumait, Y.; Ettetuani, B.; Chrairi, M.; Lamzouri, A.; Chahboune, R. Identification of Gene Expression Biomarkers Predictive of Latent Tuberculosis Infection Using Machine Learning Approaches. Genes 2025, 16, 715. [Google Scholar] [CrossRef]
- Wang, S.; He, L.; Wu, J.; Zhou, Z.; Gao, Y.; Chen, J.; Shao, L.; Zhang, Y.; Zhang, W. Transcriptional Profiling of Human Peripheral Blood Mononuclear Cells Identifies Diagnostic Biomarkers That Distinguish Active and Latent Tuberculosis. Front. Immunol. 2019, 10, 2948. [Google Scholar] [CrossRef] [PubMed]
- Berry, M.P.R.; Graham, C.M.; McNab, F.W.; Xu, Z.; Bloch, S.A.A.; Oni, T.; Wilkinson, K.A.; Banchereau, R.; Skinner, J.; Wilkinson, R.J.; et al. An Interferon-Inducible Neutrophil-Driven Blood Transcriptional Signature in Human Tuberculosis. Nature 2010, 466, 973–977. [Google Scholar] [CrossRef] [PubMed]
- Kaforou, M.; Wright, V.J.; Oni, T.; French, N.; Anderson, S.T.; Bangani, N.; Banwell, C.M.; Brent, A.J.; Crampin, A.C.; Dockrell, H.M.; et al. Detection of Tuberculosis in HIV-Infected and -Uninfected African Adults Using Whole Blood RNA Expression Signatures: A Case-Control Study. PLoS Med. 2013, 10, e1001538. [Google Scholar] [CrossRef] [PubMed]
- Anderson, S.T.; Kaforou, M.; Brent, A.J.; Wright, V.J.; Banwell, C.M.; Chagaluka, G.; Crampin, A.C.; Dockrell, H.M.; French, N.; Hamilton, M.S.; et al. Diagnosis of Childhood Tuberculosis and Host RNA Expression in Africa. N. Engl. J. Med. 2014, 370, 1712–1723. [Google Scholar] [CrossRef]
- Sweeney, T.E.; Braviak, L.; Tato, C.M.; Khatri, P. Genome-Wide Expression for Diagnosis of Pulmonary Tuberculosis: A Multicohort Analysis. Lancet Respir. Med. 2016, 4, 213–224. [Google Scholar] [CrossRef]
- Lee, S.W.; Wu, L.S.H.; Huang, G.M.; Huang, K.Y.; Lee, T.Y.; Weng, J.T.Y. Gene Expression Profiling Identifies Candidate Biomarkers for Active and Latent Tuberculosis. BMC Bioinform. 2016, 17, 27–39. [Google Scholar] [CrossRef]
- Maertzdorf, J.; McEwen, G.; Weiner, J.; Tian, S.; Lader, E.; Schriek, U.; Mayanja-Kizza, H.; Ota, M.; Kenneth, J.; Kaufmann, S.H. Concise Gene Signature for Point-of-care Classification of Tuberculosis. EMBO Mol. Med. 2016, 8, 86–95. [Google Scholar] [CrossRef] [PubMed]
- Natarajan, S.; Ranganathan, M.; Hanna, L.E.; Tripathy, S. Transcriptional Profiling and Deriving a Seven-Gene Signature That Discriminates Active and Latent Tuberculosis: An Integrative Bioinformatics Approach. Genes 2022, 13, 616. [Google Scholar] [CrossRef] [PubMed]
- Leong, S.; Zhao, Y.; Joseph, N.M.; Hochberg, N.S.; Sarkar, S.; Pleskunas, J.; Hom, D.; Lakshminarayanan, S.; Horsburgh, C.R.; Roy, G.; et al. Existing Blood Transcriptional Classifiers Accurately Discriminate Active Tuberculosis from Latent Infection in Individuals from South India. Tuberculosis 2018, 109, 41–51. [Google Scholar] [CrossRef] [PubMed]
- Bayaa, R.; Ndiaye, M.D.B.; Chedid, C.; Kokhreidze, E.; Tukvadze, N.; Banu, S.; Uddin, M.K.M.; Biswas, S.; Nasrin, R.; Ranaivomanana, P.; et al. Multi-Country Evaluation of RISK6, a 6-Gene Blood Transcriptomic Signature, for Tuberculosis Diagnosis and Treatment Monitoring. Sci. Rep. 2021, 11, 13646. [Google Scholar] [CrossRef]
- Sutherland, J.S.; Van Der Spuy, G.; Gindeh, A.; Thuong, N.T.T.; Namuganga, A.R.; Owolabi, O.; Mayanja-Kizza, H.; Nsereko, M.; Thwaites, G.; Winter, J.; et al. Diagnostic Accuracy of the Cepheid 3-Gene Host Response Fingerstick Blood Test in a Prospective, Multi-Site Study: Interim Results. Clin. Infect. Dis. 2022, 74, 2136–2141. [Google Scholar] [CrossRef]
- Luo, Y.; Xue, Y.; Liu, W.; Song, H.; Huang, Y.; Tang, G.; Wang, F.; Wang, Q.; Cai, Y.; Sun, Z. Development of Diagnostic Algorithm Using Machine Learning for Distinguishing between Active Tuberculosis and Latent Tuberculosis Infection. BMC Infect. Dis. 2022, 22, 965. [Google Scholar] [CrossRef]
- Xie, L.; Zhu, G.; Long, S.; Wang, M.; Cheng, X.; Dong, Y.; Wang, C.; Wang, G. Identification of MORN3 and LLGL2 as Novel Diagnostic Biomarkers for Latent Tuberculosis Infection Using Machine Learning Strategies and Experimental Verification. Ann. Med. 2024, 56, 2380797. [Google Scholar] [CrossRef]
- Ren, B.; Jia, F.; Fang, Q.; Xu, J.; Lin, K.; Huang, R.; Liu, Z.; Xing, X. Development of a Four Autophagy-Related Gene Signature for Active Tuberculosis Diagnosis. Front. Cell. Infect. Microbiol. 2025, 15, 1600348. [Google Scholar] [CrossRef]
- Perumal, P.; Abdullatif, M.B.; Garlant, H.N.; Honeyborne, I.; Lipman, M.; McHugh, T.D.; Southern, J.; Breen, R.; Santis, G.; Ellappan, K.; et al. Validation of Differentially Expressed Immune Biomarkers in Latent and Active Tuberculosis by Real-Time PCR. Front. Immunol. 2021, 11, 612564. [Google Scholar] [CrossRef]
- Irizarry, R.A.; Hobbs, B.; Collin, F.; Beazer-Barclay, Y.D.; Antonellis, K.J.; Scherf, U.; Speed, T.P. Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics 2003, 4, 249–264. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Y.; Parmigiani, G.; Johnson, W.E. ComBat-Seq: Batch Effect Adjustment for RNA-Seq Count Data. NAR Genom. Bioinform. 2020, 2, lqaa078. [Google Scholar] [CrossRef] [PubMed]
- Champely, S.; Ekstrom, C.; Dalgaard, P.; Gill, J. Pwr: Basic Functions for Power Analysis; R Package Version 1.3-0; Comprehensive R Archive Network (CRAN): Vienna, Austria, 2020; Available online: https://CRAN.R-project.org/package=pwr (accessed on 20 September 2025).
- Rahnenführer, J.; De Bin, R.; Benner, A.; Ambrogi, F.; Lusa, L.; Boulesteix, A.L.; Migliavacca, E.; Binder, H.; Michiels, S.; Sauerbrei, W.; et al. Statistical Analysis of High-Dimensional Biomedical Data: A Gentle Introduction to Analytical Goals, Common Approaches and Challenges. BMC Med. 2023, 21, 182. [Google Scholar] [CrossRef] [PubMed]
- Huber, W.; Von Heydebreck, A.; Sültmann, H.; Poustka, A.; Vingron, M. Variance Stabilization Applied to Microarray Data Calibration and to the Quantification of Differential Expression. Bioinformatics 2002, 18, S96–S104. [Google Scholar] [CrossRef]
- Mining, W.I.D. Data Mining: Concepts and Techniques. Morgan Kaufinann 2006, 10, 4. [Google Scholar]
- Huber, W.; Carey, V.J.; Gentleman, R.; Anders, S.; Carlson, M.; Carvalho, B.S.; Bravo, H.C.; Davis, S.; Gatto, L.; Girke, T.; et al. Orchestrating High-Throughput Genomic Analysis with Bioconductor. Nat. Methods 2015, 12, 115–121. [Google Scholar] [CrossRef] [PubMed]
- Benjaminit, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B (Methodol.) 1995, 57, 289–300. [Google Scholar] [CrossRef]
- Dudoit, S.; Yang, Y.H.; Callow, M.J.; Speed, T.P. Statistical Methods for Identifying Differentially Expressed Genes in Replicated CDNA Microarray Experiments. Stat. Sin. 2002, 12, 111–139. Available online: http://www.jstor.org/stable/24307038 (accessed on 19 February 2026).
- Kim, T.K. Understanding One-Way ANOVA Using Conceptual Figures. Korean J. Anesthesiol. 2017, 70, 22. [Google Scholar] [CrossRef]
- Tukey, J.W. Comparing Individual Means in the Analysis of Variance. Biometrics 1949, 5, 99–114. [Google Scholar] [CrossRef]
- Saltelli, A.; Aleksankina, K.; Becker, W.; Fennell, P.; Ferretti, F.; Holst, N.; Li, S.; Wu, Q. Why So Many Published Sensitivity Analyses Are False: A Systematic Review of Sensitivity Analysis Practices. Environ. Model. Softw. 2019, 114, 29–39. [Google Scholar] [CrossRef]
- Bender, R.; Lange, S. Adjusting for Multiple Testing—When and How? J. Clin. Epidemiol. 2001, 54, 343–349. [Google Scholar] [CrossRef] [PubMed]
- James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning with Applications in R; Springer: New York, NY, USA, 2013. [Google Scholar]
- Kursa, M.B.; Rudnicki, W.R. Feature Selection with the Boruta Package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef]
- Tibshiranit, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Fabian, P. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825. [Google Scholar]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2016; Volume 13–17, pp. 785–794. [Google Scholar]
- Breiman, L. Random Forests. Random For. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Scornet, E. Random Forests and Kernel Methods. IEEE Trans. Inf. Theory 2015, 62, 1485–1500. [Google Scholar] [CrossRef]
- Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
- Cui, H.; Zhang, X. Alignment-Free Supervised Classification of Metagenomes by Recursive SVM. BMC Genom. 2013, 14, 641. [Google Scholar] [CrossRef]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2009. [Google Scholar]
- Hanley, J.A.; McNeil, B.J. The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve1. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef] [PubMed]
- Waskom, M. Seaborn: Statistical Data Visualization. J. Open Source Softw. 2021, 6, 3021. [Google Scholar] [CrossRef]
- Wolf, F.A.; Angerer, P.; Theis, F.J. SCANPY: Large-Scale Single-Cell Gene Expression Data Analysis. Genome Biol. 2018, 19, 15. [Google Scholar] [CrossRef] [PubMed]
- Dennis, G.; Sherman, B.T.; Hosack, D.A.; Yang, J.; Gao, W.; Lane, C.; Lempicki, R.A. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003, 4, R60. [Google Scholar] [CrossRef]
- Sherman, B.T.; Hao, M.; Qiu, J.; Jiao, X.; Baseler, M.W.; Lane, H.C.; Imamichi, T.; Chang, W. DAVID: A Web Server for Functional Enrichment Analysis and Functional Annotation of Gene Lists (2021 Update). Nucleic Acids Res. 2022, 50, W216–W221. [Google Scholar] [CrossRef] [PubMed]
- Kanehisa, M.; Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28, 27–30. [Google Scholar] [CrossRef] [PubMed]
- Pai, M.; Behr, M. Latent Mycobacterium Tuberculosis Infection and Interferon-Gamma Release Assays. Microbiol. Spectr. 2016, 4, 10-1128. [Google Scholar] [CrossRef] [PubMed]
- Esmail, H.; Riou, C.; du Bruyn, E.; Pei-Jen Lai, R.; Harley, Y.X.; Meintjes, G.; Wilkinson, K.A.; Wilkinson, R.J. The Immune Response to Mycobacterium Tuberculosis In-HIV-1-Coinfected Persons. Annu. Rev. Immunol. 2018, 36, 603–638. [Google Scholar] [CrossRef]
- Huynh, J.; Abo, Y.N.; Triasih, R.; Singh, V.; Pukai, G.; Masta, P.; Tsogt, B.; Luu, B.K.; Felisia, F.; Pank, N.; et al. Emerging Evidence to Reduce the Burden of Tuberculosis in Children and Young People. Int. J. Infect. Dis. 2025, 155, 107869. [Google Scholar] [CrossRef]
- Kasule, G.W.; Hermans, S.; Semugenze, D.; Wekiya, E.; Nsubuga, J.; Mwachan, P.; Kabugo, J.; Joloba, M.; García-Basteiro, A.L.; Ssengooba, W. Non-Sputum-Based Samples and Biomarkers for Detection of Mycobacterium Tuberculosis: The Hope to Improve Childhood and HIV-Associated Tuberculosis Diagnosis. Eur. J. Med. Res. 2024, 29, 502. [Google Scholar] [CrossRef]
- Vázquez, C.L.; Rodgers, A.; Herbst, S.; Coade, S.; Gronow, A.; Guzman, C.A.; Wilson, M.S.; Kanzaki, M.; Nykjaer, A.; Gutierrez, M.G. The Proneurotrophin Receptor Sortilin Is Required for Mycobacterium Tuberculosis Control by Macrophages. Sci. Rep. 2016, 6, 29332. [Google Scholar] [CrossRef]
- Barbet, G.; Nair-Gupta, P.; Schotsaert, M.; Yeung, S.T.; Moretti, J.; Seyffer, F.; Metreveli, G.; Gardner, T.; Choi, A.; Tortorella, D.; et al. TAP Dysfunction in Dendritic Cells Enables Noncanonical Cross-Presentation for T Cell Priming. Nat. Immunol. 2021, 22, 497–509. [Google Scholar] [CrossRef] [PubMed]
- Lin, P.L.; Flynn, J.A.L. CD8 T Cells and Mycobacterium Tuberculosis Infection. Semin. Immunopathol. 2015, 37, 239–249. [Google Scholar] [CrossRef] [PubMed]
- Jobe, D.; Darboe, F.; Muefong, C.N.; Barry, A.; Coker, E.G.; Mohammed, N.; Jobe, A.; Davies, M.M.; Faye, B.; Jallow, R.; et al. Gene Expression in TB Disease Measured from the Periphery Is Different from the Site of Infection. Tuberculosis 2022, 134, 102187. [Google Scholar] [CrossRef] [PubMed]
- Ahmed, M.; Thirunavukkarasu, S.; Rosa, B.A.; Thomas, K.A.; Das, S.; Rangel-Moreno, J.; Lu, L.; Mehra, S.; Mbandi, S.K.; Thackray, L.B.; et al. Immune Correlates of Tuberculosis Disease and Risk Translate across Species. Sci. Transl. Med. 2020, 12, eaay0233. [Google Scholar] [CrossRef]
- Yang, Y.; Li, C.; Fan, X.; Long, W.; Hu, Y.; Wang, Y.; Qu, J. Effectiveness of Omadacycline in a Patient with Chlamydia Psittaci and KPC-Producing Gram-Negative Bacteria Infection. Infect. Drug Resist. 2025, 18, 903–908. [Google Scholar] [CrossRef]
- Pai, M.; Behr, M.A.; Dowdy, D.; Dheda, K.; Divangahi, M.; Boehme, C.C.; Ginsberg, A.; Swaminathan, S.; Spigelman, M.; Getahun, H.; et al. Tuberculosis. Nat. Rev. Dis. Primers 2016, 2, 16076. [Google Scholar] [CrossRef]
- Horne, D.J.; Zifodya, J.S.; Shapiro, A.E.; Church, E.C.; Kreniske, J.S.; Kay, A.W.; Scandrett, K.; Steingart, K.R.; Takwoingi, Y. Xpert MTB/RIF Ultra Assay for Pulmonary Tuberculosis and Rifampicin Resistance in Adults and Adolescents. Cochrane Database Syst. Rev. 2025, 7, CD009593. [Google Scholar] [CrossRef]
- Liu, T.; Wang, Y.; Gui, J.; Fu, Y.; Ye, C.; Hong, X.; Chen, L.; Li, Y.; Zhang, X.; Hong, W. Transcriptome Analysis of the Impact of Diabetes as a Comorbidity on Tuberculosis. Medicine 2022, 101, E31652. [Google Scholar] [CrossRef]
- Liu, Y.; Pu, Y.; Wang, J.; Li, Z.; Liu, S.; Tang, S. A Bioinformatics-Driven Approach to Identify Biomarkers and Elucidate the Pathogenesis of Type 2 Diabetes Concurrent with Pulmonary Tuberculosis. Sci. Rep. 2025, 15, 16931. [Google Scholar] [CrossRef]
- Darboe, F.; Mbandi, S.K.; Naidoo, K.; Yende-Zuma, N.; Lewis, L.; Thompson, E.G.; Duffy, F.J.; Fisher, M.; Filander, E.; Van Rooyen, M.; et al. Detection of Tuberculosis Recurrence, Diagnosis and Treatment Response by a Blood Transcriptomic Risk Signature in HIV-Infected Persons on Antiretroviral Therapy. Front. Microbiol. 2019, 10, 1441. [Google Scholar] [CrossRef]
- Zak, D.E.; Penn-Nicholson, A.; Scriba, T.J.; Thompson, E.; Suliman, S.; Amon, L.M.; Mahomed, H.; Erasmus, M.; Whatney, W.; Hussey, G.D.; et al. A Blood RNA Signature for Tuberculosis Disease Risk: A Prospective Cohort Study. Lancet 2016, 387, 2312–2322. [Google Scholar] [CrossRef]
- Gootenberg, J.S.; Abudayyeh, O.O.; Lee, J.W.; Essletzbichler, P.; Dy, A.J.; Joung, J.; Verdine, V.; Donghia, N.; Daringer, N.M.; Freije, C.A.; et al. Nucleic Acid Detection with CRISPR-Cas13a/C2c2. Science 2017, 356, 438–442. [Google Scholar] [CrossRef]







| Gene | ANOVA F | ANOVA p-Value | Comparison | Δlog2(FC) | Adjusted p-Value |
|---|---|---|---|---|---|
| TAP2 | 25.363 | 8.79 × 10−8 | Active vs. Latent | 1.114 | 0.0000 |
| Active vs. Control | 1.195 | 0.0000 | |||
| Latent vs. Control | 0.081 | 0.8963 | |||
| SORT1 | 39.702 | 0.04 × 10−8 | Active vs. Latent | 1.166 | 0.0000 |
| Active vs. Control | 1.077 | 0.0000 | |||
| Latent vs. Control | −0.089 | 0.8084 | |||
| WARS | 41.905 | 0.02 × 10−8 | Active vs. Latent | 1.737 | 0.0000 |
| Active vs. Control | 1.754 | 0.0000 | |||
| Latent vs. Control | 0.017 | 0.9964 | |||
| ANKRD22 | 45.285 | 0.00679 × 10−8 | Active vs. Latent | 3.852 | 0.0000 |
| Active vs. Control | 3.559 | 0.0000 | |||
| Latent vs. Control | −0.293 | 0.7872 |
| Scaling Method | AUC (95% CI) | Sensitivity (95% CI) | Specificity (95% CI) | Accuracy (95% CI) | Macro F1-Score (95% CI) |
|---|---|---|---|---|---|
| RobustScaler | 0.9911 (0.983–0.997) | 90.0% (85.5–93.8%) | 89.47% (84.2–93.5%) | 86.2% (81.0–90.5%) | 86.2% (81.0–90.3%) |
| Z-score | 0.9902 (0.981–0.996) | 90.3% (85.8–94.0%) | 88.9% (83.5–93.0%) | 85.8% (80.5–90.0%) | 85.7% (80.4–89.8%) |
| Quantile | 0.9895 (0.980–0.996) | 89.7% (85.2–93.5%) | 89.5% (84.2–93.5%) | 85.5% (80.2–89.8%) | 85.5% (80.1–89.6%) |
| Study | Statistical Model | Indication | Number of Genes | Sensitivity | Specificity | AUC |
|---|---|---|---|---|---|---|
| Berry et al., 2010 [19] | K-nearest neighbors | ATB vs. LTBI and HCs | 393 | 61.67 | 93.75 | N/A |
| ATB vs. ODs | 86 | 92 | 83 | N/A | ||
| Kaforou et al., 2013 [20] | Difference of means | ATB vs. LTBI | 27 | 95 | 90 | 0.98 |
| ATB vs. ODs | 44 | 93 | 88 | 0.95 | ||
| Anderson et al., 2014 [21] | Difference of sums | ATB vs. LTBI | 42 | 96 | 91 | 0.984 |
| ATB vs. ODs | 51 | 74 | 78 | 0.862 | ||
| Laux da Costa et al., 2015 [4] | Random Forest | ATB vs. ODs | 3 | 93 | 95 | 0.955 |
| Lee et al., 2016 [23] | Naive Bayes | ATB vs. LTBI | 3 | 97.9 | 98 | 0.979 |
| Maertzdorf et al., 2016 [24] | Random Forest | ATB vs. LTBI and HCs | 4 | 88 | 75 | 0.98 |
| Sweeney et al., 2016 [22] | Difference of geometric means | ATB vs. LTBI and ODs and HCs | 3 | 0.82 | 0.79 | 0·88 |
| Sambarey et al., 2017 [8] | Linear discriminant analysis | ATB vs. LTBI and HCs and ODs | 10 | 89.67 | 81.0 | N/A |
| Leong et al., 2018 [26] | Rigid logistic regression | ATB vs. LTBI | 24 | 93.07 | 94.5 | 0.9840 |
| Bayaa et al., 2018 [27] | LASSO | ATB vs. HCs | 6 | 90.9 | 87.8 | 0.94 |
| ATB vs. LTBI | 6 | 90.9 | 88.5 | 0.93 | ||
| Wang et al., 2019 [18] | Decision Tree | ATB vs. LTBI and HCs | 3 | 82.4 | 92.4 | 0.806 |
| Gliddon et al., 2021 [10] | Disease Risk Score Method | TB/LTBI | 3 | 95 | 85 | 0.973 |
| TB/OD | 3 | 95 | 85 | 0.938 | ||
| Perumal et al., 2021 [32] | Simple arithmetic algorithms | HCs vs. ATB | 2 | 90.48 | 66.67 | 0.9048 |
| HCs/LTBI vs. ATB | 2 | 90.91 | 71.43 | 0.8615 | ||
| HCs vs. LTBI | 2 | 91.67 | 23.81 | 0.5357 | ||
| LTBI vs. ATB | 2 | 90.48 | 71.43 | 0.8367 | ||
| Natarajan et al., 2022 [25] | N/A | ATB vs. LTBI | 7 | 80–100 | 80–95 | 0.84–1.00 |
| Sutherland et al., 2022 [28] | Mann–Whitney U tests | TB vs. ORD | 3 | 0.87 | 0.94 | 0.88 |
| Luo et al., 2022 [29] | Cforest | ATB vs. LTBI | 8 | 93.39 | 91.18 | 0.978 |
| Xie et al., 2024 [30] | LASSO/Random Forest | ATB vs. LTBI | 2 | -- | -- | 0.994 |
| ATB vs. HCs | 2 | -- | -- | 0.782 | ||
| LTBI vs. HCs | 2 | -- | -- | 0.914 | ||
| Ren et al., 2025 [31] | Support Vector Machine | ATB vs. LTBI | 4 | -- | -- | 0.86 |
| ATB vs. HCs | 4 | -- | -- | 0.99 | ||
| This study (2025) | Voting Classifier | ATB vs. LTBI and HCs | 4 | 90 | 89.47 | 0.9911 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Syed, A.H.; Alromema, N.; Almazarqi, H.A.; Irfan, J.; Ahmad, S.; Taha, A.A.; Alsayed, A.O. Machine-Learning-Derived, Mechanistically Informed Transcriptomic Signature to Diagnose Active Tuberculosis and Guide Host-Directed Therapy. Diagnostics 2026, 16, 693. https://doi.org/10.3390/diagnostics16050693
Syed AH, Alromema N, Almazarqi HA, Irfan J, Ahmad S, Taha AA, Alsayed AO. Machine-Learning-Derived, Mechanistically Informed Transcriptomic Signature to Diagnose Active Tuberculosis and Guide Host-Directed Therapy. Diagnostics. 2026; 16(5):693. https://doi.org/10.3390/diagnostics16050693
Chicago/Turabian StyleSyed, Asif Hassan, Nashwan Alromema, Hatem A. Almazarqi, Jasrah Irfan, Shakeel Ahmad, Altyeb A. Taha, and Alhuseen Omar Alsayed. 2026. "Machine-Learning-Derived, Mechanistically Informed Transcriptomic Signature to Diagnose Active Tuberculosis and Guide Host-Directed Therapy" Diagnostics 16, no. 5: 693. https://doi.org/10.3390/diagnostics16050693
APA StyleSyed, A. H., Alromema, N., Almazarqi, H. A., Irfan, J., Ahmad, S., Taha, A. A., & Alsayed, A. O. (2026). Machine-Learning-Derived, Mechanistically Informed Transcriptomic Signature to Diagnose Active Tuberculosis and Guide Host-Directed Therapy. Diagnostics, 16(5), 693. https://doi.org/10.3390/diagnostics16050693

