Machine-Learning-Derived, Mechanistically Informed Transcriptomic Signature to Diagnose Active Tuberculosis and Guide Host-Directed Therapy

Syed, Asif Hassan; Alromema, Nashwan; Almazarqi, Hatem A.; Irfan, Jasrah; Ahmad, Shakeel; Taha, Altyeb A.; Alsayed, Alhuseen Omar

doi:10.3390/diagnostics16050693

Open AccessArticle

Machine-Learning-Derived, Mechanistically Informed Transcriptomic Signature to Diagnose Active Tuberculosis and Guide Host-Directed Therapy

by

Asif Hassan Syed

^1,*

,

Nashwan Alromema

¹

,

Hatem A. Almazarqi

²

,

Jasrah Irfan

³,

Shakeel Ahmad

¹,

Altyeb A. Taha

²

and

Alhuseen Omar Alsayed

⁴

¹

Department of Computer Science, Faculty of Computing and Information Technology in Rabigh, King Abdulaziz University, Jeddah 21589, Saudi Arabia

²

Department of Information Technology, Faculty of Computing and Information Technology in Rabigh, King Abdulaziz University, Jeddah 21911, Saudi Arabia

³

Alkhor International School, Alkhor P.O. Box 22166, Qatar

⁴

Department of Research Affairs Unit, Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah 21589, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Diagnostics 2026, 16(5), 693; https://doi.org/10.3390/diagnostics16050693

Submission received: 6 January 2026 / Revised: 14 February 2026 / Accepted: 20 February 2026 / Published: 26 February 2026

(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)

Download

Browse Figures

Versions Notes

Abstract

Background/Objectives: An important diagnostic problem is to differentiate between active tuberculosis (TB) and latent TB infection (LTBI). Furthermore, the current biomarkers also offer minimal insight into disease pathogenesis to direct treatment. This triggered us to design a two-mode biomarker signature based on the multicohort analysis using a transcriptomic and stringent machine learning pipeline. Methods: When analyzing active TB, latent TB, and healthy control samples, a rigorous filter (ANOVA, p < 0.001) was used, followed by the selection of features with the help of Boruta-XGBoost and LASSO regression. This determined a small four-gene signature (TAP2, SORT1, WARS, and ANKRD22), which was selectively and highly upregulated in the active TB clinical state (p < 0.001). An ensemble staking classifier based on this signature (Random Forest and XGBoost) had a very high diagnostic performance (ROC-AUC = 0.991 (95% CI: 0.983–0.997)) in the stratification of infection phases, which was strongly confirmed in another cohort (GSE19444). Results: Importantly, the analysis of the functional pathways showed that all the genes are mapped to core dysregulated host pathways in active TB: antigen presentation (TAP2), lipid trafficking (SORT1), interferon response (WARS), and inflammasome signaling (ANKRD22). In such a way, the signature has a dual advantage: (1) high specificity, non-sputum transcriptional diagnostic of active TB, and (2) a mechanistic map of key host pathways, which describes targets of intervention. Conclusions: Thus, the signature provides a two-fold response: a biomarker panel aligned with WHO performance targets for TB triage and a mechanistic plan of therapy, which provides an easy way to implement transcriptomic discovery into clinical action against TB.

Keywords:

tuberculosis; transcriptomics; machine learning; diagnostic biomarkers; host-directed therapy

1. Introduction

1.1. Background

Tuberculosis (TB), a disease caused by Mycobacterium tuberculosis (M. tuberculosis), is an endemic health issue with more than 10 million cases each year and 1.5 million fatalities [1]. Recent diagnostics, including sputum smear microscopy and GeneXpert, are severely limited in sensitivity to paucibacillary and extrapulmonary TB, whereas detection of latent TB infection (LTBI) is still problematic [2,3,4]. It is also worth noting that zoonotic TB, which is caused by Mycobacterium bovis, is a major public health issue in most areas, and it is transmitted between animals and humans [5,6,7]. Transcriptomic profiling has become a paradigm shift in the determination of host response biomarkers, providing the possibility of non-invasive diagnostics and understanding of disease pathogenesis [8,9,10].

1.2. Research Motivation

Although transcriptomic studies have revealed potential biomarkers of TB, recent advances in the field have also leveraged multi-omics approaches and genetic association studies. For instance, multi-omics clusters have been discovered in murine models underlying disease progression [11] and polymorphism in genes, including those of the MHC I polypeptide-related sequence (MIC) family, with TB susceptibility in human populations [12]. Our work builds upon this foundation by employing a strong ML pipeline to transcriptomic data to answer three key gaps in current transcriptomic research. To start with, most of the studies are concerned with pairwise comparisons (e.g., active TB vs. healthy or active TB vs. latent TB) instead of the systematized multi-group studies that combine all three clinical conditions (active TB, latent TB, healthy controls). However, they compared them sequentially (i.e., DESeq2 between active and latent) and, nevertheless, failed to identify stage-specific signatures using simultaneous statistical models. As an example, multi-group cohorts were involved in recent studies by Herrera et al. (2022) [13] and Vargas et al. (2023) [14]. However, their reliance on sequential pairwise comparison (i.e., DESeq2 between active and latent) fails to leverage simultaneous statistical models designed to screen stage-specific signatures. Second, transcriptomic data (>20,000 genes) are high-dimensional and small sample sizes (usually less than 200 samples), thus introducing biases in feature selection and overfitting, reducing generalizability [15,16,17]. Third, the previously performed multi-group meta-analyses (e.g., Sambarey et al. (2017) [8,9] and Wang et al. (2019) [18]) used a heterogeneous dataset but failed to conduct rigorous statistical workflows (e.g., ANOVA + Tukey’s HSD) to identify disease-stage-specific biomarkers.

The present paper fills in these gaps using the following:

Multi-group analysis: ANOVA (p < 0.001) and Tukey’s HSD post hoc analysis will be used concurrently to identify biomarkers that are specific to active TB, but not to the latent and healthy conditions, eliminating the false discovery risk of iterative pairwise analyses.
Machine-learning (ML)-based feature selection: Integrating Boruta-XGBoost and LASSO regularization to rank strong biomarkers in high-dimensional data to overcome the weaknesses of univariate statistical methods.
Cross-cohort validation: External validation of GSE194444 provides an assurance of generalizability, which deals with the issue of reproducibility in previous single-cohort research.

As such, our plan is not only to discover a strong diagnostic signature but to make sure that signature is made of genes whose roles in TB pathogenesis are clearly defined and have practical application to close the gaps between diagnosis and therapy.

1.3. Study Objectives and Methodology

This paper combines transcriptomics with machine learning (ML) to determine the molecular signature to differentiate active TB, latent TB, and healthy controls and enhances our knowledge of TB pathogenesis. The objectives are as follows:

Multi-group differential expression model: Comparative gene expressions in three clinical settings by ANOVA (p < 0.001) and Tukey’s HSD post hoc testing, which overcomes binary comparison problems.
Biomarker selection optimized ML pipeline: Use of Boruta-XGBoost to select the optimal features (biomarkers), and LASSO regularization to rank biomarkers using GSE19439 and validating with GSE19444.
Functional annotation: Screened biomarkers are cross-linked with immunopathology and therapeutic targets of TB using functional annotation.

The differentially expressed genes (DEGs) were identified using the transcriptomic data of GSE19439 (ANOVA, p < 0.001, log₂FC > 1). A combination of Boruta-XGBoost and the LASSO feature selection strategy produced a four-gene panel (TAP2, SORT1, WARS, and ANKRD22). The performance of supervised classifiers (Random Forest and SVM) was found to be good (AUC: 0.89), and it was externally validated with GSE19444. Functional enrichment identified biomarker functions in antigen presentation (TAP2), lipid metabolism (SORT1), interferon response (WARS), and inflammasome response (ANKRD22) that mechanistically explained immune-metabolic dysregulation.

1.4. Main Contributions

A mechanistically informed, two-fold purpose framework, which advances TB diagnostics and therapeutic insight, is the main contribution of this piece of work. Specifically, we provide the following:

A curated, mechanistically linked panel: Discovery of a proposed minimal four-gene signature (TAP2, SORT1, WARS, and ANKRD22) in which each biomarker is localized to a core, therapeutically relevant host pathway—antigen presentation, immunometabolism, interferon response, and inflammasome activation—generating a diagnostic that is also a mechanistic map.
Strong cross-cohort validation: The diagnostic performance of these biomarkers was tested in GSE19439 data, and the AUC was 0.9911 (95% CI: 0.983–0.997). The expression dynamics were also validated by the GSE19444 cohort, which showed a significant difference in the expression between the clinical states (ANOVA, p < 0.001).
Fundamental functionality: We have developed a deployable ML pipeline for TB staging to be used in resource-constrained environments to promote access to high-quality diagnostics.
Mechanistic insights: Functional annotation of the biomarkers gives the biomarkers connection with possible therapeutic targets, which display new immune–metabolic interactions in TB progression.
Improved diagnostic performance: Our four-gene signature meets the WHO Target Product Profile criteria of non-sputum triage tests and has a 90% (95% CI: 85.5–93.8%) sensitivity and 89.47% (95% CI: 84.2–93.5%) specificity. The effectiveness of this signature reflects its possible appropriateness to be developed to a quick non-sputum triage instrument. A validated version of such a tool may play an important role in filling critical diagnostic gaps in high-burden, resource-limited clinical environments.

These contributions fill gaps in biomarker discovery and diagnostic accuracy and understanding of mechanisms, which provides diagnostic tools to stratify TB and treat it better.

Our research article is structured as follows: Section 1 presents the burden of tuberculosis (TB) throughout the world, criticizes the shortcomings of the existing diagnostic tools (e.g., frugality in paucibacillary TB), and explains why transcriptomic profiling and machine learning (ML) should be unified to detect stage-specific biomarkers, streamline ML pipelines, and annotate therapeutic pathways. Section 2 summarizes the progress in TB biomarker discovery, ML in the context of diagnostics of infectious diseases (e.g., feature selection and ensemble classifiers), and the functional activities of TAP2 (antigen presentation), SORT1 (lipid metabolism), WARS (interferon response), and ANKRD22 (inflammasome activation) in the context of TB pathogenesis. Section 3 provides the description of the methodology, such as transcriptomic datasets (GSE19439 and GSE19444) and preprocessing procedures (normalization and batch correction), genes’ differential expression analysis (volcano plots, ANOVA, and Tukey’s HSD), selection of optimal features through correlation filtering, Boruta-XGBoost, and LASSO regularization, and classifier (XGBoost, SVM, and Stacking Classifier) training and validation (AUC-ROC, accuracy, and F1-score). Section 4 gives the results, such as the identification of a four-gene panel (TAP2, SORT1, WARS, and ANKRD22) and its diagnostic properties, and a comparison with the existing biomarker panels. Section 5 addresses clinical implications, highlighting the four-gene panel’s superiority over conventional diagnostic tools and mechanistic understanding of immune-metabolic dysregulation as a therapeutic target. Section 6 offers a conclusion that includes major findings, limitations (e.g., sample size), and recommendations, including how to proceed in the future (e.g., in vitro validation of biomarker functions in TB progression).

The multi-group transcriptomics (active TB, latent TB, and healthy controls) and machine-learning-based framework, which uses ANOVA-based differential expression analysis, Boruta-XGBoost/LASSO feature selection, and cross-cohort validation to prioritize biomarkers, has been adopted to identify active TB diagnosis and host-directed therapy biomarkers, as illustrated in Figure 1.

2. Literature Review: Development of Transcriptomic Biomarkers to Diagnose Tuberculosis

The long-standing issue of differentiating between active tuberculosis (ATB) and latent tuberculosis infection (LTBI) and other diseases (ODs) has prompted a plethora of studies into host blood transcriptomic phenotypes. This review summarizes some of the important developments in the area, including the evolution toward minimal and high-performance gene panels that achieve World Health Organization (WHO) diagnostic targets (Table 1).

Early pioneering research, including that by Berry et al. (2010) [19], noted a highly interferon-inducible neutrophil-driven transcriptional signature in ATB by a 393-transcript panel. Although this study offered important pathophysiological insights, it had low diagnostic validities in distinguishing between -ATB and LTBI and healthy controls (HCs) due to complexity and low sensitivity (61.67). Follow-up studies were done on signature minimization with accuracy. The potential of the smaller panel was demonstrated in the study by Kaforou et al. (2013) [20], which reported the validation of a 27-transcript signature with excellent discrimination (AUC 0.98, sensitivity 95%, and specificity 90% between ATB vs. LTBI) in both HIV-positive and HIV-negative African cohorts. The strength of transcriptomic signatures was also confirmed by Anderson et al. (2014) [21], who applied 51 genes to compare ATB with ODs (AUC 0.862) and 42 genes to compare ATB with LTBI (AUC 0.984) in large pediatric African cohorts, successfully overcoming the shortcomings of sputum-based tests. An urgent trend was formed to create highly economical gene panels in line with the rapid PCR-based systems. In a multicohort meta-analysis, Sweeney et al. (2016) [22] identified a three-gene signature of GBP5, DUSP3, and KLF2 with strong global AUCs of between 0.88 and 0.90, irrespective of HIV or BCG status. Laux da Costa et al. (2015) [4] went on to show that the use of just GBP5 and CD64 (with GZMA) was effective, with an AUC of 0.955, a sensitivity of 93, and specificity of 95 on ATB versus ODs. Lee et al. (2016) [23] reported a four-gene panel (NEMF, ASUN, DHX29, and PTPRC) in peripheral blood mononuclear cells (PBMCs), where a Naive Bayes classifier was able to distinguish ATB and LTBI with high accuracy (AUC of 0.979), sensitivity of 97.9, and specificity of 98%. Gliddon et al. (2021) [10] validated a three-transcript signature of FCGR1A, ZNF296, and C1QB in ATB vs. LTBI with an AUC of 0.973, sensitivity of 95, and specificity of 85. The development of four-gene signatures has also been observed in the field. Maertzdorf et al. (2016) [24] were the first to validate a four-gene signature between ATB and LTBI and HCs across cohorts with high accuracy (AUC 0.98), but its specificity (75) was lower than WHO targets. Sambarey et al. (2017) [8] applied network mining to extract an ATB 10-gene signature to deflate ATB vs. LTBI and HCs and ODs, which highlighted the importance of immune-related markers, including FCGR1A.

As the studies have advanced, there has been growing interest in filling certain clinical gaps, specifically on discrimination of active and latent TB. Wang et al. (2019) [18] identified a three-gene PBMC signature (TNFRSF10C, EBF3, and A2ML1), which had a sensitivity of 82.4, specificity of 92.4, and AUC of 0.806 in distinguishing between ATB and LTBI and HCs. Natarajan et al. (2022) [25] discovered a seven-gene signature, which encompassed FCGR1B, ANKRD22, and IFITM3, with a high area under the curve (AUC) between 0.84 and 1.00 to discriminate ATB vs. LTBI in cohorts.

Further, Leong et al. (2018) [26] and Bayaa et al. (2018) [27] established the effectiveness of the current signatures (RISK6, which is a combination of six genes) in South Indian cohorts, with an AUC of 0.984, and in multi-ethnic cohorts, with an AUC of 0.93, sensitiveness of 90.9%, and specificity of 88.5%, to differentiate ATB vs. LTBI.

Specificity against other diseases also became a focal point of research. Both Laux da Costa et al. (2015) [4] and Kaforou et al. (2013) [20] focused on the problem of distinguishing between ATB and other diseases. Point-of-care (POC) testing has become a focus, and Maertzdorf et al. (2016) [24] and Sutherland et al. (2022) [28] have developed signatures (a four-gene and a three-gene “TB score”) through the Cepheid fingerstick test and focused on point-of-care (POC) application. These studies appeared to have promising AUCs of 0.98 and 0.94, respectively, exceeding or close to the WHO Target Product Profile (TPP) sensitivity (88% and 87%), though specificity still needs to be increased (75% and 94%).

More recent works have exploited advanced machine learning (ML) algorithms to improve the choice of features and model construction. Luo et al. (2022) [29] used a cforest model to combine T-SPOT findings and lymphocyte data (eight features) to obtain impressive performance, with an AUC of 0.978, a sensitivity of 93.39, and a specificity of 91.18 in ATB vs. LTBI. Xie et al. (2024) [30] used LASSO and Random Forest on GEO data, and they found new biomarkers of LTBI, including MORN3 and LLGL2, with an AUC of 0.994 between ATB and LTBI. Ren et al. (2025) [31] concentrated on autophagy-related genes (CASP1, FAS, TRIM5, and C5) and used support vector machine methods to attain high accuracy and an AUC of 0.99 between ATB and healthy controls and an AUC of 0.86 between ATB and LTBI.

Despite these developments, there are major challenges in the field. The sensitivity–specificity trade-off is a major challenge, with numerous minimal gene panels, including those by Maertzdorf et al. [24] and Wang et al. [18], or small gene panels, such as the one by Perumal et al. (2021) [32], with two genes (with sensitivity of 90.48% but specificity of 71.43% against ATB vs. LTBI) failing to achieve the WHO best targets, which mandates sensitivity of at least 90% and specificity of at least 80% on triage. Moreover, the strength of these signatures in different cohorts is a challenge, e.g., the effectiveness of signatures, such as those published by Xie et al. [30], may decrease in heterogeneous groups. Moreover, not all diagnostic signatures are well linked to disease biology, which restricts the prospective inferences of therapeutics.

However, the proposed research intends to fill these gaps by combining multi-group transcriptomics (ATB, LTBI, and HCs) with improved machine learning models, such as Boruta-XGBoost, LASSO, and a stacking model. It establishes a new four-gene signature panel (TAP2, SORT1, WARS, and ANKRD22) with the best-in-class results, the AUC at 0.9911, the sensitivity of 90%, and the specificity of 89.47%. The signature fulfills the WHO optimum specificity thresholds and satisfies the sensitivity criterion, and, most importantly, it intersects the diagnostic and mechanistic insights.

The genes of this signature have important biological implications: TAP2 implicates the dysregulated expression of MHC-I antigen presentation, ANKRD22 indicates the dysregulated expression of NOD-like receptor/inflammasomes, SORT1 indicates the alteration of lipid trafficking and metabolism, and WARS indicates the disruption of the interferon pathway, as it encodes tryptophanyl-tRNA synthetase that IFN-gamma activates. This biologically grounded, highly efficient four-gene signature (TAP2, SORT1, WARS, and ANKRD22) represents a significant leap forward, offering unparalleled diagnostic accuracy for ATB/LTBI stratification within a minimal gene framework compatible with point-of-care (POC) platforms, while simultaneously revealing novel targets for host-directed therapies. It outperforms modern models, with an AUC that is 37 percent higher than Wang et al., [18] and 1.3 percent higher than Kaforou et al., [20], which directly empowers WHO-compliant decreases in unnecessary confirmatory testing by 40–60 percent.

Table 1. Comparative analysis of different machine-learning-based literature on identifying transcriptomic biomarkers to distinguish between active tuberculosis (ATB) and latent TB infection (LTBI), as well as other diseases (ODs).

Study	Statistical Model	Indication	Number of Genes	Sensitivity	Specificity	AUC
Berry et al., 2010 [19]	K-nearest neighbors	ATB vs. LTBI and HCs	393	61.67	93.75	N/A
Berry et al., 2010 [19]	K-nearest neighbors	ATB vs. ODs	86	92	83	N/A
Kaforou et al., 2013 [20]	Difference of means	ATB vs. LTBI	27	95	90	0.98
Kaforou et al., 2013 [20]	Difference of means	ATB vs. ODs	44	93	88	0.95
Anderson et al., 2014 [21]	Difference of sums	ATB vs. LTBI	42	96	91	0.984
Anderson et al., 2014 [21]	Difference of sums	ATB vs. ODs	51	74	78	0.862
Laux da Costa et al., 2015 [4]	Random Forest	ATB vs. ODs	3	93	95	0.955
Lee et al., 2016 [23]	Naive Bayes	ATB vs. LTBI	3	97.9	98	0.979
Maertzdorf et al., 2016 [24]	Random Forest	ATB vs. LTBI and HCs	4	88	75	0.98
Sweeney et al., 2016 [22]	Difference of geometric means	ATB vs. LTBI and ODs and HCs	3	0.82	0.79	0·88
Sambarey et al., 2017 [8]	Linear discriminant analysis	ATB vs. LTBI and HCs and ODs	10	89.67	81.0	N/A
Leong et al., 2018 [26]	Rigid logistic regression	ATB vs. LTBI	24	93.07	94.5	0.9840
Bayaa et al., 2018 [27]	LASSO	ATB vs. HCs	6	90.9	87.8	0.94
Bayaa et al., 2018 [27]	LASSO	ATB vs. LTBI	6	90.9	88.5	0.93
Wang et al., 2019 [18]	Decision Tree	ATB vs. LTBI and HCs	3	82.4	92.4	0.806
Gliddon et al., 2021 [10]	Disease Risk Score Method	TB/LTBI	3	95	85	0.973
Gliddon et al., 2021 [10]	Disease Risk Score Method	TB/OD	3	95	85	0.938
Perumal et al., 2021 [32]	Simple arithmetic algorithms	HCs vs. ATB	2	90.48	66.67	0.9048
		HCs/LTBI vs. ATB	2	90.91	71.43	0.8615
		HCs vs. LTBI	2	91.67	23.81	0.5357
		LTBI vs. ATB	2	90.48	71.43	0.8367
Natarajan et al., 2022 [25]	N/A	ATB vs. LTBI	7	80–100	80–95	0.84–1.00
Sutherland et al., 2022 [28]	Mann–Whitney U tests	TB vs. ORD	3	0.87	0.94	0.88
Luo et al., 2022 [29]	Cforest	ATB vs. LTBI	8	93.39	91.18	0.978
Xie et al., 2024 [30]	LASSO/Random Forest	ATB vs. LTBI	2	--	--	0.994
		ATB vs. HCs	2	--	--	0.782
		LTBI vs. HCs	2	--	--	0.914
Ren et al., 2025 [31]	Support Vector Machine	ATB vs. LTBI	4	--	--	0.86
Ren et al., 2025 [31]	Support Vector Machine	ATB vs. HCs	4	--	--	0.99
This study (2025)	Voting Classifier	ATB vs. LTBI and HCs	4	90	89.47	0.9911

Remark: The given values consist of sensitivity, specificity, and area under the curve (AUC) of each study. Where there are limitations, these are denoted as N/A or any other forms. The studies include different statistical models and the sizes of gene panels applied to the categorization of active tuberculosis (ATB), latent TB infection (LTBI), and other diseases (ODs).

3. Materials and Methods

We used an inter-group transcriptomic and machine learning system to recognize active tuberculosis (ATB)-specific biomarkers. We analyzed the GSE19439 dataset, which was preprocessed by RMA normalization and ComBat batch correction. We conducted an ANOVA with post hoc testing to perform the differential expression analysis to identify genes specific and unique to each clinical state. Next, to optimize this gene set, we used a hybrid feature selection method (Boruta-XGBoost + LASSO), which provided a set of 4 core genes. This signature was assessed and its diagnostic quality was externally validated, proving to be accurate. Lastly, functional enrichment analysis was used to put biological relevance into place by associating the signature with major pathways that were dysregulated in active TB.

3.1. Dataset for Gene Biomarker Discovery and Validation

In the current analysis we used transcriptomic data from the GSE19439 dataset (GEO accession: GSE19439) [19], which involves 120 samples profiled on Illumina microarrays. Patient groups were defined using rigorous criteria: active TB cases (n = 42) were culture-confirmed pulmonary TB (pretreatment), latent TB cases (n = 36) were positive by both the tuberculin skin test (TST) and interferon-gamma release assay (IGRA), with no evidence of active disease, and healthy controls (n = 40) were negative for both TST and IGRA with no TB exposure. The identified biomarkers (predictors) were externally validated using an independent cohort (GEO: GSE19444) [19], which employed identical diagnostic criteria. Raw probe intensities were normalized with the Robust Multi-Array Average (RMA) algorithm [33] and corrected for batch effects using ComBat [34] to address technical variability.

The pwr package in the R (version 4.2.2) [35] was used to perform a post hoc power analysis using the effect sizes (observed) of the four-gene signature. Using the specified sample sizes and with 0.05 as the alpha value, the analysis showed a statistical power of more than 0.95 to identify said patterns of differential expression between the active TB and the latent TB/control groups. This indicates that the study was sufficiently powered to complete the initial classification task, but we would note that power in high dimensional environments has intrinsic issues of its own that need to be cautiously interpreted [36].

3.2. Data Preprocessing

3.2.1. Normalization and Transformation

The transcriptomic data of microarray went through sequential preprocesses to ensure that there was a comparability as well as the reduction of technical artifacts. RMA normalization offered background correction, quantile normalization, and summarization to log₂-scale expression values [33]. An inverse hyperbolic sine transformation (arsinh) was implemented to stabilize variance and reduce the biasing effect of background noise [37].

3.2.2. Scaling Method Comparison and Justification

Following transformation, features (genes) were centered and scaled. We selected RobustScaler for its resistance to outliers:

X_{scaled} = \frac{X - Median (X)}{IQR (X)},

(1)

where IQR = Q3 − Q1 (interquartile range). This is to avoid undue effects of outliers but still accurately represent the biological signal [38].

To make sure that this choice did not bias biologically meaningful patterns of expression, we compared it to two alternative procedures operating on the same preprocessed (RMA + ComBat) data: standard z-score normalization (mean-centering and scaling by standard deviation) as well as quantile normalization (applied after ComBat). We measured, using each approach, (1) how well the expression profile of our four-gene signature is reproducible in all samples, and (2) performance of the Voting Classifier, which has only been trained upon these four genes with the same nested cross-validation system. As reported in Section 4.8, the expression patterns of the signature were found to have almost perfect correlation (r > 0.995) in the methods, and classification performance was also consistent (AUC: 0.9895–0.9911) across methods, confirming the robustness of our findings to the scaling methodology.

3.2.3. Data Splitting

The processed GSE19439 data were split into training (60%, n = 71) and testing (40%, n = 47) sets via stratified sampling, certifying at least two samples per clinical class in the test set [39].

3.3. Differential Expression Analysis

Volcano plots were used to identify the differentially expressed genes (DEGs) in three pairwise comparisons (active vs. control, active vs. latent, and latent vs. control). Genes were deemed important when their Benjamini–Hochberg adjusted p-value < 0.05 [40] and their absolute log₂ fold change > 1 [41], which is equivalent to a difference in expression of at least two-fold.

The genes were classified according to the following criteria: upregulated (red) when log₂FC is greater than 1 and adjusted p is less than 0.05 in the first group (e.g., active TB vs. control), downregulated (blue) when log₂FC is smaller than −1 and adjusted p is less than 0.05 in the first group (e.g., active TB vs. control), and non-significant (gray) when it did not meet the thresholds in all the comparisons. Such a methodology focuses on changes of biological relevance and reduced false discoveries, which is essential in biomarker discovery in non-homogenous cohorts.

3.4. Multi-Group Comparisons and Post Hoc Testing

A one-way ANOVA was conducted to determine differentially expressed genes in all three clinical states (active TB, latent TB, and healthy controls) with a significance level of p < 0.001. This stringent cutoff was selected to minimize false positives in high-dimensional genomic data, where multiple testing poses substantial challenges [40]. ANOVA is effective in identifying the global expression difference with Type I error rate (false positives) control under multi-group design [42]. The F-statistics were computed as:

F = \frac{Between-group variance}{Within-group variance} = \frac{S S_{between} / (k - 1)}{S S_{within} / (N - k)} .

(2)

Here, k = number of groups, SS = sum of squares, and N = total samples. Genes that showed substantial ANOVA values (p < 0.001, Benjamini–Hochberg adjusted) were then subjected to Tukey’s Honest Significant Difference (HSD) post hoc test, with family-wise error rate (FWER) control of = 0.05 [43], to verify pairwise differences between the three clinical states (e.g., active vs. latent). The test statistics was computed as:

q = \frac{\bar{X_{i}} - \bar{X_{j}}}{\sqrt{MSE / n}} .

(3)

Here, n = number of individuals in each group,

\bar{X_{i}}, \bar{X_{j}}

= group means, and MSE = mean squared error of ANOVA. Only genes that met ANOVA (p < 0.001) and Tukey’s HSD (p < 0.05) criteria were retained, and this two-step approach guarantees the following:

Statistical robustness: Control over false-positive discoveries (Type-I error), when dealing with high-dimensional data.
Biological specificity: The omnibus ANOVA shows there is a global difference in place, but the post hoc Tukey’s HSD test would show exactly which of the clinical states are different from each other, so the resulting pattern of observed gene expression is specific to the stage.
Sensitivity analysis of ANOVA threshold: To assess the robustness of our very stringent cutoff, we conducted sensitivity analysis with less stringent thresholds (p < 0.01 and p < 0.05), which is in line with what is recommended by rigorous statistical testing [44]. The findings (Supplementary Table S1) showed that (1) the 4-gene signature was always among the 20 most significant genes at all thresholds, (2) the machine learning pipeline always picked the same 4 genes, and (3) model performance (AUC) was very good (>0.985) in all thresholds. This justifies the robustness of our biomarker selection strategy. The threshold of p < 0.001 gives an optimal balance of statistical rigor and clinical applicability for initial biomarker discovery.

3.5. Multiple Testing Correction Strategy

In our study, we employed different multiple testing correction approaches, as appropriate for each analytical context. For volcano plots (pairwise comparisons), we used Benjamini–Hochberg false discovery rate (FDR) correction with an adjusted p-value threshold of 0.05. For the one-way ANOVA (multi-group comparison), we applied Benjamini–Hochberg FDR correction with an adjusted p-value threshold of 0.001. For Tukey’s HSD post hoc tests (pairwise comparisons following ANOVA), we used family-wise error rate (FWER) correction at α = 0.05. This strategy employs FDR correction for high-dimensional screening (where many tests are performed) and FWER for controlled pairwise comparisons following significant ANOVA results, aligning with established statistical recommendations for multi-stage analysis [45].

3.6. Group-Specific DEG Categorization of Validated Genes

Rigorous set operations were used to stratify differentially expressed validated genes into three mutually exclusive categories:

Active-specific DEGs: Genes upregulated in active TB vs. both control and latent TB ( $p < 0.05$ , log₂FC > 1).
Latent-specific DEGs: Genes upregulated in latent TB vs. control ( $p < 0.05$ , log₂FC > 1) and downregulated in active vs. latent TB ( $p < 0.05$ , log₂FC < –1).
Control-specific DEGs: Genes downregulated in both active vs. control and latent vs. control ( $p < 0.05$ , |log₂FC| > 1).

This approach ensured unambiguous biological interpretation while minimizing false assignments from generalized immune responses.

3.7. Machine Learning Pipeline

An end-to-end machine learning pipeline was executed to discover the optimal biomarker panel and construct a diagnostic classifier for tuberculosis (TB) staging. A nested cross-validation framework was used to guarantee a high level of rigorous evaluation as well as avoid information leakage. All the feature selections (correlation filtering, Boruta-XGBoost, and LASSO regularization) were conducted only within the training folds of each outer cross-validation loop. This guaranteed absolute independence between feature selection and test data during the model development process. To assess the uncertainty of the final performance metrics, we calculated the 95% confidence interval on 1000 bootstrap resamples of the held-out test set. The AUC, sensitivity, specificity, accuracy, and macro F1-score were recalculated on each bootstrap sample. The confidence intervals reported are the 2.5th and 97.5th percentiles of the resulting distributions.

3.7.1. Feature Selection

To minimize the dimensionality and choose a minimal set of robust biomarkers, we employed a multi-step feature selection approach. This process consisted of the following stages:

Correlation filtering: Features with an absolute Pearson correlation coefficient |r| of less than 0.1 to the target and features with a pairwise correlation of |r| > 0.9 were pruned to avoid multicollinearity [46].
Boruta-XGBoost: This wrapper method iteratively identified stable features using XGBoost’s gain-based importance [47]. Features were deemed significant if their importance exceeded the maximum importance of shadow features (permuted copies) across 100 iterations.
LASSO regularization: The Least Absolute Shrinkage and Selection Operator (LASSO) was used to optimize the signature panel even further. It employs an L1-penalized objective function (see Supplementary Methods Equation (S1)) to induce sparsity, to select a minimal set of robust predictive gene biomarkers. The optimization of the regularization parameter (α = 0.01) was done using grid search [48,49].

3.7.2. Model Training and Evaluation

Six supervised classifiers were trained, and their performance was compared to determine the best model that can be used to discriminate between active TB, latent TB, and healthy controls:

Ensemble methods: XGBoost (Extreme Gradient Boosting) [50], Random Forest (RF) [51,52], and Gradient Boosting staging [53].
Kernel-based method: Support Vector Machine (SVM) [54].
Ensemble of ensembles: A Stacking Classifier (RF + SVM) [55] and a Voting Classifier (RF + XGBoost) [55].

The fundamental principles of these algorithms (e.g., gradient boosting, margin maximization, and bagging) are elaborated in the Supplementary Methods (Equations (S2)–(S7)). The standard metrics for model performance were as follows:

Accuracy [46]: Proportion of correctly classified samples.
F1-Score [49]: The harmonic mean of precision and recall, assessed via macro-averaging across all three classes.
ROC-AUC (macro-averaged): The area under the receiver operating characteristic curve, averaged across all classes to provide a reliable metric of separability in a multi-class condition [49,56].

The highest and most consistent model in terms of these metrics was chosen as the diagnostic classifier.

3.8. Biomarker Validation and Visualization

The identified biomarkers (TAP2, SORT1, WARS, and ANKRD22) were statistically revalidated and their expression patterns were visualized using the independent cohort GSE19444. Differential expression across all three clinical states was confirmed using multi-group ANOVA [42] (p < 0.001), with Tukey’s HSD post hoc test [43] validating stage-specific pairwise differences (such as TAP2 log₂FC = 2.8 in active vs. control and ANKRD22 log₂FC = 3.1 in active vs. latent).

For visualization, Kernel Density Estimation (KDE) plots generated using seaborn [57], with a bandwidth of 0.5, showed non-overlapping expression distributions, underlining the discriminatory power of the gene biomarkers. Furthermore, hierarchical heatmaps were made using the Scanpy workflow [58], which involved normalization to 10,000 reads per sample, log transformation (pp.log1p), and Ward linkage clustering. A viridis colormap efficiently illustrated the consistent upregulation of biomarkers like ANKRD22 in active TB samples, visually fortifying the statistical results.

3.9. Functional Enrichment Analysis

Functional enrichment analysis of the four biomarkers was accomplished via three complementary frameworks: Gene Ontology (GO), KEGG pathways, and Reactome pathways. GO term enrichment was conducted using DAVID [59,60], with an EASE score threshold (modified Fisher’s exact test, p < 0.05), an FDR-adjusted p < 0.05, and a minimum of 5 genes per term. KEGG pathway analysis [61] applied an FDR < 0.05 with a minimum of 3 mapped genes, while ReactomePA used a hypergeometric test (p < 0.01) and FDR < 0.05.

This integrated methodology combined functional annotation (GO), signaling pathways (KEGG), and mechanistic insight (Reactome) to contextualize the biomarkers within the various immune regulation, pathogen recognition, and host structural remodeling pathways pertinent to TB pathogenesis.

4. Results

In the present study we screened a four-gene minimal biomarker panel including TAP2, SORT1, WARS, and ANKRD22 that accurately distinguishes the active TB clinical state and maps to key host response pathways relevant to TB:

Active TB-specific transcriptional changes revealed by differential expression: Multi-group analysis identified several dysregulated genes in active TB (log₂FC > 1, FDR < 0.05); however, minimal dysregulated genes were observed between latent TB and controls.
Feature-selection-based refinement of significant dysregulated genes: The results of a hybrid Boruta-XGBoost + LASSO pipeline selected the most robust biomarker signature, comprising of a minimal set of four genes.
Diagnostic performance of the signature: A Voting Classifier fitted on this panel gave an AUC of 0.9911 (95% CI: 0.983–0.997; sensitivity 90.00% (95% CI: 85.5–93.8%) and specificity 89.47% (95% CI: 84.2–93.5%)) and correctly stratified the three clinical states.
External validation confirms robust expression: The expression of all four biomarkers was repeatedly validated in another cohort (GSE19444), and all were significantly upregulated in active TB (ANOVA, Tukey’s HSD; p < 0.001).
Functional pathway mapping: Enrichment analysis associated each biomarker to a core dysregulated pathway involving antigen presentation (TAP2), lipid metabolism (SORT1), interferon-gamma response (WARS), and inflammasome activation (ANKRD22).
Comparative performance: The signature fulfills the WHO triage test standards and competes on a positive note with available transcriptomic panels, indicating the potential for diagnostic application and host-directed therapeutic understanding.

4.1. Identification of Differentially Expressed Genes (DEGs) Across Clinical States

The results of the differential expression comparison among active TB, latent TB, and healthy controls displayed different transcriptional signatures (log₂FC above 1 and unadjusted p below 0.05). Figure 2A–C shows profound immune dysregulation in active disease state, as evident from the volcano plots. Active TB versus the healthy controls transcriptional landscape showed 100 highly expressed and 50 downregulated genes. Some of the most significant DEGs that were upregulated were TAP2 (log₂FC = 1.16, p = 1.11 × 10⁻⁵, q = 0.013) and ANKRD22 (log₂FC = 3.99, p = 2.84 × 10⁻⁴, q = 0.040).

Active vs. latent TB profiling demonstrated a distinct difference in the host response that was marked with 338 and 101 up- and down-regulated genes, respectively. The main facilitators of this change are WARS (ILMN1727271) and SORT1 (ILMN1707077) that are upregulated to a significant level (q < 0.02), which emphasizes the unique immune-metabolic environment of active disease. There was very minimal deviation in the transcriptional profile of latent TB compared to healthy controls. Although we detected 34 up- and 231 down-regulated genes, none of them passed FDR correction (q < 0.05), which points to an underlying biological similarity between latent infection and the uninfected state (healthy control).

4.2. Multi-Group Validation of Transcriptional Signatures

The results of transcriptional signatures from pairwise comparisons were verified in all clinical states via a multi-group framework: ANOVA (p < 0.05) with a post hoc test and Tukey’s HSD (α = 0.05). The analysis established significant transcriptional differences in active TB, involving 78 upregulated and 35 downregulated DEGs in active vs. controls (e.g., TAP2/ILMN1759250 and SORT1/ILMN1707077) and 179 upregulated and 42 downregulated DEGs in active vs. latent TB (e.g., WARS/ILMN1727271 and ANKRD22/ILMN17998). Latent vs. controls, in turn, did not reveal any significant changes (6 upregulated and 21 downregulated DEGs), with no genes passing the correction of the FDR (q > 0.05), which reinforces the biological homogeneity between the state of latent infection and healthy state (control).

The biomarkers TAP2, SORT1, WARS, and ANKRD22 showed consistent dysregulated levels: significant upregulation in both groups, active vs. controls (p < 0.001) and active vs. latent TB (p < 0.001), validated using Tukey’s HSD (e.g., ANKRD22 log₂FC = 3.99 and WARS log₂FC = 1.85). Genes that showed inconsistent expression (e.g., TLR5) were removed, indicating the power of this framework to eliminate noise between groups. This multi-tiered approach focused on four high-confidence biomarkers for downstream modeling that showed robustness to be used as core signatures of TB active state.

4.3. Transcriptome Analysis Discovers State-Specific Molecular Signatures

ANOVA (p < 0.001) and Tukey’s HSD post hoc (α = 0.05) multi-group validation showed that there were specific transcriptional signatures that were unique to each clinical state. The results of the analysis revealed 78 specifically differentially expressed genes (DEGs) in active TB with several interesting, upregulated genes, like TAP2 (ILMN1759250), WARS (ILMN1727271), and ANKRD22 (ILMN1799848), indicating that there are a lot of significant transcriptional dysregulations in active disease.

Contrarily, both latent TB (e.g., ABCF2/ILMN2284941, IL28RA/ILMN1680805, and AL040642/ILMN1816035) and control states (e.g., BX111043/ILMN1915914 and BU566406/ILMN1896714) identified only three DEGs. The overwhelming percentage of active-specific signatures highlights the strong dysregulation of active tuberculosis (TB). Simultaneously, the fact that the number of differentially expressed genes (DEGs) that are connected to latent TB and control is minimal thereby highlights the biological similarity between the two conditions.

4.4. Machine Learning Prioritizes Minimal Biomarker Panels with Clinical Utility

A hybrid strategy of feature selection, which consisted of selecting the eight-candidate stability selection of Boruta and the sparsity constraint of LASSO, yielded a shortened four-gene panel: TAP2, SORT1, WARS, and ANKRD22. The importance weights, as evidenced by the LASSO coefficients, showed that ANKRD22 (0.96) and WARS (0.81) were the most influential in driving the classification, as illustrated in Figure 3.

Of the six classifiers that were tested on TB staging, the Voting Classifier was found to be the best. It achieved an outstanding AUC of 0.9911 (95% CI: 0.983–0.997), with a sensitivity of 90.0% (95% CI: 85.5–93.8%) and a specificity of 89.47% (95% CI: 84.2–93.5%). The accuracy was 86.21% (95% CI: 81.0–90.5%) and the macro F1-score was 86.18% (95% CI: 81.0–90.3%). These confidence intervals, derived from 1000 bootstrap resamples of the test set, underscore the robustness of the performance. The Voting Classifier’s specificity was superior to that of Gradient Boosting and Random Forest, as shown in Figure 4.

Our four-gene panel performance, as shown in the confusion matrix (Figure 5), meets the benchmarks of the WHO triage test. It reported a sensitivity of 90.0% (95% CI: 85.5–93.8%), which is above the acceptable minimum of 80%, and a specificity of 89.47% (95% CI: 84.293.5%), surpassing the ideal threshold of 80%.

Practically, this resulted in one case of active TB misclassified as latent TB, and one case of latent TB misclassified as active TB. This accuracy profile satisfies the WHO Target Product Profile requirements for non-sputum-based triage tests. It is important to note that the panel also exhibited perfect classification of control samples (8/8 correct), indicating its robustness. The performance profile of the signature suggests it has the potential to be developed into a triage test that could efficiently screen high-risk individuals, possibly reducing the need for confirmatory testing by 40–60% in resource-constrained settings. This provides a rationale for further studies aimed at clinical translation.

4.5. Biomarker Validation Highlights Expression Dynamics

Assessment of the four machine-learning-prioritized biomarkers, namely, TAP2 (transporter involved in antigen presentation), SORT1 (protein sorting regulator), WARS (tryptophanyl-tRNA synthetase), and ANKRD22 (ankyrin repeat domain protein), validated the evident overexpression in active tuberculosis (TB) across both independent training (GSE19499) and validation (GSE19444) cohorts. Statistical analysis of the validation cohort showed that all biomarkers exhibit significant differential expression across all clinical states (ANOVA, p < 0.001 for all). Post hoc analysis (Tukey’s HSD test) established high expression in the active TB, as compared to latent TB and controls (all comparisons p < 0.001), but no significant differences between latent TB and controls (p > 0.78; Table 2).

Figure 6A–D visually captures these statistical patterns, where a clear contrast is presented between the level of expression in each group. The box plots show sharp statistically based comparisons in the levels of transcripts. In the case of SORT1, WARS, and ANKRD22, the difference in expression between the groups was significantly high (p < 0.001 across all the comparisons of interest). The comparative expression of TAP2 was also significant amongst comparisons (p < 0.001). Based on the quiescence of latent infection in terms of its functionality, no significant difference in expression of any gene between the latent TB and control groups was observed, which was marked as NS (not significant) in Figure 6.

These patterns of transcription were supported by heatmap visualization (Figure 7). It distinctly marked a clear group of samples in which all four of the biomarkers, TAP2, SORT1, WARS, and ANKRD22, were uniformly overexpressed, and solely related to the active TB cohort. Conversely, the samples of the latent TB and the healthy control groups tended to cluster together with almost identical yet low-level expression patterns.

These four genes are biologically relevant to the known pathways of host response to tuberculosis:

TAP2: Its higher expression is in line with higher requirement of antigen processing and presentation through MHC-I in active infection by mycobacteria.
SORT1: Overexpression indicates possible dysregulation of the intracellular sorting of proteins and lipids, which are important processes involved in the work of immune cells and in the process of inflammation.
WARS: As an immunomodulatory tRNA synthetase, its increase is indicative of an enhanced status of interferon-mediated antimicrobial response.
ANKRD22: The amplified levels of this protein indicate that it has a role in controlling the activation of immune cells; thus, this could be associated with the inflammasome or other innate signaling pathways that are active in the disease.

These findings are consistent across datasets, which underlines the robustness of these biomarkers in the classification of active TB. Their unique dynamics of expression underscore roles in critical immunological processes, which can be useful in clinical staging and the mechanisms underlying the pathogenesis of TB.

4.6. Sensitivity Analysis Confirms Signature Robustness

To confirm the robustness of our statistical threshold selection, we did sensitivity analysis using other ANOVA thresholds (p < 0.01 and p < 0.05). The four-gene signature proved to be very stable at all thresholds, as shown in Supplementary Table S1. Although the number of genes that passed ANOVA (167 genes at p < 0.001 to 3417 genes at p < 0.05) varied, TAP2, SORT1, WARS, and ANKRD22 were always listed among the top 20 most significant differentially expressed genes. The machine learning pipeline selected the same four genes regardless of threshold stringency, and performance of the models was very high across varying thresholds (AUC: 0.9911 at p < 0.001, 0.9885 at p < 0.01, and 0.9852 at p < 0.05). The findings affirm that the identified signature constitutes a robust biological cue rather than artifacts of statistical threshold selection.

4.7. Robustness of the Signature to Scaling Methodology

To check the likelihood of the scaling choice having an influence on biological interpretation, we contrasted RobustScaler with z-score and quantile normalization. First, the expression profiles of the four signature genes (TAP2, SORT1, WARS, and ANKRD22) were highly consistent across different scaling methods. The pairwise Pearson correlations of the expression values of each gene across all samples were above r > 0.995 (Supplementary Table S2). Second, the diagnostic capability of the Voting Classifier, when trained and tested on the data prepared through either scaling method, was considerably consistent (Table 3). All other important metrics, such as sensitivity (89.7–90.3%), specificity (88.9–89.5%), accuracy (85.5–86.2%), and macro F1-score (85.5–86.2%), had insignificant differences, and a substantial overlap in their 95 percent confidence intervals. This proves our point that such high discriminatory power of our signature is inherent to the biological signal and is not a result of a special scaling algorithm. Thus, we persisted with RobustScaler in our ultimate ML pipeline due to its robustness to outliers.

4.8. Comparative Performance of TB Diagnostic ML Models

The performance of our four-gene signature aligns with the WHO Target Product Profile targets for non-sputum triage tests, achieving a sensitivity of 90.0% (95% CI: 85.5–93.8%) and a specificity of 89.47% (95% CI: 84.2–93.5%). A comparison with modern transcriptomic models (Table 4 and Supplementary Figure S1) shows that our Voting Classifier achieved a high AUC of 0.9911 (95% CI: 0.983–0.997). The signature’s performance is comparable to or numerically exceeds that of other published panels. For instance, its AUC is higher than the 0.806 reported for a three-gene decision tree (Wang et al.) [18] and similar to the 0.979 from a three-gene Naive Bayes classifier (Lee et al., [23]). As summarized in Supplementary Figure S1, our four-gene panel also demonstrates performance on par with larger signatures, such as the 27-gene panel by Kaforou et al. (AUC: 0.98) [20] and the 42-gene model by Anderson et al. (AUC: 0.984) [21].

The performance profile of our classifier suggests it could help resolve the diagnostic ambiguity between active and latent TB that is often encountered with traditional IGRAs, while also accurately classifying healthy controls. The achieved sensitivity of 90.0% (95% CI: 85.5–93.8%) and specificity of 89.47% (95% CI: 84.2–93.5%) are aligned with WHO Target Product Profile targets for a triage test. In a hypothetical screening scenario, this could potentially reduce unnecessary confirmatory testing by 40–60%. The combination of a minimal four-gene signature with a high AUC of 0.9911 (95% CI: 0.983–0.997) positions it as a promising candidate for further development as a transcriptomic triage tool, particularly in resource-constrained settings, where simpler, cost-effective diagnostics are needed.

4.9. Gene Ontology and Pathway Enrichment Analysis of the Six Key DEGs

Functional enrichment analysis of the four biomarkers using Gene Ontology, KEGG, and Reactome frameworks revealed significant enrichment in immune response pathways (false discovery rate, FDR < 0.05). Key findings include the following:

Antigen presentation pathways:
The analysis of functional enrichment closely related TAP2 to the peptide loading of the MHC class I pathway (GO:0042590, FDR = 2.1 × 10⁻⁵; Reactome R-HSA-983170, p = 7.8 × 10⁻⁶). This result is in line with its known mechanism in the processing and presentation of mycobacterial antigens during active infection.
Interferon-mediated immunity:
In line with its immune activities, WARS was highly enriched in interferon-gamma signaling (GO:0060333, FDR = 3.4 × 10⁻⁴; Reactome R-HSA-877300, p = 1.2 × 10⁻⁵), which advocates its role in antimicrobial defense.
Cellular protein trafficking:
Functional analysis associated SORT1 with lysosomal sorting and vesicular transport (GO:0007041, FDR = 0.003; KEGG hsa04142, FDR = 0.008), an indicator of a dysregulated protein traffic situation that happens in active TB.
Innate immune activation:
The pathway analysis showed that ANKRD22 is functionally related to neutrophil degranulation (Reactome R-HSA-6798695, p = 4.5 × 10⁻⁴). Since neutrophils play a primary role in the early response of immunity against M. tuberculosis, this connection means that ANKRD22 may be involved in the inflammatory mechanism in the development of granuloma.

This specific pattern of functional enrichment in several independent databases helps to confirm the role of the discussed biomarkers in tuberculosis-associated immune responses.

5. Discussion and Future Work(s)

This study identifies and validates a minimal four-gene transcriptional signature (TAP2, SORT1, WARS, and ANKRD22), which has a high diagnostic equivalent (AUC = 0.9911) to classify active tuberculosis (ATB), latent infection (LTBI), and healthy controls (HCs). Using a hybrid machine learning pipeline of Boruta-XGBoost (stability selecting) and LASSO (regularizing), a transcriptome-wide dataset, based on whole bloods, was shrunk to a small panel of predictive biomarkers yet with biological interpretability [47]. While individual genes in this panel have been reported in prior TB studies, especially SORT1, ANKRD22, and TAP2, the primary novelty of our work is the integration of these biomarkers to create a minimal and mechanistically consistent panel, which has been tested in independent cohorts using a powerful machine learning pipeline.

To address concerns regarding parameter selection, we conducted sensitivity analyses examining two key methodological choices. To begin with, we tested the effects of the ANOVA p-value threshold, which was used to perform an initial gene filtering (Supplementary Table S1). Although our initial analysis used the cutoff of p < 0.001 to prioritize the strongest statistically significant associations, changing it to 0.01 and 0.05 did not have any impact on the final gene signature, and each time, the same four genes (TAP2, SORT1, WARS, and ANKRD22) were ranked higher than any other differentially expressed feature across all conditions. As a result, the model performance was insensitive to the threshold used (AUC > 0.985) and stable across all the thresholds, which proves that our signature is not a product of statistical artifact of parameter selection.

Second, we determined the effect of the post-normalization scaling approach on the biological interpretation. Since RobustScaler was used following RMA normalization, we compared its performance to the conventional z-score and quantile normalization methods (Section 4.8, Table 4). The correlations in the expression of the four signature genes were almost perfect in all the methods of scaling (Pearson r > 0.995), and the diagnostic performance of the Voting Classifier was not significantly different, with all performance measures exhibiting significant overlaps in the 95 percent confidence intervals. This threshold independence and scaling invariance jointly exhibit that our results acquire core host response biology rather than technical preprocessing artifacts.

Importantly, signature performance aligns with the WHO Target Product Profile (TPP) requirements of non-sputum triage examined tests, with 90% sensitivity and 89.47% specificity. This performance compares favorably with modern transcriptomic signatures, such as 27-gene panel performance (AUC = 0.98) by Kaforou et al. [20], 3-gene-classifier performance (AUC = 0.979) by Lee et al. [23], and the 6-gene assay by the RISK6 consortium (AUC = 0.84) [27]. However, it is also worth noting that direct statistical comparison (e.g., via DeLong’s test) is not possible because of the inherent disparities in the composition of cohorts, instruments used in microarrays, and validation designs across studies. Thus, we do not position our signature as a statistically proven superior alternative, but as something minimal and high performing, obtained via a robust machine learning pipeline.

Although being statistically significant is something evident in the AUC of our signature, its clinical implications are of utmost importance. The Target Product Profile (TPP) by the World Health Organization for a TB triage test gives high sensitivity (≥90) as a pre-requisite in ruling disease out, and specificity sufficient to significantly lower the requirement of confirmatory tests. The performance of our signature is consistent with these objectives and has a sensitivity of 90.0% (95% CI: 85.593.8) and a specificity of 89.47% (95% CI: 84.293.5). This estimates a possible decrease in nuisance follow-on sputum-based tests (i.e., smear microscopy, culture, or molecular tests) by 40–60% in a hypothetical low-prevalence screening environment. This points out a translation path in which statistical performance corresponds to a real change in clinical workflow and resource allocation.

Precision of the signature in clearing the ATB/LTBI diagnostic ambiguity could address a major disadvantage of interferon-γ secretion assays (IGRAs) and TSTs that are unable to distinguish infection states [62]. We had only a single misclassification between ATB and LTBI cohorts expressed in our confusion matrix, which proves reliability despite the comorbidity and genetic variation of hosts [63]. This fits the criteria of WHO interests on non-sputum biomarkers [3], which could potentially cut down on unnecessary confirmatory testing significantly in resource-constrained settings, offering a promising solution to address the overdiagnosis gap noted in high-burden regions [64,65].

Functionally, every biomarker corresponds to discrete immunological disruptions during pathogenesis of ATB. It is noteworthy that some of the genes have been linked to TB in other studies, for instance, ANKRD22 was included in the larger biomarker signature described by Natarajan et al. [25], and SORT1 has been implicated with lipid trafficking in macrophages infected with TB [66]. However, these individual markers were already known to us, and the primary novelty of our work is the combination of these markers to create a minimal and mechanistically consistent panel, which have been tested in independent cohorts using a powerful machine learning pipeline. The present biomarker panel identities cover dimensions of host response: antigen presentation (TAP2), lipid metabolism (SORT1), interferon response (WARS), as well as inflammasome activation (ANKRD22). It is important to note that, in comparison with previous correlative signatures, each of our genes in the gene panel directly targets a therapeutic node of TB pathogenesis:

TAP2 facilitates the processing of the MHC-I antigens, promoting the activation of CD8 + T-cells to attack M. tuberculosis [67,68].
ANKRD22 mediates inflammasome-mediated immunopathology through the NOD-like receptor, a pathway that is gaining a growing role in TB granuloma development [69,70].
SORT1 regulates PPARγ-dependent lipid trafficking, facilitating foam cell formation in granulomas [66].
WARS initiation causes an interferon-γ response vital to antimicrobial defense, which agrees with its functions in tryptophan depletion and immunologic evasion [19,22].

All these pathways are pointing to the host–pathogen interaction in which immune activation and metabolic hijacking protect bacteria. This therapeutically allows actionable host-directed therapies (HDTs), including TAP2 augmentation to enhance the efficacy of the CD8+ T-cell, ANKRD22 amendment to curb immunopathology, or SORT1 inhibition to overcome bacterial lipid uptake [67,68]. These measures might help reduce the time of treatment and fight the drug-resistant TB [10,27]. Recent studies investigating novel antimicrobial agents, such as omadacycline for intracellular bacterial infections [71], highlight the growing interest in repurposing and developing drugs that target host–pathogen interactions—a strategy that aligns with the therapeutic potential of our signature genes.

Despite the strong performance, actual implementation has challenges, as follow:

Statistical power in high-dimensional analysis: Our post hoc power analysis indicated that the study had enough statistical power (>0.95) to detect differences in expression for the four genes in our signature. Concurrently, we are aware of the extended challenges associated with making statistical inferences in high-dimensional biomarker discovery studies [36]. The preliminary screening of thousands of transcripts using small sample sizes predisposes the occurrence of false discovery and may hamper the capability to identify small effect sizes. We should view our findings in the light of such methodological limitations. Confirmation should be done through future validation in large independent cohorts, and such trials need principled methods to adjust missing data and other methodological problems.

Cohort heterogeneity and platform dependency: Our study leveraged publicly available GEO datasets to scale up the sample size, which is typical in biomarker discovery. Nonetheless, this consolidation acts as a source of inherent heterogeneity due to the diverse microarray platforms, sample collection protocols, and demographics. Despite rigorous batch correction (ComBat), there is a possibility that minor and unavoidable technical and biological variance would affect the generalizability of this signature. Its performance on data produced by the modern RNA-sequencing (RNA-seq) systems with even greater dynamic range and the ability to discover new transcripts remains to be assessed and is a significant next step in technological translation.

Lack of prospective clinical validation: A dominant weakness is the retrospective nature of our validation via available curated cohorts. Though cross-cohort validation (GSE19444) has good evidence of generalizability between similar datasets, it does not constitute prospective validation in a real-world clinical environment. Further research is needed in primary care clinics or community-based screening programs to confirm the signature operation properties, such as its performance in daily clinical practice, its acceptability to users, and its realistic influence on clinical decision-making within the target populations.

Clinical definition heterogeneity: Heterogeneity in clinical definitions of TB states in diverse studies is one such challenge, which is identified as a recognized issue in biomarker studies [72]. Though the criteria employed in our training and validation cohorts were rigorous and standardized (culture confirmation of active TB, dual TST/IGRA of positivity of LTBI, and dual TST/IGRA of negativity of controls), other research could apply different diagnostic thresholds, single testing, or clinical criteria only. This difference makes it more difficult to compare the performance of biomarkers individually between studies and may influence the extent to which signatures can be generalized to other populations defined by alternative criteria [73]. The future multi-centered validation should be directed toward the development of harmonious and consensus-based clinical definitions promoted by the recent guidelines to obtain strong translation in various settings.

Microbiological confirmation requirement: Further, it is notable that our signature was trained and validated on cohorts where the active TB was microbiologically confirmed (culture or smear). Its efficacy in clinically diagnosed culture-negative TB cases is a frequent situation in paucibacillary and extrapulmonary disease that remains to be assessed. Host transcriptional response can vary without confirmed bacterial burden and, therefore, this should be a concern of prospective studies.

Special population considerations: The transcriptomic alterations that take place in diabetic TB patients may modify signature accuracy [74,75]. Subsequent validation in pediatric/HIV coinfected cohorts, not adequately represented in the present study, is mandatory to give the diagnostic utility with the required equitability [20,21]. An important step toward an equitable application of this diagnostic is validation in important populations who were not represented in this study, including children and people with HIV coinfection. Pediatric TB is usually paucibacillary and there are age-related changes in immune response that can have effects on transcriptional signature acquired from adults. Similarly, immunosuppression because of HIV may profoundly modulate the host transcriptome, which may dull the interferon-driven messages central to many TB signatures [63,76]. While our signature leveraged high accuracy in the cohorts that we studied, its performance in these cohorts requires specific evaluation. Future research will require including pediatric and HIV-coinfected populations to provide a measure of generalizability and verify the diagnostic usefulness of this panel in the entire spectrum of TB disease [76,77].

Subclinical TB assessment: In addition, our study did not investigate the signature in a group of people with subclinical TB who have microbiological confirmation but without signs and symptoms. This category was not provided in the publicly available datasets used (GSE19439 and GSE19444). Further research is needed to evaluate the ability of this minimal signature to distinguish subclinical TB with latent infection and active disease, and this would greatly improve its triage application.

Treatment response dynamics: The effects of anti-tuberculosis treatment on transcriptome profiles are another aspect that should be taken into consideration. This signature was developed with the cohort of the treatment-naive individuals to determine the state of specific diagnostic markers. The dynamics of its response during therapy and, therefore, its possible applicability to indexing a response to treatment, or predicting its relapse, are undetermined and represent a key future direction of longitudinal studies.

Geographical and epidemiological variation: In addition, another dimension of complexity to biomarkers’ generalizability is the geographical heterogeneity of TB burden/prevalence. Our validation cohort (GSE19444) comprised samples of varied setting, thus giving us preliminary confidence. Nevertheless, its full validation over the entire range of epidemiological settings, including high- and medium-burden areas and low-burden ones, is necessary before any global implementation could be considered. Genetic, environmental, and coinfection backgrounds differ to a greater extent in these settings and can contribute to host transcriptional responses.

Implementation in resource-limited settings: In addition to the evidence of validity in special populations, it is also noteworthy to determine the value of the implementation of this four-gene signature in resource-limited settings. The small size of the panel makes it suitable for translation to fast and field-deployable formats. Current technologies in the field of microarray or RNA-seq workflows are not well suited for such environments, but emerging solutions, such as point-of-care (POC) molecular platforms, have offered a good path forward. The signature might be scaled to multiplex quantitative reverse transcription PCR (RT-qPCR; on smaller, efficient devices) [24], or be used with the more recent diagnostic systems, which are powered by the Cas9 enzyme (e.g., SHERLOCK) and have high sensitivity and visual readouts [28,65,78]. Successful translation will require complimentary developments in robust stabilization of RNA from fingerstick blood samples and simplified and miniaturized nucleic acid extraction. Ongoing work to design low-cost and rapid transcriptomic triage tests of TB provides proof-of-principle for this approach and demonstrates the potential of low-cost gene panels to fill this gap between biomarker discovery and clinical impact in high-burden areas [26,28].

The future work ought to focus on the following:

Four-gene signature experimental validation, through specific targeted techniques like quantitative PCR (qPCR) or NanoString on prospectively collected whole blood of specified clinical groups.
Validation on major underrepresented groups like pediatric and HIV-coinfected patients to promote an unbiased diagnostic utility.
Assessment of the signature performance in subclinical TB, an essential gap in the current diagnostic spectrum.
Multi-omics in combination with serum metabolomics (e.g., citrate/malate) to improve extrapulmonary TB detection [8,18].
The four-gene signatures could be deployed in a point-of-care system in resource-constrained settings using readily available technology like microfluidic PCR or CRISPR promotional assessments [24,28].
The longitudinal monitoring of biomarker dynamics in the presence of anti-TB therapy to appreciate response to treatment and predict relapses or cure [10,27].
International validation in multiple epidemiological environments, such as high- and medium-cut and low-burden areas, through constructs such as the Global TB Biomarker Pipeline [22,27].

6. Conclusions

In summary, our transcriptional signature (TAP2/SORT1/WARS/ANKRD22) made and tested here showed high accuracy and complied with the performance goals set by the WHO on the active TB triage test. Also, the main breakthrough of this signature is that it is used not only as a minimal, interpretable diagnostic panel, but also as a mechanistically informative roadmap, which directly implicates dysregulated pathways in disease biology, including antigen presentation (TAP2), immunometabolism (SORT1), interferon response (WARS), and granulomatous inflammation (ANKRD22). The integrated signature offers a prospective platform in the advancement of both enhanced diagnostic plans and rejuvenated host-directed interventions on tuberculosis in the future.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/diagnostics16050693/s1, Figure S1: Forest-style plot comparing diagnostic performance of transcriptomic signatures for tuberculosis; Table S1: Impact of ANOVA Threshold on Gene Selection and Model Performance; Table S2: Pairwise Pearson correlation coefficients (r) for the expression of each signature gene across different scaling methods; Supplementary Method-S1: Mathematical Details of the Machine Learning Pipeline.

Author Contributions

A.H.S., H.A.A. and N.A. conceptualized and designed the study; A.H.S. was responsible for software, data processing, and curation; A.H.S. and S.A. performed the investigation, validation, and formal analysis; A.H.S. and N.A. prepared the visualizations and figures; A.H.S. wrote the original draft of the manuscript; A.O.A., S.A., H.A.A., N.A., J.I. and A.A.T. contributed to reviewing and editing the manuscript; H.A.A., N.A. and A.A.T. provided supervision; A.H.S. administered the project. All authors were substantially and intellectually involved in the present study to meet the requirements. All authors have read and agreed to the published version of the manuscript.

Funding

This project was funded by the Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah, Saudi Arabia, under Grant No. IPP: 887-830-2025. The authors, therefore, acknowledge with thanks DSR for technical and financial support.

Institutional Review Board Statement

In the present study, we worked on the gene expression dataset. The gene expression data can be downloaded from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19439 and https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19444, accessed on 20 September 2025. Therefore, the authors in the present study were not involved with animals or human participants. However, the relevant local Ethics Committee approved the original retrospective studies.

Informed Consent Statement

Not applicable.

Data Availability Statement

Datasets are publicly available at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19439 and https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19444. (accessed on 20 September 2025).

Acknowledgments

This project was funded by the Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah, Saudi Arabia, under Grant No. IPP: 887-830-2025. The authors, therefore, acknowledge with thanks DSR for technical and financial support.

Conflicts of Interest

The authors declare that the research was conducted without any commercial or financial relationships that could be construed as a potential conflict of interest.

References

World Health Organization. 2025 Tuberculosis Global Report; World Health Organization: Geneva, Switzerland, 2025. [Google Scholar]
Kohli, M.; Korobitsyn, A.; Ismail, N.; Zignol, M.; Kasaeva, T.; Dewan, P.; Ruhwald, M.; Anyaike, C.; Ayles, H.; Basilio, R.; et al. WHO Target Product Profile for TB Detection at Peripheral Settings: 2024 Update. PLoS Glob. Public Health 2025, 5, e0004612. [Google Scholar] [CrossRef] [PubMed]
World Health Organization. WHO Consolidated Guidelines on Tuberculosis: Module 3: Diagnosis—Rapid Diagnostics for Tuberculosis Detection; World Health Organization: Geneva, Switzerland, 2021. [Google Scholar]
Laux Da Costa, L.; Delcroix, M.; Dalla Costa, E.R.; Prestes, I.V.; Milano, M.; Francis, S.S.; Unis, G.; Silva, D.R.; Riley, L.W.; Rossetti, M.L.R. A Real-Time PCR Signature to Discriminate between Tuberculosis and Other Pulmonary Diseases. Tuberculosis 2015, 95, 421–425. [Google Scholar] [CrossRef] [PubMed]
Afzal, Z.; Javed, M.T.; Mohsin, M.; Ahmad, H.M.W.; Saeed, Z.; Taimoor, M.; Aleem, R.A.; Raza, A.; Ayub, A.; Israr, F.; et al. The usefulness of glutaraldehyde coagulation test as a conjuncture test in the diagnosis of tuberculosis in humans and animals. Agrobiol. Rec. 2024, 15, 34–40. [Google Scholar] [CrossRef] [PubMed]
Wahyuda, A.; Suharto, R.H.; Muflih, M.; Rasdiyanah; Danawir, M. Exploration of Tuberculosis Transmission between Humans and Cows through Milk Testing in South Sulawesi, Indonesia. Int. J. Vet. Sci. 2025, 14, 1190–1195. [Google Scholar] [CrossRef]
Hassine, A.B.H.; Marzouk, M.; Saad, J.; Boukadida, J.; Drancourt, M. Molecular epidemiology of mycobacterium tuberculosis complex in the center of Tunisia (2008–2010 and 2014–2017). Agrobiol. Rec. 2024, 17, 69–74. [Google Scholar] [CrossRef]
Sambarey, A.; Devaprasad, A.; Mohan, A.; Ahmed, A.; Nayak, S.; Swaminathan, S.; D’Souza, G.; Jesuraj, A.; Dhar, C.; Babu, S.; et al. Unbiased Identification of Blood-Based Biomarkers for Pulmonary Tuberculosis by Modeling and Mining Molecular Interaction Networks. EBioMedicine 2017, 15, 112–126. [Google Scholar] [CrossRef]
Sambarey, A.; Devaprasad, A.; Baloni, P.; Mishra, M.; Mohan, A.; Tyagi, P.; Singh, A.; Akshata, J.; Sultana, R.; Buggi, S.; et al. Meta-Analysis of Host Response Networks Identifies a Common Core in Tuberculosis. NPJ Syst. Biol. Appl. 2017, 3, 4. [Google Scholar] [CrossRef]
Gliddon, H.D.; Kaforou, M.; Alikian, M.; Habgood-Coote, D.; Zhou, C.; Oni, T.; Anderson, S.T.; Brent, A.J.; Crampin, A.C.; Eley, B.; et al. Identification of Reduced Host Transcriptomic Signatures for Tuberculosis Disease and Digital PCR-Based Validation and Quantification. Front. Immunol. 2021, 12, 637164. [Google Scholar] [CrossRef]
Zhang, H.; Shi, M.; Yu, L.; Ran, F.; Zheng, N.; Wang, X.; Liu, Y.; Li, C.; Li, D.; Li, J. Identification of a Diagnostic Multiomics-Based Biomarker Cluster in a Mouse Model of Pulmonary Tuberculosis. Discov. Med. 2024, 36, 1268–1279. [Google Scholar] [CrossRef]
Chen, E.; Chen, C.; Chen, F.; Yu, P.; Lin, L. Positive Association between MIC Gene Polymorphism and Tuberculosis in Chinese Population. Immunol. Lett. 2019, 213, 62–69. [Google Scholar] [CrossRef]
Herrera, M.; Keynan, Y.; McLaren, P.J.; Isaza, J.P.; Abrenica, B.; López, L.; Marin, D.; Rueda, Z.V. Gene Expression Profiling Identifies Candidate Biomarkers for New Latent Tuberculosis Infections. A Cohort Study. PLoS ONE 2022, 17, e0274257. [Google Scholar] [CrossRef] [PubMed]
Vargas, R.; Abbott, L.; Bower, D.; Frahm, N.; Shaffer, M.; Yu, W.H. Gene Signature Discovery and Systematic Validation across Diverse Clinical Cohorts for TB Prognosis and Response to Treatment. PLoS Comput. Biol. 2023, 19, e1010770. [Google Scholar] [CrossRef] [PubMed]
Zheng, S.; Qu, W.; Zhang, D.; Zhou, J.; Xu, Y.; Wu, W.; Liu, C.; Huang, M.; Shen, E.; Chen, X.; et al. International Multicenter Development of Ensemble Machine Learning Driven Host Response Based Diagnosis for Tuberculosis. iScience 2025, 28, 113444. [Google Scholar] [CrossRef] [PubMed]
Łukaszuk, T.; Krawczuk, J.; Żyła, K.; Kęsik, J. Stability of Feature Selection in Multi-Omics Data Analysis. Appl. Sci. 2024, 14, 11103. [Google Scholar] [CrossRef]
Boumait, Y.; Ettetuani, B.; Chrairi, M.; Lamzouri, A.; Chahboune, R. Identification of Gene Expression Biomarkers Predictive of Latent Tuberculosis Infection Using Machine Learning Approaches. Genes 2025, 16, 715. [Google Scholar] [CrossRef]
Wang, S.; He, L.; Wu, J.; Zhou, Z.; Gao, Y.; Chen, J.; Shao, L.; Zhang, Y.; Zhang, W. Transcriptional Profiling of Human Peripheral Blood Mononuclear Cells Identifies Diagnostic Biomarkers That Distinguish Active and Latent Tuberculosis. Front. Immunol. 2019, 10, 2948. [Google Scholar] [CrossRef] [PubMed]
Berry, M.P.R.; Graham, C.M.; McNab, F.W.; Xu, Z.; Bloch, S.A.A.; Oni, T.; Wilkinson, K.A.; Banchereau, R.; Skinner, J.; Wilkinson, R.J.; et al. An Interferon-Inducible Neutrophil-Driven Blood Transcriptional Signature in Human Tuberculosis. Nature 2010, 466, 973–977. [Google Scholar] [CrossRef] [PubMed]
Kaforou, M.; Wright, V.J.; Oni, T.; French, N.; Anderson, S.T.; Bangani, N.; Banwell, C.M.; Brent, A.J.; Crampin, A.C.; Dockrell, H.M.; et al. Detection of Tuberculosis in HIV-Infected and -Uninfected African Adults Using Whole Blood RNA Expression Signatures: A Case-Control Study. PLoS Med. 2013, 10, e1001538. [Google Scholar] [CrossRef] [PubMed]
Anderson, S.T.; Kaforou, M.; Brent, A.J.; Wright, V.J.; Banwell, C.M.; Chagaluka, G.; Crampin, A.C.; Dockrell, H.M.; French, N.; Hamilton, M.S.; et al. Diagnosis of Childhood Tuberculosis and Host RNA Expression in Africa. N. Engl. J. Med. 2014, 370, 1712–1723. [Google Scholar] [CrossRef]
Sweeney, T.E.; Braviak, L.; Tato, C.M.; Khatri, P. Genome-Wide Expression for Diagnosis of Pulmonary Tuberculosis: A Multicohort Analysis. Lancet Respir. Med. 2016, 4, 213–224. [Google Scholar] [CrossRef]
Lee, S.W.; Wu, L.S.H.; Huang, G.M.; Huang, K.Y.; Lee, T.Y.; Weng, J.T.Y. Gene Expression Profiling Identifies Candidate Biomarkers for Active and Latent Tuberculosis. BMC Bioinform. 2016, 17, 27–39. [Google Scholar] [CrossRef]
Maertzdorf, J.; McEwen, G.; Weiner, J.; Tian, S.; Lader, E.; Schriek, U.; Mayanja-Kizza, H.; Ota, M.; Kenneth, J.; Kaufmann, S.H. Concise Gene Signature for Point-of-care Classification of Tuberculosis. EMBO Mol. Med. 2016, 8, 86–95. [Google Scholar] [CrossRef] [PubMed]
Natarajan, S.; Ranganathan, M.; Hanna, L.E.; Tripathy, S. Transcriptional Profiling and Deriving a Seven-Gene Signature That Discriminates Active and Latent Tuberculosis: An Integrative Bioinformatics Approach. Genes 2022, 13, 616. [Google Scholar] [CrossRef] [PubMed]
Leong, S.; Zhao, Y.; Joseph, N.M.; Hochberg, N.S.; Sarkar, S.; Pleskunas, J.; Hom, D.; Lakshminarayanan, S.; Horsburgh, C.R.; Roy, G.; et al. Existing Blood Transcriptional Classifiers Accurately Discriminate Active Tuberculosis from Latent Infection in Individuals from South India. Tuberculosis 2018, 109, 41–51. [Google Scholar] [CrossRef] [PubMed]
Bayaa, R.; Ndiaye, M.D.B.; Chedid, C.; Kokhreidze, E.; Tukvadze, N.; Banu, S.; Uddin, M.K.M.; Biswas, S.; Nasrin, R.; Ranaivomanana, P.; et al. Multi-Country Evaluation of RISK6, a 6-Gene Blood Transcriptomic Signature, for Tuberculosis Diagnosis and Treatment Monitoring. Sci. Rep. 2021, 11, 13646. [Google Scholar] [CrossRef]
Sutherland, J.S.; Van Der Spuy, G.; Gindeh, A.; Thuong, N.T.T.; Namuganga, A.R.; Owolabi, O.; Mayanja-Kizza, H.; Nsereko, M.; Thwaites, G.; Winter, J.; et al. Diagnostic Accuracy of the Cepheid 3-Gene Host Response Fingerstick Blood Test in a Prospective, Multi-Site Study: Interim Results. Clin. Infect. Dis. 2022, 74, 2136–2141. [Google Scholar] [CrossRef]
Luo, Y.; Xue, Y.; Liu, W.; Song, H.; Huang, Y.; Tang, G.; Wang, F.; Wang, Q.; Cai, Y.; Sun, Z. Development of Diagnostic Algorithm Using Machine Learning for Distinguishing between Active Tuberculosis and Latent Tuberculosis Infection. BMC Infect. Dis. 2022, 22, 965. [Google Scholar] [CrossRef]
Xie, L.; Zhu, G.; Long, S.; Wang, M.; Cheng, X.; Dong, Y.; Wang, C.; Wang, G. Identification of MORN3 and LLGL2 as Novel Diagnostic Biomarkers for Latent Tuberculosis Infection Using Machine Learning Strategies and Experimental Verification. Ann. Med. 2024, 56, 2380797. [Google Scholar] [CrossRef]
Ren, B.; Jia, F.; Fang, Q.; Xu, J.; Lin, K.; Huang, R.; Liu, Z.; Xing, X. Development of a Four Autophagy-Related Gene Signature for Active Tuberculosis Diagnosis. Front. Cell. Infect. Microbiol. 2025, 15, 1600348. [Google Scholar] [CrossRef]
Perumal, P.; Abdullatif, M.B.; Garlant, H.N.; Honeyborne, I.; Lipman, M.; McHugh, T.D.; Southern, J.; Breen, R.; Santis, G.; Ellappan, K.; et al. Validation of Differentially Expressed Immune Biomarkers in Latent and Active Tuberculosis by Real-Time PCR. Front. Immunol. 2021, 11, 612564. [Google Scholar] [CrossRef]
Irizarry, R.A.; Hobbs, B.; Collin, F.; Beazer-Barclay, Y.D.; Antonellis, K.J.; Scherf, U.; Speed, T.P. Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. Biostatistics 2003, 4, 249–264. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Parmigiani, G.; Johnson, W.E. ComBat-Seq: Batch Effect Adjustment for RNA-Seq Count Data. NAR Genom. Bioinform. 2020, 2, lqaa078. [Google Scholar] [CrossRef] [PubMed]
Champely, S.; Ekstrom, C.; Dalgaard, P.; Gill, J. Pwr: Basic Functions for Power Analysis; R Package Version 1.3-0; Comprehensive R Archive Network (CRAN): Vienna, Austria, 2020; Available online: https://CRAN.R-project.org/package=pwr (accessed on 20 September 2025).
Rahnenführer, J.; De Bin, R.; Benner, A.; Ambrogi, F.; Lusa, L.; Boulesteix, A.L.; Migliavacca, E.; Binder, H.; Michiels, S.; Sauerbrei, W.; et al. Statistical Analysis of High-Dimensional Biomedical Data: A Gentle Introduction to Analytical Goals, Common Approaches and Challenges. BMC Med. 2023, 21, 182. [Google Scholar] [CrossRef] [PubMed]
Huber, W.; Von Heydebreck, A.; Sültmann, H.; Poustka, A.; Vingron, M. Variance Stabilization Applied to Microarray Data Calibration and to the Quantification of Differential Expression. Bioinformatics 2002, 18, S96–S104. [Google Scholar] [CrossRef]
Mining, W.I.D. Data Mining: Concepts and Techniques. Morgan Kaufinann 2006, 10, 4. [Google Scholar]
Huber, W.; Carey, V.J.; Gentleman, R.; Anders, S.; Carlson, M.; Carvalho, B.S.; Bravo, H.C.; Davis, S.; Gatto, L.; Girke, T.; et al. Orchestrating High-Throughput Genomic Analysis with Bioconductor. Nat. Methods 2015, 12, 115–121. [Google Scholar] [CrossRef] [PubMed]
Benjaminit, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B (Methodol.) 1995, 57, 289–300. [Google Scholar] [CrossRef]
Dudoit, S.; Yang, Y.H.; Callow, M.J.; Speed, T.P. Statistical Methods for Identifying Differentially Expressed Genes in Replicated CDNA Microarray Experiments. Stat. Sin. 2002, 12, 111–139. Available online: http://www.jstor.org/stable/24307038 (accessed on 19 February 2026).
Kim, T.K. Understanding One-Way ANOVA Using Conceptual Figures. Korean J. Anesthesiol. 2017, 70, 22. [Google Scholar] [CrossRef]
Tukey, J.W. Comparing Individual Means in the Analysis of Variance. Biometrics 1949, 5, 99–114. [Google Scholar] [CrossRef]
Saltelli, A.; Aleksankina, K.; Becker, W.; Fennell, P.; Ferretti, F.; Holst, N.; Li, S.; Wu, Q. Why So Many Published Sensitivity Analyses Are False: A Systematic Review of Sensitivity Analysis Practices. Environ. Model. Softw. 2019, 114, 29–39. [Google Scholar] [CrossRef]
Bender, R.; Lange, S. Adjusting for Multiple Testing—When and How? J. Clin. Epidemiol. 2001, 54, 343–349. [Google Scholar] [CrossRef] [PubMed]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning with Applications in R; Springer: New York, NY, USA, 2013. [Google Scholar]
Kursa, M.B.; Rudnicki, W.R. Feature Selection with the Boruta Package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef]
Tibshiranit, R. Regression Shrinkage and Selection via the Lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
Fabian, P. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Association for Computing Machinery: New York, NY, USA, 2016; Volume 13–17, pp. 785–794. [Google Scholar]
Breiman, L. Random Forests. Random For. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Scornet, E. Random Forests and Kernel Methods. IEEE Trans. Inf. Theory 2015, 62, 1485–1500. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Cui, H.; Zhang, X. Alignment-Free Supervised Classification of Metagenomes by Recursive SVM. BMC Genom. 2013, 14, 641. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2009. [Google Scholar]
Hanley, J.A.; McNeil, B.J. The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve1. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef] [PubMed]
Waskom, M. Seaborn: Statistical Data Visualization. J. Open Source Softw. 2021, 6, 3021. [Google Scholar] [CrossRef]
Wolf, F.A.; Angerer, P.; Theis, F.J. SCANPY: Large-Scale Single-Cell Gene Expression Data Analysis. Genome Biol. 2018, 19, 15. [Google Scholar] [CrossRef] [PubMed]
Dennis, G.; Sherman, B.T.; Hosack, D.A.; Yang, J.; Gao, W.; Lane, C.; Lempicki, R.A. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003, 4, R60. [Google Scholar] [CrossRef]
Sherman, B.T.; Hao, M.; Qiu, J.; Jiao, X.; Baseler, M.W.; Lane, H.C.; Imamichi, T.; Chang, W. DAVID: A Web Server for Functional Enrichment Analysis and Functional Annotation of Gene Lists (2021 Update). Nucleic Acids Res. 2022, 50, W216–W221. [Google Scholar] [CrossRef] [PubMed]
Kanehisa, M.; Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28, 27–30. [Google Scholar] [CrossRef] [PubMed]
Pai, M.; Behr, M. Latent Mycobacterium Tuberculosis Infection and Interferon-Gamma Release Assays. Microbiol. Spectr. 2016, 4, 10-1128. [Google Scholar] [CrossRef] [PubMed]
Esmail, H.; Riou, C.; du Bruyn, E.; Pei-Jen Lai, R.; Harley, Y.X.; Meintjes, G.; Wilkinson, K.A.; Wilkinson, R.J. The Immune Response to Mycobacterium Tuberculosis In-HIV-1-Coinfected Persons. Annu. Rev. Immunol. 2018, 36, 603–638. [Google Scholar] [CrossRef]
Huynh, J.; Abo, Y.N.; Triasih, R.; Singh, V.; Pukai, G.; Masta, P.; Tsogt, B.; Luu, B.K.; Felisia, F.; Pank, N.; et al. Emerging Evidence to Reduce the Burden of Tuberculosis in Children and Young People. Int. J. Infect. Dis. 2025, 155, 107869. [Google Scholar] [CrossRef]
Kasule, G.W.; Hermans, S.; Semugenze, D.; Wekiya, E.; Nsubuga, J.; Mwachan, P.; Kabugo, J.; Joloba, M.; García-Basteiro, A.L.; Ssengooba, W. Non-Sputum-Based Samples and Biomarkers for Detection of Mycobacterium Tuberculosis: The Hope to Improve Childhood and HIV-Associated Tuberculosis Diagnosis. Eur. J. Med. Res. 2024, 29, 502. [Google Scholar] [CrossRef]
Vázquez, C.L.; Rodgers, A.; Herbst, S.; Coade, S.; Gronow, A.; Guzman, C.A.; Wilson, M.S.; Kanzaki, M.; Nykjaer, A.; Gutierrez, M.G. The Proneurotrophin Receptor Sortilin Is Required for Mycobacterium Tuberculosis Control by Macrophages. Sci. Rep. 2016, 6, 29332. [Google Scholar] [CrossRef]
Barbet, G.; Nair-Gupta, P.; Schotsaert, M.; Yeung, S.T.; Moretti, J.; Seyffer, F.; Metreveli, G.; Gardner, T.; Choi, A.; Tortorella, D.; et al. TAP Dysfunction in Dendritic Cells Enables Noncanonical Cross-Presentation for T Cell Priming. Nat. Immunol. 2021, 22, 497–509. [Google Scholar] [CrossRef] [PubMed]
Lin, P.L.; Flynn, J.A.L. CD8 T Cells and Mycobacterium Tuberculosis Infection. Semin. Immunopathol. 2015, 37, 239–249. [Google Scholar] [CrossRef] [PubMed]
Jobe, D.; Darboe, F.; Muefong, C.N.; Barry, A.; Coker, E.G.; Mohammed, N.; Jobe, A.; Davies, M.M.; Faye, B.; Jallow, R.; et al. Gene Expression in TB Disease Measured from the Periphery Is Different from the Site of Infection. Tuberculosis 2022, 134, 102187. [Google Scholar] [CrossRef] [PubMed]
Ahmed, M.; Thirunavukkarasu, S.; Rosa, B.A.; Thomas, K.A.; Das, S.; Rangel-Moreno, J.; Lu, L.; Mehra, S.; Mbandi, S.K.; Thackray, L.B.; et al. Immune Correlates of Tuberculosis Disease and Risk Translate across Species. Sci. Transl. Med. 2020, 12, eaay0233. [Google Scholar] [CrossRef]
Yang, Y.; Li, C.; Fan, X.; Long, W.; Hu, Y.; Wang, Y.; Qu, J. Effectiveness of Omadacycline in a Patient with Chlamydia Psittaci and KPC-Producing Gram-Negative Bacteria Infection. Infect. Drug Resist. 2025, 18, 903–908. [Google Scholar] [CrossRef]
Pai, M.; Behr, M.A.; Dowdy, D.; Dheda, K.; Divangahi, M.; Boehme, C.C.; Ginsberg, A.; Swaminathan, S.; Spigelman, M.; Getahun, H.; et al. Tuberculosis. Nat. Rev. Dis. Primers 2016, 2, 16076. [Google Scholar] [CrossRef]
Horne, D.J.; Zifodya, J.S.; Shapiro, A.E.; Church, E.C.; Kreniske, J.S.; Kay, A.W.; Scandrett, K.; Steingart, K.R.; Takwoingi, Y. Xpert MTB/RIF Ultra Assay for Pulmonary Tuberculosis and Rifampicin Resistance in Adults and Adolescents. Cochrane Database Syst. Rev. 2025, 7, CD009593. [Google Scholar] [CrossRef]
Liu, T.; Wang, Y.; Gui, J.; Fu, Y.; Ye, C.; Hong, X.; Chen, L.; Li, Y.; Zhang, X.; Hong, W. Transcriptome Analysis of the Impact of Diabetes as a Comorbidity on Tuberculosis. Medicine 2022, 101, E31652. [Google Scholar] [CrossRef]
Liu, Y.; Pu, Y.; Wang, J.; Li, Z.; Liu, S.; Tang, S. A Bioinformatics-Driven Approach to Identify Biomarkers and Elucidate the Pathogenesis of Type 2 Diabetes Concurrent with Pulmonary Tuberculosis. Sci. Rep. 2025, 15, 16931. [Google Scholar] [CrossRef]
Darboe, F.; Mbandi, S.K.; Naidoo, K.; Yende-Zuma, N.; Lewis, L.; Thompson, E.G.; Duffy, F.J.; Fisher, M.; Filander, E.; Van Rooyen, M.; et al. Detection of Tuberculosis Recurrence, Diagnosis and Treatment Response by a Blood Transcriptomic Risk Signature in HIV-Infected Persons on Antiretroviral Therapy. Front. Microbiol. 2019, 10, 1441. [Google Scholar] [CrossRef]
Zak, D.E.; Penn-Nicholson, A.; Scriba, T.J.; Thompson, E.; Suliman, S.; Amon, L.M.; Mahomed, H.; Erasmus, M.; Whatney, W.; Hussey, G.D.; et al. A Blood RNA Signature for Tuberculosis Disease Risk: A Prospective Cohort Study. Lancet 2016, 387, 2312–2322. [Google Scholar] [CrossRef]
Gootenberg, J.S.; Abudayyeh, O.O.; Lee, J.W.; Essletzbichler, P.; Dy, A.J.; Joung, J.; Verdine, V.; Donghia, N.; Daringer, N.M.; Freije, C.A.; et al. Nucleic Acid Detection with CRISPR-Cas13a/C2c2. Science 2017, 356, 438–442. [Google Scholar] [CrossRef]

Figure 1. The machine learning architecture in discovering TB biomarkers. The pipeline takes the transcriptomic data of GSE19439, runs the Boruta-XGBoost and LASSO to identify stage-specific differentially expressed genes (DEGs), selects optimal biomarkers, validates the performance of GSE19444, and annotates the biological functions. The four-gene diagnostic panel and TB immunopathology-enriched pathways are the most significant products.

Figure 2. Volcano plots highlighting differentially expressed genes (|log₂FC| > 1.36, p < 0.001) in (A) active vs. control, (B) active vs. latent, and (C) latent vs. control.

Figure 3. Feature importance from L1 regularization (ANKRD22 = highest coefficient).

Figure 4. Comparative performance of six machine learning classifiers across three evaluation metrics: accuracy, F1 score, and ROC AUC. Yellow cells mark the highest value for each metric. The upward arrow highlights the Voting classifier, which achieved the best ROC AUC (99.1%); and the highest accuracy (86.2%) and F1 score (86.2%).

Figure 5. Confusion matrix for the Voting Classifier.

Figure 6. Box plots of biomarker expression in the validation cohort (GSE19444). (A) TAP2, (B) SORT1, (C) WARS, and (D) ANKRD22. Asterisks denote significance thresholds: p < 0.001, and NS = not significant. Circles represent individual data points beyond the whiskers (outliers).

Figure 7. Heatmap of biomarker expression in the validation cohort (GSE19444).

Table 2. ANOVA and Tukey’s HSD results for biomarkers (validation cohort, GSE19444).

Gene	ANOVA F	ANOVA p-Value	Comparison	Δlog₂(FC)	Adjusted p-Value
TAP2	25.363	8.79 × 10⁻⁸	Active vs. Latent	1.114	0.0000
			Active vs. Control	1.195	0.0000
			Latent vs. Control	0.081	0.8963
SORT1	39.702	0.04 × 10⁻⁸	Active vs. Latent	1.166	0.0000
			Active vs. Control	1.077	0.0000
			Latent vs. Control	−0.089	0.8084
WARS	41.905	0.02 × 10⁻⁸	Active vs. Latent	1.737	0.0000
			Active vs. Control	1.754	0.0000
			Latent vs. Control	0.017	0.9964
ANKRD22	45.285	0.00679 × 10⁻⁸	Active vs. Latent	3.852	0.0000
			Active vs. Control	3.559	0.0000
			Latent vs. Control	−0.293	0.7872

Table 3. Comparison of model performance using the four-gene signature under different scaling methods.

Scaling Method	AUC (95% CI)	Sensitivity (95% CI)	Specificity (95% CI)	Accuracy (95% CI)	Macro F1-Score (95% CI)
RobustScaler	0.9911 (0.983–0.997)	90.0% (85.5–93.8%)	89.47% (84.2–93.5%)	86.2% (81.0–90.5%)	86.2% (81.0–90.3%)
Z-score	0.9902 (0.981–0.996)	90.3% (85.8–94.0%)	88.9% (83.5–93.0%)	85.8% (80.5–90.0%)	85.7% (80.4–89.8%)
Quantile	0.9895 (0.980–0.996)	89.7% (85.2–93.5%)	89.5% (84.2–93.5%)	85.5% (80.2–89.8%)	85.5% (80.1–89.6%)

Table 4. Benchmark against contemporary TB diagnostic models.

Study	Statistical Model	Indication	Number of Genes	Sensitivity	Specificity	AUC
Berry et al., 2010 [19]	K-nearest neighbors	ATB vs. LTBI and HCs	393	61.67	93.75	N/A
Berry et al., 2010 [19]	K-nearest neighbors	ATB vs. ODs	86	92	83	N/A
Kaforou et al., 2013 [20]	Difference of means	ATB vs. LTBI	27	95	90	0.98
Kaforou et al., 2013 [20]	Difference of means	ATB vs. ODs	44	93	88	0.95
Anderson et al., 2014 [21]	Difference of sums	ATB vs. LTBI	42	96	91	0.984
Anderson et al., 2014 [21]	Difference of sums	ATB vs. ODs	51	74	78	0.862
Laux da Costa et al., 2015 [4]	Random Forest	ATB vs. ODs	3	93	95	0.955
Lee et al., 2016 [23]	Naive Bayes	ATB vs. LTBI	3	97.9	98	0.979
Maertzdorf et al., 2016 [24]	Random Forest	ATB vs. LTBI and HCs	4	88	75	0.98
Sweeney et al., 2016 [22]	Difference of geometric means	ATB vs. LTBI and ODs and HCs	3	0.82	0.79	0·88
Sambarey et al., 2017 [8]	Linear discriminant analysis	ATB vs. LTBI and HCs and ODs	10	89.67	81.0	N/A
Leong et al., 2018 [26]	Rigid logistic regression	ATB vs. LTBI	24	93.07	94.5	0.9840
Bayaa et al., 2018 [27]	LASSO	ATB vs. HCs	6	90.9	87.8	0.94
Bayaa et al., 2018 [27]	LASSO	ATB vs. LTBI	6	90.9	88.5	0.93
Wang et al., 2019 [18]	Decision Tree	ATB vs. LTBI and HCs	3	82.4	92.4	0.806
Gliddon et al., 2021 [10]	Disease Risk Score Method	TB/LTBI	3	95	85	0.973
Gliddon et al., 2021 [10]	Disease Risk Score Method	TB/OD	3	95	85	0.938
Perumal et al., 2021 [32]	Simple arithmetic algorithms	HCs vs. ATB	2	90.48	66.67	0.9048
		HCs/LTBI vs. ATB	2	90.91	71.43	0.8615
		HCs vs. LTBI	2	91.67	23.81	0.5357
		LTBI vs. ATB	2	90.48	71.43	0.8367
Natarajan et al., 2022 [25]	N/A	ATB vs. LTBI	7	80–100	80–95	0.84–1.00
Sutherland et al., 2022 [28]	Mann–Whitney U tests	TB vs. ORD	3	0.87	0.94	0.88
Luo et al., 2022 [29]	Cforest	ATB vs. LTBI	8	93.39	91.18	0.978
Xie et al., 2024 [30]	LASSO/Random Forest	ATB vs. LTBI	2	--	--	0.994
		ATB vs. HCs	2	--	--	0.782
		LTBI vs. HCs	2	--	--	0.914
Ren et al., 2025 [31]	Support Vector Machine	ATB vs. LTBI	4	--	--	0.86
Ren et al., 2025 [31]	Support Vector Machine	ATB vs. HCs	4	--	--	0.99
This study (2025)	Voting Classifier	ATB vs. LTBI and HCs	4	90	89.47	0.9911

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Syed, A.H.; Alromema, N.; Almazarqi, H.A.; Irfan, J.; Ahmad, S.; Taha, A.A.; Alsayed, A.O. Machine-Learning-Derived, Mechanistically Informed Transcriptomic Signature to Diagnose Active Tuberculosis and Guide Host-Directed Therapy. Diagnostics 2026, 16, 693. https://doi.org/10.3390/diagnostics16050693

AMA Style

Syed AH, Alromema N, Almazarqi HA, Irfan J, Ahmad S, Taha AA, Alsayed AO. Machine-Learning-Derived, Mechanistically Informed Transcriptomic Signature to Diagnose Active Tuberculosis and Guide Host-Directed Therapy. Diagnostics. 2026; 16(5):693. https://doi.org/10.3390/diagnostics16050693

Chicago/Turabian Style

Syed, Asif Hassan, Nashwan Alromema, Hatem A. Almazarqi, Jasrah Irfan, Shakeel Ahmad, Altyeb A. Taha, and Alhuseen Omar Alsayed. 2026. "Machine-Learning-Derived, Mechanistically Informed Transcriptomic Signature to Diagnose Active Tuberculosis and Guide Host-Directed Therapy" Diagnostics 16, no. 5: 693. https://doi.org/10.3390/diagnostics16050693

APA Style

Syed, A. H., Alromema, N., Almazarqi, H. A., Irfan, J., Ahmad, S., Taha, A. A., & Alsayed, A. O. (2026). Machine-Learning-Derived, Mechanistically Informed Transcriptomic Signature to Diagnose Active Tuberculosis and Guide Host-Directed Therapy. Diagnostics, 16(5), 693. https://doi.org/10.3390/diagnostics16050693

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine-Learning-Derived, Mechanistically Informed Transcriptomic Signature to Diagnose Active Tuberculosis and Guide Host-Directed Therapy

Abstract

1. Introduction

1.1. Background

1.2. Research Motivation

1.3. Study Objectives and Methodology

1.4. Main Contributions

2. Literature Review: Development of Transcriptomic Biomarkers to Diagnose Tuberculosis

3. Materials and Methods

3.1. Dataset for Gene Biomarker Discovery and Validation

3.2. Data Preprocessing

3.2.1. Normalization and Transformation

3.2.2. Scaling Method Comparison and Justification

3.2.3. Data Splitting

3.3. Differential Expression Analysis

3.4. Multi-Group Comparisons and Post Hoc Testing

3.5. Multiple Testing Correction Strategy

3.6. Group-Specific DEG Categorization of Validated Genes

3.7. Machine Learning Pipeline

3.7.1. Feature Selection

3.7.2. Model Training and Evaluation

3.8. Biomarker Validation and Visualization

3.9. Functional Enrichment Analysis

4. Results

4.1. Identification of Differentially Expressed Genes (DEGs) Across Clinical States

4.2. Multi-Group Validation of Transcriptional Signatures

4.3. Transcriptome Analysis Discovers State-Specific Molecular Signatures

4.4. Machine Learning Prioritizes Minimal Biomarker Panels with Clinical Utility

4.5. Biomarker Validation Highlights Expression Dynamics

4.6. Sensitivity Analysis Confirms Signature Robustness

4.7. Robustness of the Signature to Scaling Methodology

4.8. Comparative Performance of TB Diagnostic ML Models

4.9. Gene Ontology and Pathway Enrichment Analysis of the Six Key DEGs

5. Discussion and Future Work(s)

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI