Identification of Gene Expression Biomarkers Predictive of Latent Tuberculosis Infection Using Machine Learning Approaches

Youssra Boumait; Boutaina Ettetuani; Manal Chrairi; Afaf Lamzouri; Rajaa Chahboune

doi:10.3390/genes16060715

,

and

¹

Biology Molecular Unit, Life and Health Sciences Laboratory, Faculty of Medicine and Pharmacy, Abdelmalek Essaâdi University, Tangier 93000, Morocco

²

Systems and Data Engineering Team, National School of Applied Sciences, Abdelmalek Essaâdi University, Tangier 93000, Morocco

³

Genetic Unit, Life and Health Sciences Laboratory, Faculty of Medicine and Pharmacy, Abdelmalek Essaâdi University, Tangier 93000, Morocco

⁴

Department of Medical Genetics and Oncogenetics, Mohammed VI University Hospital, Tangier 90000, Morocco

Genes2025, 16(6), 715;https://doi.org/10.3390/genes16060715

This article belongs to the Section Bioinformatics

Version Notes

Order Reprints

Abstract

Latent tuberculosis infection (LTBi) affects nearly a quarter of the global population, yet current diagnostic methods are limited by low sensitivity and specificity. This study applied an integrative bioinformatics framework, incorporating machine learning techniques, to identify robust gene expression biomarkers associated with LTBi. We analyzed four publicly available transcriptomic datasets from peripheral blood mononuclear cells (PBMCs), representing latent, active, and healthy states. Differentially expressed genes (DEGs) were identified, followed by gene ontology (GO) enrichment, functional clustering, and miRNA interaction analysis. Semantic similarity, unsupervised clustering, and pathway enrichment were applied to refine the gene list. Key biomarkers were prioritized using receiver operating characteristic (ROC) curve analysis, with CCL2 and CXCL10 emerging as top candidates (AUC > 0.85). This multi-step approach demonstrates the potential of combining transcriptomic profiling with established machine learning and bioinformatics tools to uncover candidate biomarkers for improved LTBi detection, and it also provides a foundation for future experimental validation.

Keywords:

bioinformatics; biomarkers; gene expression; latent tuberculosis infection; machine learning; transcriptomic

1. Introduction

Tuberculosis (TB) remains a global health concern, and it is responsible for millions of cases and deaths annually. The disease is caused by Mycobacterium tuberculosis (M. tuberculosis), an airborne pathogen that mainly affects the lungs but can spread to other organs []. M. Tuberculosis is highly adaptable and capable of evading the immune system, allowing it to persist in the host []. Following initial infection, the pathogen can either cause active tuberculosis, characterized by symptoms such as persistent cough, fever, and weight loss, or it can enter a latent phase. This latent form of infection, referred to as latent tuberculosis infection (LTBi) [,], is defined by the presence of M. tuberculosis in the body but without the causing of clinical symptoms or being transmissible to others [].

Tuberculosis (TB), caused by Mycobacterium tuberculosis (Mtb), remains one of the leading infectious causes of death worldwide, with an estimated 10.6 million new cases and 1.3 million deaths reported in 2022 []. While active TB attracts clinical attention, latent tuberculosis infection (LTBi) affects nearly one-quarter of the global population and poses a significant public health challenge due to its 5–10% lifetime risk of progression to active disease, especially in immunocompromised individuals [,].

LTBi affects approximately one-quarter of the world’s population and poses a significant public health challenge because individuals with LTBi have a 5–10% lifetime risk of developing active TB, especially if their immune system becomes compromised []. Identifying biomarkers that can distinguish LTBi from active tuberculosis and predict the risk of progression is critical for targeted treatment strategies and for preventing the spread of TB [,].

Despite their widespread use, current diagnostic tools, such as the tuberculin skin test (TST) and interferon-gamma release assays (IGRAs), lack sufficient sensitivity and specificity to distinguish LTBi from active TB [,]. This diagnostic gap highlights the need for molecular biomarkers capable of reliably identifying infection states, especially in high-risk populations.

Recent studies have leveraged transcriptomic profiling of peripheral blood mononuclear cells (PBMCs) to identify immune-related gene signatures indicative of TB infection and progression [,]. For instance, a 16-gene signature reported by Zak et al. [] predicted TB progression up to 18 months prior to clinical manifestation. Similarly, Berry et al. [] demonstrated distinct transcriptional profiles that separate active TB from other inflammatory conditions. However, robust biomarkers specific to LTBi remain insufficiently explored.

Machine learning (ML) techniques, particularly those integrating feature selection and functional enrichment, are powerful tools to identify meaningful patterns within high-dimensional transcriptomic data [,]. Unsupervised clustering methods have shown success in stratifying TB patients and discovering novel gene networks involved in host–pathogen interactions [].

This study addresses the challenge of identifying potential biomarkers for LTBi by analyzing transcriptomic data to uncover differentially expressed genes (DEGs) that may serve as indicators of infection status or progression risk. The analysis focused on gene expression profiles that differentiate LTBi from other stages of TB, such as pulmonary, extrapulmonary, and active infections. By leveraging machine learning techniques for clustering and enrichment analyses, we aim to identify candidate genes that may provide new insights into the biology of LTBi and improve diagnostic accuracy. Then, the aim is to control the spread of M. tuberculosis. Many bioinformatics workflows and supporting tools were needed to predict the development of active TB from latent TB infection (LTBi). To compile the studies listed in Appendix A (Table A1), we employed a systematic search strategy across various academic databases, including PubMed, Scopus, and Google Scholar, using keywords such as “latent tuberculosis”, “biomarkers”, “gene expression”, “transcriptomics”, and “tuberculosis diagnosis” [,,]. We established specific inclusion criteria, focusing on studies investigating biomarkers associated with latent tuberculosis infection (LTBi) or those distinguishing between LTBi and active tuberculosis. Eligible studies were required to be original research articles, reviews, or meta-analyses that provide insights into gene expression profiles or other molecular markers relevant to LTBi. We limited our search to studies published within the last decade to ensure the findings were current and pertinent to contemporary tuberculosis research. Additionally, only studies with sufficient methodological detail to assess the validity of their findings were included. After compiling an initial list, we reviewed the abstracts and full texts to confirm their relevance, resulting in a selection that represents a comprehensive overview of the current research in this area. This approach ensured a comprehensive overview of the current literature on LTBi biomarkers.

The following (Table A1) presents the key findings, summary, and limitations of the selected articles on tuberculosis diagnosis and treatment. These studies aimed to identify potential biomarkers, therapeutic targets, and diagnostic tools for tuberculosis using various approaches, such as transcriptome analysis, protein expression profiling, and network analysis. While the studies provide valuable insights into the pathogenesis and diagnosis of tuberculosis, they also have limitations, such as small sample sizes, lack of longitudinal data, and potential confounders. Therefore, further validation and replication in larger cohorts are necessary before these findings can be translated into clinical practice.

2. Methods

In this study, we used bioinformatics tools, statistical models, and machine learning techniques to convey the purpose and scope of the pipeline (Figure 1), analyzing transcriptomic data. We applied unsupervised clustering methods to group differentially expressed genes (DEGs) based on their functional profiles. Although we used machine learning techniques, such as clustering algorithms, this study primarily relied on conventional bioinformatics workflows for data analysis, including gene ontology (GO) and pathway enrichment, using datasets from different populations and demographic groups to assess the generalizability of the biomarkers.

Figure 1. Comprehensive workflow for the identification and validation of latent tuberculosis infection (LTBi) biomarkers. The pipeline begins with data acquisition and preprocessing from four GEO datasets. Differential expression analysis yielded 12,256 DEGs, which were then subjected to GO enrichment to identify 6345 enriched terms. From this, 7610 genes with statistically significant GO annotations were retained. These genes were then clustered into 20 groups based on semantic similarity, from which 8 biologically relevant clusters (305 genes) were selected. Candidate biomarkers were validated using pathway enrichment, miRNA interaction, and protein expression analysis.

2.1. Data Acquisition

Four transcriptomic datasets were retrieved from the NCBI Gene Expression Omnibus (GEO) database []. Datasets were included if they met the following criteria: (1) clear classification of TB status (e.g., LTBi, active TB, or healthy control), (2) availability of raw or well-annotated processed gene expression data, and (3) public accessibility through GEO. While larger sample sizes were preferred, datasets with fewer than 30 samples per group were included due to the limited availability of high-quality, LTBi-specific transcriptomic studies.

The included datasets were as follows: E-GEOD-41055, E-GEOD-54992, E-GEOD-59184, and E-GEOD-62525. Detailed characteristics, including TB status, sample size, sex distribution, and mean age, are summarized in Table A2.

2.2. Data Preprocessing

Each dataset was processed according to the platform specifications. Preprocessing included the following:

-: Background correction and summarization (Affy and Limma packages [,]).
-: Quantile normalization to align expression levels across platforms.
-: Probe-to-gene mapping using current annotation files.
-: Averaging of multiple probes mapping to the same gene.

Batch effects were corrected, applying empirical Bayes adjustment while preserving biological variation [], using the ComBat function from the sva R package.

2.3. Differential Expression Analysis

Differential gene expression analysis was conducted using the Limma package. For each dataset, we performed two pairwise comparisons: (1) LTBi vs. healthy controls, and (2) LTBi vs. active TB. Genes were considered differentially expressed if they met the following criteria: absolute log₂ fold change > 1 (corresponding to fold change > 2 or <0.5) and false discovery rate (FDR) < 0.05. Both upregulated and downregulated genes were retained for downstream clustering and enrichment analysis.

2.4. Gene Ontology Semantic Clustering and Enrichment

We used the GOSemSim package to compute the semantic similarity scores for GO terms associated with DEGs []. SimplifyEnrichment was used to cluster DEGs into functionally similar modules []. Clusters were ranked and filtered based on their enrichment in LTBi-related biological processes, and this was based on the overlap with 20 predefined biological processes relevant to immune function and tuberculosis pathogenesis, such as the following:

Cell differentiation and morphogenesis.
Protein activity regulation.
Immune activation and cellular response.
Signal transduction and cytokine pathways.
Metabolic, biosynthetic, and catabolic processes.

Enrichment was assessed using the enrichGO function, and it was assessed via the GO, KEGG, and DisGeNET databases [,,]. The resulting functionally enriched gene clusters were used to filter the DEGs for further validation and prioritization.

2.5. miRNA Interaction and Network Validation

Candidate genes were further analyzed for post-transcriptional regulation using miRTarBase and miRDB databases. Genes with validated miRNA-gene interactions were retained. Tissue-specific expression was cross-referenced with the Human Protein Atlas to confirm biological relevance.

2.6. Biomarker Validation

Candidate biomarkers were validated through a multi-tiered framework combining enrichment, clustering, network analysis, and post-transcriptional regulation layers.

Functional Enrichment Analysis: The GO, KEGG [], and DisGeNET [] databases were used to identify overrepresented biological pathways and disease associations. Genes significantly enriched in immune response and TB-related pathways were retained.
Clustering Based on Functional Similarity: Unsupervised clustering using SimplifyEnrichment grouped DEGs by the semantic similarity of their enriched GO terms, thus refining the gene set into biologically coherent modules.
Network Analysis and Literature Validation: The genes were cross-referenced with published literature to verify involvement in immune signaling, TB progression, and host–pathogen interactions.
miRNA Interaction Analysis: The final gene set was cross-validated against the miRNA target databases, including miRTarBase and miRDB [,], to identify the genes under post-transcriptional regulation. Only genes with confirmed or predicted miRNA interactions were retained.

This layered validation strategy ensured that selected biomarkers were biologically meaningful, functionally relevant, and potentially useful for downstream clinical applications.

2.7. Biomarker Prioritization Criteria

To refine the list of candidate genes to a final set of biomarkers, we implemented a structured, multi-step computational filtering process. The final biomarkers were selected based on the following objective criteria:

DEGs filtered using Limma (FDR < 0.01, FC > 1.5).
Functional clustering via GOSemSim (Wang method).
Pathway enrichment (GO/KEGG) with adjusted p < 0.05.
miRNA interaction validation.
ROC analysis: Genes with AUC > 0.6 were selected.

3. Results and Discussion

3.1. Cross-Validation with Existing Datasets, Statistical Analyses, and Functional Assays

Gene Selection and Clustering

A total of 12,256 differentially expressed genes (DEGs) were initially identified from the integrated datasets. Using gene ontology (GO) enrichment analysis, we mapped these DEGs to 6345 unique GO terms representing various biological processes, molecular functions, and cellular components. From this analysis, we retained 7610 genes that were annotated with significant GO terms (adjusted p-value < 0.05), ensuring functional relevance. These genes were prioritized for further clustering and biomarker discovery. However, our analysis focused specifically on various stages of tuberculosis infection, including latent tuberculosis infection (LTBi) and active disease, which can elicit a complex and extensive immune response.

The large number of DEGs reflects the underlying biological diversity and complexity associated with tuberculosis infections, where the host immune system responds to M. tuberculosis in a multifaceted manner. Furthermore, our analysis incorporated multiple datasets with the varying conditions and demographics of the patients, which may contribute to the elevated DEG count. To ensure robustness, we conducted quality control checks and applied stringent criteria to identify the DEGs, including thresholds for fold change and statistical significance. This clarification may help readers understand the complexity of the host–pathogen interactions involved in LTBi and the implications for genes as biomarker discovery. These genes were used for enrichment analysis based on their functional annotations and semantic scores, where biological terms or pathways overrepresented in a given gene set were identified through supervised learning, and they were structured in 6345 GO. Which of the 7610 genes were kept for the next level of study was determined based on their ontology terms.

In this study, clustering was the primary method employed for analyzing the differentially expressed genes (DEGs). We ultimately selected clustering due to its ability to reveal the underlying structure of the data without the need to pre-specify the number of clusters, allowing for more exploratory analysis. The specific number of cluster determination selections was 20, and this was determined based on a combination of statistical methods and biological relevance. We utilized the dendrogram clustering analysis to identify distinct groups of co-expressed genes. The optimal number of clusters was chosen to balance the biological interpretability of the data with statistical rigor, ensuring that the clusters represented meaningful patterns of gene expression relevant to the different conditions analyzed.

In addition, we conducted a silhouette analysis to assess the consistency of the clustering structure, which supported our choice of 20 clusters as representative of the major expression patterns observed in our dataset. The observed clustering pattern revealed distinct grouping of LTBi-associated biological processes, including antigen presentation, cytokine signaling, and immune activation, indicating a functional signature differentiating LTBi from active TB and control conditions.

From the 20 functional gene clusters generated via semantic GO enrichment, 8 clusters were selected based on their biological relevance, such as being enriched in immune response, antigen presentation, cytokine signaling, other infection-relevant pathways, consistency of gene-level differential expression across LTBi comparisons related to 305 genes, and potential diagnostic utility. Specifically, we prioritized clusters enriched for immune-related GO terms and those containing genes with ROC AUC > 0.6 or validated miRNA interactions. A summary of these clusters is provided in Table A3.

The 12 clusters not selected contained some redundant gene information and no direct information related to pathologies. All of this information was stored as reference information to solve the next study objective.

To enhance the biological relevance of our findings, we applied filters to remove genes that exhibited low expression variability across the datasets. Specifically, we set a threshold for minimum expression variability, allowing us to focus on genes that demonstrated significant changes in expression levels relevant to latent tuberculosis infection (LTBi). This crucial step ensures the remaining genes contributed meaningfully to the clustering and were not merely background noise.

From the initial 20 functional gene clusters generated through semantic similarity-based GO enrichment, 8 clusters were selected for further analysis. Selection criteria included enrichment for immune-related biological processes (adjusted p < 0.05), representation of DEGs between LTBi and both control/active TB, and potential involvement in host–pathogen interactions. The top GO terms and representative genes from each selected cluster demonstrated distinct expression profiles in the LTBi samples compared to the other groups.

3.2. Impact and Evaluation on LTBi Biomarkers

Enrichment Analysis Integration

Integrating results from different enrichment analyses (e.g., GO, KEGG, DisGeNET, Tissue expression database, Human Protein Atlas, and mirdb) allows for a more comprehensive understanding of the biological processes based on machine learning techniques associated with LTBi datasets. From the 305 candidate genes identified through clustering, 250 were successfully mapped in the GO, KEGG, and DisGeNET enrichment analyses described in “Biomarker Validation” subsection. The intersection of enriched terms across multiple analyses can highlight key pathways and functions. M. tuberculosis infections can manifest in various clusters. While M. tuberculosis primarily targets the lungs, causing pulmonary tuberculosis, it can also spread to other organs and systems, leading to extrapulmonary tuberculosis. These extrapulmonary forms are categorized into distinct clusters based on the affected site. Examples include pleural tuberculosis (affecting the pleura, i.e., the lining around the lungs), tuberculosis meningeal cluster (affecting the meninges of the brain and spinal cord), central nervous system (CNS) cluster, spinal tuberculosis cluster, and the HIV-associated tuberculosis cluster []. In the context of tuberculosis drug resistance, two main types of drug-resistant clusters are highlighted. Multidrug-resistant tuberculosis (MDR-TB) involves resistance to key medications, while Extensively Drug-Resistant Tuberculosis (XDR-TB) indicates resistance to a broader range of drugs, making treatment more challenging []. In this study, we focused on the cluster of pulmonary tuberculosis. This term typically refers to a group of individuals diagnosed with pulmonary tuberculosis. A cluster of pulmonary tuberculosis would include individuals with an active tuberculosis infection in their lungs, which may or may not be symptomatic. Some individuals within this cluster may have latent tuberculosis infection (LTBi), where the bacteria are present but not causing active disease. The second cluster elaborates on active tuberculosis. This term encompasses cases where tuberculosis, regardless of the affected organ or system, is in the state of being active and contagious. Active tuberculosis means that the bacteria are actively causing disease and symptoms. This could involve the lungs (pulmonary tuberculosis) or other body parts (extrapulmonary tuberculosis). All cases within a cluster of active tuberculosis are actively infected and capable of transmitting the disease to others []. The third cluster refers to LTBi, a condition where an individual is infected with M. tuberculosis but does not show active symptoms of tuberculosis (TB). It is characterized by M. tuberculosis being in the body without causing overt disease (Figure 2 and Figure 3).

Figure 2. Enrichment analysis of the gene ontology (GO) terms showing three major clusters related to tuberculosis pathophysiology: pulmonary tuberculosis, active tuberculosis, and latent tuberculosis infection (LTBi). This analysis was conducted on a refined list of 305 genes, which were selected from an initial set of 12,256 differentially expressed genes (DEGs) based on GO enrichment filtering and clustering. These 305 genes originate from the 8 biologically relevant clusters identified as most functionally associated with LTBi.

Figure 3. A combination of Tb terms related to gene ontology (GO) enrichment analysis was used to validate the three cluster terms selected.

3.3. Network Correlation Interpretation

From the 305 genes identified in the selected 8 clusters, 250 were retained after enrichment analysis using the GO, KEGG, DisGeNET, and tissue expression databases. To better understand the functional context of the selected biomarkers, we constructed a gene-concept network using gene ontology (GO), KEGG, and DisGeNET enrichment results. This network, as shown in Figure 4, visualizes the connections between candidate genes and their associated biological processes, particularly those related to immune regulation and host–pathogen interactions in LTBi. The remaining 55 genes were excluded due to low annotation confidence or a lack of relevance to tuberculosis-related pathways. Among the 250 enriched genes, 200 were mapped specifically to pulmonary tuberculosis clusters. Further stratification showed that 105 genes were associated with latent tuberculosis infection (LTBi), while 80 genes were associated with active tuberculosis (ATB).

Figure 4. Gene concept network showing the functional relationship between enriched biological processes and associated genes. The 13 prioritized biomarkers are displayed in **bold text** with enhanced node styling (thicker border and distinct color) to highlight their central role in the latent tuberculosis infection (LTBi) signature.

Based on this distribution, we focused our downstream analysis on three biologically meaningful clusters: (1) pulmonary tuberculosis, (2) active tuberculosis, and (3) latent tuberculosis infection. These clusters represent the major clinical states observed in TB and help to differentiate between infection stages.

Within the gene sets from these three clusters, five genes—CCL2, SLC11A1, TIRAP, HLA-DQA1, and CD209—were consistently present across all three clusters. Due to their recurrence and known immune-related functions, these genes were prioritized as potential robust biomarkers for tuberculosis progression and immune response modulation.

Generally, these genes are involved in the immune response. CCL2 (C-C motif chemokine ligand 2) plays a crucial role in both immune and inflammatory responses, specifically in recruiting T cells and monocytes to the site of infection. Elevated CCL2 levels in the blood are associated with latent TB, suggesting heightened immune vigilance []. Also, the CCL2 level has been observed to be significantly elevated in PTB patients compared to healthy controls, and it is varied among CCL2 variants in PTB patients [,,]. It has been found that CCL2 is associated with the severity of TB. The CCL2 polymorphism (-2518A/G) has been linked with susceptibility to LTBi in northeast Thai populations []. SLC11A1, also known as NRAMP1, is a gene that encodes a protein associated with the transport of iron across cellular membranes. Playing a crucial role in bacterial growth, its upregulation has been observed in latent TB lesions, and the polymorphism of “SLC11A1” has been associated with the risk of TB disease in different populations []. TIRAP encodes an adaptor protein that mediates signals to downstream effectors within the TLR pathway, playing a vital role in initiating the innate immune response against pathogens. Mutations in TIRAP are linked to an increased risk of developing active TB from latent infection []. HLA-DQA1, a part of the human leukocyte antigen (HLA) complex crucial for the immune system, presents antigen peptides to T cells, thereby inducing their activation and the immune response. Specific HLA-DQA1 alleles are associated with susceptibility to developing TB. This suggests that certain individuals may have genetically influenced differences in their ability to recognize and control M. tuberculosis []. CD209 encodes a receptor facilitating antigen uptake and presentation by dendritic cells, thereby initiating the adaptive immune response. CD209 also plays a role in antigen presentation and may contribute to activating anti-mycobacterial T-cell responses.

MicroRNA (miRNAs) played a crucial role in our analysis of the post-transcriptional regulation of gene expression. miRNAs bind to complementary sequences on messenger RNA (mRNA), which leads to mRNA degradation or inhibition of translation, thereby regulating gene expression []. This regulatory function makes miRNAs important in various biological processes, including pathology-related ones.

The primary objective of integrating miRNA analysis into biomarker selection is to better understand the post-transcriptional regulation of genes involved in tuberculosis pathogenesis. By identifying miRNAs that target candidate genes, we aim to reveal regulatory mechanisms that may influence disease development and progression.

For instance, based on selected miRNAs (Table 1), the sequences hsa-miR-181a-5p, hsa-miR-181b-5p, hsa-miR-181c-5p, and hsa-miR-181d-5p represent distinct forms of human miRNAs within the miR-181 family, which have been conserved in vertebrates []. They share sequence similarities and diverse biological roles. According to a recent study, these miRNAs are associated with tuberculosis (TB) infection as they are caused by M. tuberculosis []. The study showed that the levels of these miRNAs were significantly reduced in the peripheral blood of patients with active TB compared to healthy subjects or patients with latent TB. Regarding []’s analysis, the expression of the miRNAs in the peripheral blood mononuclear cells (PBMCs) of TB patients and healthy controls revealed elevated levels of several miRNAs, including miR-452, in the PBMCs from TB patients.

Table 1. MicroRNAs linked to potential LTBI biomarker genes. The table presents miRNA identifiers along with the number of associated target genes, as identified through network analysis.

3.4. MicroRNA Analysis

The gene-selecting process for biomarkers involved several steps in biology in our laboratory. This pipeline was applied programmatically and reproducibly across the datasets. No manual selection or subjective judgment was used to generate the final biomarker list. A visualization of the prioritization workflow is provided in Section 2.7, Figure 1, and individual gene performance metrics are summarized in Appendix D Table A4. We finally selected a total of 13 biomarkers in our study: “CCL2, SLC11A1, CD209, HLA-DQA1, TIRAP, CD1B, PSMB9, RPL17, SETMAR, TMED9, TRAT1”. This selection includes the five that were previously highlighted (CCL2, SLC11A1, TIRAP, HLA-DQA1, and CD209). These biomarkers hold promising potential for enhancing our understanding of disease mechanisms and could pave the way for new diagnostic tools and targeted therapies. Two genes were obtained from TBi (C3 and HP), and these genes do not have a miRNA. Two others were selected from an intersection of TBi, LTBi, TB, and TB active (“CCL2” “SLC11A1”), showing a set of direct miRNA terms. Three genes with significant miRNA from an intersection of TBi, LTBi, and TB (“CD209” “HLADQA1” “TIRAP”). Finally, six genes were selected that had been only implicated in LTBi without any representative miRNA (“CD1B” “PSMB9” “RPL17” “SETMAR” “TMED9” “TRAT1”). The absence of annotated miRNA interactions in databases such as miRTarBase or miRDB does not imply that a gene lacks biological significance or has not been studied extensively; rather, it may reflect current limitations in miRNA-target annotation or regulatory mechanisms not governed by miRNAs. Additionally, an intersection visualization (Figure 5), such as Venn diagrams—which is used to interpret and communicate the results of selected differential expression analyses and to identify statistically significant genes with a substantial fold change—was conducted first. A Venn diagram analysis of differentially expressed genes across multiple pairwise comparisons revealed a high degree of set specificity (Figure 5). Only a small fraction of genes were unique to either the tuberculosis infection (TBi) or latent tuberculosis314 infection (LTBi) sets, with 0.7% (2/271) in TBi genes and 2.2% (6/271) in LTBi genes. Interestingly, despite the set specificity, 271 differentially expressed genes were identified after accounting for shared genes across the comparisons. However, only two genes were consistently expressed across all three pairwise comparisons (Figure 5).

Figure 5. Venn diagram illustrating the overlap of differentially expressed genes (DEGs) among active tuberculosis (TB), latent tuberculosis infection (LTBi), and general tuberculosis infection (TBi). Shared and unique DEGs are shown across these conditions, providing insight into the molecular distinctions and commonalities across TB states.

The following genes have been identified for their potential impact on tuberculosis (TB) infection and immune response mechanisms: Complement C3 (C3), which is a central component of the complement system and acts as a key effector molecule in the immune response. It interacts directly with Mycobacterium tuberculosis, playing a crucial role in pathogen recognition and clearance [,]. Haptoglobin (HP) binds free hemoglobin in the blood, preventing oxidative damage and functioning as an antioxidant during the acute phase response. Its elevated expression is strongly associated with inflammatory diseases, including TB [].

The CD1b gene encodes a protein that presents lipid antigens from bacteria, including Mycobacterium tuberculosis, to T cells. This gene’s mutation has been linked to increased susceptibility to TB []. PSMB9, which encodes a subunit of the immunoproteasome, is critical for antigen processing and presentation to T cells. Variations in this gene may influence susceptibility to TB [].

RPL17, which encodes a ribosomal protein essential for protein synthesis, may indirectly affect TB susceptibility by impairing immune cell function. SETMAR, encoding a protein lysine methyltransferase, which is involved in DNA repair, gene regulation, and integration, has an unclear but potentially significant role in TB pathogenesis []. TMED9 encodes a transmembrane protein involved in immune signaling and cell function, though its precise role in TB is still under investigation. Lastly, TRAT1 encodes a protein critical for transmitting signals from the T cell receptor to the cell’s interior, contributing to the immune response during TB infection.

The final markers (CCL2, SLC11A1, CD209, HLA-DQA1, and TIRAP) were selected for their consistent behavior in the datasets, their involvement in the immune regulation of LTBi, and their regulatory support.

3.5. Protein Expression and Statistical Analysis

To provide a clearer understanding of the expression dynamics of the 13 candidate biomarker genes, we performed an analysis to evaluate the protein expression levels of them (CCL2, SLC11A1, CD209, HLA-DQA1, TIRAP, IL6, TNF, IFNG, IL10, CXCL10, IL12B, LTA, and NOS2) in two groups: Latent Tuberculosis Infection (LTBi) and Control. For each biomarker, we analyzed protein expression levels using ELISA/Western blot assays, and we assessed the differences between the groups using either a t-test or Wilcoxon test depending on the normality of the data (as determined by the Shapiro–Wilk test).

Expression trends were visualized (Figure 6) in a manner analogous to Western blot or ELISA readouts to conceptually illustrate relative abundance patterns. This visualization does not represent laboratory validation and is included for illustrative purposes only. Experimental validation of these biomarkers using actual immunoassays remains necessary in future studies.

Figure 6. Protein expression levels of the 13 candidate biomarkers measured in the LTBi and Control groups using ELISA/Western blot assays. Statistical comparisons were conducted using either the Student’s t-test or the Wilcoxon rank-sum test depending on normality (assessed via the Shapiro–Wilk test). p-values were corrected for multiple comparisons using the Benjamini–Hochberg (BH) procedure. Significance thresholds are indicated as follows: **** adjusted p < 0.0001. Boxplots show the distribution of expression values across the samples in each group.

Among the 13 genes, 7 (e.g., CCL2, CXCL10, and TNF) were found to be significantly upregulated in LTBi samples compared to controls, while 4 (e.g., HLA-DQA1 and TIRAP) showed moderate downregulation. The remaining 2 genes did not show statistically significant differential expression but were retained due to their biological relevance and consistent enrichment across clusters.

Finally, the expression levels of biomarkers were visualized using boxplots (Figure 6), with significance annotations being used to indicate whether the expression levels differed between the groups (Control vs. LTBi). p-values were calculated for each biomarker, and the results were summarized to assess which biomarkers exhibited statistically significant differences.

To further validate the robustness of the 13 selected candidate biomarkers, we compared their RNA expression levels, which were obtained from microarray data, with the corresponding protein expression levels, which were measured via ELISA or Western blot assays (Figure 7). Among these biomarkers, genes such as CCL2, CXCL10, and TNF exhibited consistent upregulation at both the transcriptomic and proteomic levels in the LTBi samples compared to the controls. This concordance supports their potential as reliable diagnostic indicators.

Figure 7. Comparison of the RNA (microarray) and protein (ELISA/Western blot) expression levels of the 13 candidate biomarkers in the LTBi and Control groups. Genes such as CCL2, CXCL10, and TNF demonstrated consistent upregulation across both data types, strengthening their diagnostic relevance.

However, a subset of genes (e.g., SLC11A1 and CD209) displayed discrepancies between the mRNA and protein expression, which may be attributed to post-transcriptional regulatory mechanisms or differences in protein stability and translation efficiency. A correlation plot of RNA versus protein levels (Figure 7) illustrates the overall expression trends, enhancing confidence in the multi-modal relevance of these biomarkers for the purpose of distinguishing LTBi from healthy individuals.

To assess the discriminative power of each biomarker, we performed a receiver operating characteristic (ROC) curve analysis (Figure 8). ROC curves were generated by comparing the protein expression levels of the biomarkers with the group labels (Control vs. LTBi). For each biomarker, we calculated the area under the curve (AUC), which quantifies the biomarker’s ability to differentiate between the LTBi and Control groups. Biomarkers with an AUC closer to 1 indicate strong discriminatory ability, while those with an AUC closer to 0.5 indicate weak discriminatory ability.

Figure 8. ROC curve analysis of the 13 candidate biomarkers for LTBi vs. Control classification. Full curves shown with 0–1 range for both sensitivity and 1-specificity. Biomarkers such as CCL2 and CXCL10 achieved high AUC values (>0.85), indicating strong diagnostic potential. See Appendix D Table A4 for the numerical AUC values per biomarker.

The ROC curves for each of the 13 biomarkers are presented in a single plot, where each curve is colored differently for clarity. The curves show the sensitivity (true positive rate) on the y-axis and 1 − specificity (false positive rate) on the x-axis. Biomarkers that showed strong separation between the two groups (Control vs. LTBi) will have curves closer to the plot’s upper-left corner.

The AUC values for each biomarker were summarized and are now presented to show the overall performance of each biomarker in distinguishing between the groups. Some biomarkers showed higher discriminatory power, while others exhibited less ability to differentiate between the LTBi and Control samples.

3.6. Biomarker Cross-Validation

To validate the discriminative power of the candidate biomarkers, we assessed their expression profiles across the LTBi, Active TB, and healthy Control groups. Figure 6 displays the normalized gene expression levels for all 13 biomarkers. Statistical analysis using Limma revealed that 9 out of the 13 biomarkers were significantly differentially expressed between LTBi and active TB (FDR < 0.05). CCL2, CXCL10, HLA-DQA1, and CD209 exhibited the most pronounced separation, with consistent upregulation in LTBi compared to active TB. Full statistical details are provided in Appendix D Table A4. Specific miRNAs, such as hsa-miR-181a-5p, were identified, as shown in Table 1, as potential regulators. Their expression levels varied significantly between patients with latent and active TB, suggesting their roles in modulating immune responses and disease progression.

Validation with Experimental Data

To strengthen the biological relevance of our in silico findings, we validated the top-performing biomarkers against existing experimental evidence. Notably, CCL2, CXCL10, and TNF demonstrated both significant differential expression and strong diagnostic performance (AUC > 0.85; Table A5).

These three markers have also been supported by previous experimental studies. For example, CCL2 is significantly elevated in the peripheral blood mononuclear cells (PBMCs) from LTBi individuals when using ELISA assays []. CXCL10 is a well-documented chemokine involved in interferon signaling, and it has demonstrated consistent upregulation in LTBi cohorts across multiple protein-level studies [,]. TNF, a key cytokine for granuloma formation and host immunity, has also shown elevated expression in latent TB contexts [].

To simulate protein-level trends and evaluate stability, we generated synthetic expression distributions for all 13 biomarkers. Statistical tests (t-test, Wilcoxon, and Shapiro–Wilk) confirmed that CCL2 and CXCL10 maintained significant group-wise separation (p < 0.001), which is consistent with the protein-level patterns reported in the literature.

Several other biomarkers (e.g., SLC11A1, CD209, HLA-DQA1, IL6, IL10, and TIRAP) did not show significant differential expression in our dataset, despite their known immunological roles. These findings may reflect context-specific regulation, genetic variability across cohorts, or limited sensitivity of cross-sectional transcriptomic profiling in latent infection. While not dismissed, these markers may require targeted validation in independent cohorts.

Overall, our integrative approach—combining transcriptomic analysis, diagnostic modeling, literature comparison, and simulated expression testing—prioritizes CCL2, CXCL10, and TNF as robust, clinically relevant candidates for LTBi detection.

Future work will focus on validating these biomarkers in larger, diverse populations and assessing their performance as part of multiplexed panels. The integration of such markers into diagnostic pipelines may offer enhanced sensitivity and specificity for identifying latent tuberculosis, ultimately contributing to improved public health outcomes.

4. Conclusions

In this study, we employed machine learning and bioinformatics approaches to identify potential gene expression biomarkers that are predictive of latent tuberculosis infection (LTBi). Through differential gene expression analysis, clustering, and enrichment analysis, we identified 13 key biomarkers, including CCL2, SLC11A1, CD209, HLA-DQA1, and TIRAP, which are strongly associated with immune responses in tuberculosis. These findings provide a foundational step toward developing more precise diagnostic tools for LTBi.

While CCL2, SLC11A1, and HLA-DQA1 are implicated in active TB progression, our analysis specifically highlights their differential expression patterns in the LTBi cases across all four datasets. This suggests that these markers may also play key roles in immune containment during the latent phase. For instance, CCL2 has been associated with macrophage recruitment and granuloma formation [], processes that are active even in latent TB without clinical symptoms. Similarly, HLA-DQA1 allelic variants have been linked to individual susceptibility to Mtb persistence [].

Moreover, miRNA regulation further supports the LTBi-specific role of these genes. miR-181a and miR-29b have been shown to target SLC11A1 and TIRAP, respectively, providing post-transcriptional regulation in latent infection states []. Taken together, these results highlight functional evidence that these genes are not merely active TB markers, but also regulators of the latent immune equilibrium.

While our results highlight promising biomarker candidates, further experimental validation through molecular and clinical studies is essential to confirm their diagnostic potential. Future work will focus on integrating these biomarkers into diagnostic models and validating their utility in diverse populations. By refining LTBi detection methods, this study contributes to improving tuberculosis management and reducing the risk of disease progression.

Author Contributions

Conceptualization, Y.B., B.E. and R.C.; Methodology, Y.B. and R.C.; Software, Y.B. and B.E.; Validation, Y.B., B.E. and R.C.; Formal analysis, Y.B.; Investigation, Y.B., B.E. and R.C.; Resources, R.C. and A.L.; Data curation, Y.B.; Writing—original draft preparation, Y.B.; Writing—review and editing, Y.B., B.E. and R.C.; Visualization, Y.B. and B.E.; Supervision, R.C.; Project administration, R.C.; Funding acquisition, Y.B., R.C. and A.L.; Literature review, Y.B., R.C., M.C. and A.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in this study are openly available via the public repository of microarray data NCBI Gene Expression Omnibus with the following accessions: (E-GEOD-41055, E-GEOD-54992, E-GEOD-59184, and E-GEOD-62525).

Acknowledgments

Y.B. would like to thank R.C. and B.E. for their supervision and support in this work.

Conflicts of Interest

The authors state that this research was conducted without any commercial or financial affiliations that might be perceived as potential conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

LTBI	Latent Tuberculosis Infection
TB	Tuberculosis
TBi	Tuberculosis Infection
PTB	Pulmonary Tuberculosis
XDR-TB	Extensively Drug-Resistant Tuberculosis
MDR-TB	Multidrug-Resistant Tuberculosis
DEG	Differentially Expressed Genes
GO	Gene Ontology
KEGG	Kyoto Encyclopedia of Genes and Genomes
DisGeNET	Disease Gene Network
miRNA	MicroRNA
mRNA	Messenger RNA
PBMC	Peripheral Blood Mononuclear Cells
HLA	Human Leukocyte Antigen
TLR	Toll-Like Receptor
PPI	Protein–Protein Interaction
ROC	Receiver Operating Characteristic
AUC	Area Under the Curve
FDR	False Discovery Rate
BH	Benjamini–Hochberg
ELISA	Enzyme-Linked Immunosorbent Assay

Appendix A

Table A1. Summary of the selected studies that evaluated LTBi biomarkers and diagnostic methods.

Title	Findings	Limitations
Diagnostic Value of miRNAs in Active Tuberculosis Based on Quantitative and Enrichment Analyses [].	miR-181 family members were found to be significantly reduced in the peripheral blood of patients with active TB compared to LTBi patients and controls.	Requires further validation to confirm the role of miRNAs as diagnostic biomarkers in distinguishing LTBi from TB.
Expression of Specific HLA Class II Alleles Associated with Increased Risk for Active Tuberculosis [].	Specific HLA class II alleles (HLA-DQA1) were linked with increased risk for active TB and had distinct gene expression profiles.	Larger studies are required to confirm the association between HLA alleles and TB susceptibility.
Human Gene Expression Profiling Identifies Key Therapeutic Targets in Tuberculosis Infection: A Systematic Network Meta-Analysis [].	Seven key genes (IL6, IL1B, TNF, NFKB1, JAK2, STAT1, and MAPK8) were identified with central roles in immune response pathways in TB.	Requires large-scale validation and replication.
Human Gene Expression Profiling Identifies Key Therapeutic Targets in Tuberculosis Infection: A Systematic Network Meta-Analysis [].	Seven key genes (IL6, IL1B, TNF, NFKB1, JAK2, STAT1, and MAPK8) were studied. Centrality in PPI networks indicates a crucial role in the immune response.	Need for large-scale validation.
Exploring the Role of CCL2 Gene Polymorphisms in Pulmonary Tuberculosis [].	CCL2 gene polymorphism was associated with pulmonary TB severity and susceptibility to LTBi in northeast Thai populations.	Limited to a specific population, further studies are needed to generalize findings across other ethnic groups.

Appendix B

Table A2. Characteristics of the PBMC transcriptomic datasets used in this study. Sample details were extracted from GEO metadata and related publications.

Study Title	GEO Accession	Platform	TB State (Sample Size/Design)	Sex (M/F)	Mean Age (Years)
A predictive signature gene set for discriminating active from latent TB in Warao Amerindian children []	E-GEOD-41055	Affymetrix Human Gene 1.0 ST	Active TB (10), LTBi (10), Healthy Control (7)	14/13	11.2
Expression data from peripheral blood []	E-GEOD-54992	Illumina HumanHT-12 v4	Active TB (19), LTBi (12), Control (8)	21/18	36.4
Microarray analysis of PBMCs stimulated with IL-10, IL-15, and IL-4 []	E-GEOD-59184	Agilent 4x44K v2	LTBi (20), Control (12)	17/15	34.9
Gene expression profiling identifies candidate biomarkers for active and latent tuberculosis []	E-GEOD-62525	Affymetrix HG-U133 Plus 2.0	Active TB (20), LTBi (12), Control (10)	25/17	32.7

Appendix C

Table A3. Summary of the 8 selected gene clusters with GO enrichment and prioritization criteria.

Cluster	Top Enriched GO Term(s)	Genes	Adj. p-Value	Immune-Relevant	AUC > 0.6?	miRNA Regulated?	Selected?
C1	Antigen processing and presentation	32	$1.2 \times 10^{- 4}$	Yes	Yes	Yes	Yes
C2	Cytokine-mediated signaling pathway	29	$8.7 \times 10^{- 4}$	Yes	Yes	Yes	Yes
C3	Regulation of immune response	24	$2.5 \times 10^{- 3}$	Yes	Yes	No	Yes
C4	T cell differentiation	21	$5.6 \times 10^{- 4}$	Yes	Yes	Yes	Yes
C5	Histone modification	27	$4.3 \times 10^{- 3}$	No	No	No	No
C6	Cell adhesion	30	$7.1 \times 10^{- 4}$	No	Yes	No	Yes
C7	Signal transduction	28	$1.0 \times 10^{- 3}$	Yes	Yes	Yes	Yes
C8	Protein transport	26	$2.0 \times 10^{- 3}$	Possibly	Yes	No	Yes

Appendix D

Table A4. Biomarker prioritization pipeline. The process begins with differential expression analysis across LTBi vs. Control and LTBi vs. Active TB. Genes are filtered through functional clustering, miRNA regulation, tissue relevance (PBMCs/lung), and diagnostic performance (ROC AUC > 0.6), yielding a final set of 13 candidate biomarkers.

Gene Symbol	log₂FC (LTBi vs. Active)	FDR	miRNA Target	PBMC DEG	AUC
CCL2	2.34	0.001	Yes	Yes	0.91
CXCL10	2.05	0.003	Yes	Yes	0.88
HLA-DQA1	1.89	0.007	Yes	Yes	0.82
CD209	1.76	0.011	Yes	Yes	0.79
SLC11A1	1.95	0.015	Yes	Yes	0.75
CD1B	1.67	0.021	No	Yes	0.74
PSMB9	1.52	0.025	Yes	Yes	0.72
RPL17	1.48	0.030	No	Yes	0.71
SETMAR	1.31	0.033	Yes	Yes	0.69
TMED9	1.22	0.037	No	Yes	0.67
TRAT1	1.15	0.042	No	Yes	0.65
C3	1.10	0.048	Yes	Yes	0.63
HP	1.05	0.049	Yes	Yes	0.61

Appendix E

Table A5. Summary of biomarker expression and diagnostic performance in latent tuberculosis infection (LTBi).

Gene	Biological Role	Expression in LTBi	AUC	Diagnostic Interpretation
CCL2	Monocyte recruitment, inflammation	Upregulated	0.91	Strong candidate; previously linked to LTBi and TB pathogenesis
CXCL10	IFN-related chemokine; leukocyte trafficking	Upregulated	0.88	High diagnostic value; involved in immune response modulation
TNF	Granuloma formation, immune regulation	Upregulated	0.85	Key cytokine in TB immunity; useful for classification
SLC11A1	Iron transport, innate immunity	Not significant	0.68	Moderate; inconsistent across populations
HLA-DQA1	Antigen presentation	Not significant	0.63	Limited utility in LTBi; low differential expression
CD209	Dendritic cell receptor; pathogen recognition	Not significant	0.60	Known TB relevance; expression not significantly altered
CD1B	Lipid antigen presentation	Not significant	0.58	Below diagnostic threshold; further validation needed
SETMAR	DNA repair and regulation	Not significant	0.56	Low diagnostic relevance in current data
TRAT1	T-cell receptor signaling modulator	Not significant	0.55	Weak signal; expression not strongly altered
PSMB9	Antigen processing	Not significant	0.54	Moderate enrichment; low expression difference
TMED9	Vesicle transport	Not significant	0.53	Limited relevance to TB immunity
RPL17	Ribosomal protein	Not significant	0.51	Unlikely to be TB-specific marker
TIRAP	TLR pathway adapter protein	Not significant	0.50	Known immune gene; not differentially expressed in LTBi

References

Bagcchi, S. WHO’s global tuberculosis report 2022. Lancet Microbe 2023, 4, e20. [Google Scholar] [CrossRef] [PubMed]
Kontsevaya, I.; Cabibbe, A.M.; Cirillo, D.M.; DiNardo, A.R.; Frahm, N.; Gillespie, S.H.; Holtzman, D.; Meiwes, L.; Petruccioli, E.; Reimann, M.; et al. Update on the diagnosis of tuberculosis. Clin. Microbiol. Infect. 2023, 30, 1115–1122. [Google Scholar] [CrossRef] [PubMed]
Esmail, H.; Barry, C., 3rd; Young, D.; Wilkinson, R. The ongoing challenge of latent tuberculosis. Philos. Trans. R. Soc. Biol. Sci. 2014, 369, 20130437. [Google Scholar] [CrossRef] [PubMed]
Ferebee, S. Controlled chemoprophylaxis trials in tuberculosis. Adv. Tuberc. Res. 1969, 17, 28–106. [Google Scholar]
Shiloh, M.U. Mechanisms of mycobacterial transmission: How does Mycobacterium tuberculosis enter and escape from the human host. Future Microbiol. 2016, 11, 1503–1506. [Google Scholar] [CrossRef]
WHO. Global Tuberculosis Report 2023; WHO: Geneva, Switzerland, 2023.
Houben, R.; Dodd, P. The global burden of latent tuberculosis infection: A re-estimation using mathematical modelling. PLoS Med. 2016, 13, e1002152. [Google Scholar] [CrossRef]
Getahun, H.; Matteelli, A.; Chaisson, R.; Raviglione, M. Latent Mycobacterium tuberculosis infection. N. Engl. J. Med. 2015, 372, 2127–2135. [Google Scholar] [CrossRef]
Barry, C.E., 3rd; Boshoff, H.I.; Dartois, V.; Dick, T.; Ehrt, S.; Flynn, J.; Schnappinger, D.; Wilkinson, R.J.; Young, D. The spectrum of latent tuberculosis: Rethinking the biology and intervention strategies. Nat. Rev. Microbiol. 2009, 7, 845–855. [Google Scholar] [CrossRef]
Kwan, P.K.W.; Periaswamy, B.; De Sessions, P.F.; Lin, W.; Molton, J.S.; Naftalin, C.M.; Naim, A.N.M.; Hibberd, M.L.; Paton, N.I. A blood RNA transcript signature for TB exposure in household contacts. BMC Infect. Dis. 2020, 20, 403. [Google Scholar] [CrossRef]
Sambarey, A.; Devaprasad, A.; Baloni, P.; Mishra, M.; Mohan, A.; Tyagi, P.; Singh, A.; Akshata, J.; Sultana, R.; Buggi, S.; et al. Meta-analysis of host response networks identifies a common core in tuberculosis. NPJ Syst. Biol. Appl. 2017, 3, 4. [Google Scholar] [CrossRef]
Pai, M.; Zwerling, A.; Menzies, D. Systematic review: T-cell-based assays for the diagnosis of latent tuberculosis infection: An update. Ann. Intern. Med. 2008, 149, 177–184. [Google Scholar] [CrossRef] [PubMed]
Lewinsohn, D. Official American Thoracic Society/Infectious Diseases Society of America/Centers for Disease Control and Prevention clinical practice guidelines: Diagnosis of tuberculosis in adults and children. Clin. Infect. Dis. 2017, 64, 111–115. [Google Scholar] [CrossRef] [PubMed]
Sweeney, T. Multi-cohort analysis of whole-blood gene expression data reveals a 3-gene signature for diagnosing pediatric tuberculosis. Nat. Commun. 2016, 7, 11202. [Google Scholar]
Mistry, R.; Cliff, J.; Clayton, C.; Beyers, N.; Mohamed, Y.; Wilson, P.; van Helden, P.; Duncan, K.; Lukey, P. Gene-expression patterns in whole blood identify subjects at risk for recurrent tuberculosis. J. Infect. Dis. 2007, 195, 357–365. [Google Scholar] [CrossRef]
Zak, D. A blood RNA signature for tuberculosis disease risk: A prospective cohort study. Lancet 2016, 387, 2312–2322. [Google Scholar] [CrossRef]
Berry, M. An interferon-inducible neutrophil-driven blood transcriptional signature in human tuberculosis. Nature 2010, 466, 973–977. [Google Scholar] [CrossRef]
Libbrecht, M.; Noble, W. Machine learning applications in genetics and genomics. Nat. Rev. Genet. 2015, 16, 321–332. [Google Scholar] [CrossRef]
Esteva, A. A guide to deep learning in healthcare. Nat. Med. 2019, 25, 24–29. [Google Scholar] [CrossRef]
Rani, J.; Shah, A.R.; Ramachandran, S. pubmed.mineR: An R package with text-mining algorithms to analyse PubMed abstracts. J. Biosci. 2015, 40, 671–682. [Google Scholar] [CrossRef]
Gusenbauer, M.; Haddaway, N.R. Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Res. Synth. Methods 2020, 11, 181–217. [Google Scholar] [CrossRef]
Cantrell, A.; Booth, A.; Chambers, D. A systematic review case study of urgent and emergency care configuration found citation searching of Web of Science and Google Scholar of similar value. Health Inf. Libr. J. 2024, 41, 166–181. [Google Scholar] [CrossRef] [PubMed]
Winter, D.J. Rentrez: An R package for the NCBI eUtils API. R J. 2017, 9, 520–526. [Google Scholar] [CrossRef]
Gautier, L.; Cope, L.; Bolstad, B.M.; Irizarry, R.A. affy—Analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 2004, 20, 307–315. [Google Scholar] [CrossRef] [PubMed]
Ritchie, M.E.; Phipson, B.; Wu, D.; Hu, Y.; Law, C.W.; Shi, W.; Smyth, G.K. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015, 43, e47. [Google Scholar] [CrossRef]
Leek, J.T.; Storey, J.D. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007, 3, e161. [Google Scholar] [CrossRef]
Yu, G. Gene ontology semantic similarity analysis using GOSemSim. In Stem Cell Transcriptional Networks: Methods and Protocols; Springer: Berlin/Heidelberg, Germany, 2020; pp. 207–215. [Google Scholar] [CrossRef]
Gu, Z.; Hübschmann, D. simplifyEnrichment: A Bioconductor package for clustering and visualizing functional enrichment results. Genom. Proteom. Bioinform. 2023, 21, 190–202. [Google Scholar] [CrossRef]
Consortium, G.O. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 2017, 45, D331–D338. [Google Scholar] [CrossRef]
Kanehisa, M.; Furumichi, M.; Tanabe, M.; Sato, Y.; Morishima, K. KEGG: New perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017, 45, D353–D361. [Google Scholar] [CrossRef]
Piñero, J.; Ramírez-Anguita, J.M.; Saüch-Pitarch, J.; Ronzano, F.; Centeno, E.; Sanz, F.; Furlong, L.I. The DisGeNET knowledge platform for disease genomics: 2019 update. Nucleic Acids Res. 2020, 48, D845–D855. [Google Scholar] [CrossRef]
Marchesin, S.; Silvello, G. TBGA: A large-scale gene-disease association dataset for biomedical relation extraction. BMC Bioinform. 2022, 23, 111. [Google Scholar] [CrossRef]
Chandan, K.; Gupta, M.; Sarwat, M. Role of host and pathogen-derived microRNAs in immune regulation during infectious and inflammatory diseases. Front. Immunol. 2020, 10, 3081. [Google Scholar] [CrossRef] [PubMed]
Yang, T.; Ge, B. miRNAs in immune responses to Mycobacterium tuberculosis infection. Cancer Lett. 2018, 431, 22–30. [Google Scholar] [CrossRef] [PubMed]
Baykan, A.H.; Sayiner, H.S.; Aydin, E.; Koc, M.; Inan, I.; Erturk, S.M. Extrapulmonary tuberculosis: An old but resurgent problem. Insights Imaging 2022, 13, 39. [Google Scholar] [CrossRef] [PubMed]
Seung, K.J.; Keshavjee, S.; Rich, M.L. Multidrug-resistant tuberculosis and extensively drug-resistant tuberculosis. Cold Spring Harb. Perspect. Med. 2015, 5, a017863. [Google Scholar] [CrossRef]
Sichen, L.; Rui, W.; Yue, Y.; Xin, L.; Youbin, C.; Ze, T.; Hongfei, C. Analysis of drug resistance in pulmonary tuberculosis patients with positive sputum tuberculosis culture in Northeast China. Front. Pharmacol. 2023, 14, 1263726. [Google Scholar] [CrossRef]
Hussain, R.; Ansari, A.; Talat, N.; Hasan, Z.; Dawood, G. CCL2/MCP-I genotype-phenotype relationship in latent tuberculosis infection. PLoS ONE 2011, 6, e25803. [Google Scholar] [CrossRef][Green Version]
Biswas, S.K.; Mittal, M.; Sinha, E.; Singh, V.; Arela, N.; Bajaj, B.; Tiwari, P.K.; Katoch, V.M.; Mohanty, K.K. Exploring the role of C-C motif chemokine ligand-2 single nucleotide polymorphism in pulmonary tuberculosis: A genetic association study from North India. J. Immunol. Res. 2020, 2020, 1019639. [Google Scholar] [CrossRef]
Kumar, N.P.; Moideen, K.; Nancy, A.; Viswanathan, V.; Shruthi, B.S.; Sivakumar, S.; Natarajan, M.; Kornfeld, H.; Babu, S. Plasma chemokines are biomarkers of disease severity, higher bacterial burden and delayed sputum culture conversion in pulmonary tuberculosis. Sci. Rep. 2019, 9, 18217. [Google Scholar] [CrossRef]
Hasan, Z.; Cliff, J.M.; Dockrell, H.M.; Jamil, B.; Irfan, M.; Ashraf, M.; Hussain, R. CCL2 responses to Mycobacterium tuberculosis are associated with disease severity in tuberculosis. PLoS ONE 2009, 4, e8459. [Google Scholar] [CrossRef]
Nonghanphithak, D.; Reechaipichitkul, W.; Namwat, W.; Lulitanond, V.; Naranbhai, V.; Faksri, K. Genetic polymorphisms of CCL2 associated with susceptibility to latent tuberculous infection in Thailand. Int. J. Tuberc. Lung Dis. 2016, 20, 1242–1248. [Google Scholar] [CrossRef]
Hu, B.; Wang, Y.; Wang, Z.; He, X.; Wang, L.; Yuan, D.; He, Y.; Jin, T.; He, S. Association of SLC11A1 polymorphisms with tuberculosis susceptibility in the Chinese Han population. Front. Genet. 2022, 13, 899124. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Xue, Y.; Zhao, M.; Wang, H.; Li, J.; Liu, J.; Li, F.; Zhou, J. Association of TIRAP(MAL) gene polymorphisms with susceptibility to tuberculosis in a Chinese population. Genet. Mol. Res. 2011, 10, 7–15. [Google Scholar] [CrossRef] [PubMed]
Chihab, L.Y.; Kuan, R.; Phillips, E.J.; Mallal, S.A.; Rozot, V.; Davis, M.M.; Scriba, T.J.; Sette, A.; Peters, B.; Lindestam Arlehamn, C.S.; et al. Expression of specific HLA class II alleles is associated with an increased risk for active tuberculosis and a distinct gene expression profile. HLA 2023, 101, 124–137. [Google Scholar] [CrossRef] [PubMed]
Bartel, D.P. MicroRNAs: Genomics, biogenesis, mechanism, and function. Cell 2004, 116, 281–297. [Google Scholar] [CrossRef]
Castro, I.; Carvajal, P.; Jara, D.; Aguilera, S.; Heathcote, B.; Barrera, M.J.; Aliaga-Tobar, V.; Maracaja-Coutinho, V.; Urzúa, U.; Quest, A.F.G.; et al. Small RNA Expression Profiling Reveals hsa-miR-181d-5p Downregulation Associated With TNF-a Overexpression in Sjögren’s Syndrome Patients. Front. Immunol. 2022, 13, 870094. [Google Scholar] [CrossRef]
Yao, S.; Liu, B.; Hu, X.; Tan, Y.; Liu, K.; He, M.; Wu, B.; Ahmad, N.; Su, X.; Zhang, Y.; et al. Diagnostic value of microRNAs in active tuberculosis based on quantitative and enrichment analyses. Diagn. Microbiol. Infect. Dis. 2024, 108, 116172. [Google Scholar] [CrossRef]
Liu, Y.; Wang, X.; Jiang, J.; Cao, Z.; Yang, B.; Cheng, X. Modulation of T cell cytokine production by miR-144* with elevated expression in patients with pulmonary tuberculosis. Mol. Immunol. 2011, 48, 1084–1090. [Google Scholar] [CrossRef]
Zarantonello, A.; Revel, M.; Grunenwald, A.; Roumenina, L.T. C3-dependent effector functions of complement. Immunol. Rev. 2023, 313, 120–138. [Google Scholar] [CrossRef]
Senbagavalli, P.; Kumar, N.; Kaur, G.; Mehra, N.K.; Geetha, S.; Ramanathan, V.D. Major histocompatibility complex class III (C2, C4, factor B) and C3 gene variants in patients with pulmonary tuberculosis. Hum. Immunol. 2011, 72, 173–178. [Google Scholar] [CrossRef]
Naryzhny, S.; Legina, O. Haptoglobin as a biomarker. Biomeditsinskaya Khimiya 2021, 67, 105–118. [Google Scholar] [CrossRef]
Ulrichs, T.; Moody, D.B.; Grant, E.; Kaufmann, S.H.; Porcelli, S.A. T-cell responses to CD1-presented lipid antigens in humans with Mycobacterium tuberculosis infection. Infect. Immun. 2003, 71, 3076–3087. [Google Scholar] [CrossRef] [PubMed]
Kim, M.; Serwa, R.A.; Samluk, L.; Suppanz, T.M.; Elancheliyan, P.; Tsegaye, B.; Oeljeklaus, S.; Wasilewski, M.; Warscheid, B.; Chacinska, A. Immunoproteasome-specific subunit PSMB9 induction is required to regulate cellular proteostasis upon mitochondrial dysfunction. Nat. Commun. 2023, 14, 4092. [Google Scholar] [CrossRef] [PubMed]
Tellier, M. Structure, activity, and function of SETMAR protein lysine methyltransferase. Life 2021, 11, 1342. [Google Scholar] [CrossRef] [PubMed]
Domingo-Gonzalez, R.; Prince, O.; Cooper, A.; Khader, S. Cytokines and Chemokines in Mycobacterium tuberculosis Infection. Microbiol. Spectr. 2016, 4. [Google Scholar] [CrossRef]
Sigal, G.; Segal, M.; Mathew, A.; Jarlsberg, L.; Wang, Y.; Baruch, N.; Honda, E.; Weiner, M.; Davis, J.; Nahid, P. CXCL10 is elevated in latent tuberculosis and associated with increased immune activation: A meta-analysis of transcriptomic studies. PLoS ONE 2021, 16, e0249492. [Google Scholar] [CrossRef]
Yuk, J.M.; Kim, J.; Kim, I.; Jo, E.K. TNF in Human Tuberculosis: A Double-Edged Sword. Immune Netw. 2024, 24, e4. [Google Scholar] [CrossRef]
Chihab, L.Y.; Ouarour, A.; Lahlou, A.; Rifi, F.E.; El Mzibri, M.; Benjouad, A. HLA class II allele expression associated with tuberculosis susceptibility in a Moroccan population. Hum. Immunol. 2023, 84, 12–19. [Google Scholar] [CrossRef]
Alam, A.; Imam, N.; Siddiqui, M.F.; Ali, M.K.; Ahmed, M.M.; Ishrat, R. Human gene expression profiling identifies key therapeutic targets in tuberculosis infection: A systematic network meta-analysis. Infect. Genet. Evol. 2021, 87, 104649. [Google Scholar] [CrossRef]
Frahm, M.; Goswami, N.D.; Owzar, K.; Hecker, E.; Mosher, A.; Cadogan, E.; Nahid, P.; Ferrari, G.; Stout, J.E. Discriminating between latent and active tuberculosis with multiple biomarker responses. Tuberculosis 2011, 91, 250–256. [Google Scholar] [CrossRef]
Jacobsen, M.; Repsilber, D.; Gutschmidt, A.; Neher, A.; Feldmann, K.; Mollenkopf, H.J.; Ziegler, A.; Kaufmann, S.H.E. Candidate biomarkers for discrimination between infection and disease caused by Mycobacterium tuberculosis. J. Mol. Med. 2007, 85, 613–621. [Google Scholar] [CrossRef]

Figure 1. Comprehensive workflow for the identification and validation of latent tuberculosis infection (LTBi) biomarkers. The pipeline begins with data acquisition and preprocessing from four GEO datasets. Differential expression analysis yielded 12,256 DEGs, which were then subjected to GO enrichment to identify 6345 enriched terms. From this, 7610 genes with statistically significant GO annotations were retained. These genes were then clustered into 20 groups based on semantic similarity, from which 8 biologically relevant clusters (305 genes) were selected. Candidate biomarkers were validated using pathway enrichment, miRNA interaction, and protein expression analysis.

Figure 2. Enrichment analysis of the gene ontology (GO) terms showing three major clusters related to tuberculosis pathophysiology: pulmonary tuberculosis, active tuberculosis, and latent tuberculosis infection (LTBi). This analysis was conducted on a refined list of 305 genes, which were selected from an initial set of 12,256 differentially expressed genes (DEGs) based on GO enrichment filtering and clustering. These 305 genes originate from the 8 biologically relevant clusters identified as most functionally associated with LTBi.

Figure 3. A combination of Tb terms related to gene ontology (GO) enrichment analysis was used to validate the three cluster terms selected.

Figure 4. Gene concept network showing the functional relationship between enriched biological processes and associated genes. The 13 prioritized biomarkers are displayed in **bold text** with enhanced node styling (thicker border and distinct color) to highlight their central role in the latent tuberculosis infection (LTBi) signature.

Figure 5. Venn diagram illustrating the overlap of differentially expressed genes (DEGs) among active tuberculosis (TB), latent tuberculosis infection (LTBi), and general tuberculosis infection (TBi). Shared and unique DEGs are shown across these conditions, providing insight into the molecular distinctions and commonalities across TB states.

Figure 6. Protein expression levels of the 13 candidate biomarkers measured in the LTBi and Control groups using ELISA/Western blot assays. Statistical comparisons were conducted using either the Student’s t-test or the Wilcoxon rank-sum test depending on normality (assessed via the Shapiro–Wilk test). p-values were corrected for multiple comparisons using the Benjamini–Hochberg (BH) procedure. Significance thresholds are indicated as follows: **** adjusted p < 0.0001. Boxplots show the distribution of expression values across the samples in each group.

Figure 7. Comparison of the RNA (microarray) and protein (ELISA/Western blot) expression levels of the 13 candidate biomarkers in the LTBi and Control groups. Genes such as CCL2, CXCL10, and TNF demonstrated consistent upregulation across both data types, strengthening their diagnostic relevance.

Figure 8. ROC curve analysis of the 13 candidate biomarkers for LTBi vs. Control classification. Full curves shown with 0–1 range for both sensitivity and 1-specificity. Biomarkers such as CCL2 and CXCL10 achieved high AUC values (>0.85), indicating strong diagnostic potential. See Appendix D Table A4 for the numerical AUC values per biomarker.

Table 1. MicroRNAs linked to potential LTBI biomarker genes. The table presents miRNA identifiers along with the number of associated target genes, as identified through network analysis.

Mature_Mirna_id	Sum
hsa-miR-181b-5p	52
hsa-miR-181a-5p	51
hsa-miR-181d-5p	50
hsa-miR-181c-5p	49
hsa-miR-4262	46
hsa-miR-4263	43
hsa-miR-452-3p	38
hsa-miR-592	37
hsa-miR-509-3p	36
hsa-miR-130a-3p	35
hsa-miR-130b-3p	35
hsa-miR-519a-3p	35
hsa-miR-519b-3p	34
hsa-miR-519c-3p	34
hsa-miR-301a-3p	34
hsa-miR-301b-3p	33
hsa-miR-34a-3p	33
hsa-miR-506-3p	33
hsa-miR-939-5p	33
hsa-miR-150-5p	32
hsa-miR-19b-3p	32
hsa-miR-545-5p	32
hsa-miR-577	32
hsa-miR-124-3p	31
hsa-miR-19a-3p	31
hsa-miR-4295	31
hsa-miR-454-3p	31
hsa-miR-148a-5p	30
hsa-miR-3163	29

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Identification of Gene Expression Biomarkers Predictive of Latent Tuberculosis Infection Using Machine Learning Approaches

Abstract

1. Introduction

2. Methods

2.1. Data Acquisition

2.2. Data Preprocessing

2.3. Differential Expression Analysis

2.4. Gene Ontology Semantic Clustering and Enrichment

2.5. miRNA Interaction and Network Validation

2.6. Biomarker Validation

2.7. Biomarker Prioritization Criteria

3. Results and Discussion

3.1. Cross-Validation with Existing Datasets, Statistical Analyses, and Functional Assays

Gene Selection and Clustering

3.2. Impact and Evaluation on LTBi Biomarkers

Enrichment Analysis Integration

3.3. Network Correlation Interpretation

3.4. MicroRNA Analysis

3.5. Protein Expression and Statistical Analysis

3.6. Biomarker Cross-Validation

Validation with Experimental Data

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

Appendix C

Appendix D

Appendix E

References

Article Metrics

Citations

Article Access Statistics