3.1. Cross-Validation with Existing Datasets, Statistical Analyses, and Functional Assays
Gene Selection and Clustering
A total of 12,256 differentially expressed genes (DEGs) were initially identified from the integrated datasets. Using gene ontology (GO) enrichment analysis, we mapped these DEGs to 6345 unique GO terms representing various biological processes, molecular functions, and cellular components. From this analysis, we retained 7610 genes that were annotated with significant GO terms (adjusted p-value < 0.05), ensuring functional relevance. These genes were prioritized for further clustering and biomarker discovery. However, our analysis focused specifically on various stages of tuberculosis infection, including latent tuberculosis infection (LTBi) and active disease, which can elicit a complex and extensive immune response.
The large number of DEGs reflects the underlying biological diversity and complexity associated with tuberculosis infections, where the host immune system responds to M. tuberculosis in a multifaceted manner. Furthermore, our analysis incorporated multiple datasets with the varying conditions and demographics of the patients, which may contribute to the elevated DEG count. To ensure robustness, we conducted quality control checks and applied stringent criteria to identify the DEGs, including thresholds for fold change and statistical significance. This clarification may help readers understand the complexity of the host–pathogen interactions involved in LTBi and the implications for genes as biomarker discovery. These genes were used for enrichment analysis based on their functional annotations and semantic scores, where biological terms or pathways overrepresented in a given gene set were identified through supervised learning, and they were structured in 6345 GO. Which of the 7610 genes were kept for the next level of study was determined based on their ontology terms.
In this study, clustering was the primary method employed for analyzing the differentially expressed genes (DEGs). We ultimately selected clustering due to its ability to reveal the underlying structure of the data without the need to pre-specify the number of clusters, allowing for more exploratory analysis. The specific number of cluster determination selections was 20, and this was determined based on a combination of statistical methods and biological relevance. We utilized the dendrogram clustering analysis to identify distinct groups of co-expressed genes. The optimal number of clusters was chosen to balance the biological interpretability of the data with statistical rigor, ensuring that the clusters represented meaningful patterns of gene expression relevant to the different conditions analyzed.
In addition, we conducted a silhouette analysis to assess the consistency of the clustering structure, which supported our choice of 20 clusters as representative of the major expression patterns observed in our dataset. The observed clustering pattern revealed distinct grouping of LTBi-associated biological processes, including antigen presentation, cytokine signaling, and immune activation, indicating a functional signature differentiating LTBi from active TB and control conditions.
From the 20 functional gene clusters generated via semantic GO enrichment, 8 clusters were selected based on their biological relevance, such as being enriched in immune response, antigen presentation, cytokine signaling, other infection-relevant pathways, consistency of gene-level differential expression across LTBi comparisons related to 305 genes, and potential diagnostic utility. Specifically, we prioritized clusters enriched for immune-related GO terms and those containing genes with ROC AUC > 0.6 or validated miRNA interactions. A summary of these clusters is provided in
Table A3.
The 12 clusters not selected contained some redundant gene information and no direct information related to pathologies. All of this information was stored as reference information to solve the next study objective.
To enhance the biological relevance of our findings, we applied filters to remove genes that exhibited low expression variability across the datasets. Specifically, we set a threshold for minimum expression variability, allowing us to focus on genes that demonstrated significant changes in expression levels relevant to latent tuberculosis infection (LTBi). This crucial step ensures the remaining genes contributed meaningfully to the clustering and were not merely background noise.
From the initial 20 functional gene clusters generated through semantic similarity-based GO enrichment, 8 clusters were selected for further analysis. Selection criteria included enrichment for immune-related biological processes (adjusted p < 0.05), representation of DEGs between LTBi and both control/active TB, and potential involvement in host–pathogen interactions. The top GO terms and representative genes from each selected cluster demonstrated distinct expression profiles in the LTBi samples compared to the other groups.
3.3. Network Correlation Interpretation
From the 305 genes identified in the selected 8 clusters, 250 were retained after enrichment analysis using the GO, KEGG, DisGeNET, and tissue expression databases. To better understand the functional context of the selected biomarkers, we constructed a gene-concept network using gene ontology (GO), KEGG, and DisGeNET enrichment results. This network, as shown in
Figure 4, visualizes the connections between candidate genes and their associated biological processes, particularly those related to immune regulation and host–pathogen interactions in LTBi. The remaining 55 genes were excluded due to low annotation confidence or a lack of relevance to tuberculosis-related pathways. Among the 250 enriched genes, 200 were mapped specifically to pulmonary tuberculosis clusters. Further stratification showed that 105 genes were associated with latent tuberculosis infection (LTBi), while 80 genes were associated with active tuberculosis (ATB).
Based on this distribution, we focused our downstream analysis on three biologically meaningful clusters: (1) pulmonary tuberculosis, (2) active tuberculosis, and (3) latent tuberculosis infection. These clusters represent the major clinical states observed in TB and help to differentiate between infection stages.
Within the gene sets from these three clusters, five genes—CCL2, SLC11A1, TIRAP, HLA-DQA1, and CD209—were consistently present across all three clusters. Due to their recurrence and known immune-related functions, these genes were prioritized as potential robust biomarkers for tuberculosis progression and immune response modulation.
Generally, these genes are involved in the immune response.
CCL2 (C-C motif chemokine ligand 2) plays a crucial role in both immune and inflammatory responses, specifically in recruiting T cells and monocytes to the site of infection. Elevated
CCL2 levels in the blood are associated with latent TB, suggesting heightened immune vigilance [
38]. Also, the
CCL2 level has been observed to be significantly elevated in PTB patients compared to healthy controls, and it is varied among
CCL2 variants in PTB patients [
39,
40,
41]. It has been found that
CCL2 is associated with the severity of TB. The
CCL2 polymorphism (-2518A/G) has been linked with susceptibility to LTBi in northeast Thai populations [
42].
SLC11A1, also known as
NRAMP1, is a gene that encodes a protein associated with the transport of iron across cellular membranes. Playing a crucial role in bacterial growth, its upregulation has been observed in latent TB lesions, and the polymorphism of “
SLC11A1” has been associated with the risk of TB disease in different populations [
43].
TIRAP encodes an adaptor protein that mediates signals to downstream effectors within the TLR pathway, playing a vital role in initiating the innate immune response against pathogens. Mutations in
TIRAP are linked to an increased risk of developing active TB from latent infection [
44].
HLA-DQA1, a part of the human leukocyte antigen (HLA) complex crucial for the immune system, presents antigen peptides to T cells, thereby inducing their activation and the immune response. Specific HLA-DQA1 alleles are associated with susceptibility to developing TB. This suggests that certain individuals may have genetically influenced differences in their ability to recognize and control
M. tuberculosis [
45].
CD209 encodes a receptor facilitating antigen uptake and presentation by dendritic cells, thereby initiating the adaptive immune response.
CD209 also plays a role in antigen presentation and may contribute to activating anti-mycobacterial T-cell responses.
MicroRNA (miRNAs) played a crucial role in our analysis of the post-transcriptional regulation of gene expression. miRNAs bind to complementary sequences on messenger RNA (mRNA), which leads to mRNA degradation or inhibition of translation, thereby regulating gene expression [
46]. This regulatory function makes miRNAs important in various biological processes, including pathology-related ones.
The primary objective of integrating miRNA analysis into biomarker selection is to better understand the post-transcriptional regulation of genes involved in tuberculosis pathogenesis. By identifying miRNAs that target candidate genes, we aim to reveal regulatory mechanisms that may influence disease development and progression.
For instance, based on selected miRNAs (
Table 1), the sequences hsa-miR-181a-5p, hsa-miR-181b-5p, hsa-miR-181c-5p, and hsa-miR-181d-5p represent distinct forms of human miRNAs within the miR-181 family, which have been conserved in vertebrates [
47]. They share sequence similarities and diverse biological roles. According to a recent study, these miRNAs are associated with tuberculosis (TB) infection as they are caused by
M. tuberculosis [
48]. The study showed that the levels of these miRNAs were significantly reduced in the peripheral blood of patients with active TB compared to healthy subjects or patients with latent TB. Regarding [
49]’s analysis, the expression of the miRNAs in the peripheral blood mononuclear cells (PBMCs) of TB patients and healthy controls revealed elevated levels of several miRNAs, including miR-452, in the PBMCs from TB patients.
3.4. MicroRNA Analysis
The gene-selecting process for biomarkers involved several steps in biology in our laboratory. This pipeline was applied programmatically and reproducibly across the datasets. No manual selection or subjective judgment was used to generate the final biomarker list. A visualization of the prioritization workflow is provided in
Section 2.7,
Figure 1, and individual gene performance metrics are summarized in
Appendix D Table A4. We finally selected a total of 13 biomarkers in our study: “
CCL2,
SLC11A1,
CD209,
HLA-DQA1,
TIRAP,
CD1B,
PSMB9,
RPL17,
SETMAR,
TMED9,
TRAT1”. This selection includes the five that were previously highlighted (
CCL2,
SLC11A1,
TIRAP,
HLA-DQA1, and CD209). These biomarkers hold promising potential for enhancing our understanding of disease mechanisms and could pave the way for new diagnostic tools and targeted therapies. Two genes were obtained from TBi (
C3 and
HP), and these genes do not have a miRNA. Two others were selected from an intersection of TBi, LTBi, TB, and TB active (“
CCL2” “
SLC11A1”), showing a set of direct miRNA terms. Three genes with significant miRNA from an intersection of TBi, LTBi, and TB (“
CD209” “
HLADQA1” “
TIRAP”). Finally, six genes were selected that had been only implicated in LTBi without any representative miRNA (“
CD1B” “
PSMB9” “
RPL17” “
SETMAR” “
TMED9” “
TRAT1”). The absence of annotated miRNA interactions in databases such as miRTarBase or miRDB does not imply that a gene lacks biological significance or has not been studied extensively; rather, it may reflect current limitations in miRNA-target annotation or regulatory mechanisms not governed by miRNAs. Additionally, an intersection visualization (
Figure 5), such as Venn diagrams—which is used to interpret and communicate the results of selected differential expression analyses and to identify statistically significant genes with a substantial fold change—was conducted first. A Venn diagram analysis of differentially expressed genes across multiple pairwise comparisons revealed a high degree of set specificity (
Figure 5). Only a small fraction of genes were unique to either the tuberculosis infection (TBi) or latent tuberculosis314 infection (LTBi) sets, with 0.7% (2/271) in TBi genes and 2.2% (6/271) in LTBi genes. Interestingly, despite the set specificity, 271 differentially expressed genes were identified after accounting for shared genes across the comparisons. However, only two genes were consistently expressed across all three pairwise comparisons (
Figure 5).
The following genes have been identified for their potential impact on tuberculosis (TB) infection and immune response mechanisms: Complement C3 (C3), which is a central component of the complement system and acts as a key effector molecule in the immune response. It interacts directly with Mycobacterium tuberculosis, playing a crucial role in pathogen recognition and clearance [
50,
51]. Haptoglobin (HP) binds free hemoglobin in the blood, preventing oxidative damage and functioning as an antioxidant during the acute phase response. Its elevated expression is strongly associated with inflammatory diseases, including TB [
52].
The
CD1b gene encodes a protein that presents lipid antigens from bacteria, including Mycobacterium tuberculosis, to T cells. This gene’s mutation has been linked to increased susceptibility to TB [
53].
PSMB9, which encodes a subunit of the immunoproteasome, is critical for antigen processing and presentation to T cells. Variations in this gene may influence susceptibility to TB [
54].
RPL17, which encodes a ribosomal protein essential for protein synthesis, may indirectly affect TB susceptibility by impairing immune cell function. SETMAR, encoding a protein lysine methyltransferase, which is involved in DNA repair, gene regulation, and integration, has an unclear but potentially significant role in TB pathogenesis [
55].
TMED9 encodes a transmembrane protein involved in immune signaling and cell function, though its precise role in TB is still under investigation. Lastly,
TRAT1 encodes a protein critical for transmitting signals from the T cell receptor to the cell’s interior, contributing to the immune response during TB infection.
The final markers (CCL2, SLC11A1, CD209, HLA-DQA1, and TIRAP) were selected for their consistent behavior in the datasets, their involvement in the immune regulation of LTBi, and their regulatory support.
3.5. Protein Expression and Statistical Analysis
To provide a clearer understanding of the expression dynamics of the 13 candidate biomarker genes, we performed an analysis to evaluate the protein expression levels of them (CCL2, SLC11A1, CD209, HLA-DQA1, TIRAP, IL6, TNF, IFNG, IL10, CXCL10, IL12B, LTA, and NOS2) in two groups: Latent Tuberculosis Infection (LTBi) and Control. For each biomarker, we analyzed protein expression levels using ELISA/Western blot assays, and we assessed the differences between the groups using either a t-test or Wilcoxon test depending on the normality of the data (as determined by the Shapiro–Wilk test).
Expression trends were visualized (
Figure 6) in a manner analogous to Western blot or ELISA readouts to conceptually illustrate relative abundance patterns. This visualization does not represent laboratory validation and is included for illustrative purposes only. Experimental validation of these biomarkers using actual immunoassays remains necessary in future studies.
Among the 13 genes, 7 (e.g., CCL2, CXCL10, and TNF) were found to be significantly upregulated in LTBi samples compared to controls, while 4 (e.g., HLA-DQA1 and TIRAP) showed moderate downregulation. The remaining 2 genes did not show statistically significant differential expression but were retained due to their biological relevance and consistent enrichment across clusters.
Finally, the expression levels of biomarkers were visualized using
boxplots (
Figure 6), with significance annotations being used to indicate whether the expression levels differed between the groups (Control vs. LTBi).
p-values were calculated for each biomarker, and the results were summarized to assess which biomarkers exhibited statistically significant differences.
To further validate the robustness of the 13 selected candidate biomarkers, we compared their RNA expression levels, which were obtained from microarray data, with the corresponding protein expression levels, which were measured via ELISA or Western blot assays (
Figure 7). Among these biomarkers, genes such as
CCL2,
CXCL10, and
TNF exhibited consistent upregulation at both the transcriptomic and proteomic levels in the LTBi samples compared to the controls. This concordance supports their potential as reliable diagnostic indicators.
However, a subset of genes (e.g.,
SLC11A1 and
CD209) displayed discrepancies between the mRNA and protein expression, which may be attributed to post-transcriptional regulatory mechanisms or differences in protein stability and translation efficiency. A correlation plot of RNA versus protein levels (
Figure 7) illustrates the overall expression trends, enhancing confidence in the multi-modal relevance of these biomarkers for the purpose of distinguishing LTBi from healthy individuals.
To assess the discriminative power of each biomarker, we performed a
receiver operating characteristic (ROC) curve analysis (
Figure 8). ROC curves were generated by comparing the protein expression levels of the biomarkers with the group labels (Control vs. LTBi). For each biomarker, we calculated the
area under the curve (AUC), which quantifies the biomarker’s ability to differentiate between the LTBi and Control groups. Biomarkers with an AUC closer to 1 indicate strong discriminatory ability, while those with an AUC closer to 0.5 indicate weak discriminatory ability.
The ROC curves for each of the 13 biomarkers are presented in a single plot, where each curve is colored differently for clarity. The curves show the sensitivity (true positive rate) on the y-axis and 1 − specificity (false positive rate) on the x-axis. Biomarkers that showed strong separation between the two groups (Control vs. LTBi) will have curves closer to the plot’s upper-left corner.
The AUC values for each biomarker were summarized and are now presented to show the overall performance of each biomarker in distinguishing between the groups. Some biomarkers showed higher discriminatory power, while others exhibited less ability to differentiate between the LTBi and Control samples.
3.6. Biomarker Cross-Validation
To validate the discriminative power of the candidate biomarkers, we assessed their expression profiles across the LTBi, Active TB, and healthy Control groups.
Figure 6 displays the normalized gene expression levels for all 13 biomarkers. Statistical analysis using Limma revealed that 9 out of the 13 biomarkers were significantly differentially expressed between LTBi and active TB (FDR < 0.05). CCL2, CXCL10, HLA-DQA1, and CD209 exhibited the most pronounced separation, with consistent upregulation in LTBi compared to active TB. Full statistical details are provided in
Appendix D Table A4. Specific miRNAs, such as hsa-miR-181a-5p, were identified, as shown in
Table 1, as potential regulators. Their expression levels varied significantly between patients with latent and active TB, suggesting their roles in modulating immune responses and disease progression.
Validation with Experimental Data
To strengthen the biological relevance of our in silico findings, we validated the top-performing biomarkers against existing experimental evidence. Notably,
CCL2,
CXCL10, and
TNF demonstrated both significant differential expression and strong diagnostic performance (AUC > 0.85;
Table A5).
These three markers have also been supported by previous experimental studies. For example,
CCL2 is significantly elevated in the peripheral blood mononuclear cells (PBMCs) from LTBi individuals when using ELISA assays [
39].
CXCL10 is a well-documented chemokine involved in interferon signaling, and it has demonstrated consistent upregulation in LTBi cohorts across multiple protein-level studies [
56,
57].
TNF, a key cytokine for granuloma formation and host immunity, has also shown elevated expression in latent TB contexts [
58].
To simulate protein-level trends and evaluate stability, we generated synthetic expression distributions for all 13 biomarkers. Statistical tests (t-test, Wilcoxon, and Shapiro–Wilk) confirmed that CCL2 and CXCL10 maintained significant group-wise separation (p < 0.001), which is consistent with the protein-level patterns reported in the literature.
Several other biomarkers (e.g., SLC11A1, CD209, HLA-DQA1, IL6, IL10, and TIRAP) did not show significant differential expression in our dataset, despite their known immunological roles. These findings may reflect context-specific regulation, genetic variability across cohorts, or limited sensitivity of cross-sectional transcriptomic profiling in latent infection. While not dismissed, these markers may require targeted validation in independent cohorts.
Overall, our integrative approach—combining transcriptomic analysis, diagnostic modeling, literature comparison, and simulated expression testing—prioritizes CCL2, CXCL10, and TNF as robust, clinically relevant candidates for LTBi detection.
Future work will focus on validating these biomarkers in larger, diverse populations and assessing their performance as part of multiplexed panels. The integration of such markers into diagnostic pipelines may offer enhanced sensitivity and specificity for identifying latent tuberculosis, ultimately contributing to improved public health outcomes.