Transcriptional Profiling and Deriving a Seven-Gene Signature That Discriminates Active and Latent Tuberculosis: An Integrative Bioinformatics Approach

Tuberculosis (TB) is an infectious disease caused by Mycobacterium tuberculosis (M.tb.). Our integrative analysis aims to identify the transcriptional profiling and gene expression signature that distinguish individuals with active TB (ATB) disease, and those with latent tuberculosis infection (LTBI). In the present study, we reanalyzed a microarray dataset (GSE37250) from GEO database and explored the data for differential gene expression analysis between those with ATB and LTBI derived from Malawi and South African cohorts. We used BRB array tool to distinguish DEGs (differentially expressed genes) between ATB and LTBI. Pathway enrichment analysis of DEGs was performed using DAVID bioinformatics tool. The protein–protein interaction (PPI) network of most upregulated genes was constructed using STRING analysis. We have identified 375 upregulated genes and 152 downregulated genes differentially expressed between ATB and LTBI samples commonly shared among Malawi and South African cohorts. The constructed PPI network was significantly enriched with 76 nodes connected to 151 edges. The enriched GO term/pathways were mainly related to expression of IFN stimulated genes, interleukin-1 production, and NOD-like receptor signaling pathway. Downregulated genes were significantly enriched in the Wnt signaling, B cell development, and B cell receptor signaling pathways. The short-listed DEGs were validated in a microarray data from an independent cohort (GSE19491). ROC curve analysis was done to assess the diagnostic accuracy of the gene signature in discrimination of active and latent tuberculosis. Thus, we have derived a seven-gene signature, which included five upregulated genes FCGR1B, ANKRD22, CARD17, IFITM3, TNFAIP6 and two downregulated genes FCGBP and KLF12, as a biomarker for discrimination of active and latent tuberculosis. The identified genes have a sensitivity of 80–100% and specificity of 80–95%. Area under the curve (AUC) value of the genes ranged from 0.84 to 1. This seven-gene signature has a high diagnostic accuracy in discrimination of active and latent tuberculosis.


Introduction
Tuberculosis (TB) is an infectious disease caused by Mycobacterium tuberculosis (M.tb.). Tuberculosis remains a global health problem and is the leading cause of mortality from a single infectious agent. The World Health Organization (WHO) estimates that 10 million new cases of TB occurred in the year 2019 and 1.4 million people died from TB [1]. Most infections of M.tb. manifest as a clinically asymptomatic contained state, known as latent tuberculosis infection (LTBI), which affects one fourth of the global population [2,3]. There is heterogeneity in clinical state of LTBI, which includes individuals who may have eliminated the pathogen, as well as individuals with an incipient or subclinical active disease [4].
Chest X-ray has been popularly used as a screening tool for pulmonary TB. Sputum smear microscopy for acid fast bacilli, and sputum culture, are the commonly used methods worldwide for the diagnosis of pulmonary TB. Sputum culture is the gold standard method for diagnosis of active tuberculosis (ATB), but it takes at least 6-7 days for a positive diagnosis and up to 42 days for a confirmed negative diagnosis, while sputum smear suffers from low sensitivity [5]. The GeneXpert MTB/RIF test and PCR based assays serve as rapid tests for detection of TB but require sophisticated technology and welltrained staff and hence are not affordable in low resource settings [6]. Besides, none of the available sputum-based tests can predict reactivation of TB. The Tuberculin skin test (TST) and Interferon γ release assay (IGRA) available for diagnosis of LTBI cannot accurately differentiate between LTBI and active TB [7].
There is a high risk of latent TB reactivation associated with certain risk factors, such as HIV infection, diabetes, malnutrition, immune suppressive treatment, and active smoking [8]. About 5-10% of latently infected individuals will progress to active TB during their lifetime [9]. Accurate and early diagnosis of active tuberculosis is required to control the TB disease. The World Health Organization's (WHO) End TB Strategy has set the goals of reducing TB incidence by 90% and TB deaths by 95% globally by 2035 [10]. Therefore, there is an urgent need for development of simple, non-sputum based, highly sensitive, and specific tests for diagnosis of M.tb. infection. In this context, blood-based gene expression signatures act as the most sought-after biomarkers for distinguishing individuals with active TB (ATB) disease from those with LTBI. In addition, the prognostic biomarkers that can predict the risk of active TB in individuals with LTBI would be of enormous value, so that preventive drug treatment can be offered [11]. The WHO, in conjunction with Foundation for Innovative Diagnostics (FIND) and working group of the Stop TB Partnership, has published target product profile (TPP) for non-sputum biomarker triage test, and diagnostic and predictive tests for progression of LTBI to ATB disease. The TPPs require a minimum 90% sensitivity and 70% specificity for a triage test, 65% sensitivity and 98% specificity for a diagnostic test [12], and 75% sensitivity and 75% specificity for a test to predict progression from LTBI to active TB disease over a two-year period [13]. Several studies in recent years have found that whole-blood RNA signatures can predict the active TB infection [14][15][16][17] and progression of M.tb. infection in persons at risk of developing active TB [18].
Transcriptional profiling entails differentially expressed genes, which may have implications in terms of the diagnosis and prognosis of a disease and serve as drug targets. High-throughput methods used for transcriptional profiling include microarray analysis, RNAseq, PCR array, and NanoString. Among these techniques, microarray and RNAseq data are deposited by the researchers in the gene expression omnibus (GEO) database for meta-analysis of data. The raw data from GEO are reanalyzed, annotated, and illustrated in various ways, including gene set enrichment (GSEA) analysis, KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway analysis, reactome pathway analysis, and the STRING-protein-protein interaction network. Thus, transcriptional profiling aids in identifying the differentially expressed genes, associated pathways underlying the biology of the disease, the role of host cellular and immune response, and the cell signaling mechanisms involved in the pathogenesis of TB disease. Our integrative analysis aimed to identify the transcriptional profiling that distinguishes individuals with active TB (ATB) disease, and those with latent tuberculosis infection (LTBI). We have short-listed a sevengene signature, which included five upregulated genes and two downregulated genes. The short-listed genes were validated using bioinformatics analysis in an independent cohort, which showed a statistically significant difference in gene expression between ATB and LTBI. ROC curve analysis was done to validate the accuracy of the identified biomarkers in discrimination of active and latent tuberculosis.

Data Sources
Microarray datasets used in this study were retrieved from the National Center for Biotechnology Information's (NCBI) Gene Expression Omnibus (GEO) database (https: //www.ncbi.nlm.nih.gov/geo/ (accessed on 12 March 2019), a public repository for gene expression data [19][20][21]. A search of the GEO profiles related to ATB and LTBI samples in GEO database using the terms "Tuberculosis" [Mesh terms] OR active tuberculosis [All fields] AND "Homo sapiens" [porgn] led to identifying a dataset GSE37250, which is a landmark study conducted by Kaforou et al. Thus, GSE37250 (Platform-GPL10558 Illumina Human HT-12 V4.0 expression bead chip) was utilized for exploring the differential gene expression profile between ATB and LTBI [14]. The study sites chosen by Kaforou et al. were of both urban and rural population in South Africa. Capetown, South Africa has one of the highest TB incidence populations in urban setting, with high rates of HIV infection. Another study site, Karonga district, Northern Malawi, is a rural setting, which has comparatively low TB incidence rate with stable HIV prevalence [14].
The actual dataset GSE37250 includes data on samples from those with active tuberculosis with/without HIV infection, latent tuberculosis with/without HIV infection, and other diseases with/without HIV infection from Malawi and South African population. In the present study, we downloaded and reanalyzed 51 ATB samples and 35 LTBI samples both without HIV co-infection from Malawi cohort and 46 ATB samples and 48 LTBI samples both without HIV co-infection from South African cohort.

Data Processing and Differential Gene Expression Analysis
BRB array tool is an analytic and visualization tool integrated into excel so as to visualize and statistically analyze the microarray gene expression data. We used BRB-Array (V 4.6.0, stable version) class comparison tool [22] to identify the DEGs (differentially expressed genes) in peripheral blood of ATB and LTBI from Malawi and South African cohort, respectively. The GSE37250 data file [14] was imported and processed through spot filtering, normalization, and gene-filtering criteria (gene exclusion criteria ≤ 1.5-fold change and expression data value less than 20%). Class comparison analysis was performed between ATB and LTBI classes, and multivariate permutation test was computed based on 1000 random permutations and a false discovery rate of 1%. After identifying the DEGs while comparing ATB vs. LTBI from Malawi cohort and DEGs in South African cohort, we intended to list the common upregulated genes and downregulated genes from the respective cohorts, and these differentially expressed common genes were illustrated using VENNY online web server (https://bioinfogp.cnb.csic.es/tools/venny/, accessed on 15 March 2019). Heatmap of differentially expressed upregulated and downregulated genes was illustrated using the online tool heatmapper (http://www.heatmapper.ca/, accessed on 18 March 2019) [23]. Volcano plot was used to display the statistically significant genes in the form of a scatter plot. Volcano plot was constructed using the −log10 p value of differentially expressed upregulated and downregulated genes in Y-axis versus log2 fold change of the DEGs in X-axis. The volcano plot of the DEGs was prepared and presented using Microsoft excel 2010 (ver 14.0).

Gene Ontology and Pathway Enrichment Analysis of Top DEGs
The most upregulated and downregulated genes were further analyzed using the Database for Annotation, Visualization and Integrated Discovery (DAVID) bioinformatics tool (https://david.ncifcrf.gov/tools.jsp, accessed on 20 March 2019). DAVID is a webbased program that investigates and extracts the biological meaning from a large list of genes [24]. In this study, functional annotation, overrepresented Gene Ontology (GO), and KEGG (Kyoto Encyclopedia of Genes and Genomes) pathways identified genes with p value < 0.05, and enrichment scores ≥ 3 were shortlisted. Reactome pathway analysis (https://reactome.org/, accessed on 21 March 2019) [25] was performed to identify genes participating in a network of biological interaction/pathways. A p value < 0.05 was considered to be statistically significant. The functional annotation of DEGs and pathway analysis have been reported earlier in various diseases [26][27][28][29].

Analysis of Protein-Protein Interaction Network
Protein-protein interaction (PPI) network of most upregulated genes was constructed using the STRING online database (https://string-db.org/, accessed on 21 March 2019) [30] for predicting protein-protein interactions, including direct (physical) and indirect (functional) associations. The resulting network was imported in cytoscape software (https: //cytoscape.org/, accessed on 22 March 2019) [31]; Mcode clustering method was used to find closely associated regions in this network, and two densely connected clusters were identified with mcode score > 3.2. MCODE analysis was performed with default parameter; degree cutoff = 2, node score cutoff = 0.2, k-core = 2, and max. depth = 100. Important hub genes in the PPI network were identified using cytohubba plugin with Maximal Clique Centrality (MCC) method, and the top 20 hub genes were ranked based on the highest MCC score. Functionally enriched PPI subnetworks were created using ClueGO/CluePedia plugin from cytoscape [32,33], and statistical parameters were maintained as two-sided hypergeometric test with a Benjamini-Hochberg corrected p value ≤ 0.05 and kappa scores ≥ 0.4 as criteria. ClueGO was used to visualize the biological terms of large gene clusters [33].

Validation of DEGs in An Independent Cohort Using Bioinformatics Analysis
The DEGs identified from Malawi and South African cohorts of active TB and latent TB infection were short listed and based on significant expression and fold change, and the genes were validated in an independent cohort. We searched the GEO database using the terms "Tuberculosis" [mesh terms] OR active tuberculosis [all fields] AND "Homo sapiens" [porgn] and chosen the dataset GSE19491 (https://www.ncbi.nlm.nih.gov/geo/ query/acc.cgi?acc=GSE19491, accessed on 24 November 2021) [34] for validation of DEGs in an independent cohort. We randomly selected 20 samples of ATB and 20 samples of LTBI derived from geographic population comprised of South African, and UK cohorts. Datasets from ATB samples before starting treatment were selected for analysis. GEO2R tool was utilized to identify the differentially expressed genes between ATB and LTBI. The GEO2R performs differential expression analysis using GEOquery and Limma R packages from Bioconductor project [21,35]. The GSE19491 dataset was analyzed with default setup. Sequence of data identification information was annotated with NCBI generated annotation or submitter supplied annotation platforms, and quantile normalization was applied to the identical value distribution. In the GEO2R, samples of ATB and LTB were assigned to the respective group for analysis. Significant level cutoff p value < 0.05, log 2 fold change >1 was applied to distinguish the differentially expressed genes in ATB vs. LTBI, and analysis was performed. The analysis listed the top 250 differentially expressed genes with log 2 fold change (FC) and p value, and the genes were ranked by significance.
Gene expression profile of a particular gene was visualized by clicking on a gene ID. Each red bar in the graph represented the expression measurement extracted from the original submitter supplied sample record. The sample values denoting the expression were converted into log 2 . The log 2 values were denoted in a box plot for representation.

Receiver Operating Characteristic Curve Analysis
Receiver operating characteristic (ROC)/area under the curve (AUC) analysis was prepared by choosing log2 expression of LTBI samples as control and ATB samples as test using GraphPad prism 8.0. The accuracy and performance of each single biomarker gene were measured by calculating area under the curve (AUC) values. Optimal cutoff values were selected, and the sensitivity and specificity of each biomarker were identified from the graph.

Statistical Analysis
A p value of < 0.05 was considered to be statistically significant in short listing the DEGs. In the case of GEO2R analysis, the log 2 expression value of ATB and LTBI samples were illustrated in box plot, and the statistical significance was calculated using nonparametric Mann-Whitney U test. Statistical analysis and ROC/AUC analysis were done using GraphPad prism 8.0.

Identification of Top Significant DEGs Derived from Microarray Dataset GSE37250
We downloaded the gene expression dataset (GSE37250) comprised of microarray data from ATB and LTBI samples from GEO database, based on GPL10558 platform (Illumina Human HT-12V4.0 expression bead chip). Class comparison analysis between ATB and LTBI groups identified 591 DEGs in Malawi cohort and 962 DEGs in South African cohort.
Volcano plot showing the distribution of statistically significant DEGs at p value < 0.01 and fold change > 1.5 as gene filtering criteria are illustrated in Figure 1A. Differentially expressed overlapping genes common in Malawi and South African cohorts of ATB and LTBI groups were identified. Venn diagram in Figure 1B   FCGR1A, FCGR1B, FCGR1C, BATF2, GBP1, ANKRD22, GBP5, AIM2, GBP6, and CASP4, were ranked among the top overexpressed genes in ATB compared to LTBI, whereas NDRG2, KLF12, ANO9, CD79A, FCGBP, GZMK, CXCR3, MIEF2, CXCR5, and CD27 genes were the most downregulated genes in active TB vs. LTBI. The top 10 upregulated and downregulated genes of ATB and LTBI from Malawi and South African cohorts are presented in Table 1A,B. The complete list of most upregulated and downregulated genes is presented in Supplementary Table S1.

Pathway Enrichment and Functional Annotation of Top DEGs
We have investigated the biological function of the top DEGs using GO, KEGG, and Reactome pathway analysis tools available in the DAVID online bioinformatics tool (modified Fishers exact p-value (EASE score) ≤ 0.05 was considered significant in the GO term/pathway). GO biological pathway analysis revealed that the upregulated genes were associated with innate immune response, leukocyte migration, antibacterial humoral response, defense response to bacterium, inflammatory response, platelet degranulation, killing of cells of other organisms, and regulation of apoptotic process (Table 2). On the other hand, downregulated genes were mainly related to Wnt signaling pathway, cell differentiation, inflammatory response, regulation of cell proliferation, and B cell receptor signaling pathway (Table 3). The most enriched KEGG pathways of upregulated genes were involved in Staphylococcus aureus infection, leishmaniasis, tuberculosis, systemic lupus erythematosus, complement and coagulation cascade, and phagosome formation ( Table 2). Downregulated genes were significantly enriched in primary immunodeficiency, cytokinecytokine receptor interaction, hematopoietic cell lineage, Wnt signaling, and B cell receptor signaling pathway (Table 3).  Reactome pathway analysis of the upregulated genes revealed that neutrophil degranulation, immune response, interferon signaling, interferon γ signaling, fibronectin matrix formation, RMT methylate histone arginines, α defensins, and cytokine signaling in immune system were significantly enriched (Table 2). Additionally, WNT target genes, with TNF binding to their physiological receptor, centrosome maturation, and antigen activated B cell receptor, were significantly downregulated (Table 3).

Analysis of Protein-Protein Interaction (PPI) Network among the DEGs and Identification of Hub Genes for the Upregulated Gene Network
The protein-protein interaction (PPI) network analysis of most upregulated genes was performed using STRING software tool to examine the functional interaction between the upregulated genes and identify the hub genes in the network. The resulting PPI network was significantly (p value < 1 × 10 −16 ) enriched with 76 nodes connected to 151 edges (Figure 2A). Molecular interaction of closely associated nodes inferred with confidence level 0.6-0.9 (combined score) found that loosely associated nodes interacting each other were arranged in groups with a combined score of 0.9. The hub nodes CLEC4D, CEACAM8, CEACAM1, CEACAM6, GPR84, FCAR, and MCEMP1 were interacting with each other and connected with 21 edges with a MCODE score of 7. The other 6 nodes OLFM4, DEFA4, SLPI, HP, CAMP, and TNFAIP6 were interconnected with each other ( Figure 2B). Another six nodes, FCGR1B, FCGR1A, GBP1, GBP5, DBP6, and IFIT3, were interconnected in a group with a combined score 0.7-0.9. The functional interaction of PPI network was further analyzed in detail using the ClueGO/CluePedia plugin of cytoscape.
MCODE analysis of cytoscape found the highly interconnected region of PPI network, CLEC4D, GPR84, CEACAM8, CEACAM1, CD177, FCAR, and MCEMP1 are the seven nodes that were highly interlinked with a MCODE score 7. The other 13 nodes include CASP4, DEFA4, CAMP, FCGR1A, TLR2, CASP5, CR1, AIM2, SLPI, NAIP, OLFM4, FCGR2A, and HP, which are interlinked with a MCODE Score 5. TLR2 is one such hub node, which is interconnected with other nodes. The details of interlinked nodes and their MCODE score in each module are represented in Table 4A. MCC method of cytohubba plugin recognized maximum centrality among nodes in the network. A total of 19 hub nodes presents the closest connection with other nodes ( Figure 2D). The top-ranked nodes and their MCC scores are shown in Table 4B.

Deriving and Validation of a Seven-Gene Signature in Discrimination of Active and Latent Tuberculosis
We aimed to deduce a short signature including both upregulated and downregulated genes for discriminating active and latent tuberculosis infection. Differential gene expression analysis using GEO2R analysis was done by randomly selecting 20 samples of ATB and 20 samples of LTBI from an independent cohort (GSE19491). Based on fold change of the most differentially expressed genes in Malawi and South African cohort (Table 1A) The expression of top hub genes (Table 4B) was also checked in the independent cohort. The expression of hub genes was as follows: CEACAM8 (0.97-fold up), CEACAM1 After obtaining these results, the dataset from 20 randomized samples of ATB and LTBI, respectively, was chosen again from the validation cohort, and the DGE expression was performed using GEO2R. This was repeated twice. The differentially expressed genes, which were consistently expressed in the independent cohort, were chosen. Thus, the most differentially expressed upregulated genes FCGR1B, ANKRD22, CARD17, TNFAIP6, and IFITM3 were selected by verifying the consistency of expression in an independent cohort (GSE19491).
The most downregulated genes in ATB vs. LTBI (Table 1B)  Thus, we have derived a seven-gene signature, which included five upregulated genes FCGR1B, ANKRD22, CARD17, IFITM3, and TNFAIP6 and two downregulated genes FCGBP and KLF12, as a diagnostic biomarker for discrimination of active and latent tuberculosis infection. The short-listed genes were validated in an independent cohort, through bioinformatics analysis, which showed a statistically significant difference in gene expression between ATB and LTBI samples (Figure 3). Hence, we derived a seven-gene signature and the accuracy of the identified biomarkers was validated in discrimination of active and latent tuberculosis. The identified genes have a sensitivity of 80-100% and specificity of 80-95% (Figure 4). Genes 2022, 13, x FOR PEER REVIEW 13 of 19     The seven-gene signature included five upregulated genes ANKRD22 (ankyrin repeat domain containing protein 22), CARD17 (caspase recruitment domain containing protein 17), IFITM3 (interferon-induced transmembrane protein 3), TNFAIP6 (TNF α Induced Pro-tein 6), two downregulated genes FCGBP (Fc γ binding protein), and KLF12 (Kruppel Like Factor 12). The short-listed genes were validated in an independent cohort, through bioinformatics analysis, which showed a statistically significant difference in gene expression between ATB and LTBI samples

Discussion
Whole-blood-gene expression profiling among active and latent TB-infected individuals can identify a wide range of potential transcriptional biomarkers for active TB diagnosis. Several host-response-based signatures have been reported over the last decade for distinguishing patients with ATB from those with LTBI, other diseases, and uninfected healthy controls [14,18,36,37]. In the present study, we reanalyzed the transcriptional profile of ATB and LTBI in microarray datasets derived from Malawi and South African cohorts [14]. Our analysis identified 375 upregulated genes and 152 downregulated genes that were common and consistently expressed in both the cohorts (Figure 1). Since only the common or overlapping differentially expressed genes in ATB vs. LTBI were explored for analysis, the DEGs identified can be used as a biomarker in a population with high incidence as well as low incidence of tuberculosis.
Gene ontology/pathway analysis helps us to understand the pathways involved in active tuberculosis. Blood transcriptional profiling of ATB vs. LTBI revealed that interferon signaling genes (FCGR1A and FCGR1B) were predominantly upregulated during active tuberculosis as reported in earlier studies [34,38]. Significantly enriched gene sets in the Gambian cohort study (ATB vs. LTBI) were involved in systemic lupus erythematosus, complement coagulation cascade, and Fc γ receptor-mediated phagocytosis [36]. Using RNAseq-based gene expression profiling, Estevez et al. showed that expression of genes related to neutrophil degranulation, interferon γ signaling, complement cascade, interferon (type I and type II) signaling, and antimicrobial peptide genes is highly activated in ATB compared to LTBI [39]. Thus, our gene-expression profiling and functional pathway analysis observed a compiled and unique pattern of gene expression signatures reported in previous studies. In accordance with earlier studies, the genes corresponding to immune response and to defense response to bacterium were enriched. On the other hand, downregulated genes were coding for proteins involved in Wnt signaling, B cell receptor signaling, primary immunodeficiency, cytokine-cytokine receptor interaction, and cell differentiation. Our results corroborate with earlier studies that show that key genes participating in Wnt signaling pathway were impaired in severe pulmonary TB patients [40] and B cell and T cell transcript signatures were decreased in active tuberculosis [36,41].
PPI network analysis of DEGs identified groups of closely interconnected nodes in the upregulated network (Figure 2A), and the ClueGO/Cluepedia analysis predicted the functional interpretation of this closely interacted nodes. The resulted ClueGO terms were related to complement, interferon (expression of IFN stimulated genes), and NOD-like receptor signaling pathway, which were reported in previous studies done elsewhere [34,37,38,42,43]. In addition, we observed significant enrichment of HNP1-4 stored in primary neutrophil granules, exocytosis of membrane protein, exocytosis of lumen protein, and heterodimerization of CEACAMs ( Figure 2C). MCODE cluster and MCC cytohubba analysis identified densely connected hub genes in the network ( Figure 2B). Our results demonstrate that the hub genes are predominantly active in tuberculosis and are involved in host response to Mycobacterium tuberculosis. The crucial role of FCGRs in antigen uptake is influenced by highly activating Fc receptor for IgG and immune complex [44]. CLEC4D, CEACAM8, CEACAM1, CEACAM6, GPR84, FCAR, and MCEMP1 are the top hub nodes or proteins tightly connected with other nodes in the active TB ( Figure 2B). CD177 and CEACAM8 genes are responsible for neutrophil activation in active TB, and expression of CD66a is increased upon mycobacterial infection in a time-dependent manner [45]. The C-type lectin receptor Clecsf8/clec4d is a key component in anti-mycobacterial host defense [46]. CEACAM1, CEACAM6, and CEACAM8 genes were expressed >3-fold in active TB. CEACAM1 (CD66a), CEACAM6 (CD66c), and CEACAM8 (CD66b) are genes coding for carcinoembryonic antigens of the immunoglobulin superfamily. Soluble recombinant CEACAM8-Fc dampens the TLR2-triggered immune response by interacting with CEACAM1 expressing human airway epithelium [47]. AIM2 was one of the top 10 genes, which is upregulated 4.1-fold in ATB compared to LTBI. AIM2 senses cytosolic double-stranded DNA (dsDNA), which activates the inflammasome host immune response to pathogens. AIM2-deficient mice showed increased susceptibility to M.tb. infection due to a defect in the production of IL−1β, IL18, and impaired Th1 response [48]. TLR2 gene was among the top 30 genes upregulated in ATB. TLR2 is a hub node interlinked with FCGR1A, CAMP, FCGR2A, HP, CASP4, CASP5, AIM2, CR1, SLPI, NAIP, OLFM4, and DEFA4 ( Figure 2B). Toll-like receptor 2 (TLR2), expressed on the apical surface of airway epithelial cells, is particularly important for the detection of inhaled bacteria in the human airways and for the initiation of the innate immune response [49]. TLR2 signaling is highly regulated during M.tb infection and plays a protective multifaceted role in containing chronic M.tb infection [50].
Majority of the enlisted upregulated and downregulated genes in our study are consistent with the 27 transcript signatures identified by the original submitter Kaforou et al. for distinguishing TB from latent TB infection [14,16]. Our results also corroborate a UK study by Blankley et al. that identified these genes within the top 10 upregulated genes in pulmonary TB patients as compared to healthy controls [16]. Maertzdorf et al. derived a combination of five most prominently differentiating genes FCGR1B, CD64, LTF, GBP5, and GZMA as a biosignature for TB diagnosis. This five-gene biosignature yielded the highest accuracy in discriminating between ATB and LTBI, with a sensitivity and specificity of 94% and 97%, respectively [36]. A 16-gene signature was identified in a study by Zak et al. that comprised genes ANKRD22, APOL1, BATF2, ETV7, FCGR1A, FCGR1B, GBP1, GBP2, GBP4, GBP5, SCARF1, SEPT4, SERPING1, STAT1, TAP1, and TRAFD1 [18]. A meta-analysis performed with 16 microarray datasets to profile the host transcriptional response in active tuberculosis led to the identification of five upregulated genes: AIM2, BATF2, FCGR1B, HP, and TLR5 [16]. Sweeney et al., in an integrated multi-cohort analysis, discovered a blood-based three-gene signature, GBP5, DUSP3, and KLF2, that distinguish patients with ATB from healthy controls, as well as from those with LTBI and other diseases [17]. The same group later evidenced that the three-gene TB score was significantly associated with progression of individuals from LTBI to ATB, six months prior to a sputum positive test result [51]. Of note, the differentially expressed upregulated genes such as BATF2, C1QB, CAMP, CASP5, FCGR1A, FCGR1B, FCGR1C, GBP1, GBP6, IFIT3, and P2RY14, and the downregulated genes CD19, CD27, CD79A, CXCR3, CXCR5, GZMK, TCF7, ID3, and TCF7, derived from Malawi and South African cohorts, were consistent and comparable with blood transcriptome profile of tuberculosis and tuberculosis-diabetes co-morbidity studied in Indian population [52].
The differentially expressed common genes identified from Malawi and South African cohorts of active TB and latent TB infection were short listed and based on significant expression and fold change, and the DEGs were validated in an independent cohort. The seven-gene signatures were selected majorly based on fold change of the most differentially expressed genes. Though both FCGR1A and FCGR1B were consistently expressed in validation cohort, we have chosen only one gene (FCGR1B) from FCGR family of genes. The other top DEGs such as GBP5, GBP1, and BATF2 were consistently expressed, but these genes were part of known signatures genes reported earlier [16][17][18]; hence, they were not included in deriving the gene signature. Based on fold change of the most differentially expressed genes, FCGR1B, ANKRD22, CARD17, TNFAIP6, and IFITM3 were short-listed, and these genes were finalized by consistent expression of these genes in an independent cohort. We tried to incorporate the hub genes such as TLR2 and CEACAM family of genes in deriving the signature. However, TLR2 gene was not consistently expressed in the validation cohort, whereas CEACAM1 and CEACAM8 genes were expressed in validation cohort but the expression level was low. Hence, CEACAM1 and CEACAM8 genes were not preferred in the signature genes. The downregulated genes KLF12 (1.2-fold down) and FCGBP (2.2-fold down) were selected based on fold change and consistency in expression in an independent cohort. Thus, we derived a seven-gene signature, which included the five upregulated genes FCGR1B, ANKRD22, CARD17, IFITM3, and TNFAIP6 and two downregulated genes FCGBP and KLF12, as a diagnostic biomarker for discrimination of active and latent tuberculosis infection.
The short-listed genes were validated in an independent cohort, through bioinformatics analysis, which showed a statistically significant difference in gene expression between ATB and LTBI samples (Figure 3). This seven-gene signature has not been reported so far in discriminating active and latent tuberculosis. FCGR1B, ANKRD22, CARD17, IFITM3, and TNFAIP6 genes are present in the 171 differentially expressed transcripts reported by Kaforou et al., but except for FCGR1B, none of the other genes are included in TB signatures classified by Kaforou et al. [14]. FCGR1B and ANKRD22 are present in Zak-16 signature, which was applied in predicting the progression of LTBI into active TB [18]. The accuracy and performance of each gene of the seven-gene signature was calculated by receiver operating characteristic curve (ROC)-area under the curve (AUC).
The derived seven-gene signature has a sensitivity of 80-100% and specificity of 80-95%. Area under the curve (AUC) value of the genes ranged from 0.84 to 1. The ROC curves of seven-gene signature are illustrated in Figure 4, and sensitivity, specificity and AUC of seven-gene signature are presented in Table 5. A 27-transcript signature by Kaforou et al. [14] distinguished TB from LTBI, with sensitivity of 95% (95% CI 87-100) and specificity of 90% (95% CI 80-97). Gliddon et al. derived a three-transcript signature (FCGR1A, ZNF296, and C1QB) that differentiated TB from LTBI, with CI 95% (93.3-100%) [53]. The derived seven-gene expression signature is a promising biomarker, with high diagnostic accuracy in discrimination of active and latent TB infection.

Conclusions
The present study provides insight into differentially expressed host genes in active TB compared to latent TB infection among an African population that has both a high incidence (Cape Town) and a low incidence (Northern Malawi) of tuberculosis. The results of the functional annotation and pathway enrichment analysis led to identification of the primary pathways involving the upregulated genes (interferon and immune related) and downregulated genes (Wnt signaling and B cell signaling).
We derived a seven-gene signature and validated the same in an independent cohort that can effectively discriminate between active and latent tuberculosis infection. Thus, the seven-gene signature could be further explored in other ethnic population for its potential in discrimination of active and latent tuberculosis infection.