SNPs in Genes Related to DNA Damage Repair in Mycobacterium Tuberculosis: Their Association with Type 2 Diabetes Mellitus and Drug Resistance

Genes related to DNA damage repair in Mycobacterium tuberculosis are critical for survival and genomic diversification. The aim of this study is to compare the presence of SNPs in genes related to DNA damage repair in sensitive and drug-resistant M. tuberculosis genomes isolated from patients with and without type 2 diabetes mellitus (T2DM). We collected 399 M. tuberculosis L4 genomes from several public repositories; 224 genomes belonging to hosts without T2DM, of which 123 (54.9%) had drug sensitive tuberculosis (TB) and 101 (45.1%) had drug resistance (DR)-TB; and 175 genomes from individuals with T2DM, of which 100 (57.1%) had drug sensitive TB and 75 (42.9%) had DR-TB. The presence of SNPs in the coding regions of 65 genes related to DNA damage repair was analyzed and compared with the resistance profile and the presence/absence of T2DM in the host. The results show the phylogenetic relationships of some SNPS and L4 sub-lineages, as well as differences in the distribution of SNPs present in DNA damage repair-related genes related to the resistance profile of the infecting strain and the presence of T2DM in the host. Given these differences, it was possible to generate two discriminant functions to distinguish between drug sensitive and drug resistant genomes, as well as patients with or without T2DM.


Introduction
With more than 10 million cases and 1.2 million deaths, tuberculosis (TB) remains an infectious disease of global importance [1]. Currently, drug resistance (DR), HIV, and type 2 diabetes mellitus (T2DM) are elements that contribute to the status of TB as a severe health problem. In this regard, T2DM leads to a 3.1 (2.27-4.26) fold increased risk of developing TB [2], through reactivation of latent infection or progression of recent infection [3], and the TB-T2DM binomial has been identified as a risk factor for unfavorable treatment outcomes, increasing the probability of failure, relapse, or death [4]. Patients with TB-T2DM are also more frequently found within the recent TB transmission groups [5], having a 4.7 (1.4-11.3) fold increased risk of becoming single-drug resistant, and a 2.8 (2.2-3.4) to 3.5 (1.1-11.1) fold increased risk of becoming multi-drug resistant (TB-MDR) [6,7]. This behavior has been explained as a consequence of the decrease in plasma concentrations of anti-tuberculosis drugs [8] and the interference from medications taken for glycemic control [9]. Besides, high glucose concentrations and immunological alterations in patients with T2DM [10][11][12] have been shown to promote the elongation of active infection and increased proliferation of TB bacilli [13].
From a molecular perspective, Single Nucleotide Polymorphisms (SNPs) are the primary source of variation in M. tuberculosis (Mtb) [14]. Through whole-genome sequencing (WGS) analysis of serial samples taken from affected patients, it was recently observed that a significant number of the genetic variations in Mtb are greatly influenced by the host environment [15,16]. In this sense, genes related to DNA damage repair (GRDDR) in Mtb have been the subject of increasing interest in recent years [17], since they play a fundamental role in maintaining DNA integrity and genomic diversification of Mtb [18,19], mainly in damage generated by the environment in which the bacterium is found, and by the host immune response. Mtb presents multiple DNA damage repair mechanisms, some redundant and others specifically related to induced mutagenesis [17]. It has been documented that the presence of SNPs in some Mycobacterium GRDDR can increase the mutation rate [20] and improve the adaptability of the bacteria [21] and can also be associated with increased DR or drug susceptibility [22].
It remains unknown if T2DM in the host could influence the generation of SNPs in Mtb GRDDRs and consequently drug resistance. Therefore, the aim of this study is to compare the presence of SNPs in GRDDRs in sensitive and drug-resistant Mtb genomes isolated from patients with and without T2DM.

Genome Compilation
A search for Mtb genomes was performed in the following public repositories: GenBank, TB Portals, ENA, and Patric. Additionally, a collection of genomes provided by Dr. Iñaki Comas of the Tuberculosis Genomics Unit of the Instituto de Biomedicina de Valencia, Spain, was also considered.
Inclusion of the genomes in the analysis was contingent on the metadata information meeting the following criteria: (1) presence or absence of T2DM in the host, (2) detailed description of the genotypic profile of resistance to first-line drugs (rifampicin, isoniazid, pyrazinamide and ethambutol), (3) coverage values > 99% and depth > 100×, and (4) exclusively belonging to L4, due to its high frequency and global distribution [23]. The final sample consisted of 399 genomes, which were organized into sensitive and drug-resistant TB from hosts with and without T2DM (Supplementary Table S1).

Bioinformatics Analysis of Genomes
First, the low-quality ends in the sequences (<30) were trimmed using Fastp [24], then Kraken V.2 [25,26] was used to filter reads belonging to the MTBC complex and avoid false variants as a result of DNA contamination. The reads were aligned with the BWA program [27] using a MTBC reference sequence [28], considering the default parameters. Variant calling (SNPs and INDELS) was performed following a previously described and validated pipeline [5,29], which is available online at http://tgu.ibv.csic.es/?page_id=1794, accessed on 4 October 2021. Variants present in at least 20 reads and at ≥90% frequency within each isolate were used to detect phylogenetic mutations and confirm pertinence to L4. In contrast, variants in at least 10 reads with a frequency of ≥10% to ≤90% were termed non-fixed SNPs and used to detect the first-and second-line drug resistance profile.
From the call of variants previously obtained for each one of the genomes, variants with an allelic frequency ≥ 10% in the coding regions of the 65 genes related to DNA damage repair were identified and selected (Supplementary Table S2). A database was then constructed with the non-synonymous SNPs identified in the 65 GRDDRs, the resistance profiles, and the presence/absence of DMT2 in the host.

Statistical Testing, Clustering, and Discriminant Analysis
To identify differences between resistant and sensitive TB genomes from hosts with and without T2DM, and the various sub-lineages comprising the sample, clustering analyses were performed on the non-synonymous SNPs in the GRDDRs only using the IQ-TREE software [30] (http://iqtree.cibiv.univie.ac.at/, accessed on 1 November 2021) with the default parameters for binary data. The generated consensus trees were visualized using iTOL [31] (https://itol.embl.de/, accessed on 1 November 2021).
SNPs present in >99% of genomes were considered as redundant sites. Identification of sub-lineage-related SNPs was performed using the fixation index (Fts = 1), which indicates that the SNP is fixed to the sub-lineage and is not present outside of it. The fixation index was calculated using the Genepop package for Rstudio [32], excluding mixed infections and single sub-lineages (n = 1).
Inter-group differences were analyzed by chi-square test using IBM SPSS V21 [33] (95% confidence level), excluding redundant SNPs and those fixed in sub-lineages.
For the discriminant analysis, two functions were generated; one to discriminate the presence or absence of T2DM in the host and another to determine DR-TB and sensibility. Both models were developed using the presence or absence of non-synonymous SNPs in each of the 65 GRDDRs analyzed (0 for absence and 1 for presence of SNPs in the gene). For this analysis, redundant SNPs and those fixed in sub-lineages were excluded, as well as genomes that did not present SNPs other than those related to sub-lineages. The eigenvalue, canonical correlation and Wilks' lambda were calculated as summary values for each discriminant function. Finally, the generated functions were validated by classifying one case (cross-validation) and a 10% random sample (random sample validation) after exclusion from the discriminant function calculation. SPSS ® V21 [33] software was used to perform this analysis and the validation.

SNPs in Genes Related to DNA Damage Repair
The genes with the highest number of non-synonymous SNPs were DnaE2 (18 SNPs), RecC (14 SNPs), and LigD (14 SNPs), whereas the genes Dut, Ku, Prim-PolC, RecR, and SSBa presented only one non-synonymous SNP. When contrasting the length of the genes, a positive correlation was observed in which the greater the length of the gene, the greater the number of non-synonymous SNPs (r = 0.639, p < 0.01). This is evidence that nonsynonymous SNPs are variable among GRDDRs, and this seems to be related to their length.
Ninety-nine percent of the genomes analyzed presented SNPs in the genes AlkA (position 1479085G>A), AdnA (3577958C>T), and ImuA (3811629T>C) ( Supplementary  Table S3). Additionally, <1% of the genomes presented non-synonymous SNPs in SSBa, SSBb, Ku, RecR, RecX, Mpg, DinB2, MazG, Prim-PolC, PolD2, RuvA, RuvC, MutT1, MutT4, and Ung. Notably, RuvX, which encodes a protein responsible for repairing branched DNA structures (Holliday structures) [17], was the only gene lacking non-synonymous SNPs. Likely, the SNPs observed with a frequency > 99% of the sample were previously acquired by a shared ancestor among the different L4 members. On the other hand, genes with lower presence of non-synonymous SNPs could suggest better function or some advantage in terms of conservation.

Cluster Analysis
The dendrogram created with the 352 non-synonymous SNPs identified in the GRD-DRs showed that the clusters generated were in complete agreement with the 19 sublineages that comprised the sample, but no association was observed with the presence of T2DM in the host ( Figure 1). Only lineages 4.2.1 and 4.10 showed differences in the presence of some clusters observed between sensitive and drug resistant genomes, reflected in the presence of non-synonymous SNPs in RecGwed (3011653A>G) and MutT2 (1286927C>A), respectively. The only genome belonging to lineage 4.6.1.1 was placed within the 4.3 lineage group due to the similarity between these lineages. Through the selection of 16 nonsynonymous SNPs in 13 GRDDRs (Fts = 1) ( Table 2), it was possible to generate a dendrogram that more clearly defined the sub-lineages comprised by the sample (Figure 2). This indicates that certain SNPs in GRDDR are specific to the sub-lineages and strongly related to their evolutionary development.        .1    Given this observation, both sub-lineage-fixed SNPs and redundant SNPs in the sample were given no further consideration in the comparative analysis between host groups or resistance profiles. It should be noted that 11.7% of the genomes (n = 47) did not present non-synonymous SNPs, other than those related to the sub-lineage.     Figure 2. Sub-lineage clustering using 16 non-synonymous SNPs in 13 genes related to DNA damage repair.

Comparison between Drug Sensitive and Drug Resistant Genomes of T2DM and Non-T2DM Hosts
With respect to drug resistance, the distribution of SNPs showed significant differences among groups (Table 3). In the genomes of drug resistant isolates, no non-synonymous SNPs were present in the genes RecR, DinB2, Prim-PolC, RNaseH1, RNaseH2, ImuB, or MutT2, but a high frequency of non-synonymous SNPs was found in RecGwed (p < 0.001) (only present in L4.2.1 and mixed infection), MutY (p < 0.001), and UvrA (p = 0.003). In contrast, drug sensitive TB isolates showed an absence of non-synonymous SNPs in the genes Mpg, SSBa, and Ku, but a high frequency was observed in ImuB (p = 0.028), RNaseH2 (p = 0.028), RNaseH1 (p = 0.046), and MutT2 (p < 0.001) (only present in L4.10 and mixed infection).
In relation to the presence/absence of DMT in the host, significant differences were observed in the distribution of genes with SNPs ( Table 3). The genomes of individuals with T2DM presented no non-synonymous SNPs in the genes Fpg2 or Ung, but showed a higher frequency of non-synonymous SNPs in LigB (p = 0.006), ImuA (p = 0.024) (only present in L4. 3   T2DM: Type 2 diabetes mellitus. Only genes with significant differences between groups are presented. • Lineage-related SNPs and SNPs > 99% of the sample are excluded. * Statistically significant difference between hosts with/without T2DM (chi-square test, p < 0.05). ** Statistically significant difference between sensitive and drug-resistant strains (chi-square test, p < 0.05). Genes not shown in the table are described in Supplementary Table S4.

Discriminant Analysis
Discriminant function analysis was performed on the 333 non-synonymous SNPs present in the 65 GRDDR genes, assigning values of 0 and 1, respectively, to the absence and presence of non-synonymous SNPs in the gene. Sub-lineage-related SNPs and redundant SNPs in the sample were excluded from the analysis, as well as those genomes that had no SNPs other than those related to sub-lineages. The discriminant function for TB from individuals with or without T2DM showed a moderate but statistically significant discrimination value (eigenvalue = 0.499, canonical correlation = 0.577, Wilks' lambda = 0.667, df = 63, p = 0.000), with functions at the centroids to differentiate between genomes from TB isolates from hosts without T2DM = 0.600 and genomes from TB isolates from hosts with T2DM = −0.827. This analysis correctly classified 73.6% of the genomes of TB from hosts with T2DM, with a sensitivity of 0.69 and a specificity of 0.80. However, it presented lower accuracy in the cross-validation (64.2% correct classification) and validation through a random sample including 10% of the genomes (64.4% correct classification) (Table 4A). From a selection of 63 Mtb GRDDRs, we can duplicate the model (Table 5A), establishing the presence of SNPs in RecR, LigB, and DnaE1 as the most discriminant genes associated with the presence of T2DM in the host. T2DM: Type 2 diabetes mellitus. * In cross-validation, each case is classified using the functions derived from the rest of the cases. ** 10% of the sample is excluded from the function calculation and is classified using the functions derived from the rest of the cases; the average obtained from 10 repetitions is presented. The standardized coefficients for both discriminant functions are detailed in Table 5.
On the other hand, the discriminant function for sensitive and drug resistant isolates showed moderate but significant values (eigenvalue = 0.788, Canonical correlation = 0.664, Wilks' lambda = 0.559, df = 63, p = 0.000), with functions at the centroids to differentiate between genomes from drug sensitive TB = −0.803 and genomes with drug resistant TB = 0.975. This discriminant function correctly classified 74.8% of the resistant genomes with sensitivity and specificity values of 0.82 and 0.81, respectively (Table 4B), while the cross-validation presented a classification that was 64.8% correct and the random sample validation was 68.1% correct. The model generated with a selection of 63 GRDDRs in the drug resistant and drug sensitive isolates shows that the presence of non-synonymous SNPs in the gene RecGwed had the highest discriminant value, followed by MutY, DnaE2, and Mfd (Table 5B). This demonstrates their high frequency in genomes with some level of drug resistance and thus their utility as discriminants. Finally, it is important to note that the differences observed between the proportions of isolates correctly classified by the discriminant functions and their validations confirms that the sample used was limited, but still generates functional models with which it is possible to determine the presence of pharmacological resistance and T2DM in the host through analysis of these genes.

Discussion
Analysis of the polymorphisms in GRDDR in a set of Mtb genomes (lineage 4) allowed the identification of several relationships among the identified SNPs, presence of T2DM in the host, pharmacological resistance and, unexpectedly, the L4 sub-lineages. The results represent the first approach to the study of GRDDRs and evidence an unexpected diversification of SNPs in these Mtb genes, as well as the possible influence of T2DM and DR on their development.
Some of the SNPs identified were highly specific for L4 sub-lineages and could be of further use for classification, while some Mtb genotyping systems refer to the use of certain SNPs from GRDDR, such as: UvrD2 (3570528C>G) [39], MutT3 (500224G>T), MutT4 (4393839C>T), MutY (4031203C>A) [40], DnaE1 (1750465T>C), RuvB (2923264G>A), DinB2 (3416734G>A), and SSBa (58786G>C) [41]. This is the first report of a new panel of 16 non-synonymous SNPs distributed in 13 GRDDRs that allows a correct sub-clustering of L4 sub-linages (Table 2). Although the limited size of the analyzed sample and the low representativeness of some sub-lineages are recognized, the clustering capacity observed with the SNPs present in GRDDR poses new questions about the coevolution of these mutations and the sub-lineages with which they are associated, and raises the possibility of their use as markers for phylogenetic classification.
It has been documented that the genes MutY and UvrA play an important role in DNA repair pathways and that their deficiency sensitizes Mtb to different clastogens [42][43][44].
While MutM (Fpg) and MutY are involved in the cleavage of oxidized guanine and its paired base [20], UvrA (with UvrB) plays an essential role in DNA damage recognition and initiates nucleotide excision repair [17]. On the other hand, RecGwed has a binding affinity for branched DNA structures, mainly in the stationary phase [37], and is involved in drug resistance acquisition pathways of the L4.2.1 [45]. Considering the above, the presence of a more significant number of SNPs in these genes in drug resistant strains could indicate that they result of positive selection induced by anti-TB treatment, which confers advantages when facing the effects of the drugs. On the other hand, the higher frequency of SNPs in the MutT2 (only present in L4.10 and mixed infection), RNaseH1, RNaseH2, and ImuB genes predominantly found in sensitive strains raises several questions regarding their origin and utility.
Differences in the non-synonymous SNPs in the GRDDR observed between individuals with and without T2DM suggest that the presence of T2DM could influence a greater diversity of polymorphisms in the GRDDR in Mtb. This could be explained by the influence of the metabolic and immune system alterations characteristic of patients with T2DM [16,46], reduced plasma concentrations of anti-tuberculosis drugs [8,47], consumption of drugs for glycemic control [9], and the increased formation of cavitary lesions [46], factors that also promote drug resistance [8,46], some of them due to the appearance of mutations generated by the increase of oxidative stress in bacteria [9].
In this sense, the higher frequency of SNPs in LigB ligase and multifunctional LigD ligase, which acts with Ku in the repair of double-strand breaks by non-homologous end joining [48]; Prim-PolC polymerase, which participates in abasic site filling [49]; the accessory protein ImuA, involved in DnaE2-induced mutagenesis [50]; MazG, the protein product of which degrades 5-OH-dCTP to prevent C>T (G>A) mutation when incorporated into DNA [51]; RecX, which regulates the functions of the RecA gene [17], which is fundamental to the SOS response, induced mutagenesis and drug resistance [52] and the RecG and RuvB genes, involved in the resolution of branched DNA structures [53], could explain the increased risk of individuals with the TB-T2DM binomial generating drug resistance or other negative outcomes to anti-tuberculosis treatment. However, further studies are required to determine the specific participation of these or other genes in the process of drug resistance in individuals with the TB-T2DM binomial.
The variations observed between the SNPs of sensitive and resistant isolates from individuals with or without T2DM made it possible to develop two discriminant functions with which to distinguish these characteristics. Besides providing the first description of their use in this context, these functions showed acceptable levels of sensitivity and specificity, which were higher than the random classification values (0.5) in terms of identifying both drug resistant isolates and those from hosts with T2DM. Although there are molecular markers for the diagnosis of drug resistance with better levels of sensitivity and specificity [54], we consider that the discriminant function for drug resistant strains using GRDDR non-synonymous SNPs could be used as a surrogate marker, since it does not use genes commonly related to drug resistance.
We consider that the misclassifications generated by both functions could be influenced by factors related to the time of evolution of the infection, different levels of drug resistance, and clinical control of T2DM in the host. Nevertheless, the presence of non-synonymous SNPs in GRDDR could be used as a tool to distinguish the presence or absence of T2DM in the host, based on Mtb genome analysis. However, further studies are required to confirm the true utility of this proposal.
The main limitation of this study was related to the reduced number of genomes and, consequently, of the members forming the study groups, which acted to generate an unpaired distribution in the sample. This number was related to the limited and nonuniform clinical and epidemiological information of the hosts in the databases. In most cases, information regarding T2DM is limited to presence/absence without incorporating pharmacological treatment for glycemic control, glycemic testing, time elapsed with T2DM condition, and presence of other comorbidities/addictions, which also limited the depth of analysis. Based on the above, incorporation of sufficient clinical and epidemiological information pertaining to the host should be considered a mandatory requirement for the registration of genomes in the repositories. Availability of this information will undoubtedly be key for subsequent studies.

Conclusions
The present study is the first analysis of polymorphisms present in GRDDR in the context of drug resistance and T2DM. The results show differences in the distribution of non-synonymous SNPs present in GRDDRs that were related to the resistance profile of the infecting strain and the presence of T2DM in the host. This provides evidence that the environment of a patient with T2DM influences the development of SNPs that could be involved in the development of drug resistance. However, further studies are required in order to confirm this possibility.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/genes13040609/s1, Table S1: Data of the analyzed sequences, Table S2: Genes related to DNA damage repair in M. Tuberculosis, Table S3: Distribution of nonsynonymous SNPs in genes related to DNA damage repair according to drug resistance and the presence/absence of T2DM in the host, Table S4: Distribution of genes with SNPs according to drug resistance and absence/presence of T2DM in the host (continuation).