Snapshot of Mycobacterium tuberculosis Phylogenetics from an Indian State of Arunachal Pradesh Bordering China

Uncontrolled transmission of Mycobacterium tuberculosis (M. tuberculosis, MTB) drug resistant strains is a challenge to control efforts of the global tuberculosis program. Due to increasing multi-drug resistant (MDR) cases in Arunachal Pradesh, a northeastern state of India, the tracking and tracing of these resistant MTB strains is crucial for infection control and spread of drug resistance. This study aims to correlate the phenotypic DST, genomic DST (gDST) and phylogenetic analysis of MDR-MTB strains in the region. Of the total 200 samples 22 (11%) patients suspected of MDR-TB and 160 (80%) previously treated MDR-TB cases, 125 (62.5%) were identified as MTB. MGIT-960 SIRE DST detected 71/125 (56.8%) isolates as MDR/RR-MTB of which 22 (30.9%) were detected resistant to second-line drugs. Whole-genome sequencing of 65 isolates and their gDST found Ser315Thr mutation in katG (35/45; 77.8%) and Ser531Leu mutation in rpoB (21/41; 51.2%) associated with drug resistance. SNP barcoding categorized the dataset with Lineage2 (41; 63.1%) being predominant followed by Lineage3 (10; 15.4%), Lineage1 (8; 12.3%) and Lineage4 (6; 9.2%) respectively. Phylogenetic assignment by cgMLST gave insights of two Beijing sub-lineages viz; 2.2.1 (SNP difference < 19) and 2.2.1.2 (SNP difference < 9) associated with recent ongoing transmission in Arunachal Pradesh. This study provides insights in identifying two virulent Beijing sub-lineages (sub-lineage 2.2.1 and 2.2.1.2) with ongoing transmission of TB drug resistance in Arunachal Pradesh.


Introduction
India is leading in the highest rates of tuberculosis (TB) incidence and mortality globally, with an estimate of 2.69 million cases [1,2]. Although drug-resistant tuberculosis (DR-TB) is a major public health concern globally, it represents an alarming situation in India, with 135,000 MDR-TB cases contributing to 27% of global DR-TB cases [1]. Patients with DR-TB often require profound changes in their drug regimens, which are invariably linked to poor treatment adherence and sub-optimal treatment outcomes compared to drugsensitive TB. Higher drug-resistant TB cases remain a challenge for clinicians and National Tuberculosis Elimination Programme (NTEP) for accurate and effective TB treatment in India [3,4]. In India, the paucity of rapid diagnosis in locations having low resources and high endemicity areas where access to health care centers is difficult, remains a major constraint in treating DR-TB cases. It is estimated that around 56% of MDR-TB cases in India remain undiagnosed [4]. Arunachal Pradesh, one of the states in the northeastern region of India bordering China with 80% area covered with forest, mostly with hilly terrains, has awakened consciousness of NTEPs due to the high prevalence of around 78.8% MDR-TB cases [5]. The reason for undiagnosed drug-resistant cases is the location of most villages in impoverished forest zones, poor connectivity of roads to health centers leading to inadequate access to health services.
There is increasing evidence that the inter-strain variation in M. tuberculosis exists due to variation in gene expression profiles and is biologically significant [6]. Several molecular epidemiological studies have proposed that certain types of M. tuberculosis strains that are prone to drug resistance may have rapid transmission rates and a higher rate of recurrence due to relapse [7]. Understanding the role of strain variation of M. tuberculosis to clinical phenotypes requires an approach to categorize M. tuberculosis isolates into groups that share most of the genotypic and phenotypic traits. For this, studies based on phylogenetic analysis, which organize clinical isolates into genetically related groups, are needed. Such studies provide an evolutionary framework for investigating polymorphisms and their potential biological relevance [8].
Molecular strain typing using the 24-loci MIRU-VNTR along with Spoligotyping is widely used for characterizing ongoing transmission of M. tuberculosis strains in a particular geographical location and has been shown to provide crucial information effective for the public health interventions [9]. However, these typing methods are reported to miss considerable amounts of genetic diversity, and where the overall diversity of circulating clones is limited, these approaches are insufficient to differentiate among strains [10].
With the advancement of Next-generation sequencing, Whole-genome sequencing (WGS) data are appraised for its use in epidemiological studies, strain typing for outbreak investigations, and surveillance of infectious disease due to higher sensitivity and rapid turnaround time [11]. Different genotyping methods such as spoligotyping and MIRU-VNTR were previously used for the categorization of lineages and sub-lineages. However, comparative analysis led to the use of single nucleotide polymorphisms (SNPs), which provided valuable insights into the epidemiology of circulating strains and were used as robust genetic signatures for phylogenetic categorization [12][13][14]. A well-established SNP barcode approach is already well known for analyzing 60 loci in M. tuberculosis sensu stricto genomes and has efficiently been used for categorization of major Lineages 1-7 and sub-lineages [13]. Moreover, core genome multilocus sequence typing (cgMLST) based on entire allele change is widely being used with confidence for assessing phylogenetic position to genomic data sets within a single species The sum of all core genes and their alleles for a species comprise the species cgMLST schema [14]. Phylogenetic heterogeneity, drug-resistant patterns, and association of lineages with drug resistance are not well known from the Arunachal Pradesh region of India. This study was aimed to utilize approaches of cgMLST and SNP barcoding to envision circulation of M. tuberculosis sensu stricto lineages. We also aim to perform phenotypic DST and genomic DST (gDST) to see the patterns of drug resistance in clinical strains of M. tuberculosis.
This study will help in the identification of hypervirulent strains/clones that are circulating among drug-resistant TB cases in Arunachal Pradesh and will provide crucial information to NTEP and public health programmers. Fast and accurate tracking of hypervirulent M. tuberculosis strains is, therefore, essential to keep track of ongoing circulating clones, which is decisive for infection control and can help in the prediction of potential future outbreaks. Sputum samples were processed using NALC-NaOH method [15]. Briefly, a minimum 3 mL of sputum sample was mixed with an equal amount of 0.5% NALC-4% NaOH, vortexed and incubated at 37 • C for 10 min. Samples were then neutralized and washed with phosphate buffer (PH 6.8) by centrifuging at 10,000 rpm for 10 min. Pellet was resuspended in 2 mL of phosphate buffer and mixed well. Smears were prepared for Ziehl-Neelsen (ZN) staining.

Materials and Methods
Five hundred microliter of the decontaminated sample was inoculated in Bactec MGIT 960 culture tubes containing 800 µL mixture of oleic acid, albumin, dextrose, and catalase (OADC) and polymyxin B, amphotericin B, nalidixic acid, trimethoprim, azlocillin (PANTA) supplement as per the manufacturer's instructions (Becton Dickinson Diagnostic Instrument Systems, Sparks, MD, USA). Leftover decontaminated sample aliquots were stored at −80 • C for future use.
Identified cultures of M. tuberculosis complex were subcultured on slants of Lowenstein Jensen (LJ) medium and incubated at 37 • C for 21-28 days. Growth from LJ medium was used for Bactec MGIT 960 DST and DNA extraction for WGS [17].

Bactec MGIT 960 SIRE DST
Single colony with the help of sterile inoculating loop from each LJ medium was inoculated in each MGIT 960 system tube and incubated in Bactec instrument until flagged positive. First-line SIRE DST was performed as per the manufacturer's protocol [18]. DST was performed on Day 1 and Day 2 by single dilution (0.5 mL of 1:100 dilution inoculum for growth control (GC) and 0.5 mL of inoculums directly in four drug panel tubes) and Day 3 to Day 5 (0.5 mL of 1:4 mL dilution inoculums directly for four drug panels and further dilution in 1:100 for GC) from the day of flagged positive of MGIT 960 instrument tube [18,19]. M. tuberculosis H37Rv [ATCC (American Type Culture Collection) number 2799] and known MDR-TB strain were used as quality control. The inoculated MGIT tubes with DST racks were loaded in the automated Bactec MGIT 960 system and the growth was continuously monitored by BD Epi-center.

Second Line DST Using Bactec MGIT 960
Isolates identified as MDR-TB were tested against second line drugs viz., moxifloxacin (MOX), levofloxacin (LEV), amikacin (AMK), and linezolid (LNZ). All drugs were purchased from Sigma-Aldrich Corporation (St. Louis MO, USA) and were chemically in the form of powder. Stock solution of drugs AMK (1 mg/mL), MFX (1 mg/mL), LFX (1 mg/mL) and LNZ (1 mg/mL) were prepared as per the instruction and sterilized through 0.22 µm pore-size Millex-GS filter units (Millipore Bedford, MA, USA). M. tuberculosis H37Rv [ATCC (American Type Culture Collection) number 2799] and known fluoroquinolone (FQ) resistant strains were used as quality control strains. Second-line DST was performed as per the protocol [19][20][21]. As AST carrier rack for second-line drug (SLDs) panels were not available commercially for the MGIT-960 system, it was registered as one of the SIRE (Streptomycin, Isoniazid, Rifampicin, Ethambutol) panel in order to obtain a printable report and drug susceptibility testing results.

Whole Genome Sequencing
Genomic DNA extraction from 65 clinical isolates (representatives of sensitive and resistant groups as described in Figure 1) of M. tuberculosis complex was carried out using the Chloroform-Isoamyl alcohol (CI) method [22], and quantification of genomic DNA was performed using Qubit Fluorometer (Thermo Fisher). The genomic DNA was sequenced using plexWell WGS-24 Library Preparation kit (Illumina, San Diego, CA, USA). Library pools were subjected to paired-end sequencing on a HiSeq platform (Illumina, San Diego, CA, USA). The quality of sequenced reads was screened using FastQC v0.11.9, and the reads with an average quality score of >=20 were retained [23]. Reads that were shorter than 36bp and possible adapter contaminating sequences were removed using Trim Galore Version 0.6.4 [24]. The output of contigs/genomes were assembled using the SPAdes genome assembler (version 3.9.0) using the default k-mer size [25] and annotated using Prokaryotic genome annotation pipeline (PGAP) [26].

Identification of SNPs
Raw reads of each genome were mapped to M. tuberculosis H37Rv reference genome (Accession NC_000962.3) using Burrows-Wheeler Aligner (BWA-MEM algorithm) bwa-0.7.12 [27]. SAM to BAM format conversion and sorting of mapped sequences, filling in mate coordinates to keep best reads using mate score tags was performed using Samtools 1.10 [28]. Duplicate alignments were marked and removed using samtools markdup command. BAM files were indexed for piling up variants by samtools mpileup. Mutations with read depth above 10 reads were considered true mutations. Annotation and filtering of variants were conducted using SNPEFF 5.0e [29] and SnpSift [30].

Assignment of Principal Genetic Groups
To assign a principal genetic group (PGG), each sequenced isolate was manually screened for polymorphisms in gyrA codon 95 and KatG codon 463 and were categorized accordingly to PGG as 1, 2, or 3, respectively, as explained previously [31].

Identification of Lineages and Sub-Lineages Using WGS SNP Barcoding
Isolates based on the patterns of SNP at the designated loci were categorized into phylogenetic lineages groups as lineage 1 (Indo-Oceanic), lineage 2 (East Asian), lineage 3 (East African Indian), lineage 4 (Euro-American), lineage 5 (West Africa 1), lineage 6 (West Africa 2), and lineage 7 (Horn of Africa) and lineage 8 [13]. After splitting WGS isolates into lineages, further categorization was conducted on the basis of SNP's [13]. Construction of the UPGMA tree was conducted by concatenating SNPs and visualized using iToL V.6 software [32].

Phylogenetic Analysis and Construction of cgMLST
Schema for cgMLST was set up with an efficient Workflow for a Blast Score Ratio Based Allele Calling Algorithm (Chew BBACA) [33]. For construction of cgMLST, first a wgMLST schema was created using M. tuberculosis H37Rv , were extracted and used as representative for M. tuberculosis lineage 1-8. These 39 publicly available complete genomes were used for validation of the cgMLST schema. The resulting loci was then subjected to AlleleCall, which identified and excluded 104 possible paralogous loci from further downstream analysis using the default BLAST Score Ratio (BSR) threshold of 0.6. Finally, cgMLST was extracted containing a set of 1443 core loci (present in 100% of the isolates). The resulting cgMLST matrix was uploaded in phyloviz 2.0 [34] to generate and visualize UPGMA Tree. The genome of M. cannettii was used to root the tree.

Data Analysis
All data obtained from gDST, phenotypic DST, PGG, and categorization of lineage on the basis of SNP barcoding were maintained on MS Excel 2013 for further analysis.

Demographic Details and Characteristics of MDR-TB Patients
Of the total 200 patients included in the study, 91 (45.5%) were males and 109 females with mean age (± standard deviation) of 29.52 ± 13.21 and 27.85 ± 14.17 years, respectively. The majority of cases were adults, 183 (91.5%), and 17 (8.5%) were from the pediatric age group.

Mutations in Genes Associated with First-and Second-Line Drugs Using WGS
A total of 65 isolates sequenced were analyzed for mutations conferring drug resistance in genes associated with first-line and second-line drug resistance (Table 2). Genomes sequences of each isolate were screened for mutations in genes conferring resistance to first-line anti-tuberculosis drugs viz; rpsL, rrs, gidB for STR; KatG, inhA, ahpA, fab, ndh for INH; rpoA, rpoB, rpoC for RIF; embABC for EMB and pncA for PZA.     (18/30; 60%) in embB, and Asp49Ala (5/10; 50%) in pncA was found to be predominantly present in genes known to confer drug resistance for first-line drugs STR, INH, RIF, EMB, and PZA, respectively.
In case of FQ, Asp94Gly (6/16; 37.5%) was found to be the predominant mutation in gyrA gene region, of which two genomes were also found to have mutation 1484G>T and 1401 A>G in rrs gene region conferring drug resistance to injectable class of drugs known to confer drug resistance. No mutations were found for genes associated with linezolid, clofazimine, delamanid, and bedaquiline.
Patterns of mutation resulting from WGS analysis and their association with phenotypic DST are shown in Table 2. Variant densities of each genome against M. tuberculosis H37Rv were generated using Blast Ring Image Generator BRIGv0.95 and is shown in Figure 2.

Phylogenetic Analysis Based on PrincipalGenetic Group (PGG)
Based on the constitution of amino acids at loci 95 and 493, the PGG informative sites within the genes gyrA and katG, we found each isolate was designated to a PGG. Out of 65 sequenced genomes, 60 were clustered in PGG1, 4 in PGG2, and only 1 in PGG3. On comparing the data based on sub-lineage classification, all the genomes belonging to sub-lineage 1, 2, and 3 were clustered in PGG1 group, while sub-lineages of lineage 4 dominated PGG2 group. Only 1 genome from lineage 4 was assigned PGG3, which was a pre-XDR isolate. Assignment of PGG and sub-lineage classification for each isolate is shown in Supplementary Table S1.

Discussion
With increasing drug-resistant TB cases in India, it becomes pivotal to recognize the clonal expansion of lineages or clones contributing to drug resistance, specifically in geographical regions where drug resistance is suspected [35]. Northeastern states of India have higher rates of MDR-TB, around 32.7% of which Arunachal Pradesh region is known to have 78.8% of TB drug resistance [5,36]. Most of the land in Arunachal Pradesh is under forest area, and villages located in such impoverished forest zones have very inadequate or non-existent access to health services, including the hampered efforts of the national TB elimination program. Further, proximity to China, difficult terrains, and heavy rainfalls and landslides impact health services, leading to a high MDR rate. Nonetheless, host genetics also might be contributing to high resistance in this area, but that aspect was not studied in this work, and we are not making any conclusions on the genetic reasons for high MDR. Whole-genome sequencing (WGS) has become a standard for typing of M. tuberculosis isolates and is known to have higher resolution over MIRU-VNTR-based clustering [37]. In this study we planned to utilize the approach of WGS and SNP barcoding for typing and clustering of M. tuberculosis lineages circulating in the Arunachal Pradesh region of India. This study will help in defining the transmission of strains associated with drug resistance circulating in the Arunachal Pradesh region of India.
Our report for MDR-TB from Arunachal Pradesh is 56.8%, slightly lower than previously reported 79.3%. The variation may be due to a lower number of sample sizes in the previous study, results based on DNA-based line probe Assay rather than Bactec MGIT 960, which detects viable bacilli and incorrect denominators used while calculating drug resistance rates [36]. In our study, 80% of the samples were obtained from previously treated TB cases resulting in an increased rate of drug resistance. However, no reports of FQ resistance were reported from Arunachal Pradesh earlier.
There is mounting evidence that strain diversity plays a role in the transmission of disease [38]. In order to spot strain transmission in Arunachal Pradesh, we randomly selected cultures for WGS from each category of varying data sets of drug-resistant patterns and used gene by gene PGG, SNP barcoding, and core genome MLST (cgMLST) method to allocate strains in well-defined phylogenetic groupings ( Figure 1). All these methods were used in various phylogenetic studies for standardized phylogenetic assignment of diseases and outbreak resolutions [13,39,40]. The PGG results also correlated with results of lineage grouping by SNP barcoding as 60 (92.3%) isolates belong to PGG group 1 (KatG Leu463Leu, gyrA Thr95Thr) of which 8 (13.3%) were Lineage 1 (Indo Oceanic), 41 (68.3%) Lineage 2 (East Asian) and 10 (16.7%) Lineage 3 (East-Arican Indian) and 1 (1.7%) Lineage 4 (Euro American). PGG group 2 (KatG Leu463Arg, gyrA Thr95Thr) and PGG group 3 (KatG Leu463Arg, gyrA Thr95Ser) included 4 (6.2%) and 1 (1.5%) strain belonging to lineage 4. Categorization of lineage as per the PGG group reported from other studies was consistent with our finding [31,40] (Supplementary Table S1).
This study provided for the first time a complete picture of TB phylogenetics across Arunachal Pradesh region based on cgMLST compared to phylogenetics that is based on SNP calling methods. The resulting cgMLST phylogenetic tree contained all reference lineages and four major M. tuberculosis lineages from our dataset and matched with results of SNP-based methods. None of the genomes belong to Lineage 5-8, and all were M. tuberculosis sensu stricto (Figure 1).
Using SNP barcoding method, we found 41 (63.1%) isolates grouped with lineage 2 (East Asian); 10 (15.4%) isolates with lineage 3 (Central Asian), 8 (12.3%) isolates with lineage 1, and 6 (9.2%) isolates to be lineage 4 (Supplementary Table S1). We found Lineage 2 (East Asian) as the predominant lineage (63.1%) circulating in Arunachal Pradesh among suspected drug-resistant TB cases. To gain further insight, we looked for SNP-based markers to differentiate sub-lineages. Among Lineage 2 only two clones circulating in Arunachal Pradesh were found viz: sub-lineage 2.2.1 (82.9%) and 2.2.1.2 (17.1%), respectively. Both these sub-lineages 2.2.1 and 2.2.1.2 belonged to modern Beijing clade, which was depicted by the presence of SNP markers at mutT2 codon Gly58Ala, ogt Gly12Gly specific to modern Beijing as reported in earlier studies [40,41]. Of a total of 6535 SNP's, Beijing clone 2.2.1 was responsible for >80% of transmission and clusters using thresholds of up to 19 SNPs showing recent transmission of strains. Another clone of Beijing 2.2.1.2 showed clusters with SNP differences of 9 SNPs, showing ongoing transmission of the strains specifically associated with drug resistance. All these 7 (100.0%) isolates of clone 2.2.1.2 were multidrugresistant, and 2 (28.6%) were resistant to FQ. Beijing sub-lineages 2.2.1 and 2.2.1.2 was reported from various parts of the world associated with outbreaks and drug resistance from Vietnam and Southern China [42][43][44][45][46][47]. We also observed Lineage 1 clone 1.1.3 with SNP differences of 101, showing this clone as endemic in Arunachal Pradesh for longer time periods. Lineage 1 is known to be associated with activation of long-term latent infection compared to that of Lineage 2 (modern Beijing) strains, which are known for more likely to progress to active disease in various host populations, more virulent, and thus highly transmissible [48,49]. A total of 6 isolates of lineage 4 were identified and were unclustered. Of Lineage 3, one clone including 9 (90%) isolates was found showing SNP differences of 18, also showing ongoing transmission. Out of 9 isolates, 3 (33.3%) were MDR-TB. One isolate (10%) of sub-lineage 3.1.2.1 was also found and was MDR-TB as well as resistant to FQ. New clades of lineage 3 were also reported to be circulating in the Assam region of India by Devi et al., which is consistent with our study [50].
The main strength of our study is that it highlights the higher rates of drug resistance in the Indian state of Arunachal Pradesh, which has a common border with China, and provides insights into the phylogenetic diversity of MDR-TB isolates from this state using cgMLST. These findings may have important implications in understanding the molecular epidemiology of DR-TB and for its control and prevention, particularly in the state of Arunachal Pradesh. However, our study has some limitations. The main limitation was that we could not include a control group from other states/regions of India in order to compare the strain diversity, specifically the Beijing sub-lineage 2.2.1 and 2.2.1.2, its association with drug resistance.

Conclusions
Our findings show dissemination of clusters of Beijing clones associated with drug resistance and regional spread may be emerging and aggressive. Approaches to contain Beijing strains (sub-lineage 2.2.1 and 2.2.1.2) may prevent transmission of these strains across other parts of India. Clonal expansion of these strains in Arunachal Pradesh in the future may lead to an outbreak of Beijing strains and underline the need for surveillance studies incorporating epidemiological information and a track of ongoing transmission to prevent drug-resistant TB outbreaks. We also found transmission of lineage 3 clade and the presence of Lineage 1 as endemic in Arunachal Pradesh. These findings may have important implications for control and prevention of TB in the northeastern part of India, Arunachal Pradesh.