Sequencing-Based Genotyping of Pakistani Burkholderia mallei Strains: A Useful Way for Investigating Glanders Outbreaks

Burkholderia (B.) mallei is a host-adapted equine pathogen that causes glanders, a re-emerging zoonotic disease, which is endemic in Pakistan and other developing countries and seriously impacts the global equine movement. Due to globalization, the geographical restriction of diseases vanishes and the lack of awareness of and experience with eradicated diseases in industrialized countries also promotes the re-introduction of infections in these regions. Owing to the high equine population, the Pakistani province Punjab is a potential hotspot where several glanders outbreaks have been seen over last two decades. For determining the genomic diversity of B. mallei in this and other equine-populated prefectures, the genomes of 19 B. mallei strains isolated between 1999 and 2020 in different locations were sequenced and their genotypes were determined. Particularly, for genetically highly homogenous pathogens like B. mallei genotyping techniques require a high discriminatory power for enabling differentiation on the strain level. Thus, core-genome single nucleotide polymorphism (cgSNP) analysis was applied for distinguishing the highly similar strains. Furthermore, a whole-genome sequence-based core genome multi locus sequence typing (cgMLST) scheme, specific to B. mallei, was developed and additionally applied to the data. It was found that B. mallei genotypes in Pakistan persisted over time and space and genotype clusters preferred connection with a time point rather than the place of isolation, probably due to frequent equine movement, which promotes the spread of glanders. The cgMLST approach proved to work in accord with SNP typing and may help to investigate future glanders outbreaks.


Introduction
International animal trading poses the risk of global dissemination of pathogens. Even on smaller scales (i.e. animal movement between districts), unrecognised infection carriers can have a fatal impact, causing outbreaks among the native population [1]. Such transmission scenarios have occurred for glanders, a bacterial infection caused by Burkholderia mallei that mainly affects equines [2,3]. Horses, in particular, can develop an asymptomatic, chronic form of glanders, which makes them perilous spreaders, while mules and donkeys usually decease rapidly due to acute glanders [4]. Although solipeds are its primary host, B. mallei can also infect other mammals, including humans, making it a zoonotic pathogen. Laboratory workers, veterinarians and animal caretakers are at highest risk [5,6]. There is no vaccine available and antibiotic treatment of glanders is laborious and protracted [7,8].
In the past, glanders received wide attention due to the fatal nature of the disease, but now it has been eradicated in many countries due to strict measurements and culling policy [9]. However, sporadic cases still occur, often attributed to animal import from

Strain Isolation and Identification
Between 2017 and 2020, eight B. mallei strains (Table 1) were recovered from clinically suspected cases of glanders in game (polo) and draught equid communities from different areas of Pakistan ( Figure S1). The identity as B. mallei was confirmed by PCR targeting bimA ma . For investigating whether there is a prevalence of certain genotypes in Pakistan, we also included a panel of B. mallei strains (n = 11) isolated from different outbreaks of glanders between 1999 and 2007 (Table 1). These strains were isolated from clinical samples of puss, blood and nasal swabs of equines that were used as working animals, e.g., in the police service or polo matches.

Genome Sequencing
The Pakistani strains were sequenced by Illumina technology yielding on average 2,419,160 reads (range: 1,615,876-3,957,054) per sample with an average length of 255 bp and sufficiently high coverage for further analysis (Tables 2 and S1). Genomes that were assembled from these reads met the expected size and GC content. However, due to the short-read sequencing approach, the genomes remained fragmented, comprising 262 to 379 contigs (Table 2). In contrast, by pursuing a hybrid assembly strategy combining Illumina short-read and nanopore long-read sequencing data, a higher level of contiguity could be reached for the genomes of four strains from our strain collection (Table 3).

SNP Typing of Pakistani Strains
The investigated strains from Pakistan were compared in a cgSNP analysis ( Figure 1, Table S2) to seven strains from India, one historic strain from Pakistan and one from Iran, isolated between 1932 and 2015. In this analysis, 1016 core genome SNPs were called. The recent strains from Pakistan formed one large cluster that clearly differentiated them from a cluster formed by Indian strains by at least 35 SNPs. The historic strain NCTC 3709, isolated in 1932 in Lahore, did not fall within the Pakistani cluster, exhibiting 327 to 367 SNP differences compared to the contemporary strains. than 10 varying SNPs, implying a higher genomic heterogeneity of strains from 2017 to 2020. The strains from Faisalabad (2018; Pak2018H10) and Islamabad (2017; Pak2017H7) were the most similar among these strains, as they were separated merely by two SNP, while all other strains exhibited at least 10 SNP differences. They belonged to the most prominent cluster, cluster III, formed by strains from 2017 to 2019 from almost every sampled location. When comparing the two sampling decades, the strains exhibited 10 to 96 SNPs difference.

cgMLST Scheme Development and Validation
Due to the high quality criteria for penetration query genomes as a basis for the cgMLST scheme, several strains commonly used in B. mallei genotyping studies were not represented in the set of chosen public database entries. Thus, we additionally sequenced four strains of our strain collection (Table 3) using Illumina short-read in combination with nanopore long-read sequencing for high-quality hybrid assemblies that were added to the set of genomes as the basis for cgMLST development. All in all, 22 genomes were chosen for cgMLST generation (Table S1), none of which were identified as taxonomic or quality outlier. Using the cgMLST Target Definer, 2838 of 5025 genes were identified as suitable targets for the scheme (56.5% of the reference genome), while 1890 genes (37.6%) Within the Pakistani cluster, the strains formed smaller separate clusters, which were defined by sampling decade rather than by location of isolation, i.e., the strains from 2017 to 2020 did not mix in clusters with the PRL-named strains (1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007). In a cgSNP analyses merely including the strains sequenced in this study, all in all, 660 core genome SNPs were detected and SNP differences of 0-96 SNPs were observed (Table S2). In the polytomy based on these cgSNP data, which was in agreement with Figure 1 ( Figure S2), five Pakistani clusters could be made out (Figure 1), although within these, the SNP distance could be as high as 22 SNPs (Pak2020M11 and Pak2020M8 forming cluster II). The most homogenous cluster, cluster IV, was formed by strains from Faisalabad and Sargodha between 1999 and 2007 with 0-3 differing SNPs. Furthermore, PRL1 and PRL41 (cluster V), both isolated in Faisalabad but with a four-year distance, exhibited identical cgSNP profiles. The assembly of strain PRL20, a strain that had already been sequenced and published before [1], was also included in the genotyping analysis. The distance between this strain and the cluster formed by PRL1 and PRL41 constituted 35 SNPs.
It was striking that the more contemporary strains exhibited fewer clusters, with less than 10 varying SNPs, implying a higher genomic heterogeneity of strains from 2017 to 2020. The strains from Faisalabad (2018; Pak2018H10) and Islamabad (2017; Pak2017H7) were the most similar among these strains, as they were separated merely by two SNP, while all other strains exhibited at least 10 SNP differences. They belonged to the most prominent cluster, cluster III, formed by strains from 2017 to 2019 from almost every sampled location. When comparing the two sampling decades, the strains exhibited 10 to 96 SNPs difference.

cgMLST Scheme Development and Validation
Due to the high quality criteria for penetration query genomes as a basis for the cgMLST scheme, several strains commonly used in B. mallei genotyping studies were not represented in the set of chosen public database entries. Thus, we additionally sequenced four strains of our strain collection (Table 3) using Illumina short-read in combination with nanopore long-read sequencing for high-quality hybrid assemblies that were added to the set of genomes as the basis for cgMLST development. All in all, 22 genomes were chosen for cgMLST generation (Table S1), none of which were identified as taxonomic or quality outlier. Using the cgMLST Target Definer, 2838 of 5025 genes were identified as suitable targets for the scheme (56.5% of the reference genome), while 1890 genes (37.6%) were classified as accessory. Further 297 genes (5.9%) were discarded as a result of the Multi Copy Filter analysis. Thus, the final scheme comprised 2838 genes.
In order to validate the newly defined scheme, sequences of 47 B. mallei strains representing the currently known genomic diversity, including sequences of the same strains from multiple sources, were analyzed by cgMLST and compared to the results of the cgSNP analysis, which is the current gold standard. For this cgSNP analysis, a read-and assemblybased approach were chosen, using the tools Snippy and Parsnp. The neighbor-joining analysis based on cgMLST profiles was very well in accordance with both trees based on cgSNP data ( Figure 2, Figure S3).
On average, 97.24% of the cgMLST targets were called from these assemblies, with a mean value of 98.1%, although the number of contigs of these assemblies ranged from 209 to 1382. In three tested assemblies, less than 75% of target genes were identified, thus they were excluded from the analysis (Table S1). These assemblies also displayed the lowest N50 values of all tested data, <6600 bp, and they were highly fragmented, comprising >1600 contigs.
Despite varying percentages of good targets, there were also duplicate strains from different sequencing projects clustered together that displayed 0-27 allelic differences in cgMLST. It was observed that an N50 value below 15,000 bp markedly reduced the number of identifiable targets to less than 95%.
For one strain, NCTC10230, the allelic profiles between one out of three datasets differed immensely from the other two, namely in 306 and 329 targets. Accordingly, the strains clustered differently in the tree, which was also the case in the trees based on SNP analysis. Thus, it must be concluded that the strain name was incorrectly assigned to this sequence data.
In the cgSNP analysis 2318 SNPs were called by Snippy and 2676 by Parsnp, depicting the variability between both approaches. Accordingly, the number of differing SNPs between strains varied with the methods. The distances calculated by cgMLST lay within the ranges of the SNP calling tools (Table S3).

Allele-Based Typing of Pakistani Strains
The Pakistani strains were subjected to cgMLST analysis using the newly developed cgMLST scheme. On average, 98.87% of the targets were called (97.2-99.3%; Table S1). In this analysis, no identical allelic profiles were detected and the strains differed in 1-87 targets (Table S4). In agreement with the cgSNP analysis, the more recent Pakistani strains displayed higher heterogeneity (13-73 targets) than the older strains (1-55 targets). The most homogenous cluster found in the cgSNP analysis, cluster IV, also showed in the cgMLST results ( Figure 3). These strains exhibited 1-4 allelic differences and formed a cluster in the center of the tree, around which branches with the more contemporary strains emerged. Thereby, in contrast to the cgSNP analysis, the cluster that was formed by five strains from 2017 to 2019, SNP cluster III, was dispersed and the congruence of Pak2017H7, Pak2018M4 and Pak2019H6 to strains from police horses from Faisalabad (1999) became more apparent. Furthermore, existing differences between some of these strains were more pronounced. Especially the strain Pak2018H10 (Faisalabad, 2018) clearly was differentiated from the others by 37 to 73 alleles. Likewise, the two strains from 2019 from Lahore and Islamabad, Pak2019H6 and Pak2019H9, did not group together in the cgMLST analysis, although the difference of 20 alleles was well in accordance with the 20 SNPs difference in cgSNP analysis.  Additionally, in silico multiple-locus variable number tandem repeat analysis (MLVA) was conducted for the Pakistani strains (Table S5). For some of the strains investigated here, MLVA profiles have been published before [1]. When the in silico profiles were compared to those, the allelic numbers for several loci diverged from these published profiles. Thus, in silico MLVA was dismissed for genotyping as the assembly of the target regions, which exhibit a high number of repeats, pose a particular challenge to assemblers and the MLVA data based on PCR was considered more reliable.

Discussion
Although glanders has been endemic in Pakistan since at least the 19th century, little is known about the distribution of different B. mallei genotypes in this region. The present study is the first extensively employing whole genome sequencing for molecular genotyping of B. mallei outbreak strains from Pakistan. Often, typing studies suffer from the availability of only a few strains that can be investigated, which might pose a problem for revealing the true genomic variability within this specie and determining the method that is best suited for differentiation of more distantly related strains as well as highly congruent outbreak isolates. Thus, we investigated 19 strains that cover a sampling period of over two decades and originate from different locations in Pakistan.
For a reliable differentiation of species and strains, it is necessary to identify unique molecular signatures with a high discriminatory power. Several methods are at hand. However, the Burkholderia-specific MLST scheme [21] fails to differentiate the highly clonal B. mallei strains while an MLVA scheme [22] provides higher resolution, but the investigated repeat regions are prone to homoplasy. Recently, a cgMLST scheme for B. pseudomallei, the assumed progenitor of B. mallei, was proposed [32]. As this scheme could also not sufficiently differentiate B. mallei strains (unpublished data), a B. mallei-specific cgMLST scheme was developed in the present study. The percentage of targets identified as suitable core genome genes included in this scheme was well in range with other studies, although this value is highly specie-dependent. The number of identified target genes, 2838 targets, was lower than the published number of 3456 B. mallei core genome genes [20]. However, for the definition of the core genome in the present study, more strains (22 strains) were used than before (seven strains, [20]) and it is known that with an increasing Furthermore, strains PRL1 and PRL41, both from Faisalabad but isolated four years apart, could be differentiated by cgMLST profiles (five alleles of a difference), which was not possible by cgSNP (zero SNP differences). However, the PRL and more recent strains still did not form clusters defined by isolation location or year. The three exemplary Indian strains that were included in the cgMLST analysis did not mix with Pakistani strains.
Additionally, in silico multiple-locus variable number tandem repeat analysis (MLVA) was conducted for the Pakistani strains (Table S5). For some of the strains investigated here, MLVA profiles have been published before [1]. When the in silico profiles were compared to those, the allelic numbers for several loci diverged from these published profiles. Thus, in silico MLVA was dismissed for genotyping as the assembly of the target regions, which exhibit a high number of repeats, pose a particular challenge to assemblers and the MLVA data based on PCR was considered more reliable.

Discussion
Although glanders has been endemic in Pakistan since at least the 19th century, little is known about the distribution of different B. mallei genotypes in this region. The present study is the first extensively employing whole genome sequencing for molecular genotyping of B. mallei outbreak strains from Pakistan. Often, typing studies suffer from the availability of only a few strains that can be investigated, which might pose a problem for revealing the true genomic variability within this specie and determining the method that is best suited for differentiation of more distantly related strains as well as highly congruent outbreak isolates. Thus, we investigated 19 strains that cover a sampling period of over two decades and originate from different locations in Pakistan.
For a reliable differentiation of species and strains, it is necessary to identify unique molecular signatures with a high discriminatory power. Several methods are at hand. However, the Burkholderia-specific MLST scheme [21] fails to differentiate the highly clonal B. mallei strains while an MLVA scheme [22] provides higher resolution, but the investigated repeat regions are prone to homoplasy. Recently, a cgMLST scheme for B. pseudomallei, the assumed progenitor of B. mallei, was proposed [32]. As this scheme could also not sufficiently differentiate B. mallei strains (unpublished data), a B. mallei-specific cgMLST scheme was developed in the present study. The percentage of targets identified as suitable core genome genes included in this scheme was well in range with other studies, although this value is highly specie-dependent. The number of identified target genes, 2838 targets, was lower than the published number of 3456 B. mallei core genome genes [20]. However, for the definition of the core genome in the present study, more strains (22 strains) were used than before (seven strains, [20]) and it is known that with an increasing number of genomes included in the analysis, the size of the detected core genome decreases [33].
In a well-defined cgMLST scheme the retrieval rate of these targets in outbreak strains should constitute at least on average 95% to 97.5% [29,31]. This was the case for the B. mallei scheme when challenged with sequences from worldwide strains covering the complete diversity of the species, as well as the Pakistani outbreak strains. Likewise, the new scheme was compatible with cgSNP analysis as both revealed the same epidemiological patterns. However, Pakistani strains that were identical in the SNP analysis showed differences in cgMLST allelic profiles. Thus, we believe the cgMLST scheme might help outbreak investigations, in which highly congruent strains have to be differentiated. For other species, thresholds of allelic differences have been determined that define a single outbreak event, e.g., five and twenty alleles in case of Bacillus anthracis and Enterococcus faecium, respectively [27,30]. As in the present study, the true epidemiologic connection between isolates remained elusive, and further studies are required to determine this value for B. mallei. Such a threshold is also not known yet for B. pseudomallei cgMLST analyses, as even two alleles separate outbreak strains of a single transmission event from unrelated isolates and epidemiological connections were merely assumed for isolates differing by one allele [32]. A similarly strict differentiation could be expected for B. mallei.
When the newly developed cgMLST scheme was employed for the analysis of a diverse set of global B. mallei strains, the resulting polytomy matched the known patterns and was in accordance with the cgSNP typing results. The analysis of duplicate strains from different sequencing projects deposited in public databases gave allelic patterns that for some strains differed in several targets. However, differences can be attributed to the age of the isolates resulting in different replication cycles in the laboratories, as well as varying sequence qualities. Observed differences in allelic numbers of one and the same strain could be attributed to ambiguities in the assemblies. In particular, the N50 value of the assemblies proved crucial for the success of cgMLST analysis. In fragmented assemblies, genomic elements might be truncated or missing. In order to overcome this lack of high-quality assemblies for the development of the cgMLST scheme, we added genome sequences of four strains to the set of query genomes. The assemblies for those strains were generated by a hybrid assembly approach combining Illumina and nanopore sequencing data, which improved genome completeness and accuracy.
Although glanders is endemic in Pakistan and neighboring countries for at least 150 years [13][14][15]18], there are rarely whole genome sequence data present in the public databases. This complicates epidemiological investigations that rely on the determination of genome similarity for tracking strain origins. Serological studies proved the prevalence of B. mallei in the Punjab province [34], to which Faisalabad, Sargodha and Lahore belong, although glanders seems to be restricted to local endemic points [15].
The whole genome genotyping approaches employed in this study showed that the Pakistani strains form a group within global B. mallei phylogeny that can be distinguished from closely located Indian strains. Apparently, there is no extensive mixing between B. mallei populations from both countries.
The B. mallei PRL strains have been investigated before by MLVA [1], on the basis of which three clusters could be defined (clades A-C). The largest of these clusters, clade A, contains amongst others the strains PRL3, PRL4, PRL11 and PRL44, which were also clustered in this study by both cgSNP (cluster IV) and cgMLST analyses. According to Hornstra et al. [1], the hosts to PRL3, PRL4 and PRL44 originate from the same farm. Two horses, hosts to the former two strains, got infected with B. mallei in Lahore, which might be the source of infection of a mule, host to PRL44, 1.5 years later. However, the MLVA profiles of the strains differ [1]. In the presented study these strains were highly similar in cgSNP as well as the cgMLST profile, suggesting an epidemiological connection. As repeat regions are less stable markers than SNPs, the time distance between the isolation might account for the differing MLVA profiles. It has to be remarked, that strains from 1999 (PRL2, PRL11) and 2007 (PRL34) also fell into cluster IV in the present study, as they exhibited a similar level of identity to those strains (0-3 SNPs and 1-4 cgMLST alleles differences), although no connection between the cases is known.
When comparing the cgSNP and cgMLST data, strains that differed merely by a couple or by no SNPs, although they were isolated from different places and/or years apart, showed larger differences in cgMLST profiles, as would be expected and this makes sense from an epidemiological point of view, as comparably high numbers of genome alterations occur in B. mallei during passage through a host [35]. The same source persisting over eight years for these strains is unlikely, although horses are known to carry chronic glanders infections for several years [10]. PRL 1 and PRL 41, that were identical in cgSNP analysis, differed in MLVA profile [1] as well as in cgMLST alleles. Furthermore, PRL2 and PRL3, which could neither be differentiated by MLVA [1] nor by SNP typing, showed a slightly different allelic cgMLST profile, which would be expected as they were isolated in 1999 and 2005 in Faisalabad and Sargodha, respectively. Thus, cgMLST can add information and might help differentiating strains compared to SNP and MLVA analysis. We can confirm the former notion [1] that numerous B. mallei lineages circulate in Pakistan. However, the connection between genomic links and epidemiological links remains difficult given the complex epidemiological situation where horses often move between different cities and regions.
The analysis is complicated by a gap of 11 years between the sampling periods and by the fact that various undetected glanders outbreak events can be expected in the region [14]. In prefectures with high equine population, glanders is still prevalent and outbreaks in breeding establishments, as observed in 2006 to 2007 in Sargodha [15] might support the spread of B. mallei throughout the country. However, the currently circulating strains are distinct from NCTC 3709, isolated at the beginning of the 20th century in Lahore. It is known that glanders was re-introduced in India in the 1960s during the Indo-China war by imported horses and also employed in warfare [10,18], which might be the reason for genotype differences between the historic and the currently circulating B. mallei strains in Pakistan.
Thorough genotype characterization of B. mallei strains would not only help to elucidate the origin of Pakistani strains, but also the detection and surveillance of glanders worldwide. Laroucau et al. [36] reported that in two horses that were serologically tested and found to be positive for glanders, the routine PCR with tissue targeting the fliP-IS407A gene did not give a positive result, probably due to genetic variation in the infecting strains. Solely by applying more extensive methods, like SNP and MLST typing, the serological results could be confirmed. For such cases, in silico analysis based on whole genome sequences might help improving diagnostics, which is also important for countries where glanders is eradicated for decades, as veterinarians do not recognize the symptoms of the disease easily [11]. The application of cgMLST that gives a lab-independent, uniform sequence type assignment could further support the tracing of the origins of infections.
Here, we could show that the read-dependent cgSNP approach works well together with cgMLST analysis that is based on assemblies, and we believe that future studies could benefit from employing both methods, especially when the analysis must be based on assemblies, when no raw read data is available.

Sampling and Identification
B. mallei strains were isolated from blood and puss samples of clinically suspected glanders equids (horses and mules), which were brought to the Veterinary Medical Teaching Hospital (VMTH) of the University of Agriculture (UAF), Pakistan, between 1999 and 2020. Blood cultures were carried out in Oxoid Signal Blood Culture System (Oxoid, Basingstoke, UK). For this purpose, approximately 10 mL peripheral venous blood collected from the jugular vein was inoculated and incubated on a shaker at 37 • C for 56 h. Samples where the indicator devices showed positive culture signals were sub-cultured on 5% sheep blood agar plates and presumptive colonies were biochemically tested and confirmed by conventional PCR targeting the B. mallei bimAma gene as described elsewhere [37]. The amplicons were cleaned and sequenced by Lab-Genetix (Lahore, Pakistan) for further confirmation. Puss samples were cultured on blood agar plates and incubated at 37 • C for 56 h. Whitish grey to grey, non-hemolytic or marginally hemolytic colonies were also confirmed by PCR as described above.

Cultivation and DNA Isolation
For DNA isolation, the Pakistani B. mallei strains were grown in brain-heart infusion broth containing 4% glycerol at 37 • C for 2d. DNA was extracted using enzymatic digestion and phenol-chloroform extraction, according to standard protocols [38].
For the development of a cgMLST scheme covering a high diversity of B. mallei, four B. mallei strains from the Friedrich-Loeffler Institutes' strain collection (Mukteswar, NCTC120, 34, BfR242) were selected, for which no high-quality genome assemblies were deposited in the public databases, yet. The strains were grown on nutrient agar (Merck, Darmstadt, Germany) supplemented with 3% glycerine and 7.5% blood for 48 h at 37 • C. DNA was isolated using the NucleoBond HMW DNA kit (Macherey-Nagel, Düren, Germany) and subsequently used for Illumina and nanopore sequencing library preparation.

Library Preparation, Sequencing and Assembly
Short-read sequencing libraries were prepared from the isolated DNA with the Nextera XT library preparation kit (Illumina Inc., San Diego, CA, USA) and subsequently sequenced using v3 chemistry on a MiSeq system (Illumina) in paired-end mode.
Four strains (Mukteswar, NCTC120, 34, BfR242) were additionally sequenced by nanopore long-read technology (ONT). For this purpose, libraries were prepared with the Ligation Sequencing Kit SQK-LSK 109 (Oxford Nanopore Technologies Ltd., Oxford, UK) together with the Barcoding Kit EXP-NBD 104 (Oxford Nanopore Technologies Ltd., Oxford, UK) and sequenced on an R9.4.1 flow cell with a MinION Mk1B sequencing device (Oxford Nanopore Technologies Ltd., Oxford, UK) for 24 h. Sequencing raw data and hybrid assemblies were deposited at the European Nucleotide Archive under project number PRJEB52165.
Additionally, the public Sequence Read Archive (SRA) and NCBI GenBank were browsed on 20 January 2022 for Illumina read data and assemblies of B. mallei strains isolated in Pakistan and neighboring countries. This foreign data was processed using WGSBAC as described above. Accession numbers and metadata of foreign strains, as well as their function in this study, can be found in Table S1.

cgMLST Scheme Generation and Validation
Public databases were searched for genome assemblies of B. mallei on 11 October 2021. Metadata and function of these sequences are listed in Table S1. The quality of the assemblies was assessed using the WGSBAC pipeline (see above). For the generation of the cgMLST scheme 18 publicly available genomes were chosen (Table S1) as they met the following criteria: sequences covered at least 89% of the genome fraction of the reference strain ATCC 23344 (GCF_000011705.1), showed less than 2 N's per 100 kb, no contaminations and the number of contigs did not exceed 20. For doublet strains, the assembly with the highest quality was chosen. Four hybrid assemblies generated in the course of the present study (see above) were included in the set of strains used for cgMLST scheme development.
The cgMLST scheme was generated by a genome-wide gene-by-gene comparison using the MLST + target definer incorporated in Ridom SeqSphere+ v7.7 with default parameters, as previously described [31], including several quality filters to ensure scheme stability. The genome of B. mallei ATCC 23344 (GCF_000011705.1) served as the seed genome.
For validation of the scheme, publicly available B. mallei raw reads from the NCBI SRA database were downloaded (accessed on 20 October 2021; Table S1) and assembled with Shovill v1.0.4. The assemblies were analyzed with the new cgMLST scheme. A gene was considered a good target if it aligned with 100% to the reference sequence and showed at least 90% identity to this reference. A distance tree was calculated based on the allelic profiles by neighbor-joining algorithm implemented in SeqSphere+ v7.7 with pairwise ignoring missing values. By default, only samples with at least 90% of detected targets were included in the analysis. Furthermore, a core genome SNP analysis based on the raw reads and assemblies was conducted by Snippy v.4.6.0 (https://github.com/tseemann/ snippy, accessed on 11 April 2022) in conjunction with RAxML v8.2.12 [45] and Parsnp v1.2 within the Harvest suite [46], respectively. The resulting trees were compared using Dendroscope v3.5.9 [47]. Bootstrapping was performed whenever possible using RAxML with 200 iterations.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/pathogens11060614/s1, Figure S1: Map of Pakistan and sampling locations (LHR-Lahore; FSD-Faisalabad; AZK-Azad Jammu and Kashmir; ISB-Islamabad; SGI-Sargodha). The map is property of Wikimedia Commons and licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license (https://creativecommons.org/licenses/bysa/3.0/deed.en, accessed on 12 April 2022) and was modified; Figure S2: Maximum likelihood tree based on cgSNP alignment of exclusively Pakistani strains using Snippy; Figure S3: Tanglegrams of cgMLST-and cgSNP-based trees. Red numbers indicate bootstrap support values; Table S1: Foreign data accession data and assembly quality of Pakistani strains and foreign data;