CRISPR-Cas Diversity in Clinical Salmonella enterica Serovar Typhi Isolates from South Asian Countries

Typhoid fever, caused by Salmonella enterica serovar Typhi (S. Typhi), is a global health concern and its treatment is problematic due to the rise in antimicrobial resistance (AMR). Rapid detection of patients infected with AMR positive S. Typhi is, therefore, crucial to prevent further spreading. Clustered Regularly Interspaced Short Palindromic Repeats and CRISPR-associated genes (CRISPR-Cas), is an adaptive immune system that initially was used for typing purposes. Later, it was discovered to play a role in defense against phages and plasmids, including ones that carry AMR genes, and, at present, it is being explored for its usage in diagnostics. Despite the availability of whole-genome sequences (WGS), very few studied the CRISPR-Cas system of S. Typhi, let alone in typing purposes or relation to AMR. In the present study, we analyzed the CRISPR-Cas system of S. Typhi using WGS data of 1059 isolates obtained from Bangladesh, India, Nepal, and Pakistan in combination with demographic data and AMR status. Our results reveal that the S. Typhi CRISPR loci can be classified into two groups: A (evidence level >2) and B (evidence level ≤2), in which we identified a total of 47 unique spacers and 15 unique direct repeats. Further analysis of the identified spacers and repeats demonstrated specific patterns that harbored significant associations with genotype, demographic characteristics, and AMR status, thus raising the possibility of their usage as biomarkers. Potential spacer targets were identified and, interestingly, the phage-targeting spacers belonged to the group-A and plasmid-targeting spacers to the group-B CRISPR loci. Further analyses of the spacer targets led to the identification of an S. Typhi protospacer adjacent motif (PAM) sequence, TTTCA/T. New cas-genes known as DinG, DEDDh, and WYL were also discovered in the S. Typhi genome. However, a specific variant of the WYL gene was only identified in the extensively drug-resistant (XDR) lineage from Pakistan and ciprofloxacin-resistant lineage from Bangladesh. From this work, we conclude that there are strong correlations between variations identified in the S. Typhi CRISPR-Cas system and endemic AMR positive S. Typhi isolates.


Introduction
Typhoid fever is a systemic enteric infection, caused by Salmonella enterica serovar Typhi (S. Typhi), a human-restricted bacterial pathogen [1,2]. It is estimated to lead to 117 thousand deaths and 11 million episodes of illnesses every year and thus remains a major global public health concern [3]. The fecal-oral transmission route of S. Typhi makes typhoid fever highly endemic in areas with poor water and sanitation systems, especially the South Asian countries such as Bangladesh, India, Nepal, and Pakistan [3,4]. Moreover, treating typhoid fever has become harder, because of the increasing antimicrobial resistance (AMR) [5]. Recently, a highly clonal and extensively drug-resistant (XDR) lineage of S. Typhi that is resistant to all, but one oral antibiotic, azithromycin, caused a large-scale typhoid outbreak in Pakistan [6]. A highly ciprofloxacin-resistant lineage (named 'Bdq'; as a part of genotype 4.3.1.3, it will be referred to as 4.3.1.3q1 in the rest of the article) has appeared in Bangladesh and carries a qnr gene-containing plasmid, pK91 [5,7]. Isolates with high azithromycin resistance have been reported in Bangladesh as well [8,9]. With the availability of whole-genome sequence (WGS) data, these AMR characteristics can be easily detected and a large amount of WGS data is publicly available for S. Typhi. WGS data can also shed light on the presence of defense mechanisms that can recognize and destroy foreign genetic materials [10].
One such system is the Clustered Regularly Interspaced Short Palindromic Repeat and CRISPR-associated genes (CRISPR-Cas) for which little information is available in S. Typhi [11][12][13][14]. A CRISPR locus usually contains two to several hundreds of direct repeat (DR) sequences of 23-50 bp in length, separated by unique spacer sequences of similar length [15]. Spacers share complementarity with sequences identified in foreign DNA elements (protospacers) and are acquired from phages, plasmids, and other transferrable elements that previously infected bacteria [16][17][18]. To differentiate foreign DNA elements from self-DNA, the Cas proteins follow often at least three-nucleotide long protospacer-adjacent motif (PAM) present on the target sequence [19,20].
The genus Salmonella is known to carry a class-1 type I-E system, closely related to the CRISPR-Cas system in Escherichia coli (E. coli) [21,22]. The systems have been reported to carry either one or two CRISPR loci and a cas-gene cluster of cas3, cse1-cse2-cas7-cas5-cas6e-cas1-cas2 genes [2,14]. CRISPR-Cas systems in other bacterial species have been explored extensively for typing purposes [23]. For AMR, it became evident that the size of the CRISPR loci correlates with the presence or absence of AMR-related genes [24][25][26][27]. In S. Typhi, only a few studies explored the usage of the CRISPR-Cas system for typing purposes, which is still an unexplored territory [11,12]. Moreover, the earlier studies analyzed only a smaller number of whole-genome sequences (WGS) to explore the diversity of the system. For example, Fabre et al. used 18 S. Typhi WGS data to report two different CRISPR loci in the genome (CRISPR1 and CRISPR2) and used PCR assays to amplify those loci to explore the diversity of DR and spacers [11]. Therefore, an opportunity exists to follow-up this work with a larger set of WGS data to explore the S. Typhi CRISPR-Cas system further and report on its diversity as well.
In this work, we analyzed the S. Typhi CRISPR-Cas system using WGS data of 1059 isolates obtained from four major typhoid-endemic countries (Bangladesh, India, Nepal and Pakistan) with the country of isolation, demographic data, and AMR status. Our work identified potential CRISPR-Cas system-related markers that associate specifically with endemic and AMR-related S. Typhi isolates. We further identified unique spacer targets in bacteriophages and plasmids that led to the identification of a specific PAM sequence for S. Typhi. Next, we annotated common and new cas genes, of which one, the gene WYL, could be specifically linked to XDR isolates from Pakistan. Collectively, our study reveals with an impressive dataset that the CRISPR-Cas system in S. Typhi might become of use to monitor the dissemination of AMR endemic isolates so that their spreading can be contained.

Source and Assembly of the Genome Data
We used published WGS data of 536 isolates from Bangladesh, 198 from Nepal, 131 from India, and 20 from Pakistan [5,28,29]. These 885 isolates were considered as "Surveillance" cases. WGS data of 100 isolates from the ongoing XDR S. Typhi outbreak in Pakistan were included and considered as "Outbreak" cases [6]. Moreover, we included WGS of 74 travel-associated typhoid cases from the UK who traveled from the four above mentioned countries and categorized them as "Travel" cases [30]. Details of all 1059 cases are provided in Dataset S1. Raw S. Typhi genome data (fastq files) of all cases were downloaded from the European Nucleotide Archive (ENA), following the accessions given in source articles. We used SPAdes v3.12.0 (options: cov-cutoff = 'auto') to assemble the fastq files and removed smaller contigs (<300 bp) [31]. N50 of the contig files were calculated and added in the Dataset S1.
To compare the S.

Detection of CRISPR Loci and Cas Genes
To detect the CRISPR loci, we ran all assembled contigs through the CRISPRCasFinder v4.2.19 locally (without the cas-gene option) [32]. Following the earlier reports of S. Typhi DR and spacer length (29 and 32 bp) [11,21], all DRs and spacers longer than that were checked manually for truncation and overlap. Identified confirmed loci with an evidence score of 3 or 4 harbored increased numbers of spacers and were considered as "group-A CRISPR loci", which is the same as CRISPR1, an earlier nomenclature used by Fabre et al. [11]. However, unlike previous reports, more than one locus with an evidence score of 1 or 2 (low number of spacers) were found and all can be compared to CRISPR2 of the earlier nomenclature, thus we considered them as part of the second group of loci, "group-B CRISPR loci".
Sequences of direct repeats (DR) and spacers were extracted from all S. Typhi, Salmonella spp. and E. coli isolates separately and screened for sequence identity within their groups (ignored if redundant otherwise termed as 'unique'). All unique DR and spacers were given a unique three-part identifier (e.g., Td29a, Ts32ac, Ss32aak) following the strategy explained in Figure S1. Spacer arrangements of all group-A and B CRISPR loci were determined separately.
To detect the cas genes, we used Prokka v1.13.3 (options: gcode = 11) to annotate all 1113 genomes and blastp v2.7.1 + (options: evalue = 1 × 10 −9 and qcov_hsp_perc = 80) to search the annotated protein sequences against the cas-gene repository published earlier [18,[33][34][35]. For each detected cas-gene, corresponding nucleotide sequences were extracted from the annotation files and searched against the contigs using blastn (options: evalue = 1 × 10 −9 and qcov_hsp_perc = 95, maximum_bit_score) to find the gene location (sorted by position in the contigs) and orientation (positive or, negative-strand) and define the cas-gene loci. The distance from the cas-gene loci to the nearby CRISPR loci was also calculated. These data were used to visualize the cas-gene loci with nearby CRISPR loci in all S. Typhi isolates and compare them among themselves. In the case of detected cas-genes of other types of CRISPR-Cas system than I-E, the length of their coding sequences (CDS) was determined and added with their gene name (in superscript) to define an identifier for the CDS. An asterisk (*) was added to all cas genes if its CDS had any non-sense mutation and was interrupted prematurely.
The Supplementary Methods provides details on the collection of epidemiological data, generating multilocus sequence typing (MLST), genotype and AMR data, conservation of direct repeats (DR), spacers and their phylogenetic analysis, and finding spacer targets (Doc S1).

CRISPR Loci of S. Typhi Genomes
A total of 1919 CRISPR loci were detected in the S. Typhi genomes analyzed in this study ( Table  1). Of them, 55% (1054/1919) were group-A and 45% (865/1919) were group-B CRISPR loci. One to even five CRISPR loci per isolate were detected, but the majority harbored just one (40%) or two (41%) of them ( Figure S2). Bangladeshi surveillance isolates showed a lower range of CRISPR loci per strain (1-2; 690/536 vs. 2-3; 1229/523) compared to those from other countries ( Figure 1a and Table 1). The extensive drug-resistant (XDR) and non-XDR isolates from the Pakistani outbreak also carried a relatively lower number of CRISPR loci (2-3; 184 loci from 88 XDR isolates of genotype 4.3.1.1.P1 and 26 loci from 12 non-XDR isolates) ( Table 1 and Figure 1a). All other isolates across different study settings and countries had a higher number of CRISPR loci (1019 loci from 423 isolates) (Figure 1a). Among the dominant genotypes (with >50 isolates) identified in this study (  Figure 1b).   Loci number Loci number Loci number Genes 2020, 11, 1365 7 of 24 The maximum likelihood-based phylogenetic tree (MLT) of all group-A CRISPR loci of S. Typhi showed only one primary clade, whereas the MLT was generated with the group-B CRISPR loci had subclades specific to their DR sequences ( Figure S3a,b). The MLT of all 1919 group-A and B CRISPR loci showed similar inferences ( Figure S4). Surprisingly, 2nd and 3rd group-A CRISPR loci of a Bangladeshi isolate (accession: ERR2663968) were placed outside the primary clade of the MLT ( Figure  S3a). A blastn analysis confirmed this finding by showing a 100% sequence identity with Salmonella enterica serotype Enteritidis and not S. Typhi.
In the 1919 CRISPR loci, we further identified 15 different DRs, with most having strict specificity toward a certain CRISPR loci group (Table 2 and Figure S3b). Next, 47 unique spacer sequences were detected. Most of the spacers (n = 39) showed specificity to either one of the two CRISPR loci groups, except the four spacers named Ts32c, g, h, i, Ts34a, c, e, and, f ( Figure 2 and Table 3) (See Figure S1 for details about the identifiers of DR and spacers). Among the highly present spacers, Ts32e and l were only present in group-A loci, whereas Ts55a, Ts54a, and Ts34d showed complete specificity to group-B CRISPR loci ( Figure 2 and Table 3). The MLT of all S. Typhi spacers did not show any clustering for CRISPR loci group specificity ( Figure 2). Instead, group-A and -B CRISPR loci had 7 and 22 different spacer arrangement patterns, respectively, named as a1-7 and b1-22 (Table 4).
Genes 2020, 11, x FOR PEER REVIEW 7 of 24 The maximum likelihood-based phylogenetic tree (MLT) of all group-A CRISPR loci of S. Typhi showed only one primary clade, whereas the MLT was generated with the group-B CRISPR loci had subclades specific to their DR sequences ( Figure S3a,b). The MLT of all 1919 group-A and B CRISPR loci showed similar inferences ( Figure S4). Surprisingly, 2nd and 3rd group-A CRISPR loci of a Bangladeshi isolate (accession: ERR2663968) were placed outside the primary clade of the MLT ( Figure S3a). A blastn analysis confirmed this finding by showing a 100% sequence identity with Salmonella enterica serotype Enteritidis and not S. Typhi.
In the 1919 CRISPR loci, we further identified 15 different DRs, with most having strict specificity toward a certain CRISPR loci group (Table 2 and Figure S3b). Next, 47 unique spacer sequences were detected. Most of the spacers (n = 39) showed specificity to either one of the two CRISPR loci groups, except the four spacers named Ts32c, g, h, i, Ts34a, c, e, and, f ( Figure 2 and Table 3) (See Figure S1 for details about the identifiers of DR and spacers). Among the highly present spacers, Ts32e and l were only present in group-A loci, whereas Ts55a, Ts54a, and Ts34d showed complete specificity to group-B CRISPR loci ( Figure 2 and Table 3). The MLT of all S. Typhi spacers did not show any clustering for CRISPR loci group specificity ( Figure 2). Instead, group-A and -B CRISPR loci had 7 and 22 different spacer arrangement patterns, respectively, named as a1-7 and b1-22 (Table 4).         Table 4. Different arrangement patterns of the spacers in S. Typhi isolates in this study. Each arrangement was considered as an "array pattern" and labeled after their loci type with a number (started with "a" for group-A loci, e.g., a1, a2, etc. and "b" for group-B loci, e.g., b1, b2, etc.).

CRISPR Loci of S. Typhi Versus other Salmonella Species
The addition of 91 group-A CRISPR loci from other Salmonella species (84 loci from 48 isolates of 19 serovars) and E. coli (seven loci from six isolates) to the MLT of all S. Typhi group-A CRISPR loci (n = 1054) revealed the presence of a primary clade (bootstrap 98) specific for S. Typhi (Dataset S2 and Figure 4a). The closest neighbor to the clade was S. enterica subsp. enterica (Figure 4a). This finding obtained further support from the estimated distance (median distance 3.12) calculated from the multiple sequence alignment (MSA) of all 1145 group-A CRISPR loci (Table S2). The lowest intraspecies distance (mean 0.07) among group-A CRISPR loci was found for S. Typhi (Table S2). Furthermore, the average length of these CRISPR loci was the shortest among all 19 different Salmonella serovars studied (after S. enterica Dublin; two isolates) (Table S2). Inside the S. Typhi serovar specific clade, two small and one large cluster were noticed (100% bootstrap). One small cluster had members of genotype 3.

CRISPR Loci of S. Typhi Versus other Salmonella Species
The addition of 91 group-A CRISPR loci from other Salmonella species (84 loci from 48 isolates of 19 serovars) and E. coli (seven loci from six isolates) to the MLT of all S. Typhi group-A CRISPR loci (n = 1054) revealed the presence of a primary clade (bootstrap 98) specific for S. Typhi (Dataset S2 and Figure 4a). The closest neighbor to the clade was S. enterica subsp. enterica (Figure 4a). This finding obtained further support from the estimated distance (median distance 3.12) calculated from the multiple sequence alignment (MSA) of all 1145 group-A CRISPR loci (Table S2). The lowest intra-species distance (mean 0.07) among group-A CRISPR loci was found for S. Typhi (Table S2). Furthermore, the average length of these CRISPR loci was the shortest among all 19 different Salmonella serovars studied (after S. enterica Dublin; two isolates) (Table S2). Inside the S. Typhi serovar specific clade, two small and one large cluster were noticed (100% bootstrap). One small cluster had members of genotype 3.2.2 and the other had 2.3.3 (ST2209), 3.0.0, and 3.2.1 (mostly ST2) from Bangladesh. The larger cluster had all other genotypes identified (Figure 4a and Table S1). No other specific genotype, country-, or MLST-related clustering was identified (Figure 4a).
The MLT of all group-B CRISPR loci did not show exclusive S. Typhi clades but re-illustrated the DR sequence specificity (Figure 4b). Only three S. Typhi DR sequences (Td35a, Td39a, and Td39b) showed specificity for the serovar, whereas the subclades of other DR sequences were connected to other Salmonella species (Figure 4b). MLT of all DR sequences detected in S. Typhi, other Salmonella spp., and E. coli showed striking sequence similarity to the most dominant S. Typhi DR, Td29a with five other DR sequences (three from other Salmonella serovars and two from E. coli), and all were accompanied by high spacer counts ( Figure S5a,b). The MLT of all group-B CRISPR loci did not show exclusive S. Typhi clades but re-illustrated the DR sequence specificity (Figure 4b). Only three S. Typhi DR sequences (Td35a, Td39a, and Td39b) showed specificity for the serovar, whereas the subclades of other DR sequences were connected to other Salmonella species (Figure 4b). MLT of all DR sequences detected in S. Typhi, other Salmonella spp., and E. coli showed striking sequence similarity to the most dominant S. Typhi DR, Td29a with five other DR sequences (three from other Salmonella serovars and two from E. coli), and all were accompanied by high spacer counts ( Figure S5a,b).

Spacers and DRs of S. Typhi
Among the spacers, Ts32i had ubiquitous presence (n = 976) among all study settings, countries, and genotypes, except for the genotype 3.2.2 (n = 0; p < 0.001). Among others, Ts32c, e, g, h, and l had a high number of presence among S. Typhi loci ( Table 3). The spacer arrangement pattern, a2, and a5 both presented high specificity (based on presence or absence) to a major non-multidrug resistance (MDR) genotype 3.2.2 (p < 0.001) ( Figure 5). Among the dominant spacer patterns, a5 was significantly underrepresented in the MDR or XDR group (p < 0.001), whereas a2 was present ubiquitously (p < 0.001; Figure S6). The same XDR isolates were also dominated by a combined pattern of spacer arrangement, a2-b1 (n = 79), however, the pattern was also present in a high number of non-XDR isolates ( Figure S6b,c).

Spacers and DRs of S. Typhi
Among the spacers, Ts32i had ubiquitous presence (n = 976) among all study settings, countries, and genotypes, except for the genotype 3.2.2 (n = 0; p < 0.001). Among others, Ts32c, e, g, h, and l had a high number of presence among S. Typhi loci ( Table 3). The spacer arrangement pattern, a2, and a5 both presented high specificity (based on presence or absence) to a major non-multidrug resistance (MDR) genotype 3.2.2 (p < 0.001) ( Figure 5). Among the dominant spacer patterns, a5 was significantly underrepresented in the MDR or XDR group (p < 0.001), whereas a2 was present ubiquitously (p < 0.001; Figure S6). The same XDR isolates were also dominated by a combined pattern of spacer arrangement, a2-b1 (n = 79), however, the pattern was also present in a high number of non-XDR isolates ( Figure S6b,c). A closer look at the presence of different DRs revealed a couple of specific patterns (n = 6) as well. The two group-B CRISPR loci specific DRs, Td39a, and Td39b, were more frequently observed among S. Typhi isolates obtained from surveillance (n = 173) than outbreak (n = 7) or travel (n = 0) cases (p < 0.001) ( Table 2 and Figure 3). Two of the dominant DRs, Td23a (n = 456) and Td35a (n = 192), were almost absent among the Bangladesh surveillance isolates (n = 3 and n = 6, respectively), whereas the latter DR was only identified in two Pakistan outbreak-related isolates (p < 0.001) ( Table  2). A few pairs of spacers and DR sequences (Ts34d-Td35a, Ts55a-Td23a, and Ts54a-Td39a/b) also showed specificity to different countries and study settings (p < 0.001) (Table S3). Dataset S3 has the sequences of all identified DR and spacers.

Spacer Targets and PAM Identification
We thus identified specific spacers, DRs, combined spacer patterns, and DR-spacer pairs of S. Typhi that could potentially serve as biomarkers to help identify regional endemicity and AMR amongst others. Only a few spacers harbored 100% (or, nearly 100%) identity with the bacterial, plasmid, phage, viral, and AMR-related sequences (see Doc S1 and Figure S7 for more details and about the databases and filter settings). For all the obtained spacer target hits, the possible PAM sequences (10 bp downstream and upstream of the protospacer) were not conserved, except for the spacers targeting the plasmid sequences (Figure 6a-e). Indeed, the potential PAM regions of plasmid sequences were highly conserved and were marked by the motif TTTCA (upstream) and TGCGT (downstream) (Figure 6b). An almost identical but less conserved motif TTTCT was also observed in the upstream PAM region of the protospacers identified in the phage sequences (Figure 6c). In total, only six different spacers (Ts23b, Ts32a, Ts32g, Ts32i, and Ts32o) harbored protospacers in the phage A closer look at the presence of different DRs revealed a couple of specific patterns (n = 6) as well. The two group-B CRISPR loci specific DRs, Td39a, and Td39b, were more frequently observed among S. Typhi isolates obtained from surveillance (n = 173) than outbreak (n = 7) or travel (n = 0) cases (p < 0.001) ( Table 2 and Figure 3). Two of the dominant DRs, Td23a (n = 456) and Td35a (n = 192), were almost absent among the Bangladesh surveillance isolates (n = 3 and n = 6, respectively), whereas the latter DR was only identified in two Pakistan outbreak-related isolates (p < 0.001) ( Table 2). A few pairs of spacers and DR sequences (Ts34d-Td35a, Ts55a-Td23a, and Ts54a-Td39a/b) also showed specificity to different countries and study settings (p < 0.001) (Table S3). Dataset S3 has the sequences of all identified DR and spacers.

Spacer Targets and PAM Identification
We thus identified specific spacers, DRs, combined spacer patterns, and DR-spacer pairs of S. Typhi that could potentially serve as biomarkers to help identify regional endemicity and AMR amongst others. Only a few spacers harbored 100% (or, nearly 100%) identity with the bacterial, plasmid, phage, viral, and AMR-related sequences (see Doc S1 and Figure S7 for more details and about the databases and filter settings). For all the obtained spacer target hits, the possible PAM sequences (10 bp downstream and upstream of the protospacer) were not conserved, except for the spacers targeting the plasmid sequences (Figure 6a-e). Indeed, the potential PAM regions of plasmid sequences were highly conserved and were marked by the motif TTTCA (upstream) and TGCGT (downstream) (Figure 6b). An almost identical but less conserved motif TTTCT was also observed in the upstream PAM region of the protospacers identified in the phage sequences (Figure 6c). In total, only six different spacers (Ts23b, Ts32a, Ts32g, Ts32i, and Ts32o) harbored protospacers in the phage sequences, all were short in length (23 or 32 bp), mostly present in group-A loci (Tables 3 and 6). In contrast, five spacers (Ts34j, Ts53a, Ts53b, Ts53c, and Ts59a) that harbored protospacers in the plasmid sequences showed specificity to the group-B CRISPR loci and longer in base-pair length (34,53 or 59 bp) (Tables 3 and 6).
Each phage-targeting spacer had a different viral target, but none, except Ts32i, targeted a Salmonella spp. phage (accession: MK268344.1) ( Table 6). Ts32i was present in 91% (974/1059) of the isolates except in genotype 3.2.2 (Table 3). Ts32g was also ubiquitously present (n = 1054) and had a target against Sinorhizobium phage phiN3 (Tables 3 and 6). In contrast, all plasmid sequences targeted by the S. Typhi spacers were part of only four different "hypothetical" proteins from either the species Salmonella enterica or the family Enterobacteriaceae ( Table 6). None of these proteins showed any hit in the Pfam database (https://pfam.xfam.org/). sequences, all were short in length (23 or 32 bp), mostly present in group-A loci (Tables 3 and 6). In contrast, five spacers (Ts34j, Ts53a, Ts53b, Ts53c, and Ts59a) that harbored protospacers in the plasmid sequences showed specificity to the group-B CRISPR loci and longer in base-pair length (34,53 or 59 bp) (Tables 3 and 6). Each phage-targeting spacer had a different viral target, but none, except Ts32i, targeted a Salmonella spp. phage (accession: MK268344.1) ( Table 6). Ts32i was present in 91% (974/1059) of the isolates except in genotype 3.2.2 (Table 3). Ts32g was also ubiquitously present (n = 1054) and had a target against Sinorhizobium phage phiN3 (Tables 3 and 6). In contrast, all plasmid sequences targeted by the S. Typhi spacers were part of only four different "hypothetical" proteins from either the species Salmonella enterica or the family Enterobacteriaceae ( Table 6). None of these proteins showed any hit in the Pfam database (https://pfam.xfam.org/).   Table 6. Detected targets of S. Typhi spacer against the plasmid and phage database. The target-finding algorithm is illustrated in Figure S7.

Cas Genes
All 1059 S. Typhi isolates had a set of eight cas-genes belonging to the type-I-E CRISPR-Cas system. Except for five, all (n = 1054) had the same cas locus length, gene arrangement, and orientation (Figure 7a-c). Among them, 1047 had a group-A CRISPR locus present, 85-87 bp downstream of the cas gene loci, whereas six had a group-B locus at that location and one had none (Figure 7a-c). Five other isolates had a non-sense mutation in their cas gene sequence (Figure 7d-f). The isolate (accession: ERR2663968) with two group-A CRISPR loci had two complete sets of cas genes (the second set is depicted in Figure 7g). Blastn analysis of the second set of cas genes showed >95% sequence identity with other Salmonella enterica rather than S. Typhi. The cas genes of the second set were placed outside the primary clade that contained all other S. Typhi cas genes in the MLTs. Indeed, this is true for all three MLTs of cas1, cas2, and cas3 genes from S. Typhi, other Salmonella species, and E. coli ( Figure  S8a-c). None of the three MLTs showed any high-bootstrap branching either inside that S. Typhi clade, reflecting a high level of conservation and stability of its cas-loci ( Figure S8). However, the presence of a few other Salmonella species inside the S. Typhi clade was noticed in the case of cas2-MLT ( Figure S8b). In addition to the type-I-E CRISPR-Cas system, all S. Typhi isolates had three copies of DinG, 2-4 copies of DEDDh, and 1-2 copies of WYL genes ( Figure S9 and Table S4). All three copies of DinG and two of the DEDDh genes were completely conserved in all 1059 S. Typhi isolates (Table S4). Blastn results of a copied variant of the WYL gene, WYL 888 (WYL gene of 888 bp length) showed high sequence identity with a gene that is commonly present on plasmid pK91 (found in S. Typhi genotype 4.3.1.3q1), plasmid-2 of the XDR (genotype 4.3.1.1.P1) isolates from Pakistan and pCTXM-2248 of an E. coli (accession: MG836696.1) [5][6][7]. Remarkably, it was only present in genotype 4.3.1.3q1 (100%; 56/56) and 4.3.1.1.P1 (XDR, 86/86) isolates (p < 0.001) ( Figure S9 and Table S4), making it a potential marker for lineage or plasmid identification.

Discussion
We here show that S. Typhi isolates can carry up to five different CRISPR loci and about 19% (203/1059) had three or more CRISPR loci ( Figure S2). Although previous studies reported only one or two loci [2,11,12], they analyzed WGS data of a handful of S. Typhi isolates, a maximum of 18 genomes by Fabre et al. [11], which could be the reason why others missed the third, fourth, or the fifth loci. However, these isolates carried only one group-A CRISPR locus with a high spacer count, resembles CRISPR1 in the previous nomenclature used by Fabre et al. and it agrees with a few of the previous reports on the CRISPR-Cas system in S. Typhi [11,12]. However, nearly 40% (422/1059) of our isolates had only one CRISPR locus and their number was significantly higher among Bangladeshi surveillance isolates, while the Pakistani outbreak isolates had a relatively lower average loci number (p < 0.001; Figure 1a and Table 1). Local and highly clonal S. Typhi lineages have been reported from both countries [5,6] and none of these lineages had higher average numbers of CRISPR In addition to the type-I-E CRISPR-Cas system, all S. Typhi isolates had three copies of DinG, 2-4 copies of DEDDh, and 1-2 copies of WYL genes ( Figure S9 and Table S4). All three copies of DinG and two of the DEDDh genes were completely conserved in all 1059 S. Typhi isolates (Table S4). Blastn results of a copied variant of the WYL gene, WYL 888 (WYL gene of 888 bp length) showed high sequence identity with a gene that is commonly present on plasmid pK91 (found in S. Typhi genotype 4.3.1.3q1), plasmid-2 of the XDR (genotype 4.3.1.1.P1) isolates from Pakistan and pCTXM-2248 of an E. coli (accession: MG836696.1) [5][6][7]. Remarkably, it was only present in genotype 4.3.1.3q1 (100%; 56/56) and 4.3.1.1.P1 (XDR, 86/86) isolates (p < 0.001) ( Figure S9 and Table S4), making it a potential marker for lineage or plasmid identification.

Discussion
We here show that S. Typhi isolates can carry up to five different CRISPR loci and about 19% (203/1059) had three or more CRISPR loci ( Figure S2). Although previous studies reported only one or two loci [2,11,12], they analyzed WGS data of a handful of S. Typhi isolates, a maximum of 18 genomes by Fabre et al. [11], which could be the reason why others missed the third, fourth, or the fifth loci. However, these isolates carried only one group-A CRISPR locus with a high spacer count, resembles CRISPR1 in the previous nomenclature used by Fabre et al. and it agrees with a few of the previous reports on the CRISPR-Cas system in S. Typhi [11,12]. However, nearly 40% (422/1059) of our isolates had only one CRISPR locus and their number was significantly higher among Bangladeshi surveillance isolates, while the Pakistani outbreak isolates had a relatively lower average loci number (p < 0.001; Figure 1a and Table 1). Local and highly clonal S. Typhi lineages have been reported from both countries [5,6] and none of these lineages had higher average numbers of CRISPR loci (Figure 1b). Hence, clonality could be a contributing factor for the lower number of CRISPR loci identified in these isolates.
The spacer sequences of S. Typhi showed remarkable conservation, and only 47 unique spacers were detected in 1919 CRISPRs identified in the genomes of 1059 S. Typhi isolates (Table 3 and Dataset S3). Many spacers in group-A loci (Ts32c, e, g, h, i, and l) were almost universally present in all S. Typhi isolates, whereas specific spacers (Ts55a, Ts54a, Ts34d) showed high numbers of presence in group-B loci ( Figure 2 and Table 3). Reports on CRISPRs identified in other pathogens described a higher number of unique spacers, i.e., 2823 spacers from 669 Pseudomonas aeruginosa and 745 from 100 E. coli isolates [26,39]. In our study, 48 other Salmonella (19 different serovars) and six E. coli isolates showed 857 unique spacers from 136 CRISPR loci and 118 unique spacers from 35 loci identified in their genome, respectively (data not shown). However, a study of 400 Salmonella enterica isolates of four serovars (Enteritidis, Typhimurium, Newport, and Heidelberg) reported 179 unique spacers [21]. A lower number of unique spacers have also been reported for pathogens like Campylobacter jejuni, Neisseria meningitidis, Pasteurella multocida, Streptococcus agalactiae, and Shigella spp. [40,41]. Such conservative nature of S. Typhi spacers could be due to host-restriction of S. Typhi.
It is now well established that spacers are likely to share complementarity with a target sequence (protospacer) in foreign DNA. The S. Typhi CRISPRs have been studied before, but the PAM sequence was yet to be defined. In our work, we report for the first time a possible PAM sequence, TTTCA/T. Although this PAM is based on the protospacers of only nine different spacers (Table 6), the nearly universal presence of two phage-targeting spacers, Ts32g (n = 1054) and Ts32i (n = 976), make this PAM motif more plausible. Besides that, Ts32i also targets a Salmonella phage suggestive for a functional CRISPR-Cas-related viral immunity system to protect the S. Typhi genome against bacteriophages.
Furthermore, the differentiation between the spacers or DRs of group-A and -B CRISPR loci were evident in our work. Very few spacers (n = 8) and DRs (n = 1) were present in both groups and considering the spacer targets, the S. Typhi group-A CRISPR loci seem more associated with phage defense, whereas group-B CRISPR loci potentially play a role in the defense against plasmids (Tables 3  and 6). This is not a common finding since the reports of defense mechanisms in other bacterial species against phages and plasmids are mainly linked to group-A CRISPR loci [42][43][44].
Similar to the previous reports [11][12][13], the CRISPR-Cas system identified in our study belongs all to the type I-E category in the case of S. Typhi. Among the identified cas genes, very few (n = 5) had an incomplete reading frame (Figure 7), which could be caused by non-sense mutations or sequencing errors. However, all cas gene loci were detected near a group-A locus, except six, where a group-B locus was present instead (Figure 7b). Thus, most of the group-B loci can be called "orphan" loci. According to the CRISPRCasFinder tool, CRISPR loci with low evidence score (which we termed group-B loci) might be false-positive, but some of the CRISPR arrays can be real. Indeed, the CRISPRCasFinder tool was specifically designed to identify these types of CRISPR loci so they could be functionally studied [32]. To our knowledge, orphan loci have never been reported for S. Typhi before. However, as identified in other prokaryotes, they can exist and even be functional without nearby cas-gene loci [15,16,18,32,45,46].
We also identified three different cas genes of other types of CRISPR-Cas system, i.e., DinG, DEDDh, and WYL ( Figure S9 and Table S4). Although the presence of the DinG family helicase gene suggests an existing type-IV-A CRISPR-Cas system [33], no other cas-genes of that system were found. No CRISPR loci were present on the same contigs either, but that is not uncommon for this type of system [16,18]. The type IV-A system is considered as a degraded derivative of class 1 CRISPR-Cas system, hypothesized to be originating from combinations of mobile genetic elements [16,18,47]. The presence of multiple copies of the WYL gene (part of the type-I system) among the S. Typhi isolates in our study was interesting, as two copies of this gene, WYL 693 and WYL 888 , had a difference in origin and presence. The former had a chromosomal match, whereas the latter was probably plasmid-borne (Table S4). WYL 888 matched the plasmid sequences of genotype 4.3.1.3q1 (Bdq lineage) and 4.3.1.1.P1 (XDR lineage) [5][6][7], making it a potential biomarker for these resistance lineages. However, the role of WYL 888 on these plasmids remains to be elucidated. Remarkably, both the S. Typhi lineages completely lacked a copy of the DEDDh 558 gene (Table S4). Proteins containing the WYL domain are not uncommon in bacteria and have been reported to regulate transcription of the CRISPR-Cas systems [48]. The DEDDh gene, on the other hand, has defined exonuclease activity and can fuse with cas1 and cas2 genes to exert such function [49,50]. The presence of multiple DEDDh domains in S. Typhi genomes may indicate a compensatory role for the shorter cas3 gene (compared to other Salmonella species, data not shown), which also functions as an exonuclease.

Conclusions
In conclusion, this study is the first large-scale bioinformatic investigation of the S. Typhi CRISPR-Cas system identified in the genomes obtained from isolates studied in different backgrounds and four endemic countries. Our results reveal unique conservation and clonality of the S. Typhi type I-E CRISPR-Cas system, specifically the cas-genes. Despite the clonality of this system, variations were identified in the type I-E CRISPR-Cas system of S. Typhi that significantly associated with AMR status, genotypes, demographic origin, and endemic isolates currently circulating in the south Asian countries. Although no AMR-gene targeting spacers were found, spacers targeting the AMR-containing plasmids were identified. This indicates a lack of a direct CRISPR-regulated pathway, rather regulating the AMR-gene acquisition or elimination via controlling the entry of plasmids. Finally, a possible S. Typhi PAM sequence, TTTCA/T, was defined in this study. Our findings lay a foundation for new genetic and biochemical experiments to dissect the CRISPR-Cas system of S. Typhi further and gain mechanistic insights into its molecular function. Overall, the strong correlations of variations identified in the system with AMR and demographic data of the endemic isolates from south Asian countries should be investigated further with keeping the development of rapid and inexpensive diagnostic tests as a target.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4425/11/11/1365/s1, Dataset S1: Complete data of all S. Typhi isolates used in this study, including the ENA accessions, metadata from the source articles, and the results we generated for each isolate like-MLST, genotype, number of CRISPR loci, spacer patterns, and presence of newly described cas-genes. Dataset S2: Accession numbers of 48 other Salmonella (19 different serovars, excluding Typhi) and six E. coli isolates. Dataset S3: Fasta sequence file of all S. Typhi direct repeats (DR) and spacers identified in this study. Doc S1: Describes the additional parts of the Method and Material section. Figure S1: An explanation of the identifier we generated for each unique direct repeat (DR) and spacer sequence in this study. Figure S2: Number of CRISPR loci present per S. Typhi isolate. Maximum five different loci were found. Figure Figure S7: Flow-chart showing the details of spacer-target finding algorithm. Figure S8: Phylogenetic trees of all S. Typhi, other Salmonella and E. coli isolates in this study, based on (a) cas1 (model: LG+G4), (b) cas2 (model: LG+G4), and (c) cas3 (model: PROTGAMMAVTF) gene sequences. Figure S9: Presence of the identified cas genes of other types of CRISPR-Cas system than type-I-E in all 1059 isolates. Each arrow represents a specific cas gene. A stripped arrow indicates the location of the cas-genes into two different contigs and the dashed line of the arrow specifies an interrupted gene. Except for the DinG *924 and DinG *1212 variants of the DinG gene, none were located on the same contig. Table S1: Different S. Typhi genotypes identified in this study. Table S2: Estimated distance within and between different Salmonella serovars including S. Typhi based on multiple sequence alignment of all group-A CRISPR loci sequences. Table S3: Presence of multiple DR-spacer pairs in different genotypes, countries, and study settings. Table S4: Details on different copies of DinG, DEDDh, and WYL genes. The length of the different copies of the genes was determined and added with their gene name (in superscript) to define an identifier for the coding sequence (CDS). An asterisk (*) was added to all cas genes of an isolate if the CDS had any non-sense mutation and interrupted prematurely.