Comparative Analysis of Streptococcus pneumoniae Type I Restriction-Modification Loci: Variation in hsdS Gene Target Recognition Domains

Streptococcus pneumoniae (pneumococcus) is a respiratory commensal pathogen that causes a range of infections, particularly in young children and the elderly. Pneumococci undergo spontaneous phase variation in colony opacity phenotype, in which DNA rearrangements within the Type I restriction-modification (R-M) system specificity gene hsdS can potentially generate up to six different hsdS alleles with differential DNA methylation activity, resulting in changes in gene expression. To gain a broader perspective of this system, we performed bioinformatic analyses of Type I R-M loci from 18 published pneumococcal genomes, and one R-M locus sequenced for this study, to compare genetic content, organization, and homology. All 19 loci encoded the genes hsdR, hsdM, hsdS, and at least one hsdS pseudogene, but differed in gene order, gene orientation, and hsdS target recognition domain (TRD) content. We determined the coding sequences of 87 hsdS TRDs and excluded seven from further analysis due to the presence of premature stop codons. Comparative analyses revealed that the TRD 1.1, 1.2, and 2.1 protein sequences had single amino acid substitutions, and TRD 2.2 and 2.3 each had seven differences. The results of this study indicate that variability exists among the gene content and arrangements within Type I R-M loci may provide an additional level of divergence between pneumococcal strains, such that phase variation-mediated control of virulence factors may vary significantly between individual strains. These findings are consistent with presently available transcript profile data.


Importance
Phase variation is common among bacterial pathogens and usually involves a "switch" between different subpopulations. For example, pneumococcal populations undergo phase variation via recombination events within the Type I restriction-modification locus yielding alternate alleles of the target specificity subunit hsdS, resulting in subpopulations with differential DNA methylation and gene expression. Here we present results of bioinformatic analyses to profile and compare the Type I R/M loci from a panel of diverse pneumococcal strains. The potential implication of HsdS genetic variation in the rate and targets of methylase-mediated phase variation could include specific gene silencing, and/or altered gene expression. Sequence variation in hsdS genes encoding target recognition subunits, as well as variation in the number of flanking hsdS' pseudogenes within the relevant genomic locus, suggests additional levels of diversity between strains. This interpretation is consistent with the observed variability between strains in terms of transcript profiles and rates of phase variation.

Introduction
Streptococcus pneumoniae (pneumococcus) is a significant opportunistic pathogen that can cause a variety of localized infections of the respiratory mucosa, as well as serious invasive diseases such as sepsis and meningitis [1][2][3]. Invasive pneumococcal infections are associated with a high degree of morbidity and mortality, despite the availability of vaccines and antibiotics [4,5]. Pneumococcal carriage in the nasopharynx and upper airways is quite common, especially among children, and represents the reservoir and initial stage for infection [6][7][8][9][10][11][12]. The ability of pneumococci to rapidly adapt to different host environments is important to combat host immune defenses. Phase variation between differentiated phenotypic states is a common mechanism by which pathogenic bacteria can adapt rapidly to changing host environments. Pneumococcal phase variation is estimated to occur at a rate of 10 −3 to 10 −6 per generation (markedly greater than the 10 −8 per generation rate for spontaneous mutation) and is visible as opaque or transparent colony phenotypes when viewed under oblique light [13]. Opaque variants are typically recovered from invasive infection sites and have increased virulence-associated phenotypes such as resistance to complement and phagocytic killing, whereas transparent variants are associated with asymptomatic colonization and localized disease [13][14][15][16]. Clinical pneumococcal isolates contain a heterogeneous mixture of both colony phenotypes, but since the phenotypes are not genetically "locked," and thus they can freely switch back and forth, resulting in a variable and constantly changing proportion of each phenotype within the overall bacterial population. Pneumococcal phase variation is a complex process that has been the subject of intense interest and study over a number of years [13,14,[17][18][19][20][21].
The genetic mechanism for pneumococcal phase variation is based on recombination-mediated diversity between six different alleles of the hsd Type I restriction-modification (R-M) locus [22][23][24]. This locus encodes the target specificity gene hsdS, methyltransferase gene hsdM, and restriction gene hsdR that encode the HsdS, HsdM, and HsdR subunits. The subunits assemble into a heteromeric enzyme complex that functions to specifically methylate the bacterial genomic DNA and destroy foreign DNA [25]. The HsdS subunit has two different DNA target recognition domains (designated as TRD 1 and TRD 2) which direct recognition of specific sequence motifs that are then methylated by the HsdM subunit [22,23]. The number of recognition sequence repeats in the genome determines the DNA methylation pattern, which in turn influences gene expression. This mechanism is further complicated by the presence of hsdS pseudogenes with divergent TRD sequences in the R-M locus. To date, two versions of TRD 1 (named 1.1 and 1.2) and three versions of TRD 2 (named 2.1, 2.2, and 2.3) have been characterized. Recombination events between these TRD sequences can produce six predicted hsdS genes, resulting in six unique HsdS subunits that each have unique bacterial DNA methylation patterns [22,23].
HsdS-mediated DNA methylation was shown to change gene expression linked to phase variation of colony morphology [22][23][24]. Notably, we found that specific TRD alleles conferred different phenotypes in different pneumococcal strain backgrounds. One explanation for this phenotypic difference was that genetic variation may have existed in the hsdS TRD coding sequences. Thus, in this study, we performed a bioinformatics analysis to compare the DNA sequences of TRD 1.1, 1.2, 2.1, 2.2, and 2.3 between 19 pneumococcal genetic backgrounds and found that hsdS TRDs were greater than 96% similar at the protein level. This led us to hypothesize that the differences in TRD sequences could affect target sequence specificity and/or activity mediating DNA methylation. Notably, the published transcript profiles between phase-locked strains indicate significant variation in hsdS phase-types in different pneumococcal strain backgrounds; i.e., specific hsdS alleles did not uniformly result in the same gene expression profile. In order to understand how methylase-mediated phase variation may have differential effects in different strains, we performed a comprehensive bioinformatic comparison of the relevant genes from a wide array of pneumococcal genomic sequences. The results indicate that while there is a high degree of sequence conservation among hsd genes, there were key divergences in gene arrangement within the loci. Potential implications for these differences in rearrangements in control of phase variation are discussed.

S. pneumoniae hsd Type I Restriction-Modification (R-M) Loci Are Genetically Diverse
To gain a broad perspective of the Type I R-M locus in pneumococci, we first aimed to acquire a diverse genetic dataset. We performed a nucleotide BLAST search on the NCBI GenBank Database using the S. pneumoniae TIGR4 hsdS gene (1569 bp) as the query sequence. The search summary showed BLAST hits on 73 subject sequences. Two partial matching subject sequences belonged to Fusobacterium nucleatum but they were excluded because no Type I R-M system was detected in either F. nucleatum genome. Of the remaining 71 subject sequences, we identified 38 candidate data sets that had complete genome data available. Each candidate data set was carefully screened until we collected a representative pool of 18 complete genomes that differed in serotype, body site of origin, and country of origin (Table 1). No information was available regarding the ability of the strains to undergo phase variation of colony phenotype. We first aimed to determine if pneumococci contained multiple copies of the Type I R-M locus. To address this, each genome was carefully examined and found to contain a single copy of the locus. To determine the genetic content, organization and homology between the R-M loci, the coding sequences in each genome were downloaded from the GenBank database and carefully annotated using a DNA editing program. An annotated genomic sequence for S. pneumoniae EF3030 was not available, so its R-M locus was sequenced for this study. We found that all 19 R-M loci encoded the restriction gene hsdR, methylase gene hsdM, specificity gene hsdS, and at least one hsdS pseudogene, and ranged in size from 7.2 to 8.5 kb. Since we knew from our previous study that S. pneumoniae D39 and S. pneumoniae TIGR4 hsdS target recognition domains (TRD) were distributed differently, we wondered whether this was true in other pneumococcal strains. To address this, the 19 R-M loci were examined and found to encode a total of 87 hsdS TRDs that differed in location. With this information in hand, we were able to create schematic maps for each strain and easily assess shared features ( Figure 1).  Table 1. Strain names are shown on the left. The hsdS target recognition domains 1.1 (red), 1.2 (yellow), 2.1 (green), 2.2 (blue), and 2.3 (orange) are shown. Strains EF3030 and G54 encoded two identical copies of TRD 1.1, so the second copy in hsdS'' was labeled TRD 1.1′. The two small coding sequences between hsdS'' and glnA are hypothetical genes.
Comparative analyses revealed that the R-M loci differed in gene order, gene orientation, and hsdS TRD content. Some strains (e.g., S. pneumoniae D39 and S. pneumoniae AP200) had identical genetic content and were grouped together resulting in a total of 10 unique locus "types." Most strains (13 out of 19) encoded all five TRDs (1.1, 1.2, 2.1, 2.2, and 2.3). Two strains encoded two identical copies of TRD 1.1, but completely lacked TRD 1.2, while four strains encoded only some TRDs. Four out of 19 strains lacked the recombinase gene creX and TRD 2.1 suggesting that the two genetic factors are commonly linked and are dispensable in this system. Overall, we concluded that the Type I R-M locus was a conserved feature in pneumococci that was susceptible to high rates of recombination-mediated mutation in the hsdS gene and pseudogenes. Moreover, due to the variable number of hsdS TRD pseudogenes, the potential exists for variation in numbers of potential allelic combinations between strains.

S. pneumoniae hsdS Coding Sequences Were Highly Homologous
Since we and others showed that the combination of hsdS TRDs within a single strain directly affected pneumococcal phase variation, we next aimed to determine whether genetic variation existed within the hsdS TRD coding sequences. To address this, all 87 hsdS TRD protein-coding sequences were determined. Seven sequences had premature stop codons and were excluded from further analysis. We next tallied the total number of in-frame sequences to compare for each TRD: TRD 1.1 (n = 18), TRD 1.2 (n = 15), TRD 2.1 (n = 15), TRD 2.2 (n = 17), and TRD 2.3 (n = 15). A summary is listed in Table 2.  Table 1. Strain names are shown on the left. The hsdS target recognition domains 1.1 (red), 1.2 (yellow), 2.1 (green), 2.2 (blue), and 2.3 (orange) are shown. Strains EF3030 and G54 encoded two identical copies of TRD 1.1, so the second copy in hsdS" was labeled TRD 1.1 . The two small coding sequences between hsdS" and glnA are hypothetical genes.
Comparative analyses revealed that the R-M loci differed in gene order, gene orientation, and hsdS TRD content. Some strains (e.g., S. pneumoniae D39 and S. pneumoniae AP200) had identical genetic content and were grouped together resulting in a total of 10 unique locus "types." Most strains (13 out of 19) encoded all five TRDs (1.1, 1.2, 2.1, 2.2, and 2.3). Two strains encoded two identical copies of TRD 1.1, but completely lacked TRD 1.2, while four strains encoded only some TRDs. Four out of 19 strains lacked the recombinase gene creX and TRD 2.1 suggesting that the two genetic factors are commonly linked and are dispensable in this system. Overall, we concluded that the Type I R-M locus was a conserved feature in pneumococci that was susceptible to high rates of recombination-mediated mutation in the hsdS gene and pseudogenes. Moreover, due to the variable number of hsdS TRD pseudogenes, the potential exists for variation in numbers of potential allelic combinations between strains.

S. pneumoniae hsdS Coding Sequences Were Highly Homologous
Since we and others showed that the combination of hsdS TRDs within a single strain directly affected pneumococcal phase variation, we next aimed to determine whether genetic variation existed within the hsdS TRD coding sequences. To address this, all 87 hsdS TRD protein-coding sequences were determined. Seven sequences had premature stop codons and were excluded from further analysis. We next tallied the total number of in-frame sequences to compare for each TRD: TRD 1.1 (n = 18), TRD 1.2 (n = 15), TRD 2.1 (n = 15), TRD 2.2 (n = 17), and TRD 2.3 (n = 15). A summary is listed in Table 2.  Comparative analyses using a protein alignment program revealed a high level of similarity (>96%) between sequences belonging to the same TRD. In the 18 TRD 1.1 comparisons, a single amino acid substitution was detected at position one in strain G54 (Phe1Ile) resulting in an overall 99.4% similarity. In the 15 TRD 1.2 comparisons, a single amino acid substitution was identified in S. pneumoniae SPN994039 and S. pneumoniae OXC141 (Ala134Pro) resulting in a 99.2% similarity. In the 15 TRD 2.1 comparisons, three unique mutations were present at position 101: eight strains encoded 101-Gly, six strains encoded 101-Ala, and one strain encoded 101-Val resulting in 99.4% similarity. In the 17 TRD 2.2 comparisons and 15 TRD 2.3 comparisons, seven different mutations sites resulted in a 96.7% similarity for each data set. We next wondered whether other genes in the Type I R-M locus were different from one another ( Figure 2, Table 3). To address this, the protein-coding sequences for the restriction subunit HsdR, the methylase subunit HsdM, and the recombinase unit CreX were determined and compared sequence alignments of 18 HsdR, 18 HsdM, and 15 CreX protein sequences revealed a high level of similarity (>98%) between the subunits ( Table 2). Based on the data from these analyses, we were able to create a summary schematic that mapped mutation positions and the overall percent similarity (Figure 3). It became apparent that the hsdS genes and hsdS pseudogenes had more differences in the TRD 2 coding sequences. It is possible the variation observed in TRD 2 may indicate a driving role in determining DNA methylation specificity as compared to TRD 1. The possibility also exists that variation in the hsdR and hsdM genes could affect inter-subunit interactions to affect the fidelity of DNA methylation.    Vertical lines indicate that at least one amino acid mutation was detected at that position. The total number of matching residues were used to determine % similarity (shown above each coding sequence). Vertical lines indicate that at least one amino acid mutation was detected at that position. The total number of matching residues were used to determine % similarity (shown above each coding sequence).

Discussion
S. pneumoniae phase variation of colony morphology is mediated by site-specific recombination of the hsdS gene in a Type I restriction-modification (R-M) system, which alters DNA methylation and ultimately results in differential gene expression [22]. Although the combination of hsdS target recognition domains (TRD) were shown to be associated with phase variation in different strains [22][23][24], we aimed to investigate whether the TRD DNA and protein-coding sequences were conserved in different strains. To address this, we conducted a comparative analysis study of 19 S. pneumoniae Type I R-M loci and determined their hsdS TRD content, organization, and sequence identity.
The identification of a Type I R-M system in 19 diverse S. pneumoniae genetic backgrounds was significant because it suggested that this system may act as a conserved and underappreciated virulence factor. Within this dataset, we characterized ten unique R-M locus "types" that differed in hsdS genetic content. Six R-M locus "types" were expressed by more than two strains each, indicating that some R-M loci may be more favorable than others and are maintained in a population. Since all ten "types" encoded at least three TRDs each, we concluded that the preservation of many TRDs could increase the likelihood of generating an advantageous recombinant hsdS variant with a better chance of survival. However, the four strains lacking TRD 2.1 and creX may have reduced potential for the generation of hsdS allele variants.
The high amount of hsdS genetic diversity found in the 19 Type I R-M loci led us to investigate the sequence identity of 87 different hsdS TRD protein sequences. We demonstrated that they were very similar, but not always identical. For example, the protein sequences of TRD 1.1, 1.2, and 2.1 had single amino acid differences while those of TRD 2.2 and 2.3 had seven differences each. These findings are important because they suggest that phase variation-specific epigenetic regulation via DNA methylation may be mediated by small genetic differences, particularly in the second TRD. One potential implication is that a mutation in an hsdS TRD could alter the HsdS recognition sequence, result in differential DNA methylation, and ultimately alter gene expression. This could lead to changes in bacterial fitness due to increased or decreased virulence factor expression or gene silencing. It is also possible that sequence differences in the HsdR, HsdM, and HsdS TRDs can mediate how the subunits work together as a multimeric complex. Future studies would be necessary to determine how the subunits fit together and whether the mutation sites map to the predicted multimeric interface.
S. pneumoniae phase variation occurs every 10 −3 to 10 −6 per generation and typically results in opaque or transparent colony phenotypes [13]. It remains unclear whether certain hsdS types, or mutations in hsdS TRDs, can alter the rate of phase switching in pneumococci. We propose that it would in a strain-specific manner and would be highly dependent on the type and number of surface-expressed factors encoded. It is difficult to speculate on potential rates of phase switching when two studies have shown that pneumococcal strains encoding a single hsdS allele produced monomorphic colonies that were either all opaque or all transparent [23,24]. Spontaneous phenotypic variation is a common theme in pathogenic bacteria to increase biological fitness under changing conditions. Bacterial adaptation occurs by either alterations in DNA sequences (genetic mutations), or differences in DNA methylation (epigenetic regulation). Intra-host bacterial evolution is primarily driven by spontaneous mutations (e.g., slipped-strand mispairing, recombination events, and point mutations) in surface-expressed factors/antigens [41]. For example, a random on/off slipped-strand mispairing over simple sequence repeats in genes encoding phosphorylcholine and other lipooligosaccharide antigens in Haemophilus influenzae can result in high-frequency phase variation (10 −2 /cell per generation) [42][43][44][45]. This mechanism has also been reported in Helicobactor pylori [46], Escherichia coli [47], and Staphylococcus aureus [48,49]. Recombination-mediated mutations resulting in phase variants have been reported for Salmonella [50], Mycoplasma pneumoniae [51,52], and N. meningitidis [53,54]. Neisseria gonorrhoeae undergoes low-frequency phase variation (10 −6 /cell per generation) [55]. Small colony variants of Pseudomonas aeruginosa typically arise due to spontaneous mutations [56]. These examples highlight how simple genetic modifications in a variety of human pathogens can alter their fitness.
The major limitation of this study was the relatively small number of complete pneumococcal genomes candidates analyzed (n = 18). Another limitation was the query sequence used to identify candidate genome data. In this analysis, we clearly showed that some pneumococcal R-M loci lack one or more hsdS TRDs. By searching only for sequences that encoded S. pneumoniae TIGR4 hsdS, which only encoded TRDs 1.1 and 2.2, we may have excluded some pneumococcal strains. A separate more inclusive search of all available genomes that encode at least one TRD would allow us to gain a comprehensive understanding of genetic variation in the R-M locus. By comparing the DNA methylation patterns and transcript profile, one could potentially develop a more refined understanding of phase variation-specific expression patterns. Alternatively, this system could be used to aid in the alteration of virulence factor expression or gene silencing which could help refine the expression of factors involved in host-pathogen interactions. Overall, our findings may help explain how pathogenic pneumococci can rapidly undergo intra-host adaptations in order to survive in diverse and rapidly changing environments. Understanding the relationship between the hsdS TRD combination, sequence identity, and specific recognition sequence may aid in a better understanding of how gene expression is altered. The findings in this study may serve as a model for future studies of host-pathogen interactions.

Sequence Analysis
Type I restriction-modification (R-M) system sequence data from 18 S. pneumoniae strains were acquired from the GenBank database. The sequencing of the S. pneumoniae EF3030 (R-M) system was performed at the UAB Heflin Center for Genomic Science. The 19 Type I R-M systems were carefully annotated using the DNA alignment program A Plasmid Editor V2.0.47. Each hsdS target recognition domain coding sequence was translated to protein sequence using A Plasmid Editor. Multiple protein sequence alignments were performed using EBI Clustal Omega, Cambridge, UK (1.2.4; http://www.clustal.org/omega/). The exclusion criteria for data analysis was the presence of a premature stop codon. The list of excluded samples included S. pneumoniae strains: AP200 (TRD 1.2 and 2.2), G54 (TRD 2.3), SWU02 (TRD 1.2 and 2.3), OXC141 (TRD 1.1), and EF3030 (TRD 2.3). Strains G54 and EF3030 each had two identical copies of TRD 1.1, so only one copy of the sequence was used in this analysis.