AT homopolymer strings as a motif for self-recognition and repair of genomes in Salmonella enterica subspecies I

: Adenine and thymine homopolymer strings of at least 8 nucleotides (AT 8+mers) were characterized in Salmonella enterica subspecies I and other Eubacteria. Incidence of the motif dif-fered between Eubacteria but not between Salmonella enterica serotypes. Of 481 AT 8+mers loci in serovars Typhimurium, Enteritidis, and Gallinarum, 35 (12.3%) had mutations. We propose that the AT 8+mer motif identifies genomes with optimal gene content and provides self-recognition that facilitates efficient genome repair. A theory that genome regeneration accounts for both serovar diversity and persistence of predominant Salmonella serovars associated provides a new framework for investigating root causes of foodborne illness.


Introduction
Approximately 30 of 1500 Salmonella enterica subspecies I (S. enterica) serovars have been persistent agents of foodborne illness in people for the past several decades [1]. Despite improved biosecurity throughout the food production pipeline, reduction of salmonellosis has plateaued over the past two decades [2]. The inability to reduce salmonellosis indicates new approaches to understanding the biology of this important pathogen are needed. Recently, the most commonly occurring single nucleotide polymorphism (SNP) that caused disruption of a gene in S. enterica serovar Enteritidis (Enteritidis) was identified, and it was deletion of a single adenine in a homopolymer string of 8 nucleotides (nt) within the fimbrial gene sefD [3]. Mutational analysis, phenotype microarray, and infection experiments in the egg-laying hen indicated that the sefD mutation increased organ invasion and mortality in hens, disturbed egg production, enhanced growth of the pathogen to high cell density, and otherwise behaved as a regulator of dimorphism of phenotype [4]. The impact of the discovery was that the performance of a killed vaccine for hens was enhanced by increasing SefD in preparations [5]. The drastic change in biological phenotype imparted by the single base pair deletion suggested that characterization of purine homopolymer strings of adenine, AAAAAAAA, and its pyrimidine base pair (bp) of thymine, TTTTTTTT, in S. enterica and other eubacteria should be explored.
A homopolymer of adenine:thymine (AT) with 8 nucleotides or more is abbreviated in this manuscript as an AT 8+mer. It is a DNA motif suggested by conformational studies to bend DNA out of the Z-conformation [6]. Polyadenine regions can impact gene regulation in prokaryotes and can contribute to microsatellite instability in eukaryotes [7][8][9][10]. Evidence exists to show that homopolymer nucleotide strings contribute to non-programmed slipped strand replication and the accumulation of errors in DNA [11][12][13].
Thus, the physicochemical impact of these strings was another reason to catalogue this motif in the genome of S. enterica.
To evaluate AT 8+mers in S. enterica subspecies, several serovars of S. enterica and other bacterial genera were compared for both AT 8+mers and GC 8+mers. S. enterica serovars Enteritidis and Typhimurium were analyzed because they are two of the three most common causes of foodborne salmonellosis in the US and abroad. They are from different genomic lineages and have been extensively studied and sequenced. Together they have caused approximately 40% of all foodborne salmonellosis in the US [1]. S. enterica Gallinarum was included because it is another poultry-associated pathogen that shares a genomic lineage with Enteritidis. However, its biological impact is different from that of S. Enteritidis. It does not cause human salmonellosis; instead, it causes devastating disease in poultry resulting in high morbidity, mortality, and economic loss [14]. Comparing Typhimurium and Enteritidis, which have different genomic lineages yet cause foodborne illness, to Gallinarum, which is genetically related to Enteritidis but has a drastically different epidemiology to both, is a comparative approach used before to link single nucleotide polymorphisms to phenotype [15]. In this study the three genomes were compared to better understand the content of AT 8+mer homopolymer nucleotide strings in S. enterica, and the association the motif might have with naturally occurring mutation that disrupts open reading frames of genes.
Additional background on select S. enterica serotypes S. enterica serovars Enteritidis and Typhimurium differ biologically although both are predominant causes of foodborne illness. One way they differ is in immunological properties of the cell surface. Serovar Typhimurium is a serovar Group B organism, with an antigenic formula of 1,4, [5],12:i:1,2 [16]. Serovar Enteritidis is a Group D organism, with an antigenic formula of 1,9,12:g,m:-, thus it is mono-flagellated [16].
Epidemiological patterns for the two predominant pathogens also differ. Enteritidis is an exceptional Salmonella pathogen in part because it efficiently contaminates the internal contents of eggs produced by otherwise healthy-appearing hens. It produces a high molecular mass (HMM) O-antigen, which not only protects killing of the pathogen by the host complement system, but also acts as a protective capsule in the hostile environment of the egg [17][18][19]. Typhimurium is also resistant to complement, but it does not produce HMM O-antigen and, thus, does not survive in the internal contents of eggs to an extent that can be detected by epidemiological surveillance. Both Typhimurium and Enteritidis can contaminate a broad spectrum of other food sources such as the eggshell, the poultry gastrointestinal tract, poultry carcasses, and fresh vegetables. Both serovars can invade organs and survive in macrophages, which contributes to systemic spread during infection [20]. Variation between strains within each serovar occurs but serotype characteristics and general genome organization is maintained [21,22]. There are serovar-specific patterns in plasmid carriage and fimbrial genes. Comprehensive reviews of the similarities and differences between Salmonella serovars are available [23][24][25][26][27].
S. enterica serovars Gallinarum and Enteritidis are genetically closely related [28]. Gallinarum's antigenic formula is 1,9,12:-:-, which indicates it has the same lipopolysaccharide O-antigen epitopes as Enteritidis; however, it lacks both H1 and H2 flagellin proteins and is thus non-motile. Both Gallinarum and Enteritidis can contaminate the internal contents of eggs; however, Gallinarum has mutations and rearrangements throughout its chromosome that restrict its host range to the avian host, possibly by reducing immunological response to infection and thus facilitating systemic infection [20]. Thus, the most striking differences between the foodborne pathogen and Gallinarum is that the latter makes poultry extremely sick, reduces egg production and causes high mortality. In contrast hens infected with Enteritidis often appear healthy, remain in production, and thus eggs become contaminated internally and are a source of foodborne illness. The ability of Enteritidis to spread through flocks that appear healthy was one of the contributing factors in its world-wide spread through the layer industry. The differences in the epidemiology, association with food, and virulence characteristics of the three pathogens, all of which occur in the poultry environment, suggested that comparative analysis of S. enterica serovars Typhimurium, Enteritidis, and Gallinarum would help set a baseline for the association between the AT 8+mer motif and naturally occurring mutation of an important food borne pathogen. Other pathogenic Salmonellae and other Eubacteria were also included in analysis.

Genomes of Eubacteria analyzed for strings of homopolymers
The database of 1,434 complete genomes of S. enterica subspecies I (taxid:59201), as well as other Eubacteria listed in Table S1, was used as source material as available from the National Center for Biotechnology Information (NCBI) [29]. The last accession date was April 30, 2020. S. enterica serovar Typhimurium LT2 (NC_003197.2) was used as the primary reference sequence to name genes and gene functions, and it was used to order genes [30]. Two other references were S. enterica serovar Enteritidis strain P125109 and S. enterica serovar Gallinarum strain 9184, with respective NCBI accession numbers of NC_011294.1 and CP019035.1 [31,32]. S. enterica serovars Typhimurium, Typhi, and Enteritidis genomes were over-represented compared to other serovars, and together they comprised 39.4% of all completed genomes available. Only 51.2% of S. enterica subspecies I genomes had a complete adenylate cyclase (cyaA) gene, which is required for virulence as a foodborne pathogen. The other sequences were plasmids, which were not under review in this study. Genome CP018657 is classified as serovar Enteritidis, but all analyses suggest it is serovar Typhimurium; thus, it is excluded from analyses. A broader examination of AT and GC 8+mer homopolymers included Escherichia coli, Proteus mirabilis, Shigella sonnei, Yersinia pseudotuberculosis, Vibrio vulnificus (chromosome I and II), Staphylococcus aureus, Streptococcus pyogenes, Enterococcus faecalis, Bacillus anthracis and Bacillus cereus. Genome databases at NCBI show homopolymer strings, as well as other combinations of low-complexity regions, in lower-case gray font because there is recognition that some sequence strings might be susceptible to alignment error and thus require masking during the alignment process. For the BLAST searches conducted here, each gene was observed for high fidelity of surrounding regions, therefore it is unlikely low complexity impacted observed alignments.

Incidence and location of homopolymer nucleotide strings
Counting of kmers, locating kmers within genomes, and determining impact on open reading frames within annotated genes was done with Genious Prime 2020.0.3 (Biomatters, Inc., San Diego, California, USA). Homopolymer strings of all 4 nucleotides, ranging from 5 -20 nucleotides, were catalogued in several S. enterica serovars and other genera (TABLE 1). For S. enterica subspecies I grouped by serovar, at least 12 complete genomes were assessed. For other genera, at least 3 complete genomes were assessed. Averages and standard deviations were calculated. Ttest analysis was used to determine if differences between groups were significant at p < 0.01. Other types of data processing were that the genome of interest was stored in SeqBuilder Pro, Lasergene V16.0.0 352) (DNASTAR, Madison, Wisconsin, USA) and in Geneious format. Strings of homopolymers of different lengths were entered as windows of text and the genomes were searched. Results were copied into an Excel ".csv" file as Unicode text (Microsoft Excel for Mac, V16. 16.20 (200307). The text to column feature, and appropriate delimiters, were used to produce columns of data to calculate distance between nucleotide strings. The average, standard deviation, and median values between AT 8+mer homopolymers were then calculated.

Determination of a common denominator for comparison of genomes from bacteria of different genera
S. enterica subspecies I serovar Typhimurium LT2 was the reference genome used to produce a common denominator to normalize genomes of different sizes. Every AT 8+kmer for Typhimurium LT2 was tabulated and classified as intergenic, intragenic, or regulatory using Genious Prime 2020.0.3 (Table S2). The same software was used to generate a map of AT kmers within the circular chromosome. Another approach used to establish a baseline incidence of AT 8+mers occurring in genes was to generate a list of random numbers using the 4,600 predicted genes of the reference genome. Two hundred random numbers were generated between 1 -4600 corresponding to numbered genes, a FASTA file was compiled, and the number of AT 8+mers within the randomly generated sets was determined.

Comparison of AT 8+mers in 3 S. enterica serovars that vary in epidemiological parameter
A FASTA file was generated from the list of genes and regulatory regions having AT 8+mers from Typhimurium LT2. The reference genome FASTA file was used for BLAST searches against the other two serovars for detailed analysis, namely S. enterica Enteritidis P125109 and Gallinarum 9184. Each genome was sequentially processed for AT 8+mers as it appeared on either strand in either direction, using Geneious Prime 2020.0.3 functions. Differences occurring within AT 8+mers for the 3 genomes were tabulated. Other manipulations of genes used data available at NCBI or were further analyzed with Lasergene V16.0.0 (352) (DNASTAR, Madison, Wisconsin, USA). Table 1 lists all genera evaluated for the motif. First, S. enterica subspecies I serovars were collated to include 12 different complete genomes for serovars Typhimurium, Enteritidis, and Typhi. A fourth S. enterica group included 12 foodborne Salmonellae of mixed serovars associated with poultry and/or foodborne disease, only strains with complete genomes were analyzed because gaps associated with draft genomes would impact results. Table 1 also lists results from analysis of 3 strains each from a variety of Eubacteria genera; in addition, the outlier group for S. enterica was 12 strains of Escherichia coli (E. coli). Values greater than 1 indicated that more than the expected number of motifs were observed in comparison to S. enterica after normalizing for the size of the genome, and less than 1 indicates fewer motifs were observed than expected.

The AT 8+mer motif in Eubacteria is specific to Genus and species
Results of comparisons between Eubacteria were as follows: 1) AT 8+mers in S. enterica groups were significantly more frequent than what was observed for E. coli (p < 0.005); 2) The range in results was a minimum of 90.0 AT 8+mers for Vibrio vulnificus cII to a maximum of 712.7 for Proteus mirabilis; 3) Standard deviations between strains in each Genera ranged from 2.3 for Yersinia pseudotuberculosis to 84.1 for Enterococcus faecalis; 4) All the genera examined, including S. enterica and E. coli, had a relative paucity of GC 8+mers as compared to AT 8+mers; thus, it appears there is a bias for Eubacteria maintaining AT 8+mers in genomes, or inversely, selecting against GC 8+mers; 5) Each genus appeared distinctly different from others; thus, conservation of AT 8+mers appears to be species specific; 6) Vibrio vulnificus had 180 and 90 AT 8+mers in chromosomes cI and cII respectively; thus, AT 8+mer content might be a chromosomal characteristic that maintains the organization of chromosomes.
Genomes varied widely in size across the Eubacteria, and a common denominator was needed to normalize data. To produce a common denominator, the reference genome of serovar Typhimurium LT2 was mapped for the location of all AT 8+mers. On average the motif occurred every 16,634nt (Table S2). The AT 8+mers appeared to be dispersed throughout the entire genome of serovar Typhimurium LT2 (Figure 2). The range of AT 8+kmer distance was 11 to 117,141nt, and the median was 11,578nt (Table S2). Distances of 52,048nt or greater between motifs were over 3 standard deviations and were thus possibly deficient in AT 8+mers. Of 13 putatively deficient regions, the 4 longest regions were assessed for phage genes, pseudo genes, insertion elements, transposases, ribosome binding sites and regulons. The 4 regions were located between nucleotides i) 1368633-1444823 (76,198nt), ii) 2612956-2730097 (117,148nt), iii) 4124625-4209022 (84,404nt), and iv) 4342879-4418289 ((75,418). At this time, no feature could be found that differentiated AT 8+mer deficient regions from regions with shorter distances between AT 8+mers.

The AT 8+mer motif in Salmonella enterica is not specific to serotype
The genome of reference strain S. enterica serovar Typhimurium LT2 is 52.2% GC. When data were expressed as ratios of AT:GC homopolymer strings, the AT 8mer homopolymers (e. g. AAAAAAAA and TTTTTTTT) were much more prevalent than GC 8mers in the reference genome ( Figure 1). In total there were 294 AT 8mers and 11 GC 8mers in the reference serovar, which is a ratio of 27 AT 8mers to every GC 8mer. AT strings longer than 8bp were less frequently observed (Figure 1). To account for every AT kmer of at least 8 nucleotides, the longer motifs were added to 8mers in further analyses; thus, the term AT 8+mer is applied throughout to describe the motif. As was referenced in the introduction, the length of the homopolymer impacts the physicochemical bending properties of DNA and thus we wanted to account for every kmer of 8 nucleotides or more.
Results from analysis of AT 8+mers between S. enterica serotypes were: i) The incidence of AT 8+mers in the reference genome for serovar Typhimurium LT2 was the lowest of the 12 strains in the group, which suggests that using the serovar as a reference would not over-estimate the incidence of AT 8+mers for S. enterica or other genera; ii) The range of AT 8+mers per S. enterica grouping in Table 1 was from 315.6 to 332.6, and the average was 322.2 +/-12.83 AT 8+mers; iii) The standard deviations for AT 8+mers in serovar Typhimurium and in the group of mixed serovars were, respectively, 13.0 and 13.9, ; iv) Serovars Enteritidis and Typhi, with respective standard deviations of 10.5 and 5.9, appeared more clonal than Typhimurium, which agrees with current knowledge; v) the foodborne serovars, namely Typhimurium, Enteritidis, and the group of mixed serovars, had a more variable motif content than host restricted Typhi. Overall, the S. enterica serovar groups were not significantly different from each other. There were not enough completed genomes of the host-restricted serovar Gallinarum to include it in analysis. Table 2 lists all genes and regulatory regions with at least one AT 8+mer in serovars Typhimurium, Enteritidis, and/or Gallinarum. Genes were listed in the order in which they appeared in the reference genome for Typhimurium LT2 (NC_003197.2). Some genes in serovars Enteritidis and/or Gallinarum did not have homologs in the Typhimurium reference strain, and vice versa. Six categories of genes were listed, and a total of 175 genes and 13 regulons were included. The number of pseudogenes found with the motif for Typhimurium, Gallinarum, and Enteritidis were 3, 22, and 5, respectively, and each genome had a total of 40, 287, and 96 pseudogenes each. Overall, 7.5%, 8.4%, and 5.2% of genes of pseudogenes had the motif, respectively. In total 30 pseudogenes out of 481 loci (6.2%) were identified as having the motif. For the 188 total genes and regulatory sites listed, 4.2% of genes had AT 8+mers for a S. enterica genome with an average of 4517 genes. The ratio of AT homopolymer kmers, either adenine or thymine but not mixed, to GC homopolymers was determined using Geneious software as described in text. The range in number of nucleotides per kmer searched was 5 to 10 (see legend label). Results showe that a nucleotide motif of 8 was the most common encountered, and that approximately 27 AT homopolymers were found for every 1 GC AT homopolymer in the reference sequence of S. enterica LT2 NC_003197.2.

Discussion
The AT 8+mer motif was located in genes and regulatory regions that impact phenotype, growth potential, virulence and metabolism of Salmonella enterica subspecies I. In addition, there is biological evidence that AT 8+mers influence evolution at the scale of the single nucleotide. For example, A and T homopolymers impact transcription termination in Archea [33]. The canine herpesvirus thymidine kinase gene has mutational hotspots at stretches of 8 adenines [34]. T7 bacteriophage RNA polymerases undergo transcription slippage at A and T homopolymers [35]. As mentioned previously for S. enterica serovar Enteritidis, a mutational hotspot in 1 of 8 adenines increased virulence [3].
While there is reason to suspect AT 8+mers as mutational hotspots, the conundrum exists that there must be a mechanism for repair of accumulating mutations. Otherwise, evolution of any one serotype of S. enterica would be unidirectional towards extinction. There are several examples of Salmonella serotypes, e. g. Typhimurium, Enteritidis, Newport, Infantis and Heidelberg, that continuously circulate over decades; however, the majority of serotypes cause illness inconsistently, rarely, or never [1,16]. For this reason, we theorize there is another function for AT 8+mers. It is proposed that AT 8+mers align sections of genomes during replication, DNA acquisition, and DNA repair processes, thus maintaining a general organization of the S. enterica genome. This function would result in repair of mutations occurring between stretches of wildtype AT 8+mers during the replication/repair process and/or during acquisition of new DNA by homologous recombination [36,37]. It would also account for an inherent mechanism of self-recognition, which would facilitate preferential, but not exclusive, DNA exchange within a Genus species. The pan-genome of S. enterica subspecies 1 has a mosaic structure between serotypes, with frequent inversions, deletions, and insertions occurring between serotypes; however, the chromosomal arrangement of many Salmonella lineages is comparatively stable [25,32,38,39]. AT 8+mers being important to the processes of DNA replication, repair and acquisition by repair mechanisms and homologous recombination would account for i) the stability of some serotypes with conserved genome features that are persistent, e. g. serovar Typhimurium [1], ii) the occasional emergence of a new serotype that happens to undergo clonal expansion in an environment favorable for growth, e. g. serovar Tennessee in peanut butter [40,41], iii) the rare emergence of a hybrid strain following a major recombination event that results in rapid proliferation of a serotype with new biological properties, e. g. serovar Enteritidis and its ability to contaminate and survive in the internal contents of eggs [42], and iv) the periodic emergence and disappearance of serotypes that are not optimized for the survival in the environment in which they are generated.
S. enterica serovars with similar AT 8+mer content would thus be expected to maintain the ability to form Holliday structures at least within subspecies I. In contrast, the two chromosomes of Vibrio vulnificus could be inhibited from recombination in part because the AT 8+mer content differs substantially. E. coli and S. enterica are natural exchangers, and an area of future research is to evaluate if some, but not all, AT 8+mer content in chromosomal segments of different Genus species align to facilitate the formation of Holliday structures that are an integral part of homologous recombination [43][44][45].

Conclusion
In summary, we suggest that AT 8+mers are a motif in the genome of Eubacteria that facilitates DNA replication, repair, and exchange while also maintaining speciation. In regards to Salmonella enterica subspecies I, the motif is proposed to contribute to the emergence of serotypes, and at the same time, maintain some genomes with optimized gene content that are highly successful as foodborne pathogens [46][47][48][49]. Future research on the AT 8+mer contribution to genome organization, fidelity of replication, and ability to restore mutated gene content will require proof of concept experimentation. Biological experimentation at an applied level will focus on finding environmental niches within food production systems that facilitate genomic exchange and repair mechanisms. Application for improving food safety will involve determining effective interventions.