Comparative Genomics Applied to Systematically Assess Pathogenicity Potential in Shiga Toxin-Producing Escherichia coli O145:H28

Shiga toxin-producing Escherichia coli (STEC) O145:H28 can cause severe disease in humans and is a predominant serotype in STEC O145 environmental isolates. Here, comparative genomics was applied to a set of clinical and environmental strains to systematically evaluate the pathogenicity potential in environmental strains. While the core genes-based tree separated all O145:H28 strains from the non O145:H28 reference strains, it failed to segregate environmental strains from the clinical. In contrast, the accessory genes-based tree placed all clinical strains in the same clade regardless of their genotypes or serotypes, apart from the environmental strains. Loss-of-function mutations were common in the virulence genes examined, with a high frequency in genes related to adherence, autotransporters, and the type three secretion system. Distinct differences in pathogenicity islands LEE, OI-122, and OI-57, the acid fitness island, and the tellurite resistance island were detected between the O145:H28 and reference strains. A great amount of genetic variation was detected in O145:H28, which was mainly attributed to deletions, insertions, and gene acquisition at several chromosomal “hot spots”. Our study demonstrated a distinct virulence gene repertoire among the STEC O145:H28 strains originating from the same geographical region and revealed unforeseen contributions of loss-of-function mutations to virulence evolution and genetic diversification in STEC.


Introduction
Shiga toxin-producing Escherichia coli (STEC) consists of a group of genetically and phenotypically diverse strains differing greatly in pathogenicity. While some strains can cause severe disease, others are only associated with mild diarrhea or no disease at all [1]. Such variation is attributed in part to the differences in their genetic makeup, especially the repertoire of virulence genes. The genome of STEC is extremely diverse, consisting of a core genome that is conserved among the STEC strains and an accessory genome containing strain, lineage, or pathovar-specific genes [2][3][4]. Enterohemorrhagic E. coli (EHEC), originally referring to STEC strains that cause hemolytic uremic syndrome (HUS) or hemorrhagic colitis (HC), expresses Shiga toxin (Stx) and enterohemolysin, and produces the attaching-and-effacing (A/E) lesions on intestinal epithelial cells [5]. Although STEC O157:H7, a prototype of EHEC, has been considered the most frequent cause of STEClinked outbreaks, non-O157 strains are increasingly characterized as EHEC due to their association with HUS. Some of these strains, similar to O157:H7, are typical EHEC [6,7], whereas many are atypical EHEC, including the STEC O104:H4 linked to the 2011 large outbreak of hemorrhagic diarrhea in Europe [8]. These O104:H4 strains lack the locus of enterocyte effacement (LEE) harboring the genes responsible for formation of A/E lesions but carry two plasmids conferring cells' aggregative capability and multi-drug

Molecular Typing and Comparative Genomic Analyses
The bacterial strains, source/isolation year, and their genomic characteristics are listed in Table 1. The phylogroups were determined in silico using the Clermont method [34]. The Multi-Locus Sequence Typing (MLST)-based genotypes were determined using MLST 2.0 service at Center for Genomic Epidemiology via Warwick scheme [35] in April 2021. The pangenomes of STEC were calculated using Roary (V3. 13.0), a pangenome bioinformatic pipeline as described previously [36]. Briefly, the GFF files of each genome were retrieved from GenBank in December 2020 and the Coding DNA Sequences (CDSs) were extracted from each input genome and converted to protein sequences. Homologous proteins were identified by all-against-all comparison with 95% as the minimum BLASTP percentage identity. The relatedness of the STEC strains was first evaluated by a core genome-based tree. All core genes were aligned using MAFFT (V7.471) followed by construction of a maximum-likelihood tree with 1000 bootstrap in IQ-TREE (version 2.1.2) using the best-fit model GTR + F + I, as selected by ModelFinder [37,38]. To identify the strains carrying more common genes, an accessory genome-based tree was constructed using a FASTA file of each genome with the binary of presence and absence of the accessory genes in Roary. An approximately maximum-likelihood tree was constructed in FastTree with the default settings (Jukes-Cantor + CAT model or GTR + CAT model) for generalized time-reversible (GTR) models.  a The source and isolation year for strain MG1655 is based on the information available for its parental strain K-12 as described previously [43]. b The serotype of strain MG1655 is determined by BLAST searches of E. coli O-antigen and H-antigen databases described previously [44,45]. c pEHEC refers to the plasmid carrying genes encoding enterohemolysin.

Identification of Virulence Genes
The putative virulence genes were identified using VFanalyzer at The Virulence Factor DataBase (VFDB) using pre-annotated genomes in GenBank format. Each pre-annotated genome was uploaded to VFanalyzer to scan for all Escherichia Virulence Factors (VFs) deposited in VFDB as of April 2021, which included genes identified in 37 E. coli strains belonging to adherent invasive E. coli (AIEC), avian pathogenic E. coli (APEC), enteroaggregative E. coli (EAEC), Shiga toxin-producing enteroaggregative E. coli (StxEAEC), enterohemorrhagic E. coli (EHEC), enteropathogenic E. coli (EPEC), enterotoxigenic E. coli (ETEC), neonatal meningitis-associated E. coli (NMEC), and uropathogenic E. coli (UPEC). The identified virulence genes were further verified manually by BLASTn search of a database containing all STEC strains in Geneious Prime with a threshold of 80% for gene coverage and 75% for sequence identity.

Detection of GIs and PAIs
GIs were first revealed using IslandViewer 4 [46] with the genome of E. coli K-12 strain MG1655 as a reference. A GI was called when a prediction was made by at least one of the three methods (IslandPath-DIMOB, SIGI-HMM, and IslandPick). The GIs and PAIs known to contribute to stress resistance and/or pathogenicity in enteric pathogens (Table 2) were further examined by performing BLAST searches against a custom database containing all genomes described in this study. When a complete GI or PAI was not detected, all CDSs encoded by the query GI or PAI were used to search the genome of the testing strain by BLASTP to reveal if any homologs were present in the testing strain. Presence of key virulence genes on each PAI were revealed by BLASTn of each virulence gene followed by mapping their chromosomal locations in Geneious Prime. Strain EDL933 harbors two tellurite resistance islands (TRIs) (OI-43 and OI-48), exhibiting over 99% identity. The relatedness of the GIs or PAIs was assessed by multiple sequence alignment using Clustal Omega in Geneious Prime followed by construction of a consensus tree using the Neighbor-Joining method and Jukes-Cantor for the genetic distance model.

Statistical Analyses
The difference in gene content between the clinical and environmental strains was assessed by an unpaired t-test with a 95% confidence interval using Prism 8.0 (GraphPad Software). The association of virulence genes with isolation origin was evaluated with Fisher's exact test with Bonferroni correction for multiple comparisons using R version 4.1.0.

Comparative Genomics of STEC O145:H28
Comparative genomic analyses of the 19 STEC O145:H28 strains revealed a total of 8527 genes, among which, 3702 genes were detected in all genomes (core genes, 43.4%) ( Figure 1A). The accessory genes included 2229 genes present in more than three, but less than 19 genomes (shell genes) and 2596 genes present in less than three genomes (cloud genes), implying that STEC O145:H28 strains carry a large number of dispensable genes. As expected, the pangenome size (total genes) increased greatly when the three non O145:H28 STEC reference genomes were included in the analyses, as well as the number of cloud genes (Supplementary Materials Figure S1). Among the 10,288 genes detected in all STEC genomes, 29.1% (2998) were core genes and 70.9% were accessory genes, supporting that STEC possesses a large accessory genome. Examining the strain-specific genes revealed that the clinical O145:H28 strains carried more strain-specific genes than the environmental strains (unpaired t test, p = 0.0095) ( Figure 1B). The Belgium outbreak strain RM13516 carried the greatest number of strain-specific genes, followed by the U.S. outbreak strain RM13514. Among the environmental strains, the cattle isolates RM9872-C1 and RM9154-C1 carried more strain-specific genes than any other strains. The majority of the strain-specific genes in O145:H28 were MGE-associated and hypothetical genes. Others included genes related to transporters, toxin-antitoxin systems, secretion systems, the type three secretion system (T3SS) effectors, and the type IV pilus.

Statistical Analyses
The difference in gene content between the clinical and environmental strains w assessed by an unpaired t-test with a 95% confidence interval using Prism 8.0 (GraphPa Software). The association of virulence genes with isolation origin was evaluated wi Fisher's exact test with Bonferroni correction for multiple comparisons using R versio 4.1.0.

Comparative Genomics of STEC O145:H28
Comparative genomic analyses of the 19 STEC O145:H28 strains revealed a total 8527 genes, among which, 3702 genes were detected in all genomes (core genes, 43.4% ( Figure 1A). The accessory genes included 2229 genes present in more than three, but le than 19 genomes (shell genes) and 2596 genes present in less than three genomes (clou genes), implying that STEC O145:H28 strains carry a large number of dispensable gene As expected, the pangenome size (total genes) increased greatly when the three no O145:H28 STEC reference genomes were included in the analyses, as well as the numb of cloud genes (Supplementary Materials Figure S1). Among the 10,288 genes detected all STEC genomes, 29.1% (2998) were core genes and 70.9% were accessory gene supporting that STEC possesses a large accessory genome. Examining the strain-specif genes revealed that the clinical O145:H28 strains carried more strain-specific genes tha the environmental strains (unpaired t test, p = 0.0095) ( Figure 1B). The Belgium outbrea strain RM13516 carried the greatest number of strain-specific genes, followed by the U. outbreak strain RM13514. Among the environmental strains, the cattle isolat RM9872-C1 and RM9154-C1 carried more strain-specific genes than any other strain The majority of the strain-specific genes in O145:H28 were MGE-associated an hypothetical genes. Others included genes related to transporters, toxin-antitox systems, secretion systems, the type three secretion system (T3SS) effectors, and the typ IV pilus. the soft-core genes refer to all genes present in any 18 input genomes (n = 18); the shell genes refer to all genes present in at least three but less than 18 input genomes (3 ≤ n < 18); the cloud genes refer to all genes present in less than three input genomes (n < 3). (B): Number of strain specific genes in STEC O145:H28. The number of strain specific genes was calculated in Roary as detailed in the Materials and Methods section. Red bars represent the clinical isolates whereas blue bars represent environmental isolates. * indicates that no plasmids were present in the published genome (GenBank Accession numbers are presented in Table 1).

Strain Relatedness
The relatedness of the STEC strains was examined by core and accessory genes-based trees. When the 2998 core genes were used for a similarity calculation, all STEC O145:H28 strains were placed in a large clade, apart from the cluster containing the O145:H25 strains and the O157:H7 strain EDL933 (Figure 2A). Within the O145:H28 clade, the Belgium outbreak strain RM13516 was placed in the same cluster as the Japanese clinical isolate 112648, the U.S. clinical isolate 95-3192, and all the stx 1a -positive environmental isolates, whereas the U.S. outbreak strain RM13514 was placed in a different cluster with the Japanese clinical isolates 122715 and 10942, the U.S. clinical isolate 2015C-3125, and all the stx 2a -positive environmental isolates ( Figure 2A).
Microorganisms 2022, 10, x FOR PEER REVIEW 9 of 26 the Materials and Methods section. The core genes refer to all genes shared by all input genomes (n = 19); the soft-core genes refer to all genes present in any 18 input genomes (n = 18); the shell genes refer to all genes present in at least three but less than 18 input genomes (3 ≤ n < 18); the cloud genes refer to all genes present in less than three input genomes (n < 3). (B): Number of strain specific genes in STEC O145:H28. The number of strain specific genes was calculated in Roary as detailed in the Materials and Methods section. Red bars represent the clinical isolates whereas blue bars represent environmental isolates. * indicates that no plasmids were present in the published genome (GenBank Accession numbers are presented in Table 1).

Strain Relatedness
The relatedness of the STEC strains was examined by core and accessory genes-based trees. When the 2998 core genes were used for a similarity calculation, all STEC O145:H28 strains were placed in a large clade, apart from the cluster containing the O145:H25 strains and the O157:H7 strain EDL933 (Figure 2A). Within the O145:H28 clade, the Belgium outbreak strain RM13516 was placed in the same cluster as the Japanese clinical isolate 112648, the U.S. clinical isolate 95-3192, and all the stx1a-positive environmental isolates, whereas the U.S. outbreak strain RM13514 was placed in a different cluster with the Japanese clinical isolates 122715 and 10942, the U.S. clinical isolate 2015C-3125, and all the stx2a-positive environmental isolates ( Figure 2A). Core-genes based phylogenetic tree. The maximum likelihood-based phylogenetic tree was constructed based on the alignment of concatenated nucleotide sequences of 2998 homologous CDSs from each of the strains and using iqtree2 with the best fit model GTR + F + I as selected by ModelFinder and was assessed by bootstrapping with 1000 pseudoreplicates. (B): Accessory genes-based relatedness tree. The accessory genes of the STEC genomes were used to construct a FASTA file with the binary of presence and absence in Roary, followed by the construction of an approximately maximum-likelihood tree using FastTree as detailed in the Materials and Methods section. Thus, strains within the same cluster share more common accessory genes than the strains in a different cluster. The accessory genes of the STEC genomes were used to construct a FASTA file with the binary of presence and absence in Roary, followed by the construction of an approximately maximum-likelihood tree using FastTree as detailed in the Materials and Methods section. Thus, strains within the same cluster share more common accessory genes than the strains in a different cluster.
A distinct topology was observed when the accessory genes were used to assess strain relatedness. Regardless of the serotype or genotype, all clinical strains were placed in the same clade, where three Japanese strains formed a cluster, the two U.S. strains 95-3192 and 2015C-3125 also grouped together, and the outbreak strains RM13514 and RM13516 were placed in the same cluster as the two O145:H25 strains ( Figure 2B). All environmental isolates except RM9154-C1 were placed in a separate clade, consisting of two clusters, one containing all the stx 1a -positive environmental isolates and the other one containing all the stx 2a -positive environmental isolates. The environmental isolate RM9154-C1 formed a singleton and appeared to share more accessory genes with the clinical strains than with the other environmental strains ( Figure 2B).

Conservation of E. coli Virulence Genes in Environmental Strains
The 333 virulence genes retrieved from the VFDB encode VFs belonging to 10 major functional categories and contribute to E. coli pathogenicity in diverse pathotypes ( Figure 3A). Of the 333 genes examined, 165 were detected in EDL933, and 140 and 132 were detected in O145:H25 strains CFSAN004176 and CFSAN004177, respectively ( Figure 3B). Among the O145:H28 strains, the number of virulence genes ranged from 126 in the environmental strain RM12275-C1 to 151 in the clinical strain 112648. This variation was mainly due to the difference in number of genes encoding non-LEE encoded T3SS effectors and genes related to adherence and iron uptake (Supplementary Materials Table S1). There was no significant difference in the number of total virulence genes between the clinical and the environmental O145:H28 strains (unpaired t test, p > 0.05), nor in the number of virulence genes in each VF category examined (unpaired t test, p > 0.05). Furthermore, none of the detected virulence genes were specifically associated with the isolation source (Fisher's test, adjust p > 0.05).
Seven out of the 20 autotransporters genes, ag43, cah, ehaA, ehaB, espI, espP, and upaG, were detected in the majority of STEC strains examined. The ag43 encodes a 1039-aa autotransporter adhesin Ag43, exhibiting 69.0% identity with Cah, a 949-aa calcium-binding autotransporter protein. Strain EDL933 carried two copies of cah but no ag43, whereas all O145:H28 strains carried both ag43 and cah except for strains 10942 and RM12275-C1. Neither ag43 nor cah was detected in the O145:H25 strains. Loss-of-function mutations were detected in both ag43 and cah, including point insertions, point deletions, transposon insertions, and large deletions (Supplementary Materials Table S2). The serine protease autotransporter EspI exhibits 48.8% identity with the serine protease autotransporter EspP. The espI was detected only in the U.S. outbreak strain RM13514 and the two STEC O145:H25 strains, whereas the espP was detected in strain EDL933 and the majority of O145:H28 strains including RM13514. A point deletion in espP was detected in RM12275-C1 (Supplementary Materials Table S2). The upaG gene encodes a 1447-aa YadA-like family protein that mediates aggregation, biofilm formation, and adhesion to extracellular matrix proteins. A homolog of upaG was identified in all strains, ranging in size from 4620 bp in RM13514, 4767 bp in EDL933, 4851 bp in the two O145:H25 strains, to 5130 bp in other O145:H28 strains. Two point insertions in upaG were identified in strain 10942 (Supplementary Materials Table S2). The ehaA and ehaB encode a 1327-aa and a 980-aa autotransporter adhesin, respectively, contributing to attachment and biofilm formation [50,51]. Both ehaA and ehaB were present in all strains including the two O145:H25 strains; however, IS insertions in both genes and a point deletion in ehaA were detected in multiple strains (Supplementary  Materials Table S2).
( Figure 3A). Of the 333 genes examined, 165 were detected in EDL933, and 140 and 132 were detected in O145:H25 strains CFSAN004176 and CFSAN004177, respectively ( Figure 3B). Among the O145:H28 strains, the number of virulence genes ranged from 126 in the environmental strain RM12275-C1 to 151 in the clinical strain 112648. This variation was mainly due to the difference in number of genes encoding non-LEE encoded T3SS effectors and genes related to adherence and iron uptake (Supplementary  Materials Table S1). There was no significant difference in the number of total virulence genes between the clinical and the environmental O145:H28 strains (unpaired t test, p > 0.05), nor in the number of virulence genes in each VF category examined (unpaired t test, p > 0.05). Furthermore, none of the detected virulence genes were specifically associated with the isolation source (Fisher's test, adjust p > 0.05). Non-LEE encoded T3SS effectors (56) ABC transporter for dispersin (5) T3SS (34) T6SS (65) Toxin (18) Autotransporter (20) Invasion (4) Iron uptake (33) LEE encoded T3SS Effectors (7) B Presence of each virulence gene was verified by BLASTn search of a database containing all STEC strains in Geneious with a threshold of 80% for coverage and 75% for sequence identity. For genes carrying a loss-of-function mutation, they are marked as absent. * indicates that no plasmids were present in the published genome (GenBank Accession numbers are presented in Table 1).

Number of genes
Four genes (ibeA, ibeB, ibeC, and tia) related to E. coli invasion were examined. While no homologs of ibeA or tia were detected in any of the strains examined, homologs of ibeB and ibeC were detected. For ibeC, in addition to a highly similar homolog (%Identity, 95.3), a distantly related homolog (% Identity, 78.0) was detected on the plasmid pEHEC in all O145:H28 strains and in EDL933, although a point deletion was present in the plasmidborne ibeC in strain 2015C-3125 (Supplementary Materials Table S2).
Among the 33 genes related to iron uptake, homologs of 13 genes were detected. The genes encoding heme utilization system (chuAS and chuTWXYU) were present in all O145:H28 strains. In contrast, none of the heme utilization genes were detected in O145:H25. The genes involved in aerobactin siderophore synthesis (iucABCD) and the ferric siderophore receptor gene iutA were detected in all O145:H28 strains except RM9154-C1. A homolog of iroB, encoding the salmochelin biosynthesis C-glycosyltransferase, was present in all O145:H28 strain. IS insertions and point deletions in iucC, iucD, and iroB were detected (Supplementary Materials Table S2). Strain EDL933 carried two copies of iroB; one was intact while the other contained a point deletion, similar to the iroB in strain 2015C-3125.
There are three type six secretion system (T6SS) gene clusters reported in E. coli [52], among them, the T6SS-1 and T6SS-2 are the most common. Among the 22 genes related to T6SS-2, homologs of 15, eight, and five genes were identified in strain EDL933, both O145:H25 strains, and the majority of O145:H28 strains, respectively (Supplementary  Materials Table S1). No homologs of the genes related to T6SS-1 or T6SS-3 were identified in any of the O145:H28 strains examined.
The toxin genes detected in STEC included plasmid-borne enterohemolysin genes hlyCABD, chromosome-borne hemolysin gene hlyE, and Shiga toxin genes stx. All strains except strain 95-3192 carried hlyCABD, although point deletions in hlyB and hlyC were detected in multiple strains (Supplementary Materials Table S2).

Diversification of Pathogenicity Islands (PAIs) in STEC
Eight PAIs contributing to the pathogenicity of enteric pathogens and two GIs contributing to E. coli stress resistance were further examined ( Table 2). All strains carry a complete LEE; however, variations in PAIs OI-122 and OI-57, TRI, locus of proteolysis activity (LPA), and acid fitness island (AFI) were detected. Additionally, although a complete locus of adhesion and autoaggregation (LAA) island was not detected in any of the strains examined, LAA modules were identified.

LEE
The LEE in strain EDL933 corresponds to the O-island 148 (OI-148) and inserts downstream of tRNA gene selC (Table 3). A complete LEE was identified in all O145:H28 strains, ranging in size (bp) from 46,601 to 47,785. Similar to the EDL933 LEE, the O145:H28 LEEs mainly harbor genes related to biosynthesis of T3SS apparatus and the genes encoding T3SS effectors. Furthermore, all the O145:H28 LEEs insert downstream of selC and encode γ subtype intimins (Table 3). In contrast, the O145:H25 LEEs insert downstream of the tRNA gene pheV and encode β subtype intimins ( Table 3). The O145:H25 LEEs are much larger in size and form a separate cluster compared with the O145:H28 LEEs (Table 3 and Supplementary Materials Figure S2A). Besides the genes related to biosynthesis of T3SS apparatus and T3SS effectors, the O145:H25 LEEs harbor additional genes encoding MGIs, transporters, and transcriptional regulators.

OI-122
The OI-122 in EDL933 is 23,455 bp and inserts downstream of pheV. A complete or partial OI-122 was identified in all O145:H28 strains except RM12275-C1, ranging in size (bp) from 12,900 to 53,262 (Table 3). In RM12275-C1, the integration site for OI-122 is unoccupied. A sequence alignment of OI-122 failed to separate the clinical O145:H28 strains from the environmental isolates (Supplementary Materials Figure S2B).  [17]. The LEEs in strains RM13514 and RM13516 were reported previously [6]. The LEEs in strains 10942, 112648, and 122715 were based on the genome annotations reported previously [41]. The LEE regions in O145:H25 strains were defined by IslandViewer4 in this study. The LEE regions in environmental STEC O145:H28 strains were defined by BLASTn search of a database containing all STEC genomes examined in this study using the RM13514 LEE as a query. b The OI-122 PAIs in strains RM13514 and RM13516 were reported previously [6]. The OI-122 PAIs in strains 10942, 112648, and 122715 were based on the genome annotations reported previously [41]. The OI-122 regions in environmental STEC O145:H28 strains and in O145:H25 strains were defined by BLASTn search of a database containing all STEC genomes examined in this study using the OI-122 of the strain EDL933 and of the RM13514 as queries. The tRNA genes pheV and pheU are identical. In some genomes, this tRNA gene was annotated as pheV whereas in others, it was annotated as pheU. In strain RM12275-C1, this integration site is unoccupied. c The OI-57 regions in O145 strains were defined by BLASTn search of a database containing all STEC genomes examined in this study using the EDL933 OI-57 as a query.
The complete OI-122 contains three modules, each bordered by MGIs. Module I carries virulence gene pagC, encoding a membrane protein involved in serum resistance in Salmonella enterica. A complete module I was identified only in strain RM13516, exhibiting 99.6% identity with the OI-122 module I in EDL933. Module II carries genes encoding T3SS effectors EspL, NleB, and NleE. A complete module II was present on all OI-122s identified. Both O145:H25 strains carried a second copy of module II, located on the LEEs. Module III encodes an Efa1/LifA-like adherence protein. A full-length efa1 gene was detected only in strains RM13516 and CFSAN004177. A truncated efa1 was detected in strain CFSAN004176 and in 10 other O145:H28 strains. Module III was detected in five out of seven O145:H28 clinical strains and in six out of 12 environmental O145:H28 strains. In both O145:H25 strains, a second copy of module III was present on the LEEs, similar to the module II. In fact, the 17 kb DNA segment containing both module II and III on the LEE was nearly identical to the corresponding region on OI-122.

OI-57
The OI-57 in EDL933 belongs to the genome of prophage CP-933O, carrying virulence genes adfO (paa) and ckf, and is associated with the highly pathogenic seropathotypes A and B [53]. A homolog of OI-57, ranging in size (bp) from 14,385 to 137,948 was identified in all O145:H28 strains, but not in either of the O145:H25 strains (Table 3). Similar to EDL933 OI-57, the O145:H28 OI-57s insert in gene ompW. Both adfO and ckf were present on O145:H28 OI-57s but annotated as paa and the type I toxin-antitoxin system Hok family toxin gene, respectively. Sequence alignment of OI-57 failed to separate the clinical O145:H28 strains from the environmental strains (Supplementary Materials Figure S2C).

TRI
Strain EDL933 harbors two TRIs, one located downstream of tRNA gene serW (OI-43) and the other inserted downstream of tRNA gene serX (OI-48) ( Table 4). Both islands are bordered by a 14 bp DR sequence ( Figure 4A). Spontaneous deletions of TRIs in strain EDL933 occur via recombination between the DRs flanking each TRI [54]. In both O145:H25 strains, a TRI was identified, located downstream of tRNA gene ileX (Table 4). Among the O145:H28 strains examined, only RM13514 harbors two TRIs, one inserted downstream of serX (TRI-1), similar to the OI-48 in strain EDL933, and the other inserted downstream of ileX (TRI-2), just as the TRIs in O145:H25 strains ( Figure 4B). All other O145:H28 strains harbor one TRI, such as the OI-48 in strain EDL933, inserted downstream of serX. The two 14 bp DRs bordering the OI-43 or OI-48 in EDL933 were present in the Belgium outbreak strain RM13516 ( Figure 4C). In other O145:H28 strains including RM13514, only one 14 bp DR was detected. Two 12 bp DRs bordering the TRI-2 in strain RM13514 were identified, while in both O145:H25 strains, only one 12 bp DR was detected ( Figure 4D). A sequence alignment of TRIs grouped the O145:H25 TRIs with the RM13514 TRI-2, apart from all other TRIs, consistent with the TRI chromosomal integration sites (Supplementary Materials Figure S2D).

LAA
LAA is a large PAI initially identified in STEC O91:H21 strain B2F1 and present mainly in LEE-negative STEC strains [20]. LAA contains four modules, each flanked by ISs and/or DRs. The main virulence genes on the 11,465 bp module I are sisA and hes, related to attenuation of host immune response and adherence to epithelial cells, respectively. None of the strains examined harbored a complete module I, although sisA was detected in all O145:H28 strains and located on the TRIs (TRI-1 in strain RM13514). Two copies of sisA were identified in EDL933, located on OI-43 and OI-48, respectively. Neither sisA nor hes was detected in the O145:H25 strains.   [17]. The TRI regions in all O145:H28 strains were defined by BLASTn search of a database containing all STEC genomes examined in this study using the EDL933 OI-48 as a query. The TRI regions in O145:H25 strains were defined by BLAST search genome each using the TRI-2 (inserted downstream of the ileX gene in strain RM13514) as a query as no homologs were detected when OI-48 was used as a query. b This 5 kb LAA Module II segment in both O145:H25 strains mainly carries the virulence gene lesP (espC). A homolog of this segment is present on the virulence plasmid pEHEC. c IE06 is based on the annotation reported previously [6].   The main virulence genes on the 21,363 bp module II are iha and lesP, encoding adhesin Iha and a serine protease autotransporter, respectively. A fragment of module II was identified in all O145:H28 and O145:H25 strains. Two copies of the same module II fragments were identified in EDL933. Similar to sisA, this partial module II was located on the TRIs in O145:H28 strains and in EDL933. In fact, sisA inserts in the partial LAA module II. Unlike module II in strain B2F1, the partial module II in EDL933 and in all O145:H28 strains, except RM13514, carried iha but not lesP. RM13514 carried both iha and lesP, located on TRI-1 and TRI-2, respectively. No iha was detected in O145:H25 strains, although two copies of lesP were identified, located on the TRI and the pEHEC plasmid, respectively.
The main virulence genes on the 25,932 bp module III are pagC, related to serum resistance in Salmonella enterica, tpsA and tpsB, encoding a two-partner secretion system. A partial module III was detected in all O145:H28 strains and in strain EDL933, similarly to the partial module II, located on TRIs (TRI-1 in RM13514). No homologs of module III genes were identified in O145:H25 strains. The pagC was only detected in strains RM13514, RM13516, and EDL933, but located on the OI-122. The tpsA gene was not detected in any of the strains, whereas a 539 bp DNA fragment exhibiting 98.7% identity with the tpsB gene was detected in all O145:H28 strains and in strain EDL933, but not in the O145:H25 strains.
The main virulence gene on the 21,088 bp module IV is ag43, encoding an adhesin that promotes autoaggregation. A partial module IV ranging in size from 14,084 bp to 15,593 bp was identified in all O145:H28 strains. This partial module was also located on the TRI (TRI-1 in RM13514), about 13 Kb apart from the partial module III. The gene cah was detected on this partial module IV. A second partial module IV, located on the OI-122, was detected in all O145:H28 strains except 10942, RM9154-C1, and RM12275-C1, ranging in size from 14 kb to 19 kb (Table 4). A third partial module IV was detected in outbreak strain RM13516, located on the integrative element 06 (IE06). The ag43, but not cah, was present on the partial module IV located on OI-122 and IE06, respectively (Table 4). In strain EDL933, a partial module IV was present on OI-43 and OI-48, but not on OI-122. A fragment of module IV was detected in both O145:H25 strains, which, similarly to the LAA module II fragment, was located on the TRI (Table 4).

Others
The AFI in STEC O145:H25 strains is highly similar to the AFI in E. coli K-12 strain MG1655, carrying genes related to acid resistance (yhiF, yhiD, hdeBAD, gadE, gadW, gadX, and gadA) and genes encoding the efflux pump (mdtEF). The AFI in all O145:H28 strains is highly similar to the AFI in strain EDL933, besides the acid resistance genes it also harbors the iron acquisition genes chuSA and chuTWXYU. Therefore, the AFIs in all O145:H28 strains are larger in size (~22 Kb) than the AFIs in both O145:H25 strains and in strain MG1655 (~13 Kb) ( Table 4).
The 37 kb LPA island often inserts downstream of selC; therefore, it is likely present in LEE-negative strains. No homologs of LPA genes were identified in any of the O145:H28 strains examined; however, a 22 kb GI, located immediately downstream of selC, was identified in both O145:H25 strains. This GI exhibited high sequence similarity with the LPA only at the two border regions. Neither espP nor iha was detected on this GI. No homologs of the high-pathogenicity island (HPI), subtilase encoding pathogenicity island (SE-PAI), or locus of heat resistance (LHR) were identified in any of the strains examined (Table 2).

Discussion
The vast genetic variation in STEC poses challenges in detection of hypervirulent strains in diverse environmental and food samples. E. coli holds an open pangenome and evolves more rapidly compared to species with closed pangenomes. Emergence of hypervirulent strains could occur in any niches when the right set of virulence genes become available and assembled successfully. Such events are likely recorded in the accessory genome. The clear separation between clinical and environmental strains revealed in our study supports this speculation and suggests that accessory genes indeed carry information reflecting the recent strain evolution history (short evolution). This speculation is further supported by the clear separation of the stx 1a -positive O145:H28 isolates from the stx 2apositive O145:H28 isolates. Formation of the singleton by the cattle isolate RM9154-C1 is consistent with our recent report that strain RM9154-C1 harbors a stx 2a -prophage differing largely from the stx 2a -prophages in other environmental strains [30]. In fact, RM9154-C1 carries a stx 2a -prophage highly similar to the stx 2a -prophages in several hypervirulent strains including the O145:H28 strain RM13514, O157:H7 strain EDL933, and O104:H4 strain 2011C-3493 [30]. Unlike the core genome, which generally encodes essential functions, the accessory genome often encodes functions to provide additional traits for bacteria to expand their ecological niches [11,55]. The virulence genes repertoire of each STEC strain is likely impacted by the microbial community it resides in, as well as stress factors it encounters, considering that many common stresses promote HGT. A recent study revealed that clinically important STEC strains have emerged in multiple sublineages of bovine-adapted linage, implying that certain ecological niches, such as the bovine intestinal environment, may have provided selection pressures for STEC/EHEC virulence genes [14].
Our study revealed widespread loss-of-function mutations in diverse virulence genes. Loss-of-function mutations are an important mechanism in bacterial niche adaptation, including in the infected hosts [56,57]. Mutation in a regulatory gene could rewire the cell metabolism, providing the bacteria a largely altered physiology to exploit a resource in a new niche, as demonstrated for rpoS and rcsB mutations in STEC [58,59]. A mutation in fimH was linked to increased bacterial uroepithelial adhesion and bladder colonization [60], while loss of curli was suggested to improve pathogens' survival following infections [61][62][63]. Similarly, mutations in cah were suggested to confer STEC a selective trait when autoaggregative properties become detrimental [64], and mutations in autotransporter gene ompF were suggested to provide protection to E. coli cells from toxic compounds, such as tetracycline [65]. Our study revealed mutations in additional loci leading to silencing the production of curli, common pili, laminin-binding fimbriae, type I fimbriae, adhesin proteins, and autotransporters in STEC, although further studies are needed to understand the selection pressures for these mutations as well as the ecological benefits in maintaining these mutations in the STEC population.
Many bacterial pathogens utilize T3SS to inject toxins and/or effectors into host cells. These toxins or effectors have evolved to counteract the host's defenses via modulating host cell signaling pathways [66,67]. Our study revealed widespread mutations in genes related to T3SS apparatus and effectors, including Tir, one of the best characterized T3SS effectors in STEC. Tir functions as a receptor for the outer membrane protein intimin, mediates direct interaction between bacterial pathogen cells and the eukaryotic host cells and recruits α-actinin to the pedestal, a critical step in formation of A/E lesions [68]. Interestingly, loss-of-function mutations were also detected in eae. Therefore, T3SS genes are likely under positive selection in STEC and undergo adaptive changes, similarly to the genes encoding surface proteins and fimbrial/non-fimbrial surface structures that mediate the interactions with other bacteria, phages, and/or host immune systems [69,70].
Our study provides genomic insights into the divergent evolution of virulence in STEC. Consistent with a previous report that O145:H25 differs from O145:H28 phylogenetically and carries more fimbrial genes than O145:H28 [28], the CFA/I fimbriae genes were only detected in O145:H25. CFA/I fimbriae mediate the colonization of ETEC cells to human intestine [71]. This gene cluster is also present in a large number of STEC strains belonging to phylogroup B1, including the STEC O104:H4 outbreak strain 2011C-3493. However, compared to O145:H28, O145:H25 lacks the iron uptake genes on the AFI (heme utilization system) and the TRI (aerobacterin siderophore and ferric siderophore receptor), implying distinct physiological trait and fitness between the two serotypes, especially under ironlimiting conditions and in infected humans. Additionally, O145:H25 lacks other virulence genes located on the TRI, and the entire pathogenicity island OI-57, but carries a much larger LEE compared with O145:H28, which is likely a result of the transposon-mediated insertion of a 17 kb DNA fragment duplicated from OI-122.
Great genetic variations were also detected among the O145:H28 strains. Among the 333 virulence genes examined, the number of detected genes ranges from 128 to 154. This difference is mainly attributed to the differences in the number of genes encoding non-LEE-encoded T3SS effectors, and the genes encoding autotransporters and toxins. Since environmental isolate RM12275-C1 lacks OI-122, it lacks all OI-122-encoded virulence genes. The OI-122 in O145:H28 varies greatly in size and the number of genes encoding MGEs, suggesting that this chromosomal region is a "hot spot" for insertions, deletions, and/or recombination. The gene ompW serves as another "hot spot" in the O145:H28 genome since it contains the integration site for OI-57. In EDL933, OI-57 is part of the prophage CP-933O genome. In O145:H28 strains, the OI-57s are overlapped with the prophages and integrated within the ompW, including the stx 2a -prophage in RM12367-C1 [30]. Therefore, the variation in OI-57 is likely due to the difference in genomes of the prophages inserted in ompW.
Genetic variation in STEC O145:H28 is further contributed by the mobility of GIs. Strain RM13514 is the only one carrying two TRIs, one (TRI-1) highly similar to the EDL933 TRIs, whereas the other (TRI-2) highly similar to O145:H25 TRIs. TRI-1 appears to be fixed as one of the two 14 bp DRs is eliminated, but TRI-2 likely retains the mobility since the two 12 bp DRs flanking the TRI-2 are intact. The TRI in the Belgium outbreak strain RM13516 is the only one likely to retain mobility. Variation in O145:H28 TRIs are mainly attributed to the number of transposase genes and the genes involved in iron acquisition (iucABCD and iutA). The TRI in RM9154-C1 is much smaller than the TRIs in other strains due to lack of genes iucABCD and iutA. Great sequence diversity was also detected for the LAA modules. Consistent with a previous report that the LAA is largely present in LEE-negative strains, none of the strains examined carries a complete LAA. However, several key genes located on the four LAA modules were detected but were incorporated in TRIs. Duplications of the LAA module IV were detected in multiple strains and all are associated with transposons. Because the main virulence factor encoded by the LAA module IV is Ag43, variation in the LAA module IV would likely result in the difference in production of Ag43. Presence of the LAA modules on the TRIs in EDL933 perfectly explains the loss of bacterial adherence to human intestinal epithelial cells when the TRIs were excised from chromosome [54]. However, this impact would be averted if the LAA modules are located elsewhere, such as on the OI-122 as detected in this study.

Conclusions
The plasticity of the STEC genome provides the pathogen great potential for genome expansion and niche adaptation. Our study demonstrated the impacts of three major forces, HGT, loss-of-function mutations, and gene duplications, in driving divergent evolution of the genome and virulence of STEC. The high mutation rate in genes encoding protein adhesins, fimbriae, pili, autotransporters, and T3SS effectors implies the presence of positive selection for these genes in diverse ecological niches. Furthermore, environmental stresses that promote HGT or induce mutations accelerate genome and virulence evolution in STEC, including UV radiation, antimicrobial washes, and host defenses. Understanding the interactions between STEC and their indigenous community members would shed light on factors driving the emergence of STEC variants. Such knowledge is much needed for the development of effective mitigation strategies for detecting and controlling the transmission of STEC.