Validation of Core and Whole-Genome Multi-Locus Sequence Typing Schemes for Shiga-Toxin-Producing E. coli (STEC) Outbreak Detection in a National Surveillance Network, PulseNet 2.0, USA

Molly M. Leeper; Morgan N. Schroeder; Taylor Griswold; Mohit Thakur; Krittika Krishnan; Lee S. Katz; Kelley B. Hise; Grant M. Williams; Steven G. Stroika; Sung B. Im; Rebecca L. Lindsey; Peyton A. Smith; Jasmine Huffman; Alyssa Kelley; Sara Cleland; Alan J. Collins; Shruti Gautam; Eishita Tyagi; Subin Park; João A. Carriço; Miguel P. Machado; Hannes Pouseele; Dolf Michielsen; Heather A. Carleton

doi:10.3390/microorganisms13061310

,

and

¹

Division of Foodborne, Waterborne, and Environmental Diseases, Centers for Disease Control and Prevention, Atlanta, GA 30329, USA

²

Applied Science Research and Technology, Inc., Smyrna, GA 30080, USA

³

Booz Allen Hamilton, Atlanta, GA 30309, USA

⁴

IHRC Inc., Atlanta, GA 30346, USA

Microorganisms2025, 13(6), 1310;https://doi.org/10.3390/microorganisms13061310

This article belongs to the Special Issue The Molecular Epidemiology of Infectious Diseases

Version Notes

Order Reprints

Abstract

Shiga-toxin-producing E. coli (STEC) is a leading causing of bacterial foodborne and zoonotic illnesses in the USA. Whole-genome sequencing (WGS) is a powerful tool used in public health and microbiology for the detection, surveillance, and outbreak investigation of STEC. In this study, we applied three WGS-based subtyping methods, high quality single-nucleotide polymorphism (hqSNP) analysis, whole genome multi-locus sequence typing using chromosome-associated loci [wgMLST (chrom)], and core genome multi-locus sequence typing (cgMLST), to isolate sequences from 11 STEC outbreaks. For each outbreak, we evaluated the concordance between subtyping methods using pairwise genomic differences (number of SNPs or alleles), linear regression models, and tanglegrams. Pairwise genomic differences were highly concordant between methods for all but one outbreak, which was associated with international travel. The slopes of the regressions for hqSNP vs. allele differences were 0.432 (cgMLST) and 0.966 wgMLST (chrom); the slope was 1.914 for cgMLST vs. wgMLST (chrom) differences. Tanglegrams comprised of outbreak and sporadic sequences showed moderate clustering concordance between methods, where Baker’s Gamma Indices (BGIs) ranged between 0.35 and 0.99 and Cophenetic Correlation Coefficients (CCCs) were ≥0.88 across all outbreaks. The K-means analysis using the Silhouette method showed the clear separation of outbreak groups with average silhouette widths ≥0.87 across all methods. This study validates the use of cgMLST for the national surveillance of STEC illness clusters using the PulseNet 2.0 system and demonstrates that hqSNP or wgMLST can be used for further resolution.

Keywords:

PulseNet 2.0; Escherichia coli; STEC; whole genome sequencing; outbreak; cgMLST; wgMLST; hqSNP

1. Introduction

Escherichia coli (E. coli) is a large and diverse group of Gram-negative bacteria found in the environment and in the intestines of people and animals. While most strains of E. coli are harmless and are part of healthy intestinal tracts, some E. coli are pathogenic, with diarrheagenic strains causing symptoms such as stomach cramps, vomiting, fever, and watery or bloody diarrhea. Diarrheagenic E. coli are categorized into six pathotypes: Shiga-toxin-producing E. coli (STEC), Enterotoxigenic E. coli (ETEC), Enteropathogenic E. coli (EPEC), Enteroaggregative E. coli (EAEC), Diffusely Adherent E. coli (DAEC), and Enteroinvasive E. coli (EIEC), which produces a clinical manifestation similar to shigellosis [1]. Diarrheagenic E. coli is transmitted through the fecal–oral route via contaminated food or water, swimming in untreated water, or through direct contact with animals, people, or the environment [2].

One of these six pathotypes, STEC, causes human illness by producing a toxin known as Shiga toxin. STEC is a leading cause of foodborne and zoonotic illness in the United States, resulting in an estimated 265,000 illnesses, 2600 hospitalizations, and 30 deaths in the United States annually [3,4]. In some STEC infections, a condition known as hemolytic uremic syndrome (HUS) develops, which can cause anemia, acute renal failure, and death [5]. STEC bacteria are broadly categorized by serotype as STEC O157 and non-O157 STEC, and persons infected with STEC O157 strains are more likely to be hospitalized and develop HUS more frequently than those infected with non-O157 STEC strains [4,6]. While humans constitute the primary reservoir for non-STEC pathotypes, the intestinal tracts of animals, especially cattle and other ruminants, are the primary reservoirs of STEC [2].

Since 1996, PulseNet USA has served as the national molecular subtyping network for foodborne, waterborne, and one-health-related disease surveillance in the United States [7,8,9,10]. PulseNet USA is coordinated by the U.S. Centers for Disease Control and Prevention (CDC) and the Association of Public Health Laboratories (APHL) and comprises over 80 state and local public health laboratories and food regulatory federal agencies [10,11]. PulseNet-participating laboratories use standardized laboratory workflows and data-analysis tools to detect local and multistate foodborne and zoonotic illness clusters, including those caused by E. coli, primarily by STEC (both O157 and non-O157 serotypes) and EIEC/Shigella pathotypes. PulseNet surveillance also encompasses other E. coli pathotypes, such as EAEC, EPEC, and ETEC; however, these pathotypes are less frequently associated with human illness clusters in the United States. Additionally, with the exception of EIEC/Shigella, non-STEC pathotypes are not nationally notifiable [12]; thus, many states do not collect or report information on these pathotypes. Moreover, most clinical and public health laboratories do not use methods that can detect diarrheagenic E. coli from pathotypes other than STEC in stool samples [2].

Between 1996 and 2019, pulsed-field gel electrophoresis (PFGE) was the subtyping method used by PulseNet to detect clusters of foodborne illness. While PFGE has a long history of utility in enteric disease surveillance, whole-genome sequencing (WGS) offers improved discriminatory power and concordance with epidemiologic data when compared with PFGE [10,11,12,13]. In 2012, PulseNet began exploring opportunities for replacing PFGE with WGS, and in July 2019, PulseNet fully transitioned all enteric bacterial surveillance from PFGE to WGS. This included incorporating WGS data into PulseNet’s previously existing bioinformatics and information technology infrastructure, formerly a customized version of BioNumerics v7.6 software [14].

The incorporation of WGS data within the PulseNet BioNumerics v7.6 databases presented challenges that necessitated frequent updates and customizations within the software. These challenges and the need to share all bioinformatics tools as open source brought about the development of a new bioinformatics and data management system for the PulseNet USA network. In September 2024, PulseNet transitioned to a new cloud-based, modular, open-source platform, referred to as PulseNet 2.0. Data analysis within PulseNet 2.0 follows a standardized workflow that performs sequence quality assessments, the de novo assembly of sequenced genomes, speciation, sequence quality assessments, allele calling, and various genotyping tasks, such as serotyping, resistance profiling, and the identification of virulence markers and plasmids (Figure 1), using Nextflow v25.04.2 https://github.com/nextflow-io/nextflow as a workflow manager (last accessed for this study on 1 April 2025). The PulseNet 2.0 system was designed for end-to-end data analysis, making it suitable for use with WGS, a method that permits multiple characterizations of isolate genomes using one workflow [10,15,16,17].

Figure 1. PulseNet 2.0 data analysis workflow.

Specifically, for E. coli, some of these characterizations include species and serotype identification, pathotype determination, toxin profiling, and virulence marker detection [18,19,20,21,22,23,24]. Once the identification and genotyping workflow is complete, genomic and demographic data associated with each isolate are published in real time to the Escherichia national database within the PulseNet 2.0 system (Figure 2). The national database provides a centralized view of genomic and demographic data, allowing PulseNet data analysts at the CDC to rapidly detect multistate Escherichia illness clusters that have the potential to evolve into widespread outbreaks. For STEC (both O157 and non-O157), PulseNet’s primary national cluster definition is five or more clinical cases published to the PulseNet 2.0 national database within 60 days of each other and with cases relating to each other within 0 to10 allelic differences based on cgMLST [18].

Figure 2. Overview of Escherichia identification and genotyping workflow in PulseNet 2.0. ¹ https://cge.food.dtu.dk/services/SerotypeFinder/; ² https://github.com/ncbi/stxtyper; ³ https://cge.food.dtu.dk/services/VirulenceFinder/. All websites were last accessed on 1 April 2025 for this study.

As next-generation sequence technology has advanced, public health surveillance networks, such as PulseNet, have used various approaches for analyzing WGS data for foodborne disease surveillance. These include high-quality single-nucleotide polymorphism (hqSNP) analysis, core genome multi-locus sequence typing (cgMLST), and whole genome multi-locus sequence typing (wgMLST). hqSNP analyses compare isolate genomes to a closely related reference sequence to examine single-nucleotide changes in the DNA sequence. cgMLST analyses examine differences in the loci found in 95–98% of the reference strains used to build the allele scheme, and wgMLST analyses examine differences in either all loci or chromosomal loci found in the reference strains used to develop the allele schemes. These methods enable phylogenetic comparisons between sequenced isolates and are used to identify isolates that may have a common source within a foodborne or zoonotic outbreak [11,17,25,26,27]. While MLST-based methods require the up-front establishment of allele schema and methods, they are particularly suited for large-scale surveillance for their standardization and comparability, especially for epidemiological studies. On the other hand, hqSNP-based methods are often more powerful for the fine-scale resolution of genetic differences but can be more resource intensive and require the selection of an appropriate reference genome.

Previous studies have shown that unsupervised machine learning techniques can be used to cluster genomic data [28,29,30]. For example, K-means analysis, which divides a dataset into a predefined number of clusters (“K”), where data points within a cluster are more similar to each other than to points in other clusters [31], can be used to examine the phylogenetic clustering of isolate genomes. This clustering of data can be performed independently of an established, predefined cluster-detection threshold, making it an objective and external approach to enhance the validation of genomic-cluster-detection methods [28,30]. K-means clustering analysis has been successfully applied to research in various fields of biological science, such as clustering gene expression data or protein sequence data [32].

This study has two primary objectives: 1) to evaluate the overall concordance of three WGS-based subtyping methods: cgMLST, wgMLST using chromosome-associated loci [wgMLST (chrom)], and hqSNP for STEC human illness cluster detection in the United States, and 2) to evaluate the allele schemes built into the PulseNet 2.0 Escherichia national database to assess the reliability in detecting STEC outbreak clusters relative to hqSNP, a gold standard genomic comparison method. Multiple methods were used to meet these objectives, including an assessment of pairwise genomic differences across subtyping methods, linear regression models, phylogenetic clustering comparisons, and K-means clustering analysis. The findings of this study can be used to validate the use of allele-based methods for STEC illness cluster detection within the PulseNet 2.0 national USA surveillance network.

2. Materials and Methods

Selection of Isolate Datasets. A total of 251 STEC O157 and STEC non-O157 isolates from 10 foodborne and 1 travel-associated outbreak was selected from the PulseNet national database. Outbreaks occurred between 2016 and 2022 and had well-characterized sources based on epidemiologic investigations. Each outbreak was assigned a number between 01 and 11 for the study (Table 1). A total of 46 sporadic/non-outbreak STEC isolates was also selected from the PulseNet national database to evaluate the ability of the allele schemes to differentiate outbreak and sporadic/non-outbreak isolate sequences. Sporadic isolates were matched to individual outbreaks by serotype, as determined via traditional serotyping or WGS, and the selection of sporadic isolates was limited to those having collection dates within 6 months of the outbreak’s median collection date. Isolates were considered sporadic if they were not associated with any previously identified or investigated illness clusters. The number of sporadic isolates compared per outbreak ranged between 2 and 8 (Table 1). Raw sequence data files for the 297 isolates included in this study have been deposited in the National Center for Biotechnology (NCBI) Sequence Read Archive (SRA) under Bioproject PRJNA218110 (PulseNet Escherichia coli and Shigella genome sequencing), and accession numbers are listed in Supplementary Table S1.

Table 1. Summary of outbreaks included in the study.

Whole Genome Sequencing. WGS data were available for all 297 isolates in the study. Sequencing was performed on Illumina instruments using the Illumina Nextera XT or DNA Prep library preparation kits (San Diego, CA, USA) by PulseNet-participating public health laboratories or CDC according to the protocols available at https://www.aphl.org/programs/global_health/Pages/PulseNet-International-SOPs.aspx, last accessed for this study on 1 April 2025 [33].

WGS Analysis and Allele Calling. Illumina sequence read files for all isolates were linked and analyzed in the PulseNet 2.0 system. Genus identification was performed using MIDAS (v1.3.2) [19]. The raw reads were trimmed with fastp https://github.com/OpenGene/fastp (v0.32.2), last accessed for this study on 1 April 2025), using an average quality threshold of 30 and downsampled to 100× coverage with seqtk https://github.com/lh3/seqtk/releases (v1.3), last accessed for this study on 1 April 2025, using an expected genome size of 4.2 Mb. MIDAS (v1.3.2) was run a second time using the cleaned reads to detect any contamination. De novo assembly of the cleaned reads was performed with SPAdes (v3.15.5) [34] using the—isolate option. The cleaned reads were aligned back to the assembled genomes using BWA (Burrow–Wheeler Aligner) https://github.com/lh3/bwa (v0.7.17), last accessed for this study on 1 April 2025. The assembly was corrected by removing any contigs shorter than 500 bp, those with an average read depth below either a threshold of 15× or 25% of the assembly average depth, whichever value was greater, and those with a GC content less than 5%, as measured based on Samtools https://github.com/samtools (v1.16.1), last accessed for this study on 1 April 2025. The cleaned reads were mapped back to the corrected assembly to create cleaned BAM—Binary Alignment Map—files. The genus and species were identified using ANI with MUMmer https://github.com/chienchi/MUMmer (v3.23), last accessed for this study on 1 April 2025, and the genus result was verified with the MIDAS result. Sequences determined by ANI to be E. coli (including Shigella spp.) were retained for further analysis. Allele calls were generated using the PulseNet 2.0 MLST caller with the following steps: the corrected assemblies were compared against reference allele sequences in the PulseNet 2.0 MLST database repository https://github.com/ncezid-biome/pn2.0-mlst-databases (schema further described below under the heading “PulseNet 2.0 Escherichia Allele Schema”) using a BLASTn https://ncbiinsights.ncbi.nlm.nih.gov/2021/07/09/blast-2-12-0/ (v2.12), last accessed for this study on 1 April 2025), approach to find the presence of each locus. The query allele sequence was defined by the presence of start and stop codons (without nonsense mutations) and a minimum similarity of 85% against a reference allele. Loci that were likely repeated (fully or partially) elsewhere in the genome were ignored. The query sequences were hashed using the 64-bit MD5 algorithm and then transformed into a 56-bit integer https://github.com/ncezid-biome/pn2.0-mlst-databases?tab=readme-ov-file#hashing-function, last accessed for this study on 1 April 2025. A further filtration was performed using the aligned reads, and alleles that did not meet these minimum quality standards for each nucleotide call were removed: 65% homozygosity rate, depth of coverage greater than 5× depth, at least 1% of reads supporting each forward and reverse strand (Figure 3).

Figure 3. PulseNet 2.0 allele calling workflow.

Sequence Quality Assessment. Once allele calling was performed, sequence quality was assessed for each genome. Genomes with <85% of alleles called within the cgMLST scheme were considered to fail quality. Genomes with ≥85% core alleles present were considered to pass quality if they met the following additional quality metric cut-offs: average coverage ≥40×; average base quality score ≥30; and assembly length = 4.2–5.9 Mbp. PulseNet uses an 85% core-allele-call threshold primarily to ensure data quality, consistency, and reliability in WGS-based subtyping across the network. Furthermore, when the additional quality metric cut-offs for the coverage, q-score, and length are met, sequences are more likely to meet the 85% cgMLST allele calling threshold, filtering out low-quality genomes from surveillance [11]. For sequences that passed quality, genotyping was performed (Figure 2) to determine the serotype, pathotype, and Shiga toxin profile (if applicable) of each isolate. Once genotyping was complete, sequences were published to the PulseNet 2.0 national database and submitted to NCBI’s sequence read archive (SRA) under Bioproject PRJNA218110.

High-Quality SNP Analysis. For hqSNP comparisons, CDC uses an hqSNP pipeline called Lyve-SET [25] to assess the phylogeny of isolates within an outbreak. The design of Lyve-SET was optimized for epidemiologic investigations and has shown that as the phylogenetic relatedness between isolate sequences increases, the likelihood of epidemiological relatedness increases [11,13,17,25,35,36]. High-quality SNP (hqSNP) data were generated for all outbreak and sporadic isolates included in the study. The hqSNP analyses were generated through Lyve-SET v1.1.4f with the default modules selected for mapping and SNP calling. Reads were cleaned during the preparation phase before running Lyve-SET. Prior to SNP calling, options were set according to the Escherichia-specific thresholds specified under the “escherichia_coli” configuration; Lyve-SET workflow option “—presets”, respectively [25]. An internal draft reference, belonging to the specified outbreak, or an external closed reference, neither associated with the outbreak or sporadic isolate set, was selected (Supplementary Table S2). Reference sequences were assembled using SPAdes v3.14.0 [34], and plasmids were masked on the generated SPAdes assemblies through identification and exclusion using PlasFlow v1.1 [37]. Phages were masked using the Lyve-SET workflow for all outbreaks by default. For each outbreak, two hqSNP analyses were performed where one contained solely outbreak-associated genomes and the second included the sporadic set for the outbreak. A phylogenetic tree (RaxML) [38] and pairwise SNP difference matrix were generated for each hqSNP analysis.

PulseNet 2.0 Escherichia Allele Schema. Currently, the PulseNet USA network uses allele-based methods for detecting Escherichia illness clusters, including those caused by STEC. Three MLST schemes have been incorporated into the PulseNet 2.0 national database for Escherichia. The core (cgMLST) scheme contains 2513 loci and represents the genes found in 95% or more of the reference strains used to develop the database [39]. The core and chromosomal accessory genes make up the whole-genome MLST (chromosomal) scheme, wgMLST (chrom), which contains 30,717 chromosomal loci, inclusive of the 2513 core genome loci. The wgMLST (all loci) scheme contains 34,483 loci and is inclusive of the core and accessory genome, as well as 3737 plasmid loci and loci from 7-gene, 8-gene, and 15-gene MLST schemes that are not already part of the core scheme (Figure 4). Locus names for all schemes incorporated into the PulseNet 2.0 national Escherichia database for are included in Supplementary Table S3.

Figure 4. PulseNet 2.0 Escherichia schema. Number of loci included within schemes are shown for the overall scheme (all loci), core genome, whole genome (excluding core), plasmid, and 7-gene, 8-gene, and 15-gene MLST schemes. * indicates that the scheme is hosted on Enterobase: https://enterobase.warwick.ac.uk/ (last accessed for this study on 1 April 2025).

Comparison of WGS-Based Subtyping Methods. The concordance between SNP- and allele-based methods was determined using multiple approaches, including pairwise genomic differences, linear regression models, phylogenetic tanglegrams, and K-means analysis. To evaluate pairwise genomic differences for the cgMLST and wgMLST (chrom) allele-based methods, allele differences between isolate genomes were converted into pairwise matrices for each outbreak within the PulseNet 2.0 system. Similarly, using hqSNP data generated for each outbreak, SNP differences between isolate genomes were determined and converted into pairwise matrices. Pairwise cg/wgMLST and hqSNP differences were combined into one overall profile per subtyping method, and a Pearson correlation coefficient was calculated in R Studio v1.4.1717 (Performance Analytics package), last accessed for this study on 01 January 2025, to show the overall correlation between each method, supported by a 95% confidence interval (CI). Appendix A describes a further comparison of pairwise differences obtained from PulseNet USA’s previous data management system, BioNumerics v.7.6.3, to those obtained from PulseNet 2.0.

For the linear regression models, three scatterplots were generated to compare the genomic differences produced by one subtyping method to that of the other two. Pairwise cgMLST and wgMLST (chrom) allele differences were plotted against their corresponding pairwise hqSNP differences, as well as to each other. For each scatterplot, a linear regression line was added to model the relationship between methods. Slopes of regression formulas, supported by 95% confidence intervals, were indicative of the genomic differences between subtyping methods among pairwise isolates. Y-intercepts were indicative of the pairwise cg/wgMLST (chrom) allele differences when hqSNP differences were zero or of the pairwise wgMLST (chrom) allele differences when cgMLST allele differences were zero. R² values were used to determine how well the data fit each regression model (goodness of fit).

Tanglegrams (side-by-side facing dendrograms) were constructed to compare the phylogenies generated using each subtyping method [(cgMLST, wgMLST (chrom), and hqSNP)] when outbreak isolate sequences were combined with their corresponding sporadic/non-outbreak isolate sequences. Allele-based dendrograms were constructed in the PulseNet 2.0 system using absolute allelic differences. SNP-based dendrograms were constructed using the maximum likelihood method [40]. All allele and SNP-based dendrograms were converted to Newick format and assembled into tanglegrams in Base R v4.1.2 (dendextend package) [41], last accessed for this study on 01 November 2024, and the layout was optimized to minimize entanglement, as intricately tangled trees can become difficult to analyze, using the step2side method [42]. The statistical association of the branches in the two facing dendrograms was assessed using two measures: the Baker’s Gamma Index (BGI) and Cophenetic Correlation Coefficient (CCC). The Baker’s Gamma Index, also known as the Goodman–Kruskal–gamma index, is a statistical measure of the similarity between two hierarchical clustering trees and ranges from −1 to 1, with values closer to 1 indicating greater statistical similarity [43]. A value near 0 indicates that the trees are not statistically similar, and a negative value suggests strong disagreement between the dendrograms. The Cophenetic Correlation Coefficient is a statistical measurement that evaluates how well a dendrogram preserves the original distances between data points and is particularly useful for understanding clustering quality. A coefficient close to 1 indicates that the clustering algorithm preserves the original data structure well, while a lower coefficient suggests the clustering less accurately represents the distances [44]. Both measures were acquired using the dendextend package in Base R v4.1.2 [41].

For further validation, an unsupervised machine learning method, K-means analysis, was performed to compare the clustering results of the three WGS-based subtyping workflows. The Silhouette method [45] was applied in R/R Studio v1.4.1717 (Nbclust package) [46], last accessed for this study on 01 November 2024, to each dataset of combined outbreak and sporadic isolate sequences. This method identifies the optimal or most favorable number of clusters, or “K”, within each dataset. The optimal K value was based on the maximum Silhouette score, and a Pearson gamma coefficient (Km_stats function) expressed the statistical significance of the chosen K. In addition, for every outbreak, average silhouette widths were obtained for the outbreak isolate group (K1) and the sporadic isolate group (K2) (Km_stats function). For visualization, the dendrogram function (scikit-learn package v1.6.1) [47], last accessed for this study on 01 November 2024, in Python v3.9.21 Jupyter v1.1.1 notebooks [47] was used to perform a hierarchical divisive cluster analysis using single linkage, whereby single linkage considers the distance between clusters as the minimum distance, depicting the partitioning of K1 and K2 groups for each outbreak. This exercise was performed for each subtyping workflow using pairwise differences generated for each combined set of outbreak and sporadic isolate sequences.

3. Results

3.1. Summary of Outbreak Information

Table 1 (shown above under the heading “Selection of Isolate Datasets”) provides a summary of the 11 outbreaks included in this study. Six unique serotypes were represented among the outbreaks, including E. coli O157:H7 (4 outbreaks), E. coli O121:H19 (2 outbreaks), E. coli O26:H11 (2 outbreaks), E. coli O103:H2 (1 outbreak), E. coli O111:H8 (1 outbreak), and E. coli O5:H9 (1 outbreak). Collection dates ranged from 1 February 2016 to 7 July 2022, and all outbreaks were foodborne with the exception of outbreak 07, an E. coli O111:H8 outbreak associated with international travel.

3.2. Pairwise Genomic Differences

The range of SNP- and allele-based pairwise genomic differences between isolates is shown in Table 2 for each outbreak. For 10/11 outbreaks, SNP differences were mostly concordant with cgMLST and wgMLST (chrom) allele differences, differing by no more than 5 SNPs from the allele-based results. However, for outbreak 07, a travel-associated outbreak, SNP differences (0–19) were more closely aligned with wgMLST (chrom) differences (0–16) than to cgMLST allele differences (0–8) (Table 2).

Table 2. Range of hqSNP- and allele-based pairwise genomic differences between outbreak isolates using PulseNet 2.0.

3.3. Linear Regression Models

For all outbreak isolate sequences, allele-based cgMLST and wgMLST (chrom) pairwise genetic differences were plotted against their respective SNP differences and are shown in Figure 5A (cgMLST) and Figure 5B [(wgMLST (chrom)]. The slope of the linear regression for cgMLST vs. SNP pairwise differences was 0.432 [95% CI: 0.426, 0.437], indicating that there were lower cgMLST differences compared to SNP differences. The y-intercept comparing cgMLST allele differences to SNP differences was 0.08, indicating that sequences that were zero SNPs different were also close to zero cgMLST alleles different on average. The slope of the linear regression for wgMLST vs. SNP pairwise differences was 0.966 [95% CI: 0.956, 0.975], indicating that there were slightly lower wgMLST allele differences between pairwise isolates compared to SNP differences. The y-intercept comparing wgMLST (chrom) allele differences to SNP differences was 0.29, illustrating that sequences that were zero SNPs different were also <1 alleles different on average. The goodness of fit for these models, as measured based on an R² value, was 0.75 for cgMLST vs. SNP and 0.82 for wgMLST (chrom) vs. SNP, reflecting moderate amounts of variation within the models. Outbreak 07 was removed from linear regression models due to outlying allele and SNP differences compared to other outbreaks.

Figure 5. (A) Scatterplot of hqSNP differences vs. cgMLST differences. (B) Scatterplot of hqSNP differences vs. wgMLST (chrom) differences. (C) Scatterplot of cgMLST vs. wgMLST (chrom) differences. Regression equations and R² values are displayed on the plots. Pearson correlation coefficients for each combination of pairwise matrices are shown below plots.

For cgMLST vs. wgMLST (chrom), the slope of the linear regression was 1.914 [95% CI: 1.895, 1.933]; (R² = 0.81), indicating that there were higher wgMLST (chrom) allele differences per cgMLST allele difference, as expected, since wgMLST incorporates more loci than cgMLST. The y-intercept was 0.35, illustrating that sequences that were zero cgMLST alleles different were less than 1 wgMLST (chrom) allele different (Figure 5C).

There was high correlation between methods when overall pairwise differences were compared, as indicated by Pearson correlation coefficients supported by 95% confidence intervals. For cgMLST vs. SNP, the correlation coefficient was 0.86 [CI: 0.858, 0.870]. For wgMLST (chrom) vs. SNP, the correlation coefficient was 0.91 [CI: 0.904, 0.911], and for cgMLST vs. wgMLST (chrom), the correlation coefficient was 0.90 [CI: 0.895, 0.903] (Figure 5).

3.4. Tanglegrams

For all 11 outbreaks, tanglegrams showed moderate-to-high concordance between subtyping methods in terms of each method’s ability to separate outbreak and sporadic isolate sequences. For allele vs. hqSNP tanglegrams, BGI values ranged from 0.413 to 0.987 (cgMLST) and 0.354 to 0.936 [wgMLST (chrom)]. BGI values ranged from 0.686 to 0.964 when cgMLST was compared to wgMLST (chrom), representing statistically similar clustering between trees (Figure 6A, Supplementary Table S4). Across all three subtyping methods and for all 11 outbreaks, the Cophenetic Correlation Coefficient was ≥0.865, indicating the high fidelity of original pairwise distances in the dendrograms (Figure 6B, Supplementary Table S4). A visual representation of tanglegrams generated for one outbreak is shown in Figure 7.

Figure 6. (A) Baker’s gamma indices for outbreak tanglegrams. (B) Cophenetic Correlation Coefficients for outbreak tanglegrams.

Figure 7. (A) Tanglegram of cgMLST and hqSNP clustering using single linkage for one representative outbreak (outbreak 04) and its corresponding sporadic/non-outbreak isolates. (B) Tanglegram of wgMLST (chrom) and hqSNP clustering for the same set of isolate sequences. Outbreak isolates are depicted in orange, and sporadic isolates are depicted in green. The tanglegram links tips with the same label to each other via a straight line. Allele/hqSNP differences are labeled at each node.

3.5. K-Means Analysis

Across all three subtyping methods and for all 11 outbreaks, the Silhouette score was maximized at K = 2, designating 2 as the ideal number of clusters/groups within each combined set of outbreak and sporadic isolate sequences. The statistical significance of K = 2 was measured based on a Pearson gamma coefficient, which ranged from 0.85 to 0.99 across outbreaks, signifying robust clustering performance via the K-means analysis (Supplementary Table S5). For every outbreak, the average silhouette width for the outbreak isolate group was consistently high across subtyping methods and ranged from 0.92 to 0.99 (cgMLST), 0.89 to 0.99 [wgMLST (chrom)], and 0.87 to 0.99 (hqSNP), where a value close to 1.00 indicates more solid and cohesive clustering within groups. The average silhouette widths for the sporadic isolate groups showed more variation, as expected, since sporadic groups were not epidemiologically linked, ranging from 0.34 to 0.93 (cgMLST), 0.32 to 0.92 [wgMLST (chrom)], and 0.35 to 0.91 (hqSNP) (Supplementary Table S6). For all 11 outbreaks, a hierarchical divisive cluster analysis using single linkage showed consistent division of outbreak and sporadic isolates into two distinct groups across subtyping methods, as shown with cluster dendrograms. For 8 of the 11 outbreaks, K-means analysis assigned all outbreak and sporadic isolates into the correct groups based on ground truth data. For 3 of the 11 outbreaks (outbreaks 04, 06, and 11), K-means analysis incorrectly assigned between one and six isolates into the wrong group (either outbreak or sporadic) based on ground truth data. These incorrect assignments occurred across all three subtyping methods for these outbreaks, and except for outbreak 06, the same isolate(s) was/were incorrectly assigned to the wrong group across subtyping methods. (For outbreak 06, the same two sporadic isolates, PNUSAE026825 and PNUSAE033271, were incorrectly assigned to the outbreak group using cgMLST and wgMLST (chrom), but when using hqSNP, all six sporadic isolates were incorrectly assigned to the outbreak group). Cluster dendrograms are shown for all outbreaks in Supplementary Table S7A. Supplementary Table S7B lists the incorrectly assigned isolates for outbreaks 04, 06, and 11, as well as the minimum pairwise difference between each isolate and the incorrect group to which it was assigned.

3.6. Summary of Metrics

All analysis metrics obtained in this study are shown in Table 3, Table 4 and Table 5.

Table 3. Summary table of metrics (regression analysis).

Table 4. Summary table of metrics (phylogenetic clustering analysis).

Table 5. Summary table of metrics (K-means analysis).

4. Discussion

As WGS continues to be used to support epidemiological investigations of foodborne and zoonotic outbreaks, validation of the methods used for routine cluster surveillance is essential. While allele-based cluster detection methods have been previously validated for other foodborne pathogens under PulseNet USA surveillance, namely Campylobacter and Salmonella [11,30], this study aimed to support the use of PulseNet’s allele-based cluster detection methods for STEC. This study demonstrated that the allele schemes (core genome and whole-genome using chromosomal loci) integrated into the PulseNet 2.0 Escherichia national database generate outputs that are highly concordant with the hqSNP analysis and well-aligned with epidemiological data. These findings establish allele-based methods as a reliable mechanism for STEC illness cluster detection in the United States using the PulseNet 2.0 system.

This study is one of the first to compare hqSNP and allele-based outputs using the PulseNet 2.0 allele calling workflow. Specific improvements in the PulseNet 2.0 allele calling workflow include (1) the incorporation of hash values for allele identification, (2) an allele-filtering step, which ensures high quality locus classification, leading to a more accurate analysis for outbreak detection, and (3) improved time and memory performance of allele calling and filtering. Overall, cloud-based data-processing and analysis pipelines, such as PulseNet 2.0, support scalable and shareable bioinformatics tools across multiple users, making PulseNet 2.0 more cost-effective, particularly in terms of the cost savings from faster, more accurate outbreak detection and responses. These improvements, coupled with the results of this study, should grant public health researchers assurance in using the PulseNet 2.0 system for STEC outbreak detection.

Using 11 well-characterized STEC outbreaks, this study showed concordance between subtyping methods when pairwise differences between outbreak isolates were compared. Within each outbreak, there were very few (i.e., ≤5) genomic differences across subtyping methods, except for an international travel-associated outbreak, where there were 19 SNP differences compared to 8 cgMLST allele differences. In this outbreak, hqSNP results were more closely aligned with wgMLST (chrom) results (16 allelic differences), suggesting that the additional allelic differences likely occurred on loci belonging to the accessory genome. This implies that wgMLST, due to its enhanced resolution, may be used routinely in conjunction with cgMLST for comparing outbreak isolate sequences associated with international travel. This finding should be acknowledged, as outbreak sequences are frequently shared and compared between international public health organizations [48]. A future analysis that examines additional international travel-associated outbreaks is needed to support this finding and would be beneficial for international public health groups that engage in genomic data sharing and comparison.

Previous evaluations of the congruence between WGS-based subtyping methods using pairwise genomic differences have yielded similar results to those of this study. For example, Pearce et al.’s comparison of cgMLST and SNP typing within a European Salmonella Enteritidis outbreak demonstrated that cgMLST analysis using the EnteroBase scheme was congruent with an original SNP-based analysis, wgMLST analysis, and epidemiological data [49]. We emphasize Pearce et al.’s statement that cgMLST can be readily implemented in laboratories that have access to web-based bioinformatics analysis tools, something now available within the PulseNet 2.0 system. Similarly, Simon et al.’s comparison of WGS-based approaches for investigating a foodborne outbreak caused by Salmonella Derby in Germany found that both SNP- and cgMLST-based methods proved to be highly suitable for reliable cluster generation [50]. While these two previous evaluations focused on an individual outbreak caused by one particular serotype and our study comprises multiple outbreaks caused by varying serotypes, the overall conclusion is consistent across studies.

In this study, the concordance between SNP- and allele-based methods was statistically evaluated using Pearson correlation coefficients and simple linear regression. These approaches confirmed a direct linear relationship between subtyping methods and quantified the strength of their concordance with Pearson correlation coefficients between 0.86 and 0.91; all correlation coefficients were supported by 95% confidence intervals. A similar investigation conducted by Blanc et al.also used linear regression and correlation coefficients to show that for Pseudomonas aeruginosa, core and whole-genome MLST approaches were as discriminatory as SNP-based approaches during outbreak investigations with correlation coefficients between 0.78 and 0.99 [51]. More relative to Escherichia, Bernaquez et al. ranked the discriminatory power of four WGS-based subtyping methods applied to a dataset of Shigella sonnei and flexneri outbreak isolates using linear regression [52]. While our study did not attempt to rank each subtyping method according to its discriminatory power, the overall findings were similar, in that in both studies, SNP- and allele-based methods were highly comparable for clustering epidemiologically related isolates.

In this investigation, while each of the three subtyping methods showed high overall concordance using linear regression, hqSNP was found to be slightly more discriminating than cgMLST on average, in that there were slightly lower cgMLST allele differences (0.432); CI: [0.426, 0.437] per one hqSNP allele difference observed in the slopes of the trend lines. This finding was foreseen, since SNP analysis can include non-core genes and intergenic regions, capturing a wider array of genetic differences than cgMLST, which is restricted to the core genome and because multiple changes in the same locus only count as one difference in MLST-based methods. The resolution between wgMLST (chrom) and hqSNP results was almost 100% analogous, where slopes showed 0.966; CI: [0.956, 0.975] wgMLST allelic differences for every one hqSNP difference. This outcome provides additional assurance that the wgMLST (chrom) scheme within the PulseNet 2.0 system captures approximately the same genetic variation as SNP-based approaches for closely related strains. Previous comparative analyses in which PulseNet outbreak data were used yielded similar findings [11,25,30]. In the present study, as expected, wgMLST (chrom) was found to provide enhanced resolution over cgMLST, where there were 1.914; CI: [1.895, 1.933] wgMLST allelic differences for every 1 cgMLST difference. Because wgMLST comprises a larger set of loci, it provides a broader view of genetic diversity and can be used to provide further resolution as needed when comparing outbreak isolates.

While SNP analysis is currently considered a gold standard genomic comparison method, the data analysis process for SNP analysis requires a certain level of expertise, is computationally intensive, and requires the selection of a reference genome [11,53]. Allele-based sequence typing presents a fitting substitute, particularly for large national surveillance networks, such as PulseNet USA, as a gene-by-gene method, such as MLST, provides a balance between resolution and computational efficiency [54,55]. Core genome schemes have been shown to be standardized and scalable for interlaboratory comparisons for enteric pathogens, as they offer a unified nomenclature that can facilitate communication and data sharing between public health entities [56]. Additionally, establishing sequence quality thresholds helps standardize results across different laboratories, making it easier to compare and aggregate data on a national or global scale. While cgMLST schemes generally include those loci present in the majority (95–100%) of isolates in a given group of bacteria [49,54,56], in a large national surveillance network, such as PulseNet USA, a lower threshold (85%) of core genes detected has been established to adjust for the variability that may be due to differences in technical replication across the network. Finally, by focusing on core genes, cgMLST may reduce the noise introduced by variations in non-core or accessory genomes, which may be less informative for distinguishing closely related strains. However, as noted in this study, some outbreaks, including those associated with international travel, may benefit from being compared using the added precision of wgMLST. Thus, while cgMLST is well-suited for long-term and routine surveillance, higher-resolution methods, such as wgMLST and hqSNP, can be used in the context of acute outbreaks, especially for pathogens with diverse accessory genomes. Higher-resolution methods may also be useful for studies exploring variation in mobile genetic elements or virulence/resistance genes not found in all strains.

In this study, tanglegrams were used as a visual tool to compare two hierarchical clusterings: cg/wgMLST (chrom) vs. hqSNP and cgMLST vs. wgMLST (chrom) of the same set of isolate sequences. The tanglegrams also provided a quantitative assessment of clustering similarity through the generation of BGI and CCC values. CCC values reflected greater concordance between methods than did BGI values (100% of outbreaks had CCC values > 0.88, regardless of which two subtyping methods were compared, whereas BGI values ranged between 0.35 to 0.99 depending on the outbreak and methods compared). This observation is likely due to differences in the intended purpose of these two measures, where the purpose of the BGI is to calculate the similarity between two clustering results, and the purpose of the CCC is to calculate how well the hierarchical clustering preserves the actual pairwise distances between isolate sequences. Nonetheless, the tanglegrams created similar tree topologies and allowed us to effectively visualize the relationships between clustering structures. A previous study by Zhang et al. also used tanglegrams to show that SNP, cgMLST, and wgMLST congruously separated porcine and environmental STEC O157:H7 isolates into various phylogenetic groups and revealed high CCC values between 0.995 and 0.996 across subtyping methods [57]. While the tanglegrams in Zhang et al.’s study demonstrated the effective clustering of STEC isolates by source type, our tanglegrams demonstrated the clustering of outbreak versus sporadic/non-outbreak isolates. Nevertheless, both studies illustrate how tanglegrams can be a useful visualization tool for comparing and clustering genomic data, particularly when they are accompanied by a statistical quantification of tree similarity. We note that many other genomic clustering visualization tools are available and could have added value to this study, for example GrapeTree, an interactive tree visualization program within EnteroBase, which can be used to create phylogenies using EnteroBase’s wg/cgMLT schemes [58].

As an external form of validation, we applied an unsupervised machine learning technique, K-means analysis, to show how outbreak and sporadic/non-outbreak isolate sequences were effectively partitioned outside of the PulseNet 2.0 system. An initial step in performing K-means analysis involves identifying the optimal number of clusters in a dataset based on a score, such as the Silhouette score (as was used in this study), or Elbow or Gap score [45]. For this study, we chose to apply the Silhouette method, obtaining a Silhouette score for each outbreak, since the Silhouette method does not require a training set to evaluate clustering performance and has the added advantage of identifying outliers in a dataset [45]. Our approach was successful, but we note that several other methods and metrics are available for evaluating clustering performance within K-means analysis and should be explored using PulseNet 2.0 data. For example, Coipan et al. examined the consensus between the Silhouette score and additional indices such as Dunn2 and McClain–Rao internal validation indices for clustering genomic data obtained using wg/cgMLST and SNP workflows [28]. Their results showed that while there were slight variations across indices based on the workflow, all metrics yielded the same optimal number of clusters and effectively separated outbreak isolates from non-outbreak isolates [28]. In our study, K-means analysis correctly designated outbreak and sporadic/non-outbreak isolates into the appropriate groups based on ground truth data for all but three outbreaks (outbreaks 04, 06, and 11). In these three outbreaks, only 1–2 isolates per outbreak were incorrectly assigned to the wrong group (with the exception of outbreak 06, where all sporadic isolates were incorrectly assigned to the outbreak group; however, this incorrect assignment only occurred when using hqSNP data). For these three outbreaks and in particular for outbreak 06, even though the average silhouette width was maximized at 2, there was enough diversity among the sporadic isolates that three (instead of two) K-groups could have reasonably been produced (Supplementary Table S5), possibly explaining why the K-means analysis forced the sporadic isolates into the incorrect group. Still, the overall results of this external validation show that K-means analysis can be used as a reliable proxy for clustering genomic data or at least as a supplementary validation technique.

This study has some limitations. First, we excluded all genes predominantly found on mobile genetic elements (phage and/or plasmid) and instead chose to limit our analysis to core and chromosomally located accessory genes. We chose to use the chromosomal-only wgMLST scheme instead of the full accessory genome approach because chromosomal genes are generally more conserved and present in all or most isolates, making comparisons more reliable and robust. Additionally, it is well-known that differences on plasmid loci and/or those found on other mobile genetic elements in the accessory genome may cause inflated genetic variation, making comparisons between subtyping methods challenging [52]; therefore, we chose to exclude these so as not to introduce unnecessary variation. However, the increased resolution provided by using all loci (including those in the accessory genome) may reveal more about the evolutionary history and genetic relatedness of different strains, which could have potential value for epidemiologic investigations, particularly for those involving travel. Second, we limited this analysis to STEC outbreaks due to the predominance STEC has over other E. coli pathotypes in PulseNet USA surveillance, wherein STEC makes up approximately 80% of all E. coli pathotypes (not including EIEC/Shigella) submitted to PulseNet USA. An expansion of this study could examine additional pathotypes to determine if the same congruence between subtyping workflows observed in this study exists across non-STEC pathotypes. We propose that an entirely separate study is warranted for EIEC/Shigella, given that cases occur through continuous person-to-person transmission, predominantly involving men who have sex with men (MSM), leading to long-term and recurrent outbreaks [59]. Undoubtedly, Shigella-specific transmission patterns, along with the prolonged genetic evolution of outbreaks, necessitates a unique assessment of existing WGS-based subtyping methods and cluster interpretation criteria [52,60]. Third, this study included ten foodborne outbreaks and one travel-associated outbreak, but there were no animal-contact-associated outbreaks included in this study. Given that zoonotic outbreaks may show increased variation based on cgMLST due to the evolution of strains between animal and human hosts [61], an expansion of this study could examine how this increased variation may affect concordance between allele- and SNP-based outputs and/or whether genomic differences between animal and human sources occur in areas other than the core genome. Finally, this study was limited to domestic outbreaks and does not incorporate extensive recent global data on STEC and other E. coli. Admittedly, surveying broader international data, including patterns associated with antimicrobial resistance and virulence gene distribution, could reveal noteworthy genetic relationships that suggest mutual sources or transmission routes across different regions, as seen in Bakleh et al.’s recent systematic review [62].

5. Conclusions

This study demonstrates that the allele schemes and allele calling workflow integrated within the PulseNet 2.0 system reliably cluster STEC outbreak isolates with the same epidemiologic concordance as hqSNP. Using multiple techniques and statistical measures including pairwise differences, linear regression models, and tanglegrams, this study confirms that the PulseNet 2.0 system can be used to detect STEC outbreaks caused by different serotypes and sources. Further evaluation using K-means analysis as an unsupervised machine learning approach objectively validated the results of this study, ensuring that the results are meaningful and reproducible. Overall, this study suggests the use of cgMLST as an ideal WGS-based analysis technique for routine STEC outbreak detection within large public health networks, due to its scalability and concordance with hqSNP, while wgMLST and hqSNP analyses can be used when further precision is needed for comparing outbreak isolates.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/microorganisms13061310/s1. Table S1: Isolate accession numbers; Table S2: List of accession numbers for additional closed reference sequences used in hqSNP (Lyve-SET) analysis; Table S3: Loci names for schema; Table S4: Baker’s Gamma Indices (BGI) and Cophenetic Correlation Coefficients (CCC) for outbreak tanglegrams; Table S5: Silhouette method graphs; Table S6: Average Silhouette widths from K-means analysis; Table S7: K-means dendrograms.

Author Contributions

Conceptualization, M.M.L., M.N.S., K.B.H. and H.A.C.; Methodology, M.M.L., M.N.S., K.B.H. and H.A.C.; Software, J.A.C., M.P.M., H.P. and D.M.; Validation, M.M.L., K.B.H. and H.A.C.; Formal Analysis, M.M.L., T.G., M.T. and K.K.; Investigation, M.M.L.; Data Curation, M.M.L., M.N.S. and K.B.H.; Writing—Original Draft Preparation, M.M.L.; Writing—Review and Editing, M.M.L., M.N.S., R.L.L., P.A.S., T.G., M.T., K.K., L.S.K., K.B.H., G.M.W., S.G.S., S.B.I., J.H., A.K., S.C., A.J.C., S.G., E.T. and S.P.; Visualization, M.M.L.; Supervision, H.A.C. All authors have read and agreed to the published version of the manuscript.

Funding

The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This work was made possible through support from the CDC’s Advanced Molecular Detection (AMD) program.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Raw sequence data files for the 297 isolates included in this study have been deposited in the National Center for Biotechnology (NCBI) Sequence Read Archive (SRA) under Bioproject PRJNA218110 (PulseNet Escherichia coli and Shigella genome sequencing). Supplementary Table S1 contains biosample and SRA accession numbers for the 297 isolates used in this study. Supplementary Table S2 contains biosample and SRA accession numbers for an additional 11 closed reference sequences used in hqSNP analysis.

Acknowledgments

The authors wish to acknowledge the PulseNet-participating state, local, and federal public health and regulatory laboratories for sequencing the isolates that were used for this study and for their contributions to the PulseNet network and to the development of the PulseNet 2.0 system. The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Disease Control and Prevention. The use of trade names and commercial sources is for identification only and does not imply endorsement by the Centers for Disease Control and Prevention, the Public Health Service, or the U.S. Department of Health and Human Services.

Conflicts of Interest

Authors João A. Carriço, Miguel P. Machado, Hannes Pouseele, and Dolf Michielsen are employed by the company bioMérieux. Author Krittika Krishnan is employed by the company Applied Science Research and Technology, Inc. Authors Jasmine Huffman, Alyssa Kelley, Sara Cleland, Alan J. Collins, Shruti Gautam, Eishita Tyagi and Subin Park are employed by the company Booz Allen Hamilton. Author Alan J. Collins is employed by the company IHRC, Inc. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

STEC	Shiga-toxin-producing Escherichia coli
WGS	Whole-genome sequencing
cgMLST	Core genome multi-locus sequence typing
wgMLST	Whole-genome multi-locus sequence typing
wgMLST (chrom)	Whole-genome multi-locus sequence typing using chromosome-associated loci
BGI	Baker’s Gamma Index
CCC	Cophenetic Correlation Coefficient
ETEC	Enterotoxigenic E. coli
EPEC	Enteropathogenic E. coli
EAEC	Enteroaggregative E. coli
DAEC	Diffusely Adherent E. coli
EIEC	Enterinvasive E. coli
HUS	Hemolytic uremic syndrome
CDC	Centers for Disease Control and Prevention
APHL	Association of Public Health Laboratories
PFGE	Pulsed-field gel electrophoresis
NCBI	National Center for Biotechnology Information
SRA	Sequence Read Archive
CI	Confidence interval

Appendix A

Comparison of Pairwise Differences Obtained from BioNumerics v.7.6.3 vs. PulseNet 2.0

An ancillary objective of this study was to evaluate the concordance between PulseNet’s former data management system (BioNumerics v.7.6.3) and the PulseNet 2.0 system. Pairwise allelic differences among outbreak isolates were compared between systems and were found to be highly comparable. For cgMLST, allelic differences were 100% analogous between systems for outbreaks 01–09, but for outbreaks 10 and 11, there was one additional cgMLST difference observed using BioNumerics (Table A1). There was more variation between systems observed for wgMLST (chrom), where 8 outbreaks showed slightly different wgMLST (chrom) allele ranges, but this variation included no more than 4 allelic differences between systems. Outbreaks 01, 02, 04, 10, and 11 showed between one and four additional wgMLST (chrom) allelic differences using BioNumerics compared to PulseNet 2.0, and outbreaks 05, 07, and 08 showed between one and two additional wgMLST (chrom) allelic differences using PulseNet 2.0 compared to BioNumerics. For outbreaks 03, 06, and 09, wgMLST (chrom) allelic differences were 100% analogous between systems (Table A1). For outbreak and sporadic/non-outbreak isolates, cgMLST allele difference ranges were 100% analogous between systems for all outbreaks except outbreak 09 where there was one additional allele difference using BioNumerics (Table A2). The range of wgMLST (chrom) allele differences varied between systems for all outbreaks (Table A2).

Overall, pairwise allelic differences obtained from each system were highly concordant across multiple sets of outbreak isolates, particularly when using the cgMLST scheme. This finding suggests that allele difference outputs using the PulseNet 2.0 allele calling workflow can be interpreted similarly to those derived from BioNumerics, omitting the need to adjust STEC cluster detection thresholds following the transition to PulseNet 2.0.

Table A1. Range of hqSNP and allele-based pairwise genomic differences between outbreak isolates; BioNumerics v.7.6.3 vs. PulseNet 2.0. Red text indicates allele range discrepancies between systems.

Outbreak Number Assigned in Study	Outbreak Code	SNP Differences	cgMLST Differences (BioNumerics)	cgMLST Differences (PN 2.0)	wgMLST (chrom) Differences (BioNumerics)	wgMLST (Chrom) Differences (PN 2.0)
01	1601MLEXK-1	0–2	0–2	0–2	0–6	0–5
02	1603VAEXH-1	0–1	0–0	0–0	0–4	0–2
03	1608MIEC5-1	0–3	0–3	0–3	0–4	0–4
04	1912IAEXW-1	0–7	0–2	0–2	0–4	0–3
05	1911MNEXH-1	0–2	0–2	0–2	0–2	0–4
06	1909CAEXH-1	0–7	0–3	0–3	0–7	0–7
07	2206MLEXD-1	0–19	0–8	0–8	0–15	0–16
08	1905MLEXK-1	0–2	0–3	0–3	0–4	0–5
09	1902MLEVC-1	0–2	0–1	0–1	0–3	0–3
10	1808MLEVC-1	0–1	0–2	0–1	0–5	0–1
11	1712MLEXH-1	0–5	0–3	0–2	0–7	0–5

Table A2. Range of hqSNP and allele-based pairwise genomic differences between outbreak + sporadic/non-outbreak isolates; BioNumerics v.7.6.3 vs. PulseNet 2.0. Red text indicates allele range discrepancies between systems.

Outbreak Number Assigned in Study	Outbreak Code	SNP Differences	cgMLST Differences (BioNumerics)	cgMLST Differences (PN 2.0)	wgMLST (Chrom) Differences (BioNumerics)	wgMLST (Chrom) Differences (PN 2.0)
01	1601MLEXK-1	0–40	0–21	0–21	0–47	0–39
02	1603VAEXH-1	0–63	0–26	0–26	0–70	0–59
03	1608MIEC5-1	0–106	0–34	0–34	0–83	0–70
04	1912IAEXW-1	0–122	0–61	0–61	0–127	0–105
05	1911MNEXH-1	0–182	0–88	0–88	0–171	0–158
06	1909CAEXH-1	0–150	0–27	0–27	0–60	0–59
07	2206MLEXD-1	0–114	0–54	0–54	0–111	0–100
08	1905MLEXK-1	0–112	0–45	0–45	0–91	0–84
09	1902MLEVC-1	0–85	0–36	0–35	0–103	0–92
10	1808MLEVC-1	0–53	0–27	0–27	0–53	0–44
11	1712MLEXH-1	0–49	0–20	0–20	0–48	0–44

References

Taylor, D.N.; Echeverria, P.; Sethabutr, O.; Pitarangsi, C.; Leksomboon, U.; Blacklow, N.R.; Rowe, B.; Gross, R.; Cross, J. Clinical and microbiologic features of Shigella and enteroinvasive Escherichia coli infections detected by DNA hybridization. J. Clin. Microbiol. 1988, 26, 1362–1366. [Google Scholar] [CrossRef] [PubMed]
About Escherichia coli Infection. Available online: https://www.cdc.gov/ecoli/about/?CDC_AAref_Val=https://www.cdc.gov/ecoli/general/index.html (accessed on 28 February 2025).
Scallan, E.; Hoekstra, R.M.; Angulo, F.J.; Tauxe, R.V.; Widdowson, M.A.; Roy, S.L.; Jones, J.L.; Griffin, P.M. Foodborne illness acquired in the United States—Major pathogens. Emerg. Infect. Dis. 2011, 17, 7–15. [Google Scholar] [CrossRef] [PubMed]
Marshall, K.E.; Hexemer, A.; Seelman, S.L.; Fatica, M.K.; Blessington, T.; Hajmeer, M.; Kisselburgh, H.; Atkinson, R.; Hill, K.; Sharma, D.; et al. Lessons Learned from a Decade of Investigations of Shiga Toxin-Producing Escherichia coli Outbreaks Linked to Leafy Greens, United States and Canada. Emerg. Infect. Dis. 2020, 26, 2319–2328. [Google Scholar] [CrossRef] [PubMed]
Mead, P.S.; Griffin, P.M. Escherichia coli O157:H7. Lancet 1998, 352, 1207–1212. [Google Scholar] [CrossRef]
Gould, L.H.; Demma, L.; Jones, T.F.; Hurd, S.; Vugia, D.J.; Smith, K.; Shiferaw, B.; Segler, S.; Palmer, A.; Zansky, S.; et al. Hemolytic uremic syndrome and death in persons with Escherichia coli O157:H7 infection, foodborne diseases active surveillance network sites, 2000–2006. Clin. Infect Dis. 2009, 49, 1480–1485. [Google Scholar] [CrossRef]
Swaminathan, B.; Barrett, T.J.; Hunter, S.B.; Tauxe, R.V.; CDC PulseNet Task Force. PulseNet: The molecular subtyping network for foodborne bacterial disease surveillance, United States. Emerg. Infect. Dis. 2001, 7, 382–389. [Google Scholar] [CrossRef]
Gerner-Smidt, P.; Hise, K.; Kincaid, J.; Hunter, S.; Rolando, S.; Hyytiä-Trees, E.; Ribot, E.M.; Swaminathan, B.; Pulsenet Taskforce. PulseNet USA: A five-year update. Foodborne Pathog. Dis. 2006, 3, 9–19. [Google Scholar] [CrossRef]
Ribot, E.M.; Hise, K.B. Future challenges for tracking foodborne diseases: PulseNet, a 20-year-old US surveillance system for foodborne diseases, is expanding both globally and technologically. EMBO Rep. 2016, 17, 1499–1505. [Google Scholar] [CrossRef]
Tolar, B.; Joseph, L.A.; Schroeder, M.N.; Stroika, S.; Ribot, E.M.; Hise, K.B.; Gerner-Smidt, P. An Overview of PulseNet USA Databases. Foodborne Pathog. Dis. 2019, 16, 457–462. [Google Scholar] [CrossRef]
Joseph, L.A.; Griswold, T.; Vidyaprakash, E.; Im, S.B.; Williams, G.M.; Pouseele, H.A.; Hise, K.B.; Carleton, H.A. Evaluation of core genome and whole genome multilocus sequence typing schemes for Campylobacter jejuni and Campylobacter coli outbreak detection in the USA. Microb. Genom. 2023, 9, mgen001012. [Google Scholar] [CrossRef]
2025 National Notifiable Conditions (Historical). Available online: https://ndc.services.cdc.gov/search-results-year/ (accessed on 28 February 2025).
Brown, E.; Dessai, U.; McGarry, S.; Gerner-Smidt, P. Use of Whole-Genome Sequencing for Food Safety and Public Health in the United States. Foodborne Pathog. Dis. 2019, 16, 441–450. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
BioNumerics, (Version 7.6): WGS Analysis Software Platform; BioMérieux: Sint-Martens-Latem, Belgium, 2024.
Besser, J.; Carleton, H.A.; Gerner-Smidt, P.; Lindsey, R.L.; Trees, E. Next-generation sequencing technologies and their application to the study and control of bacterial infections. Clin. Microbiol. Infect. 2018, 24, 335–341. [Google Scholar] [CrossRef] [PubMed]
Jackson, B.R.; Tarr, C.; Strain, E.; Jackson, K.A.; Conrad, A.; Carleton, H.; Katz, L.S.; Stroika, S.; Gould, L.H.; Mody, R.K.; et al. Implementation of Nationwide Real-time Whole-genome Sequencing to Enhance Listeriosis Outbreak Detection and Investigation. Clin. Infect. Dis. 2016, 63, 380–386. [Google Scholar] [CrossRef] [PubMed]
Stevens, E.L.; Carleton, H.A.; Beal, J.; Tillman, G.E.; Lindsey, R.L.; Lauer, A.C.; Pightling, A.; Jarvis, K.G.; Ottesen, A.; Ramachandran, P.; et al. Use of Whole Genome Sequencing by the Federal Interagency Collaboration for Genomics for Food and Feed Safety in the United States. J. Food Prot. 2022, 85, 755–772. [Google Scholar] [CrossRef]
Gerner-Smidt, P.; Besser, J.; Concepción-Acevedo, J.; Folster, J.P.; Huffman, J.; Joseph, L.A.; Kucerova, Z.; Nichols, M.C.; Schwensohn, C.A.; Tolar, B. Whole Genome Sequencing: Bridging One-Health Surveillance of Foodborne Diseases. Front. Public Health 2019, 7, 172, Erratum in Front. Public Health. 2019, 7, 365. https://doi.org/10.3389/fpubh.2019.00365. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Nayfach, S.; Rodriguez-Mueller, B.; Garud, N.; Pollard, K.S. An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography. Genome Res. 2016, 26, 1612–1625. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Thompson, C.C.; Chimetto, L.; Edwards, R.A.; Swings, J.; Stackebrandt, E.; Thompson, F.L. Microbial genomic taxonomy. BMC Genom. 2013, 14, 913. [Google Scholar] [CrossRef]
Yoon, S.H.; Ha, S.M.; Lim, J.; Kwon, S.; Chun, J. A large-scale evaluation of algorithms to calculate average nucleotide identity. Antonie Leeuwenhoek 2017, 110, 1281–1286. [Google Scholar] [CrossRef]
Lindsey, R.L.; Gladney, L.M.; Huang, A.D.; Griswold, T.; Katz, L.S.; Dinsmore, B.A.; Im, M.S.; Kucerova, Z.; Smith, P.A.; Lane, C.; et al. Rapid identification of enteric bacteria from whole genome sequences using average nucleotide identity metrics. Front. Microbiol. 2023, 14, 1225207. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Joensen, K.G.; Tetzschner, A.M.; Iguchi, A.; Aarestrup, F.M.; Scheutz, F. Rapid and Easy In Silico Serotyping of Escherichia coli Isolates by Use of Whole-Genome Sequencing Data. J. Clin. Microbiol. 2015, 53, 2410–2426. [Google Scholar] [CrossRef]
Malberg Tetzschner, A.M.; Johnson, J.R.; Johnston, B.D.; Lund, O.; Scheutz, F. In Silico Genotyping of Escherichia coli Isolates for Extraintestinal Virulence Genes by Use of Whole-Genome Sequencing Data. J. Clin. Microbiol. 2020, 58, e01269-20. [Google Scholar] [CrossRef] [PubMed]
Katz, L.S.; Griswold, T.; Williams-Newkirk, A.J.; Wagner, D.; Petkau, A.; Sieffert, C.; Van Domselaar, G.; Deng, X.; Carleton, H.A. A Comparative Analysis of the Lyve-SET Phylogenomics Pipeline for Genomic Epidemiology of Foodborne Pathogens. Front. Microbiol. 2017, 8, 375. [Google Scholar] [CrossRef] [PubMed]
Timme, R.E.; Rand, H.; Shumway, M.; Trees, E.K.; Simmons, M.; Agarwala, R.; Davis, S.; Tillman, G.E.; Defibaugh-Chavez, S.; Carleton, H.A.; et al. Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance. PeerJ 2017, 5, e3893. [Google Scholar] [CrossRef] [PubMed]
Uelze, L.; Grützke, J.; Borowiak, M.; Hammerl, J.A.; Juraschek, K.; Deneke, C.; Tausch, S.H.; Malorny, B. Typing methods based on whole genome sequencing data. One Health Outlook 2020, 2, 3. [Google Scholar] [CrossRef]
Coipan, C.E.; Dallman, T.J.; Brown, D.; Hartman, H.; van der Voort, M.; van den Berg, R.R.; Palm, D.; Kotila, S.; van Wijk, T.; Franz, E. Concordance of SNP- and allele-based typing workflows in the context of a large-scale international Salmonella Enteritidis outbreak investigation. Microb. Genom. 2020, 6, e000318. [Google Scholar] [CrossRef]
Munck, N.; Njage, P.M.K.; Leekitcharoenphon, P.; Litrup, E.; Hald, T. Application of Whole-Genome Sequences and Machine Learning in Source Attribution of Salmonella Typhimurium. Risk Anal. 2020, 40, 1693–1705. [Google Scholar] [CrossRef]
Leeper, M.M.; Tolar, B.M.; Griswold, T.; Vidyaprakash, E.; Hise, K.B.; Williams, G.M.; Im, S.B.; Chen, J.C.; Pouseele, H.; Carleton, H.A. Evaluation of whole and core genome multilocus sequence typing allele schemes for Salmonella enterica outbreak detection in a national surveillance network, PulseNet USA. Front. Microbiol. 2023, 14, 1254777. [Google Scholar] [CrossRef]
Rahbar, S. K-Means Clustering Method on Microbiome Data Unsupervised Machine-Learning Method to Group Microbime Data of the Same Characteristics; 2017. Available online: https://www.researchgate.net/publication/322055290_K-Means_Clustering_Method_on_Microbiome_Data_Unsupervised_Machine-Learning_Method_to_Group_Microbime_Data_of_the_Same_Charactristics (accessed on 1 November 2024).
Oyelade, J.; Isewon, I.; Oladipupo, F.; Aromolaran, O.; Uwoghiren, E.; Ameh, F.; Achas, M.; Adebiyi, E. Clustering Algorithms: Their Application to Gene Expression Data. Bioinform. Biol. Insights 2016, 10, 237–253. [Google Scholar] [CrossRef]
PulseNet International SOPs. Available online: https://www.aphl.org/programs/global_health/Pages/PulseNet-International-SOPs.aspx/ (accessed on 28 February 2025).
Prjibelski, A.; Antipov, D.; Meleshko, D.; Lapidus, A.; Korobeynikov, A. Using SPAdes de Novo Assembler. Curr. Protoc. Bioinform. 2020, 70, e102. [Google Scholar] [CrossRef]
Ingle, D.J.; Gonçalves da Silva, A.; Valcanis, M.; Ballard, S.A.; Seemann, T.; Jennison, A.V.; Bastian, I.; Wise, R.; Kirk, M.D.; Howden, B.P.; et al. Emergence and divergence of major lineages of Shiga-toxin-producing Escherichia coli in Australia. Microb. Genom. 2019, 5, e000268. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Rathnayake, I.U.; Graham, R.M.A.; Bayliss, J.; Staples, M.; Micalizzi, G.; Ariotti, L.; Cover, L.; Heron, B.; Graham, T.; Stafford, R.; et al. Implementation of routine genomic surveillance provided insights into a locally acquired outbreak caused by a rare clade of Salmonella enterica serovar Enteritidis in Queensland, Australia. Microb. Genom. 2023, 9, mgen001059. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Krawczyk, P.S.; Lipinski, L.; Dziembowski, A. PlasFlow: Predicting plasmid sequences in metagenomic data using genome signatures. Nucleic Acids Res. 2018, 46, e35. [Google Scholar] [CrossRef] [PubMed]
Stamatakis, A. RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 2014, 30, 1312–1313. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Zhou, Z.; Alikhan, N.F.; Mohamed, K.; The Agama Study Group; Achtman, M. The EnteroBase user’s guide, with case studies on Salmonella transmissions, Yersinia pestis phylogeny and Escherichia core genomic diversity. Genome Res. 2020, 30, 138–152. [Google Scholar] [CrossRef]
Yuan, A. Maximum Likelihood. Brenner’s Encyclopedia of Genetics, 2nd ed.; Academic Press: New York, NY, USA, 2013; Volume 4. [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2024; Available online: https://www.R-project.org (accessed on 1 November 2024).
Galili, T. Dendextend: An R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics 2015, 31, 3718–3720. [Google Scholar] [CrossRef]
Baker, F.B. Stability of two hierarchical grouping techniques case 1: Sensitivity to data errors. J. Am. Stat. Assoc. 1974, 69, 440–445. [Google Scholar]
Saraçli, S.; Doğan, N.; Doğan, İ. Comparison of hierarchical cluster analysis methods by cophenetic correlation. J. Inequal. Appl. 2013, 2013, 203. [Google Scholar] [CrossRef]
Shutaywi, M.; Kachouie, N.N. Silhouette Analysis for Performance Evaluation in Machine Learning with Applications to Clustering. Entropy 2021, 23, 759. [Google Scholar] [CrossRef]
Charrad, M.; Ghazzali, N.; Boiteau, V.; Niknafs, A. NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set. J. Stat. Software. 2014, 61, 1–36. [Google Scholar] [CrossRef]
Kluyver, T.; Ragan-Kelley, B.; Pérez, F.; Granger, B.; Bussonnier, M.; Frederic, J.; Kelley, K.; Hamrick, J.; Grout, J.; Corlay, S.; et al. Jupyter Notebooks—A publishing format for reproducible computational workflows. In Positioning and Power in Academic Publishing: Players, Agents and Agendas; Loizides, F., Schmidt, B., Eds.; IOS Press: Amsterdam, The Netherlands, 2016; pp. 87–90. [Google Scholar]
Habrun, C.A.; Birhane, M.G.; François Watkins, L.K.; Benedict, K.; Bottichio, L.; Nemechek, K.; Tolar, B.; Schroeder, M.N.; Chen, J.C.; Caidi, H.; et al. Multistate nontyphoidal Salmonella and Shiga toxin-producing Escherichia coli outbreaks linked to international travel-United States, 2017–2020. Epidemiol. Infect. 2024, 152, e17. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Pearce, M.E.; Alikhan, N.F.; Dallman, T.J.; Zhou, Z.; Grant, K.; Maiden, M.C.J. Comparative analysis of core genome MLST and SNP typing within a European Salmonella serovar Enteritidis outbreak. Int. J. Food Microbiol. 2018, 274, 1–11. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Simon, S.; Trost, E.; Bender, J.; Fuchs, S.; Malorny, B.; Rabsch, W.; Prager, R.; Tietze, E.; Flieger, A. Evaluation of WGS based approaches for investigating a food-borne outbreak caused by Salmonella enterica serovar Derby in Germany. Food Microbiol. 2018, 71, 46–54. [Google Scholar] [CrossRef] [PubMed]
Blanc, D.S.; Magalhães, B.; Koenig, I.; Senn, L.; Grandbastien, B. Comparison of Whole Genome (wg-) and Core Genome (cg-) MLST (BioNumericsTM) Versus SNP Variant Calling for Epidemiological Investigation of Pseudomonas aeruginosa. Front Microbiol. 2020, 11, 1729. [Google Scholar] [CrossRef] [PubMed]
Bernaquez, I.; Gaudreau, C.; Pilon, P.A.; Bekal, S. Evaluation of whole-genome sequencing-based subtyping methods for the surveillance of Shigella spp. and the confounding effect of mobile genetic elements in long-term outbreaks. Microb. Genom. 2021, 7, 000672. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Jagadeesan, B.; Baert, L.; Wiedmann, M.; Orsi, R.H. Comparative Analysis of Tools and Approaches for Source Tracking Listeria monocytogenes in a Food Facility Using Whole-Genome Sequence Data. Front Microbiol. 2019, 10, 947. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Maiden, M.C.; Jansen van Rensburg, M.J.; Bray, J.E.; Earle, S.G.; Ford, S.A.; Jolley, K.A.; McCarthy, N.D. MLST revisited: The gene-by-gene approach to bacterial genomics. Nat. Rev. Microbiol. 2013, 11, 728–736. [Google Scholar] [CrossRef]
Cody, A.J.; Bray, J.E.; Jolley, K.A.; McCarthy, N.D.; Maiden, M.C.J. Core Genome Multilocus Sequence Typing Scheme for Stable, Comparative Analyses of Campylobacter jejuni and C. coli Human Disease Isolates. J. Clin. Microbiol. 2017, 55, 2086–2097. [Google Scholar] [CrossRef]
Moura, A.; Criscuolo, A.; Pouseele, H.; Maury, M.M.; Leclercq, A.; Tarr, C.; Björkman, J.T.; Dallman, T.; Reimer, A.; Enouf, V.; et al. Whole genome-based population biology and epidemiological surveillance of Listeria monocytogenes. Nat. Microbiol. 2016, 2, 16185. [Google Scholar] [CrossRef]
Zhang, P.; Essendoubi, S.; Keenliside, J.; Reuter, T.; Stanford, K.; King, R.; Lu, P.; Yang, X. Genomic analysis of Shiga toxin-producing Escherichia coli O157:H7 from cattle and pork-production related environments. NPJ Sci. Food. 2021, 5, 15. [Google Scholar] [CrossRef]
Zhou, Z.; Alikhan, N.F.; Sergeant, M.J.; Luhmann, N.; Vaz, C.; Francisco, A.P.; Carriço, J.A.; Achtman, M. GrapeTree: Visualization of core genomic relationships among 100,000 bacterial pathogens. Genome Res. 2018, 28, 1395–1404. [Google Scholar] [CrossRef]
Clinical Overview of Shigellosis. Available online: https://www.cdc.gov/shigella/hcp/clinical-overview/index.html (accessed on 28 February 2025).
Charles, H.; Prochazka, M.; Thorley, K.; Crewdson, A.; Greig, D.R.; Jenkins, C.; Painset, A.; Fifer, H.; Browning, L.; Cabrey, P.; et al. Outbreak of sexually transmitted, extensively drug-resistant Shigella sonnei in the UK, 2021–2022: A descriptive epidemiological study. Lancet Infect. Dis. 2022, 22, 1503–1510. [Google Scholar] [CrossRef] [PubMed]
Trees, E.; Carleton, H.A.; Folster, J.P.; Gieraltowski, L.; Hise, K.; Leeper, M.; Nguyen, T.A.; Poates, A.; Sabol, A.; Tagg, K.A.; et al. Genetic Diversity in Salmonella enterica in Outbreaks of Foodborne and Zoonotic Origin in the USA in 2006–2017. Microorganisms 2024, 12, 1563. [Google Scholar] [CrossRef] [PubMed]
Bakleh, M.Z.; Kohailan, M.; Marwan, M.; Alhaj Sulaiman, A. A Systematic Review and Comprehensive Analysis of mcr Gene Prevalence in Bacterial Isolates in Arab Countries. Antibiotics 2024, 13, 958. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]

Figure 1. PulseNet 2.0 data analysis workflow.

Figure 2. Overview of Escherichia identification and genotyping workflow in PulseNet 2.0. ¹ https://cge.food.dtu.dk/services/SerotypeFinder/; ² https://github.com/ncbi/stxtyper; ³ https://cge.food.dtu.dk/services/VirulenceFinder/. All websites were last accessed on 1 April 2025 for this study.

Figure 3. PulseNet 2.0 allele calling workflow.

Figure 4. PulseNet 2.0 Escherichia schema. Number of loci included within schemes are shown for the overall scheme (all loci), core genome, whole genome (excluding core), plasmid, and 7-gene, 8-gene, and 15-gene MLST schemes. * indicates that the scheme is hosted on Enterobase: https://enterobase.warwick.ac.uk/ (last accessed for this study on 1 April 2025).

Figure 5. (A) Scatterplot of hqSNP differences vs. cgMLST differences. (B) Scatterplot of hqSNP differences vs. wgMLST (chrom) differences. (C) Scatterplot of cgMLST vs. wgMLST (chrom) differences. Regression equations and R² values are displayed on the plots. Pearson correlation coefficients for each combination of pairwise matrices are shown below plots.

Figure 6. (A) Baker’s gamma indices for outbreak tanglegrams. (B) Cophenetic Correlation Coefficients for outbreak tanglegrams.

Figure 7. (A) Tanglegram of cgMLST and hqSNP clustering using single linkage for one representative outbreak (outbreak 04) and its corresponding sporadic/non-outbreak isolates. (B) Tanglegram of wgMLST (chrom) and hqSNP clustering for the same set of isolate sequences. Outbreak isolates are depicted in orange, and sporadic isolates are depicted in green. The tanglegram links tips with the same label to each other via a straight line. Allele/hqSNP differences are labeled at each node.

Table 1. Summary of outbreaks included in the study.

Outbreak Number (Assigned in Study)	PulseNet Outbreak Code *	Serotype	No. of Outbreak Isolates (Sporadic Isolates) **	Confirmed Source	Range of Collection Dates
01	1601MLEXK-1	O121:H19	68 (5)	flour	2016-01-02 to 2016-09-07
02	1603VAEXH-1	O157:H7	9 (2)	raw milk	2016-03-07 to 2016-03-20
03	1608MIEC5-1	O5:H9	12 (4)	cheese served at restaurant	2016-03-14 to 2016-08-02
04	1912IAEXW-1	O103:H2	26 (3)	clover sprouts	2019-11-26 to 2019-12-23
05	1911MNEXH-1	O157:H7	18 (5)	frozen pizza crust	2019-10-10 to 2019-12-15
06	1909CAEXH-1	O157:H7	22 (6)	romaine lettuce	2019-07-14 to 2019-09-11
07	2206MLEXD-1	O111:H8	11 (4)	international travel	2022-05-24 to 2022-07-07
08	1905MLEXK-1	O121:H19	22 (4)	bison (ground)	2019-03-23 to 2019-08-12
09	1902MLEVC-1	O26:H11	21 (3)	flour	2018-12-28 to 2019-05-29
10	1808MLEVC-1	O26:H11	19 (2)	beef (ground)	2018-07-09 to 2018-09-04
11	1712MLEXH-1	O157:H7	23 (8)	leafy greens	2017-11-10 to 2017-12-14

* PulseNet outbreak codes are designated by the 2-digit year in which the outbreak was detected, 2-digit month in which the outbreak was detected, lab ID/state in which the outbreak was detected (“ML” = multi-state), and 3-digit serotype code, followed by the cluster number [8]. If multiple outbreaks meet the same criteria, then the cluster number is changed from 1 to 2, 2 to 3, etc. For example, 1601MLEXK-1 represents the 1st multi-state E. coli O121 outbreak detected in January 2016. 1601MLEXK-2 represents the 2nd multi-state E. coli O121 outbreak detected in January 2016, and so on. ** All sporadic isolates were matched to the outbreak by serotype and had collection dates within six months of the outbreak’s median collection date.

Table 2. Range of hqSNP- and allele-based pairwise genomic differences between outbreak isolates using PulseNet 2.0.

Outbreak Number Assigned in Study	Outbreak Code	hqSNP	cgMLST	wgMLST (Chrom)
01	1601MLEXK-1	0–2	0–2	0–5
02	1603VAEXH-1	0–1	0–0	0–2
03	1608MIEC5-1	0–3	0–3	0–4
04	1912IAEXW-1	0–7	0–2	0–3
05	1911MNEXH-1	0–2	0–2	0–4
06	1909CAEXH-1	0–7	0–3	0–7
07	2206MLEXD-1	0–19	0–8	0–16
08	1905MLEXK-1	0–2	0–3	0–5
09	1902MLEVC-1	0–2	0–1	0–3
10	1808MLEVC-1	0–1	0–1	0–1
11	1712MLEXH-1	0–5	0–2	0–5

Table 3. Summary table of metrics (regression analysis).

	Slope Equation; [95% CI for Slope]	R²	Pearson Correlation Coefficient; [95% CI]
cgMLST vs. hqSNP	y = 0.432x + 0.08; [0.426, 0.437]	0.75	0.86; [0.858, 0.870]
wgMLST (chrom) vs. hqSNP	y = 0.966x + 0.29; [0.956, 0.975]	0.82	0.91; [0.904, 0.911]
cgMLST vs. wgMLST (chrom)	y = 1.914x + 0.35; [1.895, 1.933]	0.81	0.90; [0.895, 0.903]

Table 4. Summary table of metrics (phylogenetic clustering analysis).

	Range of BGI Values Across Outbreaks	Range of CCC Values Across Outbreaks
cgMLST vs. hqSNP	0.413–0.987	0.981–1.00
wgMLST (chrom) vs. hqSNP	0.354–0.936	0.877–1.00
cgMLST vs. wgMLST (chrom)	0.686–0.964	0.979–1.00

Table 5. Summary table of metrics (K-means analysis).

	Range of Maximum Silhouette Scores at K = 2 Across Out Breaks	Range of Average Silhouette Widths for Outbreak Isolate Groups	Range of Average Silhouette Widths for Sporadic Isolate Groups
cgMLST	0.81–0.93	0.92–0.99	0.34–0.93
wgMLST (chrom)	0.81–0.97	0.89–0.99	0.32–0.92
hqSNP	0.87–0.99	0.87–0.99	0.35–0.91

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Validation of Core and Whole-Genome Multi-Locus Sequence Typing Schemes for Shiga-Toxin-Producing E. coli (STEC) Outbreak Detection in a National Surveillance Network, PulseNet 2.0, USA

Abstract

1. Introduction

2. Materials and Methods

3. Results

3.1. Summary of Outbreak Information

3.2. Pairwise Genomic Differences

3.3. Linear Regression Models

3.4. Tanglegrams

3.5. K-Means Analysis

3.6. Summary of Metrics

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Comparison of Pairwise Differences Obtained from BioNumerics v.7.6.3 vs. PulseNet 2.0

References

Article Metrics

Citations

Article Access Statistics