Microbiome of Bacterially Impaired Watersheds: Distribution of Potential Bacterial Pathogens

: Bacterial impairment of freshwater systems is a commonly studied global problem. How-ever, studies on the relative distribution of bacterial pathogens in different impaired aquatic systems have been limited. Frequently, impaired freshwater systems are classiﬁed by the presence of fecal indicator bacteria (FIB) and the identiﬁcation of sources of fecal contamination through microbial source tracking. In this study, we assessed the relative abundance of DNA sequences related to potential human bacterial pathogens along with human fecal indicator bacteria in three impaired watersheds. These watersheds consistently showed a high abundance of FIB for the past several years. Using Illumina paired-end DNA sequencing of 16S rRNA gene amplicons, we observed variation in the relative distribution of DNA sequences related to Legionellaceae , Enterobacteriaceae and Bacteroidaceae families across different sites. We identiﬁed potential hotspots sites in these impaired water systems, which showed a relatively high abundance of pathogen-related DNA sequences. This study demonstrates the signiﬁcance of Next-Gen DNA sequencing for the initial screening of waterborne pathogens and the identiﬁcation of high-risk sites for preferential remediation efforts in impaired water systems. Secondly, the frequent temporal monitoring of speciﬁcally identiﬁed pathogens that are in high abundance in a watershed can help in the accurate prediction and prevention of disease outbreaks.


Introduction
Bacterial contamination of drinking and recreational water is a major environmental health concern [1,2]. According to the World Health Organization [3], every year waterborne diseases cause more than 1.5 million human deaths worldwide. Although most of these deaths are reported in developing countries, there are still many incidences of waterborne diseases in developed countries [4]. For example, in the United States, every year 75,000 deaths and hospitalizations are attributed to waterborne bacterial pathogens [5].
Bacterial pathogens, such as Salmonella, Shigella and pathogenic E. coli from human or animal fecal material can enter drinking and recreational water resources [6]. Fecal material from infected individuals can carry high loads of bacterial pathogens [7], such as up to 10 10 Salmonella cells and 10 9 Shigella cells per gram of human feces [8]. In addition to fecal pathogens, other pathogens can persist in water and cause disease under optimal growth conditions [9]. This highlights the need for continuous monitoring of the water systems for potential pathogens.
The health risks associated with water contamination are frequently tested through the presence of fecal indicator bacteria (FIB) using quick and cost-effective methods, such as IDEXX Colilert (Westbrook, ME, USA) and membrane filtration-based testing [10,11]. The presence of FIB, like E. coli, enterococci and fecal coliforms suggest fecal contamination of water systems. The detection of FIB has been used as the gold standard for determining bacterial impairment of watersheds. However, FIB testing does not provide information on the abundance or relative distribution of various waterborne pathogens [12][13][14][15][16].
Information on the distribution of different bacterial pathogens is critical for water resource managers to identify and prioritize the remediation efforts of high-risk sites. Considering the scarcity of freshwater resources, a pragmatic approach is to accurately assess the actual health risks associated with the potential use of a water resource based on the presence of actual pathogens rather than simply on the presence of FIB. Currently, 946,573 km of rivers and streams in the United States have been identified as impaired, mainly based on the presence of FIB [17], under Section 303 (d) of the United States Clean Water Act [18].
Secondly, the presence or quantification of a specific human pathogen in water is done through simple PCR or qPCR using target-specific primers. However, testing hundreds of different pathogens, such as Salmonella, Campylobacter, Legionella and pathogenic strains of E. coli and Vibrio vulnificus in a water sample without any prior knowledge is a very costly and time-consuming process. Therefore, an initial screening of site-specific pathogens is needed for the long-term temporal monitoring or quantification of concerned pathogens.
In the current study, we assessed the relative abundance of DNA sequences related to potential bacterial pathogens and human fecal indicator microorganisms in 65 water samples from three bacterially impaired watersheds. We used Next-Gen DNA sequencing for the initial screening of bacterial pathogens in impaired watersheds. Two surface water systems, the Little Sac watershed (LSW) and Pearson Creek (PC), were sampled spatially to determine if there were differences in the relative distribution of pathogens at various sites along with these impaired water systems. The third water system was a karst system at Sequiota spring (SEQ spring).
Water samples from the SEQ spring were collected intermittently to determine if variations in the relative abundance of potential pathogens changed temporally. The three water systems were chosen for this study because, previously, these sites consistently showed a high abundance of FIB as determined using the IDEXX Colilert testing approach [19][20][21][22][23][24]. To date, testing of these watersheds has been focused on FIB, which provides little information on the presence of pathogens and the potential health risks that are associated with these bacterially impaired watersheds. This study investigated the relative abundance of potential pathogen-related sequences associated with the three impaired watersheds and how they are distributed spatially and temporally.

Study Area
The ten sites sampled for this project are in three different watersheds surrounding Springfield, Missouri ( Figure 1). Five sites are in the Upper Little Sac watershed, four sites are in the Pearson Creek watershed, and one site is in the Galloway Creek watershed (18.2 km 2 ). Site-specific drainage areas range from 11-609 km 2 , and land uses in the contributing area are a mix of urban and agricultural consisting mainly of livestock and forage crop production.
The SEQ spring differs as it is an outlet for a spring with a recharge area of approximately 12.5 km 2 , draining significant portions of urbanized southeast Springfield. The samples from SEQ spring were collected three times over a period of 1.5 years to evaluate temporal changes in the microbial community (Table S1 in Supplementary Materials).
Two of the three watersheds (LSW and PC) have been classified as bacterially impaired and were placed on 303 (d) of the United States Clean Water Act [18] in 1998 and 2007, respectively [25]. Although SEQ spring is not on the 303 (d) list of bacterial impairments, it has shown a high abundance of indicator microorganisms from 1986 to 2021 [26][27][28][29]. The underlying geology of the region is mostly limestone in which a karst landscape has formed where sinkholes, losing streams, and springs are common [30]. Dye tracing experiments have linked several ponds and losing sections of Galloway Creek to the spring, with flow rates varying from 0.6-11.0 million gallons per day [19,31]. 2007, respectively [25]. Although SEQ spring is not on the 303 (d) list of bacterial impairments, it has shown a high abundance of indicator microorganisms from 1986 to 2021 [26][27][28][29]. The underlying geology of the region is mostly limestone in which a karst landscape has formed where sinkholes, losing streams, and springs are common [30]. Dye tracing experiments have linked several ponds and losing sections of Galloway Creek to the spring, with flow rates varying from 0.6-11.0 million gallons per day [19,31].  (yellow squares) and Sequiota Spring (aqua triangle) in the James River Basin were selected for Next-Gen DNA sequencing. The three sites (PR_102, PC_Cat and Seq spring) that displayed a high abundance of pathogen-related sequences are marked with black asterisk ( * ). Blue arrows indicate the direction of water flow.

Water Collection and Processing
Overall, 65 water samples were collected in different 5-gallon sterilized polypropylene carboys and transported back to the laboratory on ice. A detailed description of sam- * * * 2007, respectively [25]. Although SEQ spring is not on the 303 (d) list of bacterial impairments, it has shown a high abundance of indicator microorganisms from 1986 to 2021 [26][27][28][29]. The underlying geology of the region is mostly limestone in which a karst landscape has formed where sinkholes, losing streams, and springs are common [30]. Dye tracing experiments have linked several ponds and losing sections of Galloway Creek to the spring, with flow rates varying from 0.6-11.0 million gallons per day [19,31]. and Sequiota Spring (aqua triangle) in the James River Basin were selected for Next-Gen DNA sequencing. The three sites (PR_102, PC_Cat and Seq spring) that displayed a high abundance of pathogen-related sequences are marked with black asterisk ( * ). Blue arrows indicate the direction of water flow.

Water Collection and Processing
Overall, 65 water samples were collected in different 5-gallon sterilized polypropylene carboys and transported back to the laboratory on ice. A detailed description of sam- * * * ). Blue arrows indicate the direction of water flow.

Water Collection and Processing
Overall, 65 water samples were collected in different 5-gallon sterilized polypropylene carboys and transported back to the laboratory on ice. A detailed description of sampling locations, sampling dates and volume of water filtered are reported in the Supplemental Table (Table S1). The E. coli and total coliforms were quantified using IDEXX Quanti-Tray/2000 systems (Table S2).

DNA Extraction
Water samples were filtered through 0.22 µm Sterivex filters (Millipore Corporation, Burlington, MA, USA) using a peristaltic pump (Masterflex, Cole-Pamer Co, Vernon Hills, IL, USA). Approximately, 0.75-1.75 L of water were filtered for each sample (Table S1). The filters were stored at −20 • C until further processing. Filters were then cut into small fragments using sterile scissors and placed into 50 mL tubes. Sterile water (25 mL) was added to the tubes containing fragments of filter and vortexed for five minutes to detach the bacterial cells from the filter. Suspended cells in water were harvested by centrifugation at 10,000 rpm for five minutes, and DNA was extracted using Qiagen's DNeasy PowerLyzer PowerSoil kits (Mo Bio, Carlsbad, CA, USA). DNA was eluted with 25 µL sterile water and stored at −20 • C until further processing.

DNA Sequencing
Bacterial communities from each water sample were assessed using Illumina MiSeq paired-end DNA sequencing. A two-step PCR approach was used as described previously [32]. Briefly, in the first PCR, the V3-V5 region of bacterial 16S rRNA gene was amplified using primers F515 (5 -GTGCCAGCMGCCGCGG-3 ) and R907 (5 -CCGTCAATTCM-TTTRAGTTT-3 ). These universal bacterial primers were also attached with the Illumina sequencing primers that were targeted in the second PCR amplification. Each 25 µL PCR reaction contained 1X buffer, 0. The conditions for the first PCR were an initial 5 min denaturation at 95 • C, followed by 30 cycles of 95 • C for 45 s, an annealing step at 56 • C for 45 s, an extension at 72 • C for 45 s and a final extension at 72 • C for 7 min. As a standard PCR procedure, we ran positive control (E. coli DNA) and negative control (PCR grade sterilized water) along with each set of PCR reactions. The successful amplification of PCR samples along both controls were evaluated by gel electrophoresis and staining with ethidium bromide. Amplified PCR products were cleaned using ExoSap-IT PCR Cleanup System (ThermoFisher Scientific, Waltham, MA, USA) as per the manufacture's protocol.
Cleaned PCR products of the first PCR reaction were used as the templates for the second PCR step. In the second PCR, all reagents and their concentrations were the same as described above, except for the PCR primers. The primers used in the second PCR contained the Illumina sequencing adapters A and B along with the standard unique multiplex identifier sequences. The PCR conditions for the second PCR were: initial denaturation for 3 min at 95 • C, followed by 10 cycles of denaturation at 94 • C for 30 s, annealing at 60 • C for 30 s and extension at 72 • C for 30 s, with a final extension at 72 • C for 7 min. PCR products were quantified using a Nanodrop 2000 spectrophotometer (Ther-moFisher Scientific, Waltham, MA, USA) and pooled together in equimolar concentrations. Pooled PCR amplicons from all samples were purified using the Agencourt AMPure beads (Beckman Coulter, Brea, CA, USA). The purified PCR products were sequenced using Illumina MiSeq paired-end DNA sequencing.

Sequence Processing and Phylogenetic Analysis
Raw sequences ( Figure S1) were processed for initial quality control parameters, such as read length, ambiguous bases, removal of chimeric sequences etc. using Ribosomal Database Project II (http://rdp.cme.msu.edu; accessed on 1 September 2021) [33]. Highquality filtered DNA sequences were classified using the RDP naive Bayesian classifier 2.5 [34].
Bacterial sequences related to the Legionellaceae, Enterobacteriaceae and Bacteroidaceae families were extracted using Mothur [35]. These sequences were further assessed for additional quality control parameters, such as the presence of ambiguous bases ('N') and trimmed to a uniform sequence length using Sequencher 5.4.6 (Gene Codes Corp., Ann Arbor, MI, USA). Cleaned DNA sequences along with the reference sequences of these families from GenBank were aligned and clustered into operational taxonomic units (OTUs) at 98% similarity using RDP. Neighbor-joining based phylogenetic trees were then constructed using MEGA version 10.1.8 [36].
In addition to the OTU approach, we also assessed the data using the amplicon sequence variances (ASV). The overall distribution in bacterial communities did not greatly differ regardless of the approach (OUT or ASV; see the Supplemental Material for further details; Figure S2).

Statistical Analysis
Overall, the significance of differences among bacterial communities across different sites/times was assessed by analysis of similarity (ANOSIM, p < 0.05) and visualized with nonmetric multidimensional scaling (NMDS) plots generated in R version 3.6.0 [37]. The Bray-Curtis similarity data were square-root transformed prior to ANOSIM analysis.

Results and Discussion
Overall, Next-Gen DNA sequencing resulted in 2,815,170 good quality DNA sequences across all sites, and the average number of sequences per site are reported in (Table 1). We extracted all classified sequences related to Legionellaceae and Enterobacteriaceae families containing potential waterborne pathogens for detailed phylogenetic analysis. Likewise, we evaluated the distribution of DNA sequences related to the Bacteroidaceae family, which contains the biomarker bacterial species to detect human fecal contamination.

Distribution of Legionellaceae-Related Sequences
We identified 9767 sequences that were closely related to family Legionellaceae across all sampling locations. The overall abundance of Legionella-related sequences across different watersheds were 5776 sequences from Pearson Creek, 3433 from SEQ spring and 558 from Little Sac sites. These sequences were closely related to both pathogenic and non-pathogenic Legionella spp. from GenBank.
Representative sequences from each OTU containing 50 or more sequences were randomly selected to construct a phylogenetic tree to demonstrate the relative abundance of DNA sequences from various Legionella species (both pathogenic and non-pathogenic species) at different sites. In this phylogenetic analysis, we also included reference sequences of 23 Legionella spp. obtained from GenBank ( Figure 2). The phylogenetic analysis suggested the presence of nine distinct phylogenetic clusters (I to IX). Some sequences within clusters I, IV, VII and VIII were closely related to pathogenic species of Legionella, such as L. pneumophila, L. dumoffi, L. anisa, L. parisiensis, L. maceachernii, L. micdadei, L. rubrilucens, L. jordanis, L. hackeliae and L. birminghamensis. (Figure 2). Diversity 2022, 14, x FOR PEER REVIEW Figure 2. Neighbor-joining phylogenetic tree of partial 16S rRNA gene sequences (370 bp) to the Legionellaceae family from three impaired watersheds. Only OTUs (operational tax units) containing ≥ 50 sequences were used in this analysis. The DNA sequences of some pathogenic (*) and non-pathogenic species within the Legionellaceae family were also include analysis. The size (diameter) of the pie chart is proportional to the number of sequences w cluster. The size of a pie chart can only demonstrate the variation in sequences within a giv tershed, e.g., a pie chart for the LSW site cannot be compared with a pie chart from Pearson Numbers at the node reflect the bootstrap values.
In the SEQ spring samples, clusters I, III and IX made up most sequences. O three major clusters, the abundance of Legionella-related sequences was increased winter of 2019 (26% to 92%) compared to the summer sampling events. Frequently dences of Legionnaire's disease/Legionellosis are increased during the summer a seasons due to the preferred warmer temperatures [41][42][43]. Although disease ca crease more in the summer/fall, Legionella species can persist in the environment i tively the same abundance year-round [44].
Previous dye tracer studies at this site [19,31] suggested bacterial contaminati rived from old septic tanks and leaky sewer systems in the area. If the Legionella co nation is originated from the biofilms formed in the old corroded water pipes, th We also observed the presence of sequences related to non-pathogenic Legionella spp. within clusters I, IV, VII, VIII and IX were related to non-pathogenic species of Legionella ( Figure 2). Conversely, clusters III, V and VI were unique, i.e., without any closely related references DNA sequence from GenBank. Overall, 25 of 58 known Legionella species can be infectious to humans; however, the most frequently reported infections have been associated with L. pneumophila [38]. The relative abundance of different Legionella-related sequences greatly varied at different sites or sampling seasons in each watershed (Figure 2). For example, the five sites tested in the Little Sac watershed showed a relatively high abundance of Legionella-related sequences at the site PR_102 (50% to 80%). Site PR_102 resides in the city of Springfield and is under the direct influence of urbanization. The increase in Legionella-related sequences at this site could be due to bacterial contamination originating from the old sanitary sewer systems present in the northern part of the city.
Legionella spp. can survive or grow in corroded pipes by forming biofilms that sustain higher abundances of these bacteria [39]. Two of the most dominant clusters (I and VI) contained sequences that were closely related to three pathogenic species (L. dumoffii, L. anisa and L. bozemanii) and two non-pathogenic species (L. drancourtii and L. worsleiensis). The presence of Legionella-related sequences at the PR_102 site as compared to other locations in the Little Sac Watershed highlights the need for close monitoring of Legionella species at this location to prevent health risks.
Similarly, in the Pearson Creek watershed at site PC_Cat (Figure 2), we retrieved a relatively high abundance of Legionella-related sequences (35% to 100%). The abundance of Legionella-related sequences at an upstream site (PC_SHYY) and the two downstream sites (PC_193 and PC_148) were relatively very low compared to the PC_Cat site. The upstream site is not under urban influence, whereas other sites have moderate levels of urbanization. The relatively high abundance of Legionella sequences at the PC_Cat and not at the other two downstream sites suggests a site-specific source of contamination.
Overall, at the PC_Cat site, cluster I-and VIII-related sequences were in higher abundance and Cluster VIII contained pathogenic species, including L. maceachemii, L. israelensis and L. jordanis (Figure 2). These pathogens frequently cause pneumonia but have also been reported to cause Pontiac fever [40]. Plumbing systems can harbor these bacteria in biofilms and allow them to become abundant [39]. Since this stream runs through residential areas, it is possible that leaky domestic sewer systems or leaking plumbing systems may be contributing to this influx.
In the SEQ spring samples, clusters I, III and IX made up most sequences. Of these three major clusters, the abundance of Legionella-related sequences was increased in the winter of 2019 (26% to 92%) compared to the summer sampling events. Frequently, incidences of Legionnaire's disease/Legionellosis are increased during the summer and fall seasons due to the preferred warmer temperatures [41][42][43]. Although disease cases increase more in the summer/fall, Legionella species can persist in the environment in relatively the same abundance year-round [44].
Previous dye tracer studies at this site [19,31] suggested bacterial contamination derived from old septic tanks and leaky sewer systems in the area. If the Legionella contamination is originated from the biofilms formed in the old corroded water pipes, then the low water flow during winter months might have contributed to their increased concentration in winter. Generally, less water is used for domestic purposes during winter as compared to summer months. Secondly, the availability of protozoan hosts may be influencing the increase in Legionella-related sequences in the winter at this recharge spring [45].
Generally, Legionella spp. are respiratory pathogens that can infect alveolar macrophages. Frequently Legionnaire's cases have been associated with the L. pneumophila. However, nearly 40% of other known Legionella species can cause infections in humans [38]. In aquatic environments, Legionella spp. can be detected in lakes and streams; however, their numbers are relatively very low [46]. The relatively high abundance of Legionella cells in water can become a major health concern by aerosolization of Legionella cells in an environment [47].
In the United States, around 10,000 cases of Legionnaires disease were reported in 2018 with a fatality rate of 10% [48]. In the current study, we observed a relatively high abundance of both pathogenic and non-pathogenic Legionella spp.-related sequences at three locations (PR_102, PC_Cat and SEQ spring) as compared to other sites in these watersheds. These sites need to be further explored for the identification and remediation of the sources of Legionella contamination to eliminate the potential health risks associated with the recreational use of these sites.
It is important to mention that partial 16S rRNA sequencing may not be able to accurately assign phylogeny at the species and strain levels. Further testing of specific pathogenic species through qPCR using species-specific marker genes could help to quantify different Legionella species in impaired aquatic environments. On the other hand, we observed that the assignment of partial reference sequence corroborated well with the phylogenetic placement of Legionella species from other studies based on the complete 16S rRNA gene sequences [49].

Distribution of Enterobacteriaceae-Related Sequences
The phylogenetic analyses of 6760 Enterobacteriaceae-related sequences retrieved from the three different aquatic systems could be divided into three distinct clusters (I to III). The overall abundance of Enterobacteriaceae-related sequences across different aquatic systems were 5367 sequences from Pearson Creek, 703 from SEQ spring and 609 from Little Sac sites. Most of the sequences were classified into two clusters, I and III (Figure 3).  Sequences in clusters II and III were related to references of pathogenic Yersinia spp. and Plesiomonas shigelloides, respectively. Plesiomonas shigelloides causes gastroenteritis infections in humans and is frequently reported to originate from freshwater environments and through the consumption of aquatic foods, such as shellfish and catfish [55]. Likewise, Yersinia spp. (Y. enterocolitica and Y. pseudotuberculosis) cause gastrointestinal infections and are reported to originate from untreated water that is contaminated with fecal material [56,57]. Furthermore, Galindo et al., [58] reported that spreading of Y. enterocolitica through the fecal-oral route can significantly increase the human-human transmission of this pathogen.
Overall, the relative abundance of Enterobacteriaceae-related sequences of the five sites  We used OTUs containing > 50 sequences (3213 sequences) along with the 11 references sequences of known human pathogens in this family from GenBank to construct a phylogenetic tree ( Figure 3). As with the Legionella sequences, we detected variation in site-specific abundance of Enterobacteriaceae-related sequences. At sites PR_102 and PC_Cat, cluster-I-related sequences ranged from 79% to 87% of the Enterobacteriaceae sequences. Similarly, these two sites made up 59% and 85% of the sequences in cluster III (Figure 3).
Cluster I contained references from pathogenic Salmonella spp., Shigella spp. and E. coli. Salmonella species are intracellular parasites that usually invade the gastrointestinal tract and cause salmonellosis. Some species of Salmonella, such as S. enterica cause typhoid fever, which is frequently associated with drinking water contaminated with fecal material [50].
Approximately 0.20 to 1.3 billion infections and about 3 million deaths are caused by nontyphoid species of Salmonella each year [51]. The two serovars S. typhi and S. paratyphi cause 21.7 million infections and 217,000 deaths each year, mostly in developing counties [51,52]. Similarly, Shigella spp. and E. coli cause gastrointestinal diseases, such as diarrhea and dysentery. Both organisms are reported to annually cause about 0.5 billion disease incidences worldwide [53]. Some species of Shigella and E. coli genera in our study were assigned the same OTU because the 16S rRNA gene sequences from these genera were highly similar and could not be reliably distinguished [54].
Sequences in clusters II and III were related to references of pathogenic Yersinia spp. and Plesiomonas shigelloides, respectively. Plesiomonas shigelloides causes gastroenteritis infections in humans and is frequently reported to originate from freshwater environments and through the consumption of aquatic foods, such as shellfish and catfish [55]. Likewise, Yersinia spp. (Y. enterocolitica and Y. pseudotuberculosis) cause gastrointestinal infections and are reported to originate from untreated water that is contaminated with fecal material [56,57]. Furthermore, Galindo et al., [58] reported that spreading of Y. enterocolitica through the fecal-oral route can significantly increase the human-human transmission of this pathogen.
Overall, the relative abundance of Enterobacteriaceae-related sequences of the five sites from LSW suggests that most of these sequences were originated from site PR_102. The influx of Enterobacteriaceae-related sequences at PR_102 site could be originated from human sources of fecal contamination. As discussed above, the site PR_102 is under the direct influence of urbanization and a previously MST-based study of this site suggested the presence of human fecal contamination [22].
On the other hand, other sources of fecal contamination, such as fecal material of deer, birds, rodents, fish and turtles, have also been suggested to contain Enterobacteriaceae-related bacteria, such as Salmonella and Yersinia spp. [59][60][61][62][63]. However, at the other locations in LSW that are under agricultural influence, we did not detect a relatively high abundance of Enterobacteriaceae-related bacteria. Likewise, for Pearson Creek, most Enterobacteriaceae-related sequences originated from site PC_Cat. Overall, PR_102 in LSW and PC_Cat in Pearson Creek showed high abundances of both Legionella and Enterobacteriaceae-related sequences.
Most of the Enterobacteriaceae-related sequences from the SEQ spring samples were related to phylogenetic cluster I, and a few sequences were related to cluster III. During summer 2020, we observed an increase in the number of Enterobacteriaceae sequences (56%) as compared to the summer of 2019 (21%) and the winter of 2019 (23%) (Figure 3). SEQ spring is in southeast Springfield, and this watershed is part of a karst landscape that has several sinkholes, losing streams and springs in the area [30].
Its recharge area is approximately 18.2 km 2 , and it drains a significant portion of urbanized southeast Springfield. Previous studies reported an increase in Enterobacteriaceae abundance during the warmer months of the year, as well as increases due to sewage contamination [64][65][66]. Owen et al. [29] suggested the presence of human fecal contamination in this cave, and the city of Springfield has been working to remediate aging sanitary sewer infrastructure in the recharge area to help improve water quality and reduce human health risks in local streams.

Distribution of Bacteroidaceae-Related Sequences
We also evaluated variation in the relative abundance of Bacteroidaceae-related sequences because species with this family are commonly used as a marker for the identification of fecal contamination sources [67]. Previous studies [22,23,29] evaluated these sites for various sources of fecal contamination. The PR_102 site in LSW, the PC_SHYY site in Pearson Creek and the SEQ spring have shown the presence of human fecal contamination.
We retrieved 4357 Bacteroides-classified sequences from the different watersheds. SEQ spring had 2693 sequences, followed by Pearson Creek (1598 sequences) and LSW (56 sequences) (Figure 4). The phylogenetic analysis suggested the presence of two major phylogenetic clusters within the Bacteroidaceae sequences. Cluster I (OTU1) contained the reference sequence Bacteroides dorei, which is commonly used as a marker gene for human fecal contamination and closely related sequences. This cluster contained most of the sequences retrieved from LSW and SEQ spring. In contrast to this, cluster II contained twice as many sequences as cluster I of the Pearson Creek samples. We also assessed the phylum-level distribution of all good quality classified sequences across different watersheds (Figures S3 and S4) and measured the overall bacterial species richness and diversity across different sites (Table 1).

Figure 4.
Neighbor-joining phylogenetic tree of partial 16S rRNA gene sequences (370 bp) related to the Bacteroidaceae family from three impaired watersheds. Only OTUs (operational taxonomic units) containing ≥ 50 sequences were used in this analysis. The DNA sequences of Bacteroides dorei that is used a marker organism for human fecal contamination in water was also included in the phylogenetic analysis. The size of the pie chart is proportional to the number of sequences within a cluster. The size of a pie chart can only demonstrate the variation in sequences within a given watershed, e.g., a pie chart from the LSW site cannot be compared with pie charts from Pearson creek due to the differences in the sampling time and number of DNA sequences retrieved across different watersheds. Numbers at the node reflect the bootstrap values.

Overall Bacterial Diversity
The difference in bacterial diversity across different sites was evaluated through the Shannon diversity index and Chao 1 estimator of species richness as described previously [69]. Shannon diversity, which considers both the species richness and species relative abundance was significantly higher at the PR_102 site within the Little Sac watershed and site PC_Cat within the Pearson Creek watershed (Table 1). Both sites displayed a relatively high abundance of pathogen-related sequences as compared to other sites. However, overall bacterial diversity was not significantly different across different sites. The Sequiota Spring site displayed significantly lower diversity during the winter sampling events, compared to the summer samplings ( Table 1).
The Chao 1 estimator, which describes the total species richness, was higher at the PR_102 site within LSW and summer 2020 samples from Sequiota Spring ( Table 1). The latter can be explained by Enterobacteriacea-related sequences being higher in summer 2020 samples. Sequences related to Legionellacea were higher during winter 2019.

Conclusions
In summary, using Next-Gen DNA sequencing, we identified three potential high health-risk sites (PR_102, PC_Cat and SEQ spring) that displayed a relatively high abundance of sequences related to Legionellaceae and Enterobacteriaceae families that contained known human pathogenic species. This information can be valuable for water resource  . Neighbor-joining phylogenetic tree of partial 16S rRNA gene sequences (370 bp) related to the Bacteroidaceae family from three impaired watersheds. Only OTUs (operational taxonomic units) containing ≥50 sequences were used in this analysis. The DNA sequences of Bacteroides dorei that is used a marker organism for human fecal contamination in water was also included in the phylogenetic analysis. The size of the pie chart is proportional to the number of sequences within a cluster. The size of a pie chart can only demonstrate the variation in sequences within a given watershed, e.g., a pie chart from the LSW site cannot be compared with pie charts from Pearson creek due to the differences in the sampling time and number of DNA sequences retrieved across different watersheds. Numbers at the node reflect the bootstrap values.
Overall, we observed a similar trend in the distribution of Bacteroidaceae sequences within cluster I at different sites, as seen in the case of Legionella and Enterobacteriaceaerelated sequences. The PR_102 site in LSW, PC_Cat in Pearson creek and summer 2020 samples from SEQ spring contained the most sequences in this cluster. Our results corroborated the previous MST report-based studies for sites PR_102 and SEQ spring samples [22,29]. For SEQ spring, summer 2020 made up 84% of the sequences in cluster I.
A previous MST-based study on the SEQ spring [29] reported a relatively high abundance of the human fecal markers in the summer of 2020 compared to the other two sampling periods. This validates our results, suggesting that the pathogens present in this system could be coming from the leaky sewer lines. In contrast, the PC_Cat site in Pearson Creek did not previously display human fecal contamination [23] despite the presence of high FIB and pathogen-related sequences.

Distribution of Other Genera Containing Pathogenic Species
In addition to the three bacterial families mentioned above, we also evaluated the distribution of other potential pathogenic bacterial genera (Table S3). We observed the presence of Pseudomonas-, Acinetobacter-, Clostridiumand Aeromonas-related sequences. Although some species within these genera are pathogenic, most of these species that can persist in aquatic environments are non-pathogenic [68]. Further testing is needed to determine if pathogenic species within these genera are present at these sampling sites. DNA sequences related to other bacterial genera that contain known pathogens, such as Campylobacter, Peptoclostridium and Helicobacter, were not detected at any of our sites.
We also assessed the phylum-level distribution of all good quality classified sequences across different watersheds (Figures S3 and S4) and measured the overall bacterial species richness and diversity across different sites (Table 1).

Overall Bacterial Diversity
The difference in bacterial diversity across different sites was evaluated through the Shannon diversity index and Chao 1 estimator of species richness as described previously [69]. Shannon diversity, which considers both the species richness and species relative abundance was significantly higher at the PR_102 site within the Little Sac watershed and site PC_Cat within the Pearson Creek watershed (Table 1). Both sites displayed a relatively high abundance of pathogen-related sequences as compared to other sites. However, overall bacterial diversity was not significantly different across different sites. The Sequiota Spring site displayed significantly lower diversity during the winter sampling events, compared to the summer samplings ( Table 1).
The Chao 1 estimator, which describes the total species richness, was higher at the PR_102 site within LSW and summer 2020 samples from Sequiota Spring ( Table 1). The latter can be explained by Enterobacteriacea-related sequences being higher in summer 2020 samples. Sequences related to Legionellacea were higher during winter 2019.

Conclusions
In summary, using Next-Gen DNA sequencing, we identified three potential high health-risk sites (PR_102, PC_Cat and SEQ spring) that displayed a relatively high abundance of sequences related to Legionellaceae and Enterobacteriaceae families that contained known human pathogenic species. This information can be valuable for water resource managers to preferentially focus remediation efforts at those three specific sites. The seven other locations in impaired watersheds pose relatively fewer risks to health due to the absence or very low abundance of pathogen-related sequences.
Considering the scarcity of water resources, these seven sites could be potentially used for recreational purposes, such as fishing, boating and swimming purposes. Future studies in the area will be focused on the temporal monitoring of these pathogens using specific primers in qPCR. In addition to the spatial variation in bacterial contamination, the SEQ spring site also showed the presence of pathogen-related sequences in high abundance that varied over time. Overall, this study demonstrates the significance of Next-Gen DNA sequencing for an initial screening of high-risk sites based on the relative abundance of pathogen-related sequences in bacterially impaired watersheds.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/d14020096/s1; Table S1. Details of sampling sites and volumes of water filtered across different samples; Table S2. Abundance of E. coli cells/100mL of water sample quantified using IDEXX Quanti-Tray/2000 systems. The range of E. coli cells reported here represents the cells counted at two different sampling times that were about two weeks apart (only done for IDEXX testing); Table S3. Sequences related to pathogen-containing genera across the three watersheds; Figure S1. Little sac and Pearson creek raw data files for forward and reverse reads; Figure S2. (A). Operational taxonomic unit based (OTU) based Non-metric multidimensional scaling plot of Pearson creek (triangles) and Little Sac (squares) sites, using Bray-Curtis similarity index. (B). OTU-based Non-metric multidimensional scaling plot of Sequiota spring temporal sampling, based on Bray-Curtis similarity index. (C). Amplicon Sequence Variance (ASV) based Non-metric