Application of oligonucleotide microarrays for bacterial source tracking of environmental Enterococcus sp. isolates.

In an effort towards adapting new and defensible methods for assessing and managing the risk posed by microbial pollution, we evaluated the utility of oligonucleotide microarrays for bacterial source tracking (BST) of environmental Enterococcus sp. isolates derived from various host sources. Current bacterial source tracking approaches rely on various phenotypic and genotypic methods to identify sources of bacterial contamination resulting from point or non-point pollution. For this study Enterococcus sp. isolates originating from deer, bovine, gull, and human sources were examined using microarrays. Isolates were subjected to Box PCR amplification and the resulting amplification products labeled with Cy5. Fluorescent-labeled templates were hybridized to in-house constructed nonamer oligonucleotide microarrays consisting of 198 probes. Microarray hybridization profiles were obtained using the ArrayPro image analysis software. Principal Components Analysis (PCA) and Hierarchical Cluster Analysis (HCA) were compared for their ability to visually cluster microarray hybridization profiles based on the environmental source from which the Enterococcus sp. isolates originated. The PCA was visually superior at separating origin-specific clusters, even for as few as 3 factors. A Soft Independent Modeling (SIM) classification confirmed the PCA, resulting in zero misclassifications using 5 factors for each class. The implication of these results for the application of random oligonucleotide microarrays for BST is that, given the reproducibility issues, factor-based variable selection such as in PCA and SIM greatly outperforms dendrogram-based similarity measures such as in HCA and K-Nearest Neighbor KNN.


Introduction
As the number of beach closings and advisories continue to rise, so does the public's concern regarding microbial pollution in recreational waters. In a survey of more than 230 U.S. coastal and Great Lake communities, there were at least a total of 13,410 days of beach closings or advisories during 2001 [1]. The majority of beach closings and advisories were based on the presence of elevated levels of fecal contamination as measured by fecal bacterial indicators, such as Escherichia coli and Enterococci. Under section 303(d) of the 1972 Clean Water Act, states, territories, and authorized tribes are required to develop pollutantspecific lists of impaired waters and may be required to establish a total maximum daily load (TMDL) for those impaired waters [2]. TMDLs specify the maximum amount of a pollutant that a water body can receive and still meet water quality standards. Fecal coliforms are frequently listed as impairment on many states 303(d) list of associated water-quality impairments [3]. While TMDLs have historically focused on chemical impairments, more attention is now being focused on microbial impairments. Recently, the EPA published an extensive protocol for developing pathogen TDMLs [2]. Currently, there are several regional pilot projects underway aimed at establishing fecal coliform TMDLs for impacted watersheds [4].
Reducing the loads of fecal contamination can be problematic because often the pollution sources are not known or have non-point sources. Non-point sources of microbial fecal pollution are mobilized by rain/snow events and can include urban litter, agricultural runoff, failing sewer lines, malfunctioning septic systems, and domestic and wildlife excrement. Implementation of best management practices (BMPs) for TMDL compliance is dependent upon accurately identifying the source(s) of the impairment. Source tracking of nonpoint sources of microbial pollution, specifically indicator bacteria, has been generically referred to as bacterial source tracking (BST) [5] or microbial source tracking (MST) [6,7] and can be accomplished using a collection of multidisciplinary bacterial sub-typing methods. In addition to determining the origin of fecal contamination, BST methods can differentiate between human and non-human sources of microbial pollution [6,7], which can aid in generating more accurate risk assessments for managing the risk posed by microbial pollution.
BST methods can be divided into two general groups, 1) phenotypic or biochemical-based methods, and 2) genotypic or molecular-based methods [7]. Of the phenotypic methods, multiple antibiotic resistance (MAR) analysis has been reported the most and has been shown to be successful in 1) discriminating human and animal sources of E. coli or fecal streptococci [8,9,10] and, 2) further discriminating animal sources by animal type [11]. This method involves isolating and culturing target indicator organisms from various sources and locations to create a reference library. These isolates are subsequently replica plated on selective media containing multiple antibiotics at a range of concentrations.
Antibiotic susceptibilities are characterized, subjected to discriminant analysis and compared to a reference antibiotic susceptibility library to determine identity. Reliability of the method is determined by analyzing isolates as both standards and as unknowns. The number of isolates assigned to the correct categories divided by the total number of isolates is referred to as the average rate of correct classification (ARCC) [12]. ARCC values for this method range from 62% to 94% when individuals are compared. Despite the success of this method in simple watersheds [11], some researchers have indicated that MAR lacks the sensitivity, reproducibility, and host specificity that is needed for BST [13].
In contrast to the limited number of phenotypic subtyping methods, numerous genotypic methods have been described including ribotyping [14,15,16], length heterogeneity polymerase chain reaction (LH-PCR), terminal restriction fragment length polymorphism (T-RFLP) PCR [17,18], repetitive PCR (rep-PCR) [19], denaturing gradient gel electrophoresis (DGGE) [20,21], pulsed-field gel electrophoresis (PFGE) [22,23,24], and amplified fragment length polymorphism (AFLP) [25]. Most of these molecular methods rely on PCR to interrogate a fraction of the target organisms' available genetic information. PCR amplification products are subsequently resolved by gel-electrophoresis and the resulting banding pattern may be compared to a reference library to determine the identity of the organism. ARCC values can approach 100% when using some of these methods, such as rep-PCR [19]. Despite the success of genotypic methods, there is an ongoing need in BST for increased resolving power to discriminate between closely related microorganisms. Newer technologies, like DNA microarrays, which have been employed for various environmental microbiology applications [26], could potentially increase the resolving power of BST analysis [27]. For example, DNA microarrays interrogate DNA samples at the DNA sequence level. In contrast, gel-based methods rely on DNA fragment sizing; a method in which co-migration of heterogeneous DNA sequence populations of similar sized fragments is possible. Unlike gel-based methods, which rely on size fractionation of banding patterns that are subject to positional variation, DNA microarray profiles are comprised of physically immobilized, addressable spots. In addition, the resolving power of the microarray can be further improved by increasing the amount of oligonucleotide elements on the micro array.
The methods and data analysis algorithms for the application of DNA microarrays towards BST are just starting to be developed. Recently, oligonucleotide microarrays were evaluated for their ability to differentiate 25 closely related Salmonella isolates [27]. Previously, the same authors used a similar microarray approach to discriminate closely related Xanthomonas pathovars [28]. In this study, we aim to build upon these findings and further the development of oligonucleotide microarrays for use in BST. Here we report the application of a microarray, consisting of 198 oligonucleotide elements, to discriminate 17 unique environmental isolates of Enterococcus sp. based on the host source of the bacteria.

Bacterial Isolates
A collection of 51 Enterococcus sp. isolates originating from bovine, deer, gull, and human sources were provided by Dr. Shiao Wang (University of Southern Mississippi; Hattiesburg, MS). Details of the isolation and characterization of these strains have been described in detail elsewhere [29]. Isolates were routinely propagated in brain heart infusion liquid media (Becton Dickenson, San Jose, CA). High molecular weight genomic DNA for PCR analysis was obtained from each isolate using Qiagen's DNeasy Tissue Kit (Qiagen, Valencia, CA).

PCR Amplification and Labelling
PCR primer BOX A 1R 5' CTA CGG CAA GGC GAC GCT GAC G 3', was custom synthesized by Qiagen and targeted repetitive extragenic palindromic BOX sequences [19]. Primer BOX A 1R was used to amplify select portions of the Enterococcus sp. isolate genomes to be used as target DNAs for microarray analysis. All PCR reactions and their subsequent microarray analysis were carried out in triplicate. Final reaction conditions were as follows: 10mM Tris, pH 8.3, 50mM KCl 2 , 4.5mM MgCl 2 , 0.001 (w/v) gelatin, 0.2mM dNTP's, 2μM BOX A 1R primer, and 5U Taq polymerase (Promega, Madison, WI) in a final reaction volume of 100μl. A total of 100ng of genomic DNA was used as template for each reaction. Amplification was carried out in a MJ Research Tetrad thermocycler (MJ Research, Inc., Waltham, MA) programmed as follows: initial step at 95°C for 2 min followed by 35 cycles of: 94°C for 3 sec, 92°C for 30 sec, 50°C for 60 sec, 65°C for 8 min and finally cooling to 4°C at the end of the last cycle. Ten microliter portions from each reaction were electrophoresed through a 1.0% agarose gel in 1x TAE (40mM Tris-Acetate, 1mM EDTA) running buffer and stained with Sybergold (Molecular Probes, Inc., Eugene, OR) for visualization to confirm amplification.
The remaining portions of each amplification reaction were ethanol precipitated with sodium acetate [30] and the resulting air-dried DNA pellets were re-suspended in 20μl Millipore water.
PCR products were aminoallyl(aa)-labeled as described previously [31]. Briefly, 3.3μl (3μg/μl) of random hexamers (Invitrogen, Carlsbad, CA) were added to each of the re-suspended PCR products and the final volume brought up to 39μl. The sample was heated to 100°C for five minutes and immediately placed in an ice bath. Twenty units of DNA polymerase I Klenow fragment (New England BioLabs, Beverly MA), 5μl of EcoPol (Klenow) buffer (New England Biolabs), and 2μl of 3mM dNTP/aa-labeling mix [100mM each dNTP, 50 mM aa-dUTP (Ambion, Austin TX)] were added to the reaction and the reaction was incubated at 37°C overnight. The reaction was stopped by adding 5μl of 0.5M EDTA. Unincorporated aa-dUTPs and free amines were removed from each reaction using the QIA quick PCR purification (Qiagen) kit with the following modifications: PE wash buffer was replaced with a 5 mM KPO 4 , 80% ethanol solution and elution buffer was replaced with a 4mM KPO 4 solution. Purified PCR templates were dried down in a vacuum centrifuge and resuspended in 4.5μl of 0.1M Na 2 CO 3 buffer, pH 9.3. DNA samples were labeled with a Cy5 dye by adding 4.5μl of a Cy5 mono-Reactive Dye Pack solution (Amersham Biosciences, Piscataway, NJ) and allowing the reaction to proceed in the dark at room temperature for two hours. The reaction was stopped by the addition of 35μl of 100mM NaOAc. Free dye was removed from the samples by using the QIA quick PCR purification kit (Qiagen) according to the manufacture's instructions. DNA samples were dried down and immediately processed for microarray analysis.

Microarray Oligonucleotide Probes and Fabrication
One hundred ninety eight 9mers (Table 1), with an amine-modification at the 5' end, (Sigmagenosys, Woodlands, TX) were randomly selected from a list of 102,403 9mer sequences that conform to criteria described previously [28]. Briefly, 9mer sequences had GC contents between 44-55%, could not have: 1) four nucleotide (or higher) repeats, 2) inverted repeats three nucleotides (or higher), 3) dual-terminal inverted repeats of 3 nucleotides (or higher), and 4) single-terminal inverted repeats of three nucleotides or higher. In addition to these criteria, all 9mer sequence combinations that occurred in Enterococcus sp. rRNA genes present in GeneBank as of 5/03 were eliminated. A Cy3-labeled control oligonucleotide, 5 'TTG GCA GAA GCT ATG AAA CGA TAT GGG 3', with an amine-modification at the 5' end, was used as a positional reference and hybridization control.
Microarrays were fabricated on aldehyde-coated glass microscope slides (Telechem International, Inc., Sunnyvale, CA) using the BioRad VersArray ChipWriter (BioRad, Hercules, CA) equipped with SMP3 Stealth microspotting pins (Telechem Internation, Inc.). Prior to fabrication, amine-modified oligonucleotides were transferred to a 384-well plate (Whatman, Clifton, NJ) and diluted to a concentration of 80 μM in 50% dimethyl sulfoxide (DMSO). Probes were printed in duplicate, using a 2-pin configuration, at a relative humidity of 60%. The resulting grid pattern and corresponding oligonucleotide probe location is illustrated in Fig. 1. After printing, slides were baked for 45 minutes at 80°C, briefly washed with 0.2% SDS, and subsequently rinsed with reagent grade water. Free aldehyde groups were chemically blocked by soaking printed slides in a fresh NaBH 4 solution [0.75g NaBH 4 (Sigma, MO), 225 ml phosphate buffered saline (pH 7.0), 66.5ml 100% ethanol] for five minutes. Following chemical blocking, printed slides were momentarily dipped 3 times in 0.2% SDS, washed for one minute in reagent grade water, and individually spun dried in 50ml Falcon conical tubes (Fisher Scientific, MO) at 700rpm for 10 minutes in a tabletop centrifuge. Microarray substrates were stored at room temperature in a desiccator.

Microarray Hybridization
Prior to hybridization, printed slides were pre-hybridized in 0.1% SDS, 4X SSC (1X SSC, 0.15M NaCl, 0.015M trisodium citrate, pH 7.0), and 10mg/ml bovine serum albumin (BSA) in 50ml Falcon conical tubes at 40°C with slight agitation for 2 hours. Pre-hybridized slides were rinsed 5 times in reagent grade distilled water and chilled to 4°C on a solid metal platform. Cy5 aminoallyl-labelled DNA targets were resuspended in 15μl of 4X SSC, heated at 95°C for 5 min, and immediately placed on ice. The Cy3 labelled oligonucleotide, 5'CCC ATA TCG TTT CAT AGC TTC TGC CA 3', was also included in the hybridization reaction (final concentration 0.6μM) as a control to hybridize with the control oligonucleotide attached to the microarray. Chilled hybridization reactions were pipetted on prechilled printed microarray slides, covered with array cover slips (PGC Scientifics, Gaitherburg, MD), and incubated overnight at 4°C as described previously [28]. Hybridized microarrays were gently rinsed in 4°C 4X SSC 5 times for 1 minute intervals followed by a final 30 second rinse in reagent grade water. Microarray slides were spun dried in 50 ml conical tubes as described above prior to scanning slides.

Image Analysis and Statistics
Processed microarray slides were scanned at 532nm and 635 nm using the VersArray Chipreader system (BioRad, Hercules, CA) configured at a 5μm resolution. Spot intensity data from the resulting 16-bit TIF images were initially extracted using the ArrayPro Analyzer software (Media Cybernetics, Silver Spring, MD). Background signal was determined locally for each spot using the "local corners" option. Individual spot intensities, minus local backgrounds, were normalized to total spot intensity for all of the spots on each micro array. The mean-normalized datasets were transformed by taking the logarithm of these values. An empirical data reduction process was employed (see Results)  Table 1. Control oligonucleotide designated by C.
to identify which of the 198 probe spots had the most information (example: spots that were always "on" or "off" for all isolates would have no information for this dataset) and which of the spots that were too variable within the replicates of the same isolates. Principal Components Analyses (PCA) and cluster and classification analyses were run on the remaining dataset using Pirouette (Infometrix, Inc., Bothell, WA).

Oligonucleotide Microarray Bacterial Source Tracking
Oligonucleotide microarrays were evaluated for their ability to resolve BOX PCR amplification products derived from environmental sources of Enterococcus sp. isolates originating from deer, bovine, gull, and human. Purified genomic DNA from Enterococcus sp. isolates was subjected to BOX PCR amplification and the resulting amplification products were visualized by agarose gel electrophoresis. The results of a typical experiment can be seen in Fig. 2, which represents the subset of samples originating from deer. Agar gel electrophoresis confirms amplification as well as consistency of the BOX PCR reaction. PCR products were fluorescently labelled with aminoallyl dUTP and Cy5 then resolved by hybridization to in house fabricated 9mer oligonucleotide microarrays (see Material & Methods). The results of a representative microarray experiment can be seen in Fig. 3, in which replicate BOX PCR reactions from Enterococcus sp. deer isolate 49.1.1 were hybridized to replicate oligonucleotide micro arrays. A histogram of fluorescent spot intensities indicates that these randomly selected nonamer intensities follow a lognormal distribution (data not shown). Of the 17 environmental isolates analysed, not all replicate microarrays were usable. For six of these isolates (4 human and 2 deer) a single microarray hybridization replicate, consisting of duplicate microarray spots, was available for analysis. For the remaining 11 isolates and their replicates, spots that exhibited extreme variability in normalized spot intensities among replicates within a specific source were identified and subsequently eliminated from

PCA and HCA Analysis
The dendrogram of a complete Euclidean distance Hierarchical Cluster Analysis (HCA) did not project good origin-specific clustering of the isolates. In particular, the bovine-origin replicates were spread among several clusters (example part of dendrogram Fig.  4). A K-Nearest Neighbour classification confirmed the HCA, misclassifying 8% of the deer, 16% of the human, and 50% of the gull isolates as bovine isolates. The PCA was visually superior at separating origin-specific clusters, even for as few as 3 factors (Fig. 5). A Soft Independent Modelling (SIM) classification confirmed the PCA, resulting in zero misclassifications using 5 factors for each class. Numerical descriptions of the SIM classification model for bovine-origin Enterococcus sp. are presented in Table II. These factors describe the multidimensional subspace within the PCA projection in which the various microarray source profiles exist. Factor numbers indicate the relative linear weights of each probe in each factor. For instance probes 2 and 16 have the highest weights for the most important factor, Factor 1, which accounts for 30% of the variability. Thus for this set of isolates, SIM classifications based on 5 factors for each class and 5 linear combinations of the 45 probes sufficed to distinguish the origins of Enterococcus sp. isolates.

Discussion
In an effort towards adapting new defensible methods for assessing and managing the risk posed by microbial pollution, we evaluated the utility of oligonucleotide microarrays for bacterial source tracking. Specifically, we evaluated the ability of oligonucleotide microarrays to visually discriminate 17 unique environmental isolates of Enterococcus sp. based on host origin, i.e. gull, bovine, deer, and human. As observed in an earlier study by Kingsley et al. [28], many of the microarray oligonucleotide probes exhibited high variations in fluorescent spot intensities within a series of replicates. A strong down selection for reproducible spot intensities within replicates produced a set of 45 probes, and this reduced set proved useful for classifying isolates by source. It should be reiterated that this data reduction was performed in order to improve reproducibility, and had the side effect of improving the classification fit. This is the opposite of the familiar problem of model over fitting, in which the addition of extra variables improves classification at the expense of robustness and reproducibility.
Following data reduction, a number of multivariate statistical analysis procedures are available for evaluating the relationships among microarray hybridization profiles.
Previously, PCA was successfully used to visualize relationships among microarray hybridization profiles derived from closely related Xanthomonas pathovars [28]. In this study, PCA and HCA were compared for their ability to visually cluster microarray hybridization profiles based on the environmental source from which the Enterococcus sp. class analogies consisting of 5 factors was more accurate than classification based on K-Nearest Neighbour calculations. This difference is apparent when comparing the PCA, which is a visualization of some of the SIM calculations, to the HCA, which is a visualization of some of the KNN calculations. The implication of these results for the application of random oligonucleotide microarrays for BST is that, given the reproducibility issues, factor-based variable selection such as in PCA and SIM greatly outperforms dendrogram-based similarity measures such as in HCA and KNN. Given any sample based strictly on the microarray intensity values, the SIM model outputs the best fitting class for that sample, with zero misclassifications for the dataset. Further optimization of source classifications may result from the application of information theory to detect patterns in microarray profiles. In particular, bacterial source tracking may benefit from several measures of classification utility, such as those based on mutual information that have been developed as part of information theory [32]. However, successful application of information theory for microarray analysis will be dependant upon accurately understanding, capturing, and modelling sources of variation in the microarray experimental process.
Some of these sources of variation, such as PCR amplification and microarray fabrication have been described previously [27]. Once improved microarray experimental protocols and statistical methods have been developed, it will be possible to incorporate microarray technology into the growing toolbox of technologies that is rapidly defining bacterial source tracking. While there is currently no one best method that accomplishes the ambitious goal of source tracking as demonstrated in the latest study by Stoeckel et al. [33], it is likely that a combination of methods will lead to effective source tracking. isolates of Enterococcus sp., colored by host origin: deer is red, bovine is yellow, human is green, gull is purple. For this 3D view only the first 3 components can be plotted, but clustering is evident.