1. Introduction
The paradigm of “everything is everywhere, but the environment selects” [
1] suggests that all microbial taxa have the potential to be found everywhere. This largely holds true for the main marine bacteriophage taxa, with the presence of cyanophage-like sequences of the order
Caudovirales dominating all ocean viromes, including the recently sampled Indian Ocean [
2,
3,
4,
5]. The order
Caudovirales is comprised of three families:
Myoviridae (contractile tails),
Siphoviridae (non-contractile tails) and
Podoviridae (short tails) [
6]. During the Global Ocean Sampling (GOS) expedition [
5], myovirus-associated sequences were ubiquitously distributed among sampling sites with the highest prevalence in tropical oligotrophic locations. Podo- and siphoviruses showed site-specific distributions, with the highest abundances recorded in temperate mesotrophic waters and hypersaline lagoons, respectively. Within the Indian Ocean, 32% of the viral fraction (VF) was attributed to known viruses, with 95% of the known viruses identified as belonging to the order
Caudovirales (
Myoviridae, 54.3%;
Podoviridae, 27.6%;
Siphoviridae, 17%) [
4]. The nucleo-cytoplasmic large DNA viruses (NCLDVs) were often the next major lineage present, with the family
Phycodnaviridae representing 83.9% of this group, followed by
Iridoviridae at 8.5% and
Mimiviridae at 7.3%.
Most of the virome-based studies carried out so far do not report on the diversity of the likely hosts that the viruses infect, making it unclear as to whether the viruses present in the water column are the result of active or past infections. An exception is the Tara Oceans expedition, where eukaryotic and prokaryotic diversity [
7,
8] was reported in conjunction with viral diversity [
9,
10,
11]. Global surveys, which include the southwest Indian Ocean, indicate that the α-Proteobacteria dominate the prokaryotic communities in both surface waters and at the deep chlorophyll maxima. The second most represented group are either the Cyanobacteria or γ-Proteobacteria, depending on location [
8]. For the eukaryotic fraction, samples collected during the Tara Ocean expedition showed that the pico- and nano-plankton was dominated by photosynthetic dinoflagellates (of the family Dinophyceae). Parasites of the superphylum Alveolata, specifically the marine alveolates (MALV)-I and MALV-II clusters, routinely infect members of the family Dinophyceae and can account for up to 88% of the eukaryote fraction in some locations. These two MALV clusters have recently been renamed Syndiniales groups I and II, respectively [
12]. Specifically for the southwest Indian Ocean, the eukaryotic fraction was dominated by alveolates including the Dinophyceae and their Syndiniales parasites [
7].
Studies on microbial diversity in aquatic environments rely on sample volumes ranging from tens of litres to as much as a thousand litres of water [
2,
13,
14]. Sampling of large volumes was thought to be a necessity for early sequencing technologies, which required considerable quantities (micrograms) of DNA. Newer technologies, such as linear amplification deep sequencing with Illumina, require smaller quantities (nanograms) of DNA [
15]. Additionally, various sample concentration methods have been developed in order to collect the greatest quantities of DNA possible from water samples [
16]. Standard viral filtration methods involve the use of filters with a pore size of 0.2 µm to remove bacteria from the sample and collect only the virus fraction. However, this 0.2 µm size fraction results in underreporting of giant viruses [
17,
18], as the giant virus particles can have diameters varying from ~0.2 to 1.5 µm, with
Pithovirus sibericum being the largest known member of this group [
19]. In addition, the <0.2 µm size fraction also contains large amounts of dissolved DNA. Jiang and Paul concluded that, in this size fraction, viral particles makes up only a small component of the filterable DNA, the majority being dissolved DNA of bacterial and eukaryotic origin [
20].
Dissolved DNA forms part of environmental DNA (eDNA), derived from cellular debris produced from biota living in that environment [
21]. Therefore, eDNA is being used as a tool to determine whether an invasion has taken place [
22] or to track an endangered species [
23]. The size fraction used to describe eDNA is the size fraction that removes larger eukaryotes (passing through a 0.5 mm mesh) but retains microbes (>0.45 µm filter). Therefore, the eDNA concept excludes the microbial community as they are retained in this size range. To our knowledge no study has yet addressed the question of whether the <0.45 µm size fraction, the microbial environmental DNA (meDNA) fraction, can be used as a proxy to describe the complete biota in any given environment.
In this study, we tested the hypothesis that the volume equivalent to a cup of seawater (250 mL) is sufficient to describe the most abundant microbial taxa (from viruses to protists) in the marine environment. Serendipitously, our study site is within 548 nautical miles of station 64, previously sampled by the Tara Oceans expedition (−29.5333, 37.9117), thereby allowing for a semi-qualitative comparison to be made. Our protocol differed from previous studies, including that of Tara Oceans, as it contained no concentration steps. In addition, only 50 mL of the 0.45 µm 250 mL permeate was used to describe the combined dissolved DNA and viral fraction (meDNA). The 0.45 µm size fraction was chosen because we wanted to limit the removal of giant viruses. Here we report how a relatively small water sample can be used to capture the dominant microbial taxa within any given aquatic system.
2. Materials and Methods
2.1. Sample Collection
The water sample analysed in this study was collected during the second transect of the Great Southern Coccolithophore Belt expedition (GSCB-cruise RR1202) in the southwest Indian Ocean in February 2012 [
24]. The location of the sampling station S1 (−38.314983, 40.958083, water temperature 20.83 °C, pH 8.08) was mapped using RgoogleMaps_1.2.0.7 [
25] under R version 3.3.0 (accessed on 3 May 2016) (
Figure 1a).
One litre of water was gathered from the conductivity, temperature, and depth (CTD) rosette sampler from the chlorophyll maximum layer (5 m). Of this, an aliquot of 250 mL of seawater was filtered through a 0.45-µm polycarbonate filter and the filter was used for the DNA extraction onboard the R/V Roger Revelle using Qiagen DNeasy Blood and Tissue protocol (Qiagen, Valencia, CA, USA). The DNA was stored at −21 °C and subsequently transferred to Plymouth, UK, for further processing. Fifty millilitres of filtered water were set aside, wrapped in tin foil and stored in a fridge. This too was returned to Plymouth, UK, for further processing.
2.2. DNA Extraction, Preparation and Sequencing of the >0.45 µm Fraction
The V4 region, along the prokaryotic 16S ribosomal RNA gene was amplified using the universal primer pair 515F and Illumina tagged primer 806R7, 806R10 and 806R15 (Illumina, San Diego, CA, USA) [
26]. For eukaryotic 18S ribosomal RNA gene, we used the primer pair 1391F and Illumina tagged EukB6, EukB16 and EukB23 to amplify the V9 region [
27]. For all polymerase chain reaction (PCRs), we added 1–5 μL of the eDNA (concentration range from 1.47 to 32.51 ng/μL), to 5X Colourless GoTaq Flexi Buffer (Promega, Madison, WI, USA), 1.5 μL MgCl
2 Solution 25 mM, 2.5 µL dNTPs (10 mM final concentration), 1 μL Evagreen Dye 20X (Biotium, Fremont, CA, USA), 0.1 μL GoTaq DNA Polymerase (5u/μL) and 12.9 µL of sterile water for a final volume of 25 μL for each reaction. This was done to determine the mid-exponential threshold of each reaction, ran on a Corbette Rotor-Gene™ 6000 (Qiagen). The real-time PCR proceeded with an initial denaturation at 94 °C for 3 min, followed by 40 cycles of a three-step PCR: 94 °C for 45 s and 50 °C for 60 s and 72 °C for 90 s. The fluorescence was acquired at the end of each annealing/extension step on the green channel. The cycle threshold of the amplification in the exponential phase was recorded for amplification.
A second standard PCR amplification was carried out in triplicate and run at the same conditions, excluding the addition of the Evagreen Dye. The sample was removed from the machine when it reached the cycle threshold, as previously determined. Products were run on a 1.4% agarose gel to confirm the success of the amplification and the product size of the amplification. The bands were cut out and purified using the Zymoclean Gel DNA Recovery Kit (Zymo Research, Irvine, CA, USA). Quantity and quality was verified on the NanoDrop 1000 (Thermo Scientific, Wilmington, DE, USA) and QuantiFluor E6090 (Promega). V4-16S and V9-18S were prepared mixing an equimolar concentration of each amplicon triplicate into the pool for which concentration was checked on the Bioanalyser (Agilent Technologies, Santa Clara, CA, USA). The final pooled samples were denatured and diluted to 6 pM and mixed with 1 pM PhiX control (Illumina), read 1 sequencing primer was diluted in HT1, before the flowcell was clustered on the cBOT (Illumina). Multiplexing sequencing primers and read two sequencing primers were mixed with Illumina HP8 and HP7 sequencing primers, respectively. The flowcell was sequenced (100 PE) on HiSeq 2000 using sequencing by synthesis (SBS) reagents (Version 3.0). The raw sequences are available at the European Nucleotide Archive (ENA) under accession number PRJEB16346 and PRJEB16674.
2.3. DNA Extraction, Preparation and Sequencing of the <0.45 µm Fraction
The whole 50 mL permeate was used in the nucleic acid extraction procedure. We added 100 μL of proteinase K (10 mg/mL; Sigma-Aldrich, St. Louis, MO, USA) and 200 μL of 10% sodium dodecyl sulfate (SDS) (Sigma-Aldrich) to the permeate and incubated the solution for two hours with constant rotation at 55 °C. The lysate was then collected through multiple centrifugations on a Qiagen DNeasy Blood and Tissue column (Qiagen). The standard Qiagen protocol was followed with 20 µL nuclease-free water (Sigma-Aldrich) used as the elution agent. Quantity and quality was determined using the NanoDrop 1000 (Thermo Scientific) and QuantiFluor E6090 (Promega). Two hundred microliters of DNA (<40 ng) were fragmented using a Bioruptor (Diagenode, Seraing (Ougrée), Belgium) on medium for 15 bursts of 30 s with a 30 s pause and concentrated to 30 µL on a Minelute column (Qiagen). Fragments were made into libraries using the Nextflex ChipSeq library preparation kit (BIOO scientific, Austin, TX, USA) without size selection and with 18 cycles of PCR amplification. Bioanalyser (Agilent Technologies) analysis indicated the final library contained insert between 30 basepairs (bp) to 870 bp. The library was multiplexed with other samples and sequenced (100 paired end) on a HiSeq 2000 (Illumina) using RTA1.9 and CASAVA1.8.
2.4. Bioinformatics Pipeline for the Prokaryotic (16S) and Eukaryotic (18S) Amplicon
The complete bioinformatics pipeline is illustrated in
Figure 1b. The read quality was first assessed using Fast-QC [
28]. FASTX-Toolkit [
29] was utilised for the trimming and filtering steps; the first and last 10 bases were trimmed in order to remove low quality nucleotides. Reads were then filtered in order to retain only reads with more than 95% of nucleotide positions called with a quality score of 20. Trimmed and cleaned reads from each of the triplicate V4-16S and V9-18S PCRs were pooled in order to assign OTUs using Qiime [
30] with 97% similarities for clustering and Swarm analysis [
31], respectively. A taxonomy was assigned using BLASTn implemented in Qiime and Swarm using SILVA Version 119 [
32] with a minimum e-value of 1 × 10
−5.
2.5. Bioinformatics Pipeline of the <0.45 µm Fraction (Metagenome)
As for the amplicon dataset, the quality of the reads was first assessed using Fast-QC [
28]. The FASTX-Toolkit [
29] was used to trim the first last bases to remove low quality nucleotides, and subsequently to filter out reads with fewer than 95% of nucleotide positions called with a quality score of 20. The forward read (R1) of the 100 bp pair-end HiSeq reads have been subjected to random library size normalization using Qiime script subsample_fasta.py; reverse reads (R2) had poor quality and were therefore discarded. The reads were used in a BLASTX [
33] analysis against a Virus database (db; courtesy of Pascal Hingamp) with e-values less than 1 × 10
−5. The Virus database consisted of Refseq curated viral genomes, together with additional new genomes [
11], and 20% of R1 Refseq whole organism db [
34]. In addition, the pair-end reads were assembled into contigs using a de Bruijn de novo assembly program in CLC Genomic Workbench (Version 7.1.5; CLCbio, Cambridge, MA, USA) using global alignment with automatics bubble and word size, minimum contigs length of 250, mismatch cost of 2, insertion and deletion cost of 3, length fraction of 0.5 and similarity threshold of 0.8. The contigs were annotated with the BLASTX as described for the R1 normalised reads. Blast analyses were performed by using the University of Cape Town’s HPC hex cluster.
The top hits from all the blast searches were selected through the use of a parser Perl script (
http://www.bioinformatics-made-simple.com), and then a customised R script was developed to assign taxonomy. A complete viral taxonomy was assigned through a manually curated implementation of the International Committee on Taxonomy of Viruses (ICTV) database 2013 v1 with the National Center for Biotechnology Information (NCBI) taxonomy database.
2.6. Visualization of Community Diversity
Krona tools [
35] were used to visualize community diversity as characterized by the Silva (v119), Refseq and Virus db genes taxonomy assignments. Venn diagrams were created using the R package VennDiagram_1.6.17 on R (Version 3.3.0; 2016-05-03).
2.7. Filters Applied to Annotated Datasets
We performed independent analyses on three independent PCR replicates (V4-16S and V9-18S) and assigned a taxonomy using Silva [
36]. By using replication, we removed the level of noise in the sample introduced by PCR and sequencing artefacts, while retaining rare organisms. Therefore, we considered four levels of stringency at the phylotype level: (1) T0, all phylotypes present across the three replicates; (2) T1, removing singletons from each replicate; (3) T10, a minimum of 10 copies per phylotype had to be present in any one of the replicates, (4) T10-R1, a minimum of 10 copies per phylotype present in any two replicates and (5) T10-R2, a minimum of 10 copies per phylotype present in all three replicates.
4. Discussion
Microbes, from the smallest viruses to the largest unicellular protists, dominate our oceans, playing a central role in ocean food webs and as key drivers of biogeochemical processes [
37]; yet the complex interactions and ecological significance of these relationships within and between biomes are largely unknown. The necessity of studying prokaryotes, eukaryotes and viruses together was highlighted in 2011 when it was estimated that only 11.2% and 2.2% of selected literature utilised two or three microbial groups, respectively [
38]. For this reason, the more recent ocean expeditions sampling efforts include multiple trophic levels and ecosystem components in an attempt to better describe the complex microbial ecosystem structure and dynamics [
39]. Describing and studying the hosts (prokaryotes and eukaryote assemblages) alongside their viruses can help improve our understanding of the roles of microbes in a more holistic way.
Given the patchiness of marine environments, changing rapidly both in time and space, the definition of a unique standard sample volume remains elusive [
38]. Yet fingerprint profiles in the marine environment have shown the absence of significant difference in richness when utilizing from 10 to 1000 mL of seawater [
40] as well as the low variability of the community structure when utilising more than 50 mL [
41]. With this study, we used 250 mL of water, sampling the same seawater mass for all three microbial components (prokaryotes, eukaryotes and viruses). Here we demonstrate that the application of four levels of stringency allowed us to step-wise eliminate OTUs produced by sequencing errors and/or contamination. The removal of singletons resulted in the reduction of the overall phylotypes by around 700, while retaining over 99% of the reads. This step removed sequences of terrestrial origin (e.g.,
Nicotiana and
Cicer), which are not expected to occupy the marine microbiome. Although singleton removal is a common practice, researchers do often retain these taxa under the label of “rare” microbiome. When singletons are removed in conjunction with replication of PCR runs a more stringent and precise description of the microbiota present in the environment can be obtained. This filtering step (T1 on the three replicates combined) allowed us to identify around 23,000 OTUs for the prokaryotic dataset and 3000 for the eukaryotic dataset grouping 834 and 346 as the lowest level of assigned taxa, respectively. Furthermore, the use of replication reduced the overall retained phylotypes when compared to individual replicates, because the duplicate values across the three replicates were removed, leaving only unique annotations, which constituted the dominate phylotypes of the sample. The further application of a more stringent filter, i.e., a phylotype was present with at least 10 reads in each PCR replicate, gave us the confidence that the rare microbiota were not included accidentally in the final dataset. However, this will invariably mean that genuine rare microbiota could be removed. This was the case of taxa such as
Chlorella,
Pedinomonas,
Marinobacter and
Oceanicaulis, which were removed by applying this filter level.
Bacterial composition at the location analysed by
Tara Oceans expedition (station 64), based 548 nautical miles from ours, showed high abundance of α-Proteobacteria followed by Cyanobacteria (chloroplasts), γ-Proteobacteria and Bacteroidetes [
8]. The microbial composition in our sample revealed the dominance of Cyanobacteria (
Synechococcus and
Prochlorococcus) followed by α-Proteobacteria, γ-Proteobacteria and Bacteroidetes. This Cyanobacteria dominance is more consistent with other viral abundance data (discussed further later on). Eukaryotes collected from
Tara Oceans station 64 were dominated by the pico-nanoplankton, the Alveolata (Dinophyceae and Syndiniales clade MALV-I-II), followed in abundance by “other protists” [
7]; our station was also dominated by Alveolata (Dinophyceae and Syndiniales). We hypothesise that the variation in composition from our station S1 and
Tara Oceans’ station 64 can be attributed to sampling different water masses as well as different sampling seasons:
Tara Oceans’ one was sampled in winter (July 2010), while our station S1 was collected in summer (February 2012). Given these differences, it is nonetheless remarkable how similar the microbial communities were, especially when considering the application of vastly different sampling protocols.
Analyses of the metagenomic fraction, 0.45 µm permeate, showed that annotations based on the assembled contigs lead to a more robust description of diversity. We found that the majority (86%) of our data did not match any viral genomes in our curated virus database. This was similar to what was reported by previous studies, i.e., 55% [
42], 91.4% average [
2], 88% [
4] and 64.48% [
43]. Marine viral metagenomics or metabarcoding studies currently apply various biomass or volume concentration methods before the extraction of DNA for sequencing. Such studies applied to our area of interest reported on the dominance of the order
Caudovirales. Members of the family
Phycodnaviridae were the second most abundant viral group, often followed by the family
Mimiviridae. Our study demonstrated that a similar description of viral diversity is achievable from only 250 mL of seawater. The high abundance of Prochlorococcus and Synechococcus phages was consistent with the observed dominance of their host cyanobacteria genera. Both
Prochlorococcus and
Synechococcus co-occurred and dominated the prokaryotic dataset with 30% and 9% of the sequences. It is curious to note that the
Tara Oceans expedition [
8] did not find any barcode sequences matching extant Cyanobacteria lineages despite the high abundance of both Prochlorococcus and Synechococcus phages in this locality. The reason for this anomaly might lie in the differing methodologies applied or indeed the difference in timing of sampling. Future side-by-side methodological comparative studies might resolve the reason behind these inconsistencies.
NCDLVs, such as
Phycodnaviridae and
Mimiviridae, surprisingly coincided with the presence of diatoms and dinoflagellates. These taxa, which constituted more than 90% of the eukaryotic dataset, are considered the most widespread protists on earth and are known to be routinely infected by RNA viruses [
44]. Nevertheless, dinoflagellates are also infected by NCLDVs [
44,
45] and therefore our study suggests that further undescribed host-virus relationships can occur between dinoflagellates, diatoms and NCLDVs.
The 0.45 µm permeate or meDNA contains dissolved genetic material associated with cellular derived exudates (as part of the eDNA fraction [
21]), viruses [
20] or indeed small bacteria [
46]. The comparative analyses of the two sampled size fractions revealed that bacteria and eukaryotes identified in this environment were not the source of the entire meDNA in our sample. On average 10% and 0% of the phylotypes was found in common between the meDNA and the bacterial and eukaryote permeate fractions, respectively. The likely explanation for the source of this DNA could be either the presence of viruses carrying host genes, since host genes have been identified in viral isolates or the presence of small bacteria (>0.45 µm). The latter included genera, identified in both datasets, such as
Pseudomonas,
Flavobacterium,
Serratia and
Vibrio, which are known to pass through 0.45 µm filters [
46]. Nine coding sequences of the <0.45 µm fraction had hits with 16S proteins, six of which corresponded to
Microbacterium (data not shown), and represented the main genera identified in this fraction. Furthermore it has been shown that, in adverse conditions,
Microbacterium can present size reduction, which allowed it to pass through 0.45 µm filters [
47,
48]. Viruses often acquire host genes through horizontal gene transfer and since a large proportion of genetic material with unknown identity was also described, we hypothesise that viruses are the likely source of this meDNA. Whether this will be the case for all locations and situations remain to be determined. Our hypothesis contradicts Jiang and Paul [
20], possibly because of their study locations being more productive. However, their study stopped short of confirming the species identified to actually being present in their water sample. Ultimately, eDNA, or, in our case, meDNA, do not appear to be a good proxy for describing the microbes present in this body of water.