Variation in the Genetic Repertoire of Viruses Infecting Micromonas pusilla Reflects Horizontal Gene Transfer and Links to Their Environmental Distribution

Prasinophytes, a group of eukaryotic phytoplankton, has a global distribution and is infected by large double-stranded DNA viruses (prasinoviruses) in the family Phycodnaviridae. This study examines the genetic repertoire, phylogeny, and environmental distribution of phycodnaviruses infecting Micromonas pusilla, other prasinophytes and chlorophytes. Based on comparisons among the genomes of viruses infecting M. pusilla and other phycodnaviruses, as well as the genome from a host isolate of M. pusilla, viruses infecting M. pusilla (MpVs) share a limited set of core genes, but vary strongly in their flexible pan-genome that includes numerous metabolic genes, such as those associated with amino acid synthesis and sugar manipulation. Surprisingly, few of these presumably host-derived genes are shared with M. pusilla, but rather have their closest non-viral homologue in bacteria and other eukaryotes, indicating horizontal gene transfer. A comparative analysis of full-length DNA polymerase (DNApol) genes from prasinoviruses with their overall gene content, demonstrated that the phylogeny of DNApol gene fragments reflects the gene content of the viruses; hence, environmental DNApol gene sequences from prasinoviruses can be used to infer their overall genetic repertoire. Thus, the distribution of virus ecotypes across environmental samples based on DNApol sequences implies substantial underlying differences in gene content that reflect local environmental conditions. Moreover, the high diversity observed in the genetic repertoire of prasinoviruses has been driven by horizontal gene transfer throughout their evolutionary history, resulting in a broad suite of functional capabilities and a high diversity of prasinovirus ecotypes.


Introduction
Prasinophytes are a divergent group of marine eukaryotic phytoplankton within the division Chlorophyta [1]. They have a global distribution and are a major component of coastal and oceanic diversity and phylogeny with amplicon sequences is compromised because of the specificity of the primers. For example, the primers typically used for DNApol [21] amplify MpV sequences [23,24], whereas, the primers used for MCP miss them [26]. These differences were highlighted in a freshwater study [29] in which primers for DNApol and MCP favored amplification of prasinovirus and prymnesiovirus sequences, respectively.
Another approach to examine the genetic relatedness among viruses is to build multi-gene phylogenies. This can be done based on selected core genes [11], or by comparing the presence and absence of genes across entire genomes. Gene presence-absence trees provide a rigorous way to examine evolutionary relationships among large DNA viruses [30,31], but the approach is not amenable to comparing viruses based on environmental sequence data. Nonetheless, gene presence-absence trees can be used to construct robust phylogenetic relationships among sequenced virus isolates which, in turn, can serve as a backbone for making predictions about virus gene content from environmental amplicon-based sequencing data. Therefore, a relationship between a phylogeny based on core gene sequences, such as for DNApol, and overall gene content would need to be established. In this way, environmental amplicon data for DNApol can be used to infer the gene content of prasinoviruses in nature. This approach is explored in the present study.
The relationship between virus gene content in prasinoviruses and their environmental distribution is unexplored, but marine virus communities show biogeographic patterns [32][33][34], including viruses infecting Ostreococcus tauri, which form distinct communities in contrasting environments [35]. Furthermore, the diversity and composition of prasinovirus communities is influenced by environmental factors, particularly the availability of phosphate [36]. A recent study on cyanophage isolates, which prominently host a range of AMGs, linked their genome similarity with environmental distribution, thus formulating a diversification of viruses into ecotypes [34]. This suggests that the gene content of prasinoviruses may reflect their environmental distribution.
In this study, we contrast the genomes of prasinoviruses infecting Micromonas pusilla with those of other phycodnaviruses from a range of hosts and environments, with the goal of describing their genetic composition in the context of their environmental distribution.

Sequencing and Annotation of Micromonas pusilla Viruses
The Micromonas pusilla viruses MpV-PL1 and MpV-SP1 were isolated from the mixed layer in the Gulf of Mexico and coastal waters of California (respectively), and propagated on Micromonas pusilla (UTEX LB991) [37]. The viruses were purified from 15 mL of lysate by filtration through 0.45 and 0.22 µm pore-size Durapore membrane filters (EMD Millipore Corp., Billerica, MA, USA), ultracentrifugation, and subsequent optiprep gradient centrifugation, as described in Fischer et al. [38]. The DNA was extracted and purified using QIAamp MinElute Virus DNA spin kit (Qiagen, Inc., Valencia, CA, USA) prior to sequencing to 10-fold depth and assembly by the Broad Institute, using the 454 GS FLX platform and Newbler 2.7 (454 Life Sciences, Roche Diagnostics, Basel, Switzerland). Read assembly resulted in two contigs per virus that were mapped in Mauve v2.3.1 [39] to MpV1 as a reference genome. Sequencing gaps were closed by PCR amplification with customized primers (PL1 fwd-GAGGGTGGGCACGTTGGAG, rev-GTCTCTAGGACCCCCACCCT; SP1 fwd-GCTAATGACGAGTTCGGTCG, rev-ACTAAGTAACCGAAACTGTCCCC) to bridge the gaps, cloning of the product and subsequent Sanger sequencing (NAPS, University of British Columbia, Vancouver, BC, Canada). Final genomes were assembled in Geneious 6.0.5 (Biomatters Ltd., Auckland, New Zealand) based on sequence overlap.
To annotate the assembled genomes, open reading frames (ORFs) were called using Artemis v14.0.0 [40] using a minimum ORF length of 65 amino acids (195 nt) with start and stop codons. ORFs were translated into amino-acid sequences using the standard genetic code in three reading frames using Artemis. Putative coding sequences were tested for homology in the nr-database (NCBI) with a Viruses 2017, 9, 116 4 of 18 protein BLAST (BLAST-P). Annotations were manually selected based on a minimal E-value of E −10 and minimum 50% alignment length. tRNAs were determined with tRNAscan-SE v1.21 [41].

Comparing Prasinovirus Genomes and Inferring Viral Phylogeny
Coding sequences (CDS) for MpV-PL1, MpV-SP1, two other sequenced and annotated Micromonas viruses MpV1 and MpV-12T, and M. pusilla UTEX LB991 were clustered in USEARCH (v6.1.544) [42] based on a 50% pair-wise identity at the amino-acid level. Viral clusters were labeled based on the annotation of MpV-PL1 where applicable. Genome contents were compared based on a cluster presence-absence scheme and Venn diagrams produced in R [43]. Core genes in the M. pusilla viruses were defined when a cluster contained CDS from all four genomes, or a CDS could be associated with a cluster based on functional annotation and BLAST-P analysis.

Assessing the Prevalence of Prasinoviruses in Environmental Samples
Amplicons of DNApol gene fragments were used to infer prasinovirus diversity in environmental samples. Samples of 20 to 72 L of water were taken from the surface at three sites in the Strait of Georgia, Jericho Pier (JP) and Point Atkinson (PA), the Juan de Fuca Strait (JF), and in the surface layer and at 200 m depth, several times per year in Saanich Inlet (SI) (sampling details are available in Supplementary Table S1). JP, PA, and JF samples were sequentially filtered through 47 mm diameter GC50 glass fiber filters (Advantec MFS Inc., Dublin, CA, USA) and HVLP (Millipore, Billerica, MA, USA) membrane filters (~0.45 µm nominal pore-size for each filter). Similarly, Saanich Inlet samples were filtered through 2.7 µm nominal pore size GF/D filters (Whatman, Maidstone, UK) and 0.22 µm pore-size Sterivex filters (EMD Millipore). The remaining particulate matter in each filtrate was then concentrated by tangential flow filtration (TFF) with a 30 kDa molecular-weight cutoff cartridge filter (Prep-Scale, Millipore, Billerica, MA, USA) to make a viral concentrate (VC) that was stored at 4 • C in the dark. For DNA extraction, 14 mL VC subsamples were concentrated by ultracentrifugation for 4 h at 124,000× g at 15 • C using a SW40 rotor (Beckman Coulter Life Sciences, Brea, CA, USA), and the pellets resuspended with 500 µL Tris-Ethylenediaminetetraacetic acid (EDTA) (TE) buffer (10 mM-Tris HCl; 1 mM EDTA), pH 8.0) at 4 • C overnight. Samples from Saanich Inlet were pooled into surface layer and deep composites. The viral capsids were lysed with Proteinase K (Invitrogen, Carlsbad, CA, USA) (100 µg mL −1 ) and DNA extracted using phenol-chloroform. Partial DNA polymerase sequences were amplified with AVS1 and AVS2 primers [50], and 500 ng of the PCR products used for library preparation and sequencing with a 454 GS FLX with titanium chemistry (Roche Diagnostics, Basel, CH, Switzerland) at the Broad Institute (Cambridge, MA, USA). Reverse AVS sequences were de-noised using QUIIME v1.4 [51] and chimeras were removed using UCHIME (v4.2.40) [42]. De-noised sequences were translated to amino acids with FragGeneScan v1.16 [52] and dereplicated using USEARCH (v6.1.544). Reads from all environmental samples and reference sequences were pooled and clustered at 97% identity in USEARCH (v8.1). Clusters with only one member were discarded and centroids of the other clusters were aligned with Clustal Omega v1.2.3. Gaps in the alignment were trimmed and a maximum-likelihood tree was built in RaxML 8.0. Environmental reads and reference sequences were parsed using USEARCH (v8.1) at 97% identity. The frequency distribution of parsed environmental reads were rarefied to the lowest number of cumulative reads per sample using the VEGAN package [53] in R. The tree was visualized in iTOL (v3.5) [54].
In situ measurements of temperature ( • C) and salinity (Practical Saline Units, PSU) were made with electrodes mounted on a CTD (Seabird, Bellevue, WA, USA) in Saanich Inlet or a YSI probe (YSI, Yellow Springs, OH, USA) in Jericho Pier, Point Atkinson and Juan de Fuca Strait. Additionally, remote sensing data were extracted from Aqua MODIS data (NASA Goddard Space Flight Center, Ocean Ecology Laboratory, Ocean Biology Processing Group) to estimate Chlorophyll a concentrations (Chl a, mg m −3 ), daytime sea-surface temperature (SST, 4u, • C), photosynthetically active radiation (PAR, µmol photons m −2 s −1 ) and particulate organic carbon (POC, mg m −3 443/555) as a rolling 32-day composite pre-dating the sampling period, at a 4-km resolution. Data was processed and mapped in R.

Origin and Distribution of Genes in Micromonas Viruses
The genome sequences of Micromonas viruses MpV-PL1 and MpV-SP1 were completed using custom-designed primers. Genomes were analyzed and annotated by BLAST-P analysis of MpV ORFs against the nr-database. This improved earlier annotations, although most ORFs still lack a putative function. The presented study focused on MpV-PL1 and MpV-SP1, and their comparison to the genomes of MpV1 (NC_014767) and MpV-12T (NC_020864). The viruses were isolated on three strains of M. pusilla (Table 1) and differed in genome size, the number and average length of their ORFs, GC content, and tRNAs. The genome sizes range from 173,350 bp for MpV-SP1 to 205,622 bp for MpV-12T, which does not correspond to the number or size of ORFs; MpV1 possesses the fewest (244) but, on average, longest ORFs (715 bp), while MpV-PL1 has the most (275), but not the shortest, on average (684 bp). MpV-12T also has the lowest GC content (39.8%), while MpV-PL1 has the highest content (43.3%). Although six tRNAs are common in Micromonas viruses, MpV-PL1 lacks Leu-tRNA, while MpV-12T carries two copies of Asn-tRNA.   Combining the cluster analysis of putative viral genes with an additional BLAST-P analysis against the nr-database revealed a core genome of 119 genes and 327 genes in a pan-genome (Table  2). Core genes include those essential for viral replication and virion structure, such as DNApol, DNA ligase, transcription initiation factor, and seven capsid proteins. Most putative genes are in the flexible pan-genome, including genes which are functionally of cellular origin, such as those involved in carbon metabolism and DNA repair, yet most have no functional annotation. Other putative genes of presumable cellular origin associated with amino acid synthesis, including acetaldehyde dehydrogenase, acetolacetate synthase, and aminotransferase, were found in in MpV1 (12), and MpV-PL1, but are absent in the other Micromonas viruses. Heat shock protein 70 was found in MpV-12T and MpV-PL1, and is also shared with M. pusilla UTEX LB991. The DNA methylases/methyltranferases are site-specific and differ among the viruses. Moreover, both MpV-PL1 and MpV-SP1 possess a putative host-derived gene for 6-phosphofructokinase, while MpV1, MpV-PL1 and MpV-SP1 share dTDP-D-glucose 4,6-dehydratase. In contrast only MpV-12T carries UDP-glucose 6-dehydrogenase and only MpV-SP1 has two transketolase-related genes. Several other genes are shared among MpV-PL1, MpV-SP1, and MpV1, but not with MpV-12T, which also has the most genes without functional annotation.  Combining the cluster analysis of putative viral genes with an additional BLAST-P analysis against the nr-database revealed a core genome of 119 genes and 327 genes in a pan-genome (Table 2). Core genes include those essential for viral replication and virion structure, such as DNApol, DNA ligase, transcription initiation factor, and seven capsid proteins. Most putative genes are in the flexible pan-genome, including genes which are functionally of cellular origin, such as those involved in carbon metabolism and DNA repair, yet most have no functional annotation. Other putative genes of presumable cellular origin associated with amino acid synthesis, including acetaldehyde dehydrogenase, acetolacetate synthase, and aminotransferase, were found in in MpV1 (12), and MpV-PL1, but are absent in the other Micromonas viruses. Heat shock protein 70 was found in MpV-12T and MpV-PL1, and is also shared with M. pusilla UTEX LB991. The DNA methylases/methyltranferases are site-specific and differ among the viruses. Moreover, both MpV-PL1 and MpV-SP1 possess a putative host-derived gene for 6-phosphofructokinase, while MpV1, MpV-PL1 and MpV-SP1 share dTDP-D-glucose 4,6-dehydratase. In contrast only MpV-12T carries UDP-glucose 6-dehydrogenase and only MpV-SP1 has two transketolase-related genes. Several other genes are shared among MpV-PL1, MpV-SP1, and MpV1, but not with MpV-12T, which also has the most genes without functional annotation.
A BLAST-P analysis against the nr-database of putative coding sequences with a functional annotation revealed that for most core-genes the closest hit is to other virus sequences while, for the flexible genome, the close homologues often are cellular ( Figure 2). However, few of the sequences of presumed cellular origin have their closest homologue in M. pusilla UTEX LB991, but are rather more similar to sequences in other eukaryotes, heterotrophic bacteria, cyanobacteria, or archaea. The annotation is based on MpV-PL1; AA: Amino Acid; PhoH: Phosphate starvation-inducible protein.
A BLAST-P analysis against the nr-database of putative coding sequences with a functional annotation revealed that for most core-genes the closest hit is to other virus sequences while, for the flexible genome, the close homologues often are cellular ( Figure 2). However, few of the sequences of presumed cellular origin have their closest homologue in M. pusilla UTEX LB991, but are rather more similar to sequences in other eukaryotes, heterotrophic bacteria, cyanobacteria, or archaea.

Deriving Similarity in Gene Content from DNApol
A neighbor-joining phylogenetic analysis of prasinoviruses and chloroviruses based on the presence and absence of putative genes showed the similarity of viruses to each other ( Figure 3). The more closely the viruses are related in their gene content, the closer they are on the tree, indicating that the Chlorella, Bathyococcus, and most Ostreococcus viruses form well-defined groups whereas the Micromonas viruses form three distinct branches with MpV-PL1 and MpV-SP1 branching together, and MpV1 and MpV-12T being on separate branches.

Deriving Similarity in Gene Content from DNApol
A neighbor-joining phylogenetic analysis of prasinoviruses and chloroviruses based on the presence and absence of putative genes showed the similarity of viruses to each other ( Figure 3). The more closely the viruses are related in their gene content, the closer they are on the tree, indicating that the Chlorella, Bathyococcus, and most Ostreococcus viruses form well-defined groups whereas the Micromonas viruses form three distinct branches with MpV-PL1 and MpV-SP1 branching together, and MpV1 and MpV-12T being on separate branches.
Comparing the phylogenetic relationship among prasinoviruses and chloroviruses from analyses of gene presence and absence, and DNApol sequences drew a congruent picture of the relationship among viruses. This is evident from the topology of phylogenetic trees based on gene presence and absence ( Figure 3) and full-length DNApol sequences ( Figure 4). Additionally, the phylogenetic distances between pairs of viruses based on gene presence-absence data and full-length DNApol sequences (Table 3) were highly correlated (Mantel test), whether chloroviruses were included in the analysis (r = 0.99), or not (r = 0.96). Pairs of viruses with a high degree of similarity in their gene content, measured by gene presence-absence, also have low phylogenetic distances based on full-length DNApol sequences. Amplicons from environmental samples have to be clustered at an appropriate identity level that is specific to the full length DNApol. Correlating the variation in phylogenetic distance of full length DNApol sequences to different levels of amino acid (aa) identity of corresponding amplicons showed decreasing variation with increasing stringency ( Figure 5). The variation approached zero when clustering DNApol fragments at 97% identity, which was applied to the environmental sequences in this study.

Deriving Similarity in Gene Content from DNApol
A neighbor-joining phylogenetic analysis of prasinoviruses and chloroviruses based on the presence and absence of putative genes showed the similarity of viruses to each other (Figure 3). The more closely the viruses are related in their gene content, the closer they are on the tree, indicating that the Chlorella, Bathyococcus, and most Ostreococcus viruses form well-defined groups whereas the Micromonas viruses form three distinct branches with MpV-PL1 and MpV-SP1 branching together, and MpV1 and MpV-12T being on separate branches.  Comparing the phylogenetic relationship among prasinoviruses and chloroviruses from analyses of gene presence and absence, and DNApol sequences drew a congruent picture of the relationship among viruses. This is evident from the topology of phylogenetic trees based on gene presence and absence ( Figure 3) and full-length DNApol sequences (Figure 4). Additionally, the phylogenetic distances between pairs of viruses based on gene presence-absence data and full-length DNApol sequences (Table 3) were highly correlated (Mantel test), whether chloroviruses were included in the analysis (r = 0.99), or not (r = 0.96). Pairs of viruses with a high degree of similarity in their gene content, measured by gene presence-absence, also have low phylogenetic distances based on full-length DNApol sequences. Amplicons from environmental samples have to be clustered at an appropriate identity level that is specific to the full length DNApol. Correlating the variation in phylogenetic distance of full length DNApol sequences to different levels of amino acid (aa) identity of corresponding amplicons showed decreasing variation with increasing stringency ( Figure 5). The variation approached zero when clustering DNApol fragments at 97% identity, which was applied to the environmental sequences in this study.   Comparing the phylogenetic relationship among prasinoviruses and chloroviruses from analyses of gene presence and absence, and DNApol sequences drew a congruent picture of the relationship among viruses. This is evident from the topology of phylogenetic trees based on gene presence and absence ( Figure 3) and full-length DNApol sequences ( Figure 4). Additionally, the phylogenetic distances between pairs of viruses based on gene presence-absence data and full-length DNApol sequences (Table 3) were highly correlated (Mantel test), whether chloroviruses were included in the analysis (r = 0.99), or not (r = 0.96). Pairs of viruses with a high degree of similarity in their gene content, measured by gene presence-absence, also have low phylogenetic distances based on full-length DNApol sequences. Amplicons from environmental samples have to be clustered at an appropriate identity level that is specific to the full length DNApol. Correlating the variation in phylogenetic distance of full length DNApol sequences to different levels of amino acid (aa) identity of corresponding amplicons showed decreasing variation with increasing stringency ( Figure 5). The variation approached zero when clustering DNApol fragments at 97% identity, which was applied to the environmental sequences in this study.

Environmental Prevalence of Prasinoviruses Show Adaptation to Environmental Conditions
To study the distribution of prasinovirus ecotypes in the five environmental samples, Jericho Pier ( Table S1).

Environmental Prevalence of Prasinoviruses Show Adaptation to Environmental Conditions
To study the distribution of prasinovirus ecotypes in the five environmental samples, Jericho Pier ( Table S1).  Combined over all samples, environmental DNApol fragments from phycodnaviruses of about 129 aa length, pooled at 97% similarity, produced 197 operational taxonomic units (OTUs), including the references. Phylogenetic analysis of these sequences revealed that they clustered into several groups, with most nodes being supported by bootstrap values above 75% (Figure 7). The distribution of reference sequences on the tree matches the topology of trees based on gene presence or absence (Figure 3), and full-length DNApol sequences (Figure 4). Some of the environmental sequences groups were associated with sequences from prasinovirus isolates, while others were distant from known prasinovirus sequences. Moreover, the most abundant environmental sequence from each sample clustered relatively near a sequence from a prasinovirus, with the exception of the most abundant sequence from the SI deep sample, which lies on a distant branch that only contains environmental sequences. Combined over all samples, environmental DNApol fragments from phycodnaviruses of about 129 aa length, pooled at 97% similarity, produced 197 operational taxonomic units (OTUs), including the references. Phylogenetic analysis of these sequences revealed that they clustered into several groups, with most nodes being supported by bootstrap values above 75% (Figure 7). The distribution of reference sequences on the tree matches the topology of trees based on gene presence or absence (Figure 3), and full-length DNApol sequences (Figure 4). Some of the environmental sequences groups were associated with sequences from prasinovirus isolates, while others were distant from known prasinovirus sequences. Moreover, the most abundant environmental sequence from each sample clustered relatively near a sequence from a prasinovirus, with the exception of the most abundant sequence from the SI deep sample, which lies on a distant branch that only contains environmental sequences.

Discussion
This study highlights the similarities and differences among the genomes of M. pusilla viruses and other phycodnaviruses, as well as their distribution in the environment. In particular, the results show that there is substantial overlap in the gene content among viruses infecting the genera

Discussion
This study highlights the similarities and differences among the genomes of M. pusilla viruses and other phycodnaviruses, as well as their distribution in the environment. In particular, the results show that there is substantial overlap in the gene content among viruses infecting the genera Micromonas, Ostreococcus, and Bathyococcus; however, there is also a large "flexible" component to their genomes. Moreover, there is considerable divergence among the Micromonas viruses, with the variation within these viruses being as large as it is among the sequenced prasinoviruses. Finally, an analysis of environmental DNApol sequences reveal an expansive diversity of viruses closely related to prasinovirus isolates and a niche-specific distribution of ecotypes in environmental samples. These findings are discussed in detail below.

Origin and Distribution of Genes in Micromonas Viruses
The Micromonas viruses MpV1, MpV-PL1, and MpV-SP1 show a high degree of genome similarity to each other, as well as to Ostreococcus viruses in terms of the number of ORFs, ORF length, GC content, and tRNAs, and in comparison to Bathyococcus viruses [11,12] and MpV-12T. Specifically, MpV-12T has a lower GC content and larger ORF length, and is more similar to Bathyococcus viruses, and was isolated on a different host strain than MpV-PL1 and MpV-SP1. Additionally, although MpV-12T has a wide host range [14], it does not infect the host of MpV-PL1 and MpV-SP1. Moreover, genomes were compared for homologues by clustering at an amino acid identity of 50%. This cut-off was selected based on identities among obvious homologues by annotation and based on the sensitivity of the UCLUST algorithm, which applies the same identity definition as BLAST [42]. MpV-PL1, MpV-SP1, and MpV1 shared most of their genes, while more than half of the MpV-12T genome was not shared with the other Micromonas viruses (Figure 1). Having only 80 genes shared among the Micromonas viruses at this identity level is low relative to the seven sequenced Ostreococcus lucimarinus viruses [11], which shared most of their genes and had pairwise nucleotide identities above 60% for their core genes.
The identification of 80 ORFs with functional annotation that were shared among all of the Micromonas viruses was the basis for defining a core genome among this group of viruses, with the rest of the genes being assigned as the "flexible" pan-genome. These high similarity core genes were supplemented with results from an additional BLAST-P analysis, which increased the total core genome to 119 putative genes. This additional analysis revealed genes for viral replication and virion structure, as well as phoH, a gene which is induced under phosphate stress ( Table 2). PhoH is widely distributed in marine phage and has been used as an alternative marker gene [55,56] for phages and also eukaryote viruses in diversity studies, yet its exact function is not well defined. The core genes associated with viral replication were also found in Ostreococcus viruses, although the set of conserved genes in Micromonas viruses appears lower than described for Ostreococcus viruses [10,11,13]; however, it is much larger than that found in the NCLDV super group [8].
In contrast to the core genome, there is also a shared, but flexible, pan-genome that varies among Micromonas viruses. Most of these ORFs have no functional annotation, and those that do have been seen in other prasinovirus genomes. The gene complex for amino-acid synthesis found in MpV1 and Bathyococcus viruses [12] is also present in MpV-PL1. Both MpV-PL1 and MpV-12T carry a copy of heat shock protein 70 despite their otherwise limited genome overlap. Two transketolase genes, part of the Calvin cycle and pentose phosphate pathway, are only found in MpV-SP1, but have phycodnavirus homologues in metagenomes from Yellowstone Lake [57]. A homologue of 6-phosphofructokinase, a key enzyme of glycolysis, is present in MpV-PL1 and MpV-SP1, similar to Ostreococcus viruses [13]. Given that all these genes were expressed during a transcriptional study of M. pusilla UTEX LB991 infected by MpV-SP1 [58], similarly to the expression of transaldolase, glucose-6-phosphate dehydrogenase and 6-phospohgluconate dehydrogenase in cyanophages [59], these genes may influence the host's metabolism during infection to boost viral replication. Furthermore the presence of dTDP-D-glucose 4,6-dehydratase and glycosyl transferase in MpV1, PL1, and SP1 indicates activity in sugar manipulation and potential glycosylation of proteins, similar to findings of glycosyl transferase in the Ostreococcus virus OtV1 [13]. As well, UDP-glucose 6-dehydrogenase found in MpV-12T and MpV SP1 could feed products of glycolysis into glycosylation of proteins of e.g., the capsid, similar to suggestions by Weynberg et al. [13] and Wang et al. [60].
Altogether, these presumably cell-derived metabolic genes have the potential to boost critical cell function for viral replication and could, thus, be beneficial to viral production.
The search for host homologues of viral genes resulted in only six ORFs being shared between Micromonas viruses and the host strain for MpV-PL1 and MpV-SP1 at a similarity level of 50%. In contrast, Ostreococcus viruses share 11 genes with their host [13], but often at lower amino acid identities to host homologs. A more detailed BLAST-P analysis of Micromonas virus ORFs that have a functional annotation revealed that most ORFs have the highest similarity to ORFs typically found in other viruses. Moreover, OTUs with high similarity to cellular sequences were from bacteria and eukaryotes that are not potential host taxa (Figure 2). This is similar to findings for the Ostreococcus virus OtV5 [10] and the Mollivirus, a NCLDV that infects Acanthamoeba [30]. Another comparison of prasinoviruses of different hosts also revealed a pattern of shared metabolic genes with an origin outside the host range, suggesting horizontal gene transfer [12]. Furthermore, horizontal gene transfer is believed to be the main mode to acquire novel genes for viruses of Ostreococcus and Micromonas [61], and to be beneficial for the virus [62]. Chlorovirus genomes show a similar pattern with a relatively large flexible pan-genome, a wide range of protein homologues, and evidence of horizontal gene transfer [63]. The data presented here provides putative evidence that horizontal gene transfer from a range of sources is widespread among viruses of Micromonas, possibly under selection pressure to adapt to environmental conditions.

Deriving Similarity in Gene Content from DNA Polymerase
Measuring the prevalence of virus with specific genetic repertoires, ecotypes, in the environment poses a challenge. This problem was approached by first constructing a phylogenetic tree based on the presence and absence of genes, in order to infer how closely related the viruses were to each other ( Figure 3) and then set it in correlation to the phylogeny based on full-length DNApol sequences ( Figure 4) and PCR amplicons.
The phylogeny based on gene presence and absence data presents an overall view of the genetic similarity among the prasinoviruses and its relationship to Chlorella viruses. While Ostreococcus and Bathyococcus viruses form well-defined groups, the Micromonas viruses are more scattered among the tree with MpV-12T on an isolated branch, suggesting substantial gene loss and transfer among these viruses. The relatively low bootstrap values in the gene presence and absence tree is similar to other phylogenies based on this technique [30,31]. This reflects that M. pusilla viruses generally share many genes, but MpV-PL1 shares more genes with OtV5, a virus that infects Ostreococcus sp. (125), than it does with MpV-12T (93) (Supplementary Figure S1). Furthermore, the phylogenetic tree based on the presence and absence of genes is similar in topology to the phylogenetic relationship inferred from whole-gene DNApol sequences, as well as others based on DNApol sequences or the presence and absence of genes [8,11,[28][29][30].
Comparing pairwise phylogenetic distances based on gene presence and absence and DNApol showed strong congruency in the Mantel test (Table 3). This implies that DNApol sequences can be used to infer phylogenetic relationships among environmental sequences to assess the diversity of prasinoviruses in environmental samples as has been done [21,28], and that it is a strong proxy to infer the gene content among prasinoviruses.
However, because PCR only amplifies a gene fragment, sequences need to be clustered at 97% amino acid identity to be specific to full length DNApol sequences of viruses with diverse gene content. This is less stringent than Short and Short [23], who clustered the nucleotide level at 97% and Bellec et al. [35] who considered differences by single nucleotides as defining a distinct Ostreococcus virus haplotype. In contrast, it is more stringent than clustering at 75% identity, which was used in another study on prasinovirus distribution [36]. Overall, the identity level used in this study is appropriate to approximate the similarity and difference in gene content of viruses in environmental samples.

Environmental Prevalence of Prasinoviruses Show Adaptation to Environmental Conditions
With a framework to infer the phylogenetic relationship and similarity in the genetic repertoire among prasinoviruses based on DNApol amplicons, the approach was used to determine how well represented the sequenced prasinoviruses were across environmental samples. The four Micromonas viruses examined in this study were isolated from widely separated geographic areas. MpV-SP1 and MpV-PL1 were isolated from water collected from Scripps Pier (San Diego, CA, USA) and Port Aransas, (TX, USA), respectively [64], MpV1 was isolated from a eutrophic coastal lagoon in the northwestern Mediterranean [12], and MpV-12T was isolated off of the Dutch coast [14]. Although Micromonas viruses occur in the coastal waters of British Columbia [64,65], none of the sequenced isolates were from the region; hence, it was unknown if these genotypes would be well represented in these waters.
Five environmental samples from British Columbia coastal waters that reflect a range of conditions were analyzed for prasinovirus ecotypes and in situ conditions. Saanich Inlet is productive and stratified in spring and summer, and is isolated from deeper waters beyond the inlet because of a shallow sill; this leads to hypoxic deeper waters [66]. JP is strongly stratified, with fresh water influenced by water from English Bay that is adjacent to the city of Vancouver, while PA is more exposed and mixed with a higher salinity. JF is off the coast of Victoria in very exposed and mixed waters of the Juan de Fuca Strait [67]. This is described by the prevailing salinity, temperature, and chlorophyll concentrations at the sampling locations ( Figure 6). While JP and PA are similar in their high SST of 18 • C and Chl a concentrations, JF is a much deeper mixed water body with a SST of only 10 • C and lower Chl a concentration. However, PA is more similar to JF in terms of salinity with both being 23 PSU. The combined DNApol sequences from all five environmental samples produced 197 distinct OTUs which were used to build a diverse and well-supported maximum likelihood DNApol tree displaying the prasinovirus and chlorovirus diversity.
The multitude of well-defined branches on the DNApol tree suggest a great diversity in prasinovirus ecotypes and their genetic repertoires, and visualizes their distribution across environments (Figure 7). The distribution of reference viruses on the tree generally reflects the tree topology of the reference tree-based full-length DNApol sequences and the presence and absence of genes, confirming the approach. The environmental OTUs substantially increases the known richness of prasinoviruses, and especially Micromonas viruses, in the environment. Furthermore, the specific distribution of the representative OTUs for each of the five environments suggests a specialization of the corresponding viral ecotypes to prevailing conditions. The Saanich Inlet samples, being long-term integrated samples, should rather be seen in comparison to each other than to the other three samples. Despite the Aqua MODIS data showing JP and PA being similar in temperature and JP, PA, and JF having similar Chl a concentrations and PAR levels, PA and JF are more similar environments based on their in situ salinities and presumed mixing. This is also reflected in the dominant prasinovirus genotypes for the samples. Saanich Inlet deep sequences and the stratified, near-shore JP sequences are on separate isolated branches. The dominant sequences in Saanich Inlet surface samples, and especially the two mixed, more saline PA and JF samples, share branches. This specialization of viruses to environments is congruent with findings that prasinovirus communities in the Northwest Mediterranean Sea are affected by environmental variables, and especially nutrient availability [36]. Additionally, considering the relatively wide host range of these viruses within a genus [14,62] the pattern likely represents a response by the prasinovirus community to the specific environmental conditions and not solely the host community. Altogether this could mean that prasinovirus ecotypes with similar genetic repertoires, approximated by DNApol similarity, dominate in similar environments.
In conclusion, this research highlights the genetic repertoire encoded by prasinoviruses infecting M. pusilla and other prymnesiophytes. We identified a core set of genes that are shared among Micromonas viruses despite their marked differences, and identified a large set of genes that make up a flexible part of the genome, implying that there is a large "pan-genome" that is shared among prasinoviruses. Furthermore, we set the Mircomonas virus genomes in contrast to genomes from other prasinoviruses, phycodnaviruses, and a host genome elucidating the overlap in gene content. The presumed origin of shared genes and their distribution across viral clades shows a complex evolutionary history and horizontal gene transfer. The diversity in prasinovirus genomes is linked to their distribution pattern in nature, implying adaptation of viral ecotypes to their environment.
Supplementary Materials: The following are available online at www.mdpi.com/1999-4915/9/5/116/s1, Figure S1: Venn diagram of the CDS of four prasinoviruses, based on clusters by 0.5 amino-acid identity, Table S1: Sampling details and in situ conditions for environmental amplicon sequences from Jericho Pier, Point Atkinson, Juan de Fuca Strait, and Saanich Inlet.