Prophage Genomics and Ecology in the Family Rhodobacteraceae

Roseobacters are globally abundant bacteria with critical roles in carbon and sulfur biogeochemical cycling. Here, we identified 173 new putative prophages in 79 genomes of Rhodobacteraceae. These prophages represented 1.3 ± 0.15% of the bacterial genomes and had no to low homology with reference and metagenome-assembled viral genomes from aquatic and terrestrial ecosystems. Among the newly identified putative prophages, 35% encoded auxiliary metabolic genes (AMGs), mostly involved in secondary metabolism, amino acid metabolism, and cofactor and vitamin production. The analysis of integration sites and gene homology showed that 22 of the putative prophages were actually gene transfer agents (GTAs) similar to a GTA of Rhodobacter capsulatus. Twenty-three percent of the predicted prophages were observed in the TARA Oceans viromes generated from free viral particles, suggesting that they represent active prophages capable of induction. The distribution of these prophages was significantly associated with latitude and temperature. The prophages most abundant at high latitudes encoded acpP, an auxiliary metabolic gene involved in lipid synthesis and membrane fluidity at low temperatures. Our results show that prophages and gene transfer agents are significant sources of genomic diversity in roseobacter, with potential roles in the ecology of this globally distributed bacterial group.


Introduction
Prophages (phage genomes integrated into bacterial genomes) represent a major driving force in bacterial evolution due to genomic rearrangements and the diversification of genetic repertoires [1][2][3][4][5]. Half of the fully sequenced bacterial genomes contain at least one prophage and prophage-encoded genes can represent up to 35% of a bacterial species pangenome [2,6,7]. Over time, prophages can lose their ability to induce and become genomic islands domesticated by the bacterial hosts [1]. These defective prophages commonly encode genes that have been coopted by the bacterial hosts for other functions, such as cell communication and warfare [8]. The changes in bacterial metabolism and ecological interactions caused by active and defective prophages have the potential to impact the biogeochemical and ecological roles of globally abundant bacterial groups.
Rhodobacteraceae can comprise up to 36% of the bacterial community in marine habitats such as the upper mixed layer of the ocean [9][10][11]. The Rhodobacteraceae family includes members of the Roseobacter genus and is historically referred to as the roseobacter group [12]. Sulfur metabolism and aerobic anoxygenic photosynthesis catalyzed by roseobacters impact global carbon and sulfur biogeochemical cycles [13][14][15]. Roseobacters provide vitamin B 12 to B 12 -auxotrophic eukaryotes including diatoms, dinoflagellates, and coccolithophores, impacting primary production at large scales [16][17][18]. Although most roseobacters are marine, many representatives inhabit soils, freshwater lakes, and hypersaline habitats [19]. abundance, as indicated by a large change in roseobacter-encoded transposases during a phytoplankton bloom [58].
Here, we investigated the genomics and ecology of prophages integrated in the genomes of 79 bacterial species representative of the roseobacter group [59]. Our data show that roseobacter prophages are host-specific, with the exception of one Microviridade prophage that was found in three roseobacter species. Sequences with high similarity to the newly identified prophages were present in metagenomes from the free viral fraction of seawater samples, indicating that they represent active prophages capable of induction. Our study shows that globally abundant prophages encode auxiliary metabolic genes that likely impact their host metabolism and ecology.

Prophage Identification
The roseobacter group was initially characterized as a clade; however, a recent phylogenomic analysis showed that the group is not monophyletic [59]. Results have shown six monophyletic clades containing members of the initial roseobacter group and a seventh clade that was not considered roseobacter [59]. Therefore, roseobacter (not capitalized or italicized) is used here not as a taxonomic group, but as an operational term referring to the Rhodobacteraceae that were historically classified as members of the Roseobacter clade. We selected 79 genomes of bacterial species belonging to 36 genera that are representative of the seven Rhodobacteraceae clades to investigate the genomics and distribution of prophages across the group. Complete genomes were retrieved from the NCBI RefSeq (RefSeq accession numbers, genome length, and number of contigs for each genome are reported in Table S1). We double-checked genome completion using CheckM [60]. All but one (Rubellimicrobium thermophilum, which is 87% complete) of the 79 genomes had more than 90% completion. Prophages encoded in these genomes were identified and annotated using VIBRANT (Virus Identification by IteRative ANnoTation) using default parameters (minimum scaffold length 1000 bp, minimum of four open reading frames) [61]. The bacterial genome files in FASTA format were supplied to VIBRANT, which performs Hidden Markov Model (HMM) searches against the Pfam, KEGG, and VOG databases [62][63][64]. VIBRANT identified 173 putative prophages based on the presence of viral hallmark genes: 5 as complete, 17 as high-quality drafts, 28 as medium-quality, and 123 as low-quality drafts (Table S2). The nucleotide sequences of the predicted prophages are available on FigShare (deposited on 10 April 2021). The contribution of prophages to the total bacterial genome (herein referred to as prophage density) was calculated by summing the genome length of all prophages in a given bacterial genome and dividing the result by the host genome length (Table S1). This analysis was performed separately for high-and medium-quality prophages and for all prophages identified.

Prophage Taxonomy and Phylogenomics
Putative prophages were dereplicated by clustering nucleotide sequences using CD-HIT at 95% identity [65]. Three approaches were utilized for prophage classification. First, gene annotations with high similarity to reference phages according to VIBRANT annotations were used to assign putative Lambda-like, Mu-like, other Caudovirales, and Microviridae prophages. Second, all predicted phage sequences were compared with the NCBI RefSeq viral database (10,003 genomes, accessed and downloaded 18 September 2020) using tBLASTx (E-value ≤ 10 −5 ). The nucleotide sequences of high-and mediumquality phages were also compared to the JGI IMG Viral Database of cultivated and uncultivated viruses version 3.0 (>2,000,000 viral genome fragments, accessed on 4 January 2021) using the JGI web-based BLASTn tool (identity ≥ 30%, alignment length ≥ 90 bp, E-value ≤ 10 −5 ) [66]. The IMG/VR 3 identified 31 viral contigs with similarities to the putative Roseobacter prophages. These sequences were combined with the 50 putative prophages with medium and high quality identified by VIBRANT for a phylogenomic analysis using the GL-UVAB workflow [67]. Briefly, the GL-UVAB database "Vir_DB_Nuc" (Accessed 16 November 2020) contained 195,698 reference (NCBI RefSeq) and metagenomeassembled viral sequences from 10 studies [67]. The database was filtered by excluding sequences shorter than 2 Kb in length and dereplicated at 98% identity, resulting in a database of 159,368 sequences [67,68]. Prodigal in metagenomic mode (-p meta) was used to predict protein encoding genes for both the GL-UVAB database and our predicted phages [69]. An all versus all amino acid comparison of the translated proteins was performed using DIAMOND (more sensitive mode, identity ≥ 30%, bitscore ≥ 30, alignment length ≥ 30 amino acids, and E-value ≤ 0.01) [70]. Contigs from the Vir_DB_Nuc sharing a minimum 3 proteins with putative prophages were extracted from the database and combined with the roseobacter prophage sequences for the phylogenomic analysis. Pairwise distances between genomes (putative roseobacter prophages and sequences from Vir_DB_Nuc and IMG/VR 3) were calculated using the Dice coefficient D AB = 1 − 2 × AB AA+BB where AB represents the sum of valid protein matches between sequences, and AA and BB are protein matches of sequences against themselves [67]. A tree was calculated from the pairwise Dice distances using the Phangorn package in R and visualized on the iTOL interactive web interface [71,72]. The original IDs of genomes and contigs included in the final tree, along with the information on the source database and isolation source, are described in Table S3.

Auxiliary Metabolic Genes and Gene Transfer Agents
Auxiliary metabolic genes identified by VIBRANT using the HMM searches against Pfam, KEGG, and VOG were binned into metabolic pathways by searching the Pfam accession of each gene in the KEGG database. The metabolic pathways of the genes were grouped by the prophage host genera (Table S4). The position of these genes in the phage genomes was visualized using EasyFig and Clinker [73,74]. Predicted prophages with the AMG cysE, the most abundant AMG identified in this dataset, were compared to each other and to the Rhodobacter capsulatus GTA sequence downloaded from the NCBI (GCA_000021865.1) using Clinker [74]. Clinker produces global alignments of amino acid sequences using the BLOSUM62 substitution matrix. The alignments were visualized with an identity threshold of 0.5.

Presence of Prophages in an Algae Bloom
Many roseobacters form symbiotic relationships with marine phytoplankton and macroalgae [13,[75][76][77]. To test if the putative prophages identified were also observed in association with phytoplankton, we mapped metagenome reads from a dinoflagellate bloom against the 44 high-and medium-quality prophages identified here. The metagenomes were obtained from a previous study of a Gymnodinium catenatum dinoflagellate bloom in Shenzhen, China [78,79]. Metagenomes were generated by filtering water samples through a 10 µm filter followed by FeCl 3 flocculation, 0.22 µm filtration, DNA extraction of the filters, and sequencing on an Illumina HiSeq 2000 (San Diego, CA, USA) [78]. Raw metagenome reads in FASTQ format were retrieved from the NCBI Sequence Read Archive (SRA) using the SRA toolkit (Accessed 16 December 2020). These viromes are YT1, YT3 (pre-bloom), YT4, YT5, YT6 (bloom), and YT8, YT19, and YT11 (post-bloom) [78]. The metagenomes were filtered in BBDuk, BBTools version 38.86 using quality trimming of both right and left sides (qtrim = rl), (trimq = 30), adapter trimming of both sides with a k-mer size of 23 (ktrim = l, ktrim = r, k = 23, mink = 11), a hamming distance of one (hdist = 1), and tpe and tpo parameters. Reads with an average quality below 30 (maq = 30), entropy below 0.90 (entropy = 0.90), and matches to PhiX were also removed [80]. Metagenome reads were mapped to the putative prophages using SMALT [81]. A SMALT index was created with a word length of 20 and a sampling step size of 10. The metagenomes were mapped to the index at 80% identity and only those prophage sequences recruiting at least 10 reads from at least one metagenome were kept in further analyses. The abundance of each prophage in the metagenome was visualized using the ANVI'O metagenomic workflow and using the "abundance" display mode in the interactive interface [82]. The abundances of algae and Rhodobacterales (as the closest approximation for the roseobacter group) in the seawater during the bloom were calculated in the original manuscript describing this algae bloom [79]. Briefly, dinoflagellate abundance was obtained by light microscopy counts. Bacterial cells stained with DAPI were quantified using an epifluorescence microscope. Abundance data were retried from the original manuscript using Webplot Digitizer version 4.4 (https://automeris.io/WebPlotDigitizer, accessed on 4 January 2021). To calculate Rhodobacterales abundance in cells/ml, the relative abundances from the metagenomes were multiplied by the total bacterial counts from epifluorescence microscopy.

Global Distribution of Roseobacter Prophages
We queried the TARA Oceans viromes for the presence of sequences closely related to the high-and medium-quality putative prophages identified here [83,84]. Raw reads from 200 TARA Oceans viral metagenomes (seawater filtered in 0.22 µm filters and concentrated using iron chloride flocculation) were mapped to our predicted prophages using Bowtie2 with the sensitive-local alignment mode [83,85,86]. The abundance of each prophage in the virome was expressed as reads per kilobase per million (RPKM) and visualized in R using the mapdata package [87]. The relative abundances of each prophage per virome were used as input for random forest analyses with 1000 permutations supervised by latitude, depth, temperature at 10 m depth, chlorophyll at 10 m depth, and chlorophyll at sample depth [88]. The mean decreasing error was used to rank important variables (prophages) with distributions significantly associated with latitude and temperature.

Putative Prophages in Roseobacter Genomes
The 79 roseobacter genomes analyzed here had an average length of 4.2 ± 0.076 Mbp (Mean ± SE). A total of 173 putative prophages were identified in these genomes by VIBRANT (5 complete, 17 high quality, 28 medium quality, and 123 low quality, Table S2), with at least one prophage identified in 66 roseobacter genomes. Five of the putative prophages had direct terminal repeats and formed complete circular genomes. Roseobacters had 2.19 ± 0.19 (Mean ± SE) prophages per genome ( Figure 1A). If only high-and mediumquality prophages were taken into account, roseobacters had an average of 0.96 ± 0.14 (Mean ± SE) prophages per genome ( Figure S1A). There was no relationship between the number of prophages and bacterial genome completeness (Linear regression p > 0.05). Roseobacter genome lengths were not significantly related to prophage length (Linear regression p > 0.05), and there was a weak relationship between the number of prophages and bacterial genome length (linear regression p = 0.00028, R 2 = 0.158). Similar results were observed when analyzing the subset of high-and medium-quality predicted prophages ( Figure 1A). The fraction of the total bacterial genome represented by putative prophages was defined as the prophage density. Prophage density was not significantly associated with the host genome length (Linear regression p > 0.05, Figure 1B), roseobacter clades as defined by Simon et al. 2017, isolation source (ANOVA p > 0.05 for both clades and isolation source), or host genome completeness (linear regression p > 0.05). Prophage length displayed a bimodal distribution (Hartigans' dip test for unimodality p > 0.05, alternative hypothesis is at least bimodal) ( Figure 1C). A taller peak was observed between 10 and 15 Kbp with a higher proportion of Caudovirus-like phages, and a smaller peak between 35 and 40 Kbp with a higher proportion of Mu-like and unidentified phages ( Figure 1C). This pattern was different for the subset of high-and medium-quality prophages, which showed a unimodal distribution with a peak between 35 and 45 Kbp mostly comprised of Mu-like and unclassified phages (Hartigans' dip test for unimodality p < 0.05, alternative hypothesis is at least bimodal) ( Figure S1C). 10 and 15 Kbp with a higher proportion of Caudovirus-like phages, and a smaller peak between 35 and 40 Kbp with a higher proportion of Mu-like and unidentified phages ( Figure 1C). This pattern was different for the subset of high-and medium-quality prophages, which showed a unimodal distribution with a peak between 35 and 45 Kbp mostly comprised of Mu-like and unclassified phages (Hartigans' dip test for unimodality p < 0.05, alternative hypothesis is at least bimodal) ( Figure S1C). Of the 50 high-or medium-quality prophages, 5 encoded an integrase, 18 a transposase, and 23 a recombinase. Eight prophages encoded no genes for integration and none of them encoded an excisionase. In total, 35% of the high-and medium-quality prophages were integrated into a tRNA gene, including tRNA-Gln, Leu, Ser, Gly, Thr, Val, Arg, Asn, and Met (Table S2). Three circular prophages were identified in five species, indicating complete genomes. The first, Celeribacter halophilus prophage 1, was 44,393 bp long, and the second, Loktanella hongkongensis prophage 1, was 17,661 bp long ( Figure S2) and encoded a Caudovirus-like prohead serine protease (Pfam database, PF04586.17). No auxiliary metabolic genes were present in these two prophages. The third complete circular prophage was present in three roseobacter genomes with 100% sequence identity at the nucleotide level (BLASTn, E-value < 10 −5 ) ( Figure 2). The hosts for this prophage were Jannaschia rubra, Shimia marina, and Nautella italica, which belong to roseobacter clades 7, 1, and 1, respectively. In each host, the prophages had similar lengths, 5421 bp, 5478 bp, and 5441 bp. This prophage was classified as Microviridae and encoded three hallmark genes of Microviridae: the capsid protein (PF02305.17), the bacteriophage replication gene A protein (PF05840.13), and the microvirus J protein (PF04726.13) [89,90]. Of the 50 high-or medium-quality prophages, 5 encoded an integrase, 18 a transposase, and 23 a recombinase. Eight prophages encoded no genes for integration and none of them encoded an excisionase. In total, 35% of the high-and medium-quality prophages were integrated into a tRNA gene, including tRNA-Gln, Leu, Ser, Gly, Thr, Val, Arg, Asn, and Met (Table S2). Three circular prophages were identified in five species, indicating complete genomes. The first, Celeribacter halophilus prophage 1, was 44,393 bp long, and the second, Loktanella hongkongensis prophage 1, was 17,661 bp long ( Figure S2) and encoded a Caudovirus-like prohead serine protease (Pfam database, PF04586.17). No auxiliary metabolic genes were present in these two prophages. The third complete circular prophage was present in three roseobacter genomes with 100% sequence identity at the nucleotide level (BLASTn, E-value < 10 −5 ) ( Figure 2). The hosts for this prophage were Jannaschia rubra, Shimia marina, and Nautella italica, which belong to roseobacter clades 7, 1, and 1, respectively. In each host, the prophages had similar lengths, 5421 bp, 5478 bp, and 5441 bp. This prophage was classified as Microviridae and encoded three hallmark genes of Microviridae: the capsid protein (PF02305.17), the bacteriophage replication gene A protein (PF05840.13), and the microvirus J protein (PF04726.13) [89,90].

Auxiliary Metabolic Genes
Out of the 173 putative prophage regions identified in this study (including all low-, medium-and high-quality drafts), 61 encoded 98 auxiliary metabolic genes (Table S4). The AMGs were grouped into nine KEGG metabolic categories: amino acid metabolism, biosynthesis of secondary metabolites, carbohydrate metabolism, energy metabolism, folding, sorting and degradation, glycan biosynthesis, metabolism of cofactors and vitamins, metabolism of other amino acids, and xenobiotics biodegradation and metabolism ( Figure 3). Energy metabolism, biosynthesis of secondary metabolites, and amino acid metabolism were the most abundant metabolic categories, with 28, 25, and 21 genes, respectively. The KEGG metabolic categories of the AMGs did not cluster by roseobacter clades (Figure 3). The gene cysE (serine O acetyltransferase) in the energy metabolism category was the most abundant AMG in the dataset, encoded by 22 phages. The gene acpP (acyl carrier protein) in the biosynthesis of secondary metabolites category was encoded by 8 phages, and dcm (DNMT1: DNA methyltransferase 1) in the amino acid metabolism category by 17 phages.

Auxiliary Metabolic Genes
Out of the 173 putative prophage regions identified in this study (including all low-, medium-and high-quality drafts), 61 encoded 98 auxiliary metabolic genes (Table S4). The AMGs were grouped into nine KEGG metabolic categories: amino acid metabolism, biosynthesis of secondary metabolites, carbohydrate metabolism, energy metabolism, folding, sorting and degradation, glycan biosynthesis, metabolism of cofactors and vitamins, metabolism of other amino acids, and xenobiotics biodegradation and metabolism ( Figure 3). Energy metabolism, biosynthesis of secondary metabolites, and amino acid metabolism were the most abundant metabolic categories, with 28, 25, and 21 genes, respectively. The KEGG metabolic categories of the AMGs did not cluster by roseobacter clades (Figure 3). The gene cysE (serine O acetyltransferase) in the energy metabolism category was the most abundant AMG in the dataset, encoded by 22 phages. The gene acpP (acyl carrier protein) in the biosynthesis of secondary metabolites category was encoded by 8 phages, and dcm (DNMT1: DNA methyltransferase 1) in the amino acid metabolism category by 17 phages.  Three prophages-Sulfitobacter delicatus prophage 1, Sulfitobacter dubious prophage 2, and Celeribacter indicus prophage 2-encoded the three genes acpP, fabF, and fabG that are together involved in fatty acid synthesis. The gene acpP encodes an acyl carrier protein phosphopanthetine attachment site. The acyl carrier protein (ACP) carries the growing acyl chain for bacterial fatty acid synthesis, and is a target of the class of antimicrobial compounds pantothenate antimetabolites [91]. The gene fabF encodes 3-oxoacyl-[acyl carrier protein] synthase 2 and catalyzes the elongation of acyl-ACP [92]. The gene fabG encodes beta-ketoacyl-acyl carrier protein reductase and performs the first reductive step in the elongation of fatty acids [93]. The genomic organization for these three genes in all three phages from upstream to downstream is fabG, acpP, fabF. Other genes in the fab operon, fabD and fabH, were not present in any of the phages identified in our dataset. Three prophages-Sulfitobacter delicatus prophage 1, Sulfitobacter dubious prophage 2, and Celeribacter indicus prophage 2-encoded the three genes acpP, fabF, and fabG that are together involved in fatty acid synthesis. The gene acpP encodes an acyl carrier protein phosphopanthetine attachment site. The acyl carrier protein (ACP) carries the growing acyl chain for bacterial fatty acid synthesis, and is a target of the class of antimicrobial compounds pantothenate antimetabolites [91]. The gene fabF encodes 3-oxoacyl-[acyl carrier protein] synthase 2 and catalyzes the elongation of acyl-ACP [92]. The gene fabG encodes beta-ketoacyl-acyl carrier protein reductase and performs the first reductive step in the elongation of fatty acids [93]. The genomic organization for these three genes in all three phages from upstream to downstream is fabG, acpP, fabF. Other genes in the fab operon, fabD and fabH, were not present in any of the phages identified in our dataset.

Gene Transfer Agents
The high prevalence of cysE genes in 22 putative prophages prompted further investigation. All the 22 putative prophages encoding this gene were classified as low-quality drafts by VIBRANT, and gene transfer agents (GTAs) in the neighborhood of cysE are highly conserved within Rhodobacterales [94]. To test if the 22 putative prophages were actually GTAs, we compared their genomes with that of a GTA of Rhodobacter capsulatus strain 1003, rcGTA. These genomic regions had high synteny and identity with the core gene cluster of rcGTA, from gene 1 to gene 15 ( Figure 4). Therefore, these sequences were considered to be GTAs and were not included in subsequent analyses.

Prophage Phylogenomics
A phylogenomics approach identified the closest relatives of the putative roseobacter prophages in the GL-UVAB [67] and IMG/VR databases. Only 17 of the putative roseobacter prophages shared at least three genes with viral contigs in both databases and were analyzed for phylogenomic relationships. An all-versus-all comparison of the genomes at the protein level generated a tree with two phylogenomic lineages. Both lineages contained viral genomes from marine ecosystems, freshwater, human samples, wastewater, groundwater, sediments, and terrestrial environments, with no clustering based on environmental source ( Figure 5). The majority of hosts for the viral sequences obtained from the databases was unknown. Among the putative roseobacter prophages, sequences did not cluster based on roseobacter clades from Simon et al. 2017. Shimia haliotis prophage 1 was related to four Myoviridae phages infecting Pseudomonas and Enterobacter species (Enterobacter phage Arya, Escherichia phage vB_EcoM-ep3, Escherichia phage vB_EcoM_ECO1230-10, Pseudomonas phage PPpW-3) and uncultured myoviruses ( Figure 5). The only other Ref-Seq genome related to the prophages identified here was Rhodovulum phage RS1. Despite being the closest known relatives of the roseobacter prophages, these RefSeq viruses were distant from other sequences in the tree ( Figure 5). Other putative roseobacter prophages were related to five metagenome-assembled contigs, the putative hosts of which were also roseobacter: Rhodobacter sphaeroides, Loktanella vestifoldensis, Rhodobacter sp., Oceanicola sp., Sulfitobacter noctilucae, Leisingera daeponensis, Roseovarius sp., and Rugeria mobilis (Table S3).

Roseobacter Prophages in a Phytoplankton Bloom
Roseobacters form mutualistic associations with marine phytoplankton and their abundances frequently follow that of dinoflagellate blooms [13,[75][76][77]95]. To test if the putative roseobacter prophages identified here also associated with phytoplankton blooms, we searched for sequences with similarities to the roseobacter prophages in metagenomes from a bloom of the dinoflagellate Gymnodinium catenatum in the southern China Sea. Sequences with similarities to 17 of the high-and medium-quality putative roseobacter prophages were identified in the metagenomes obtained before, during, and after the bloom ( Figure S3). The three most abundant phages in all eight metagenomes were encoded by Shimia haliotis (Shimia haliotis prophage 1 and prophage 2) and Celeribacter marinus (Celeribacter marinus prophage 2). Each of these prophages recruited a minimum of 1000 reads from each virome. However, prophage abundance in the metagenome did not have a relationship with roseobacter or dinoflagellate abundances through the progression of the bloom. Shimia haliotis prophage 1 contained a cluster of four auxiliary metabolic genes involved in the biosynthesis of secondary metabolites; rfbA, rfbB, rfbC, and rfbD (K01790, K10710, K00067, K00973). The rfbABCD cluster encodes a group of enzymes that synthesize the sugar dTDP-L-Rhamnose. Rhamnose is present in polysaccharides, glycoproteins, and O-antigen lipopolysaccharides. The fifth gene, identified as glycogen synthase, is involved in carbohydrate metabolism (K16150). Celeribacter marinus prophage 2 contained a single AMG, acpP (acyl carrier protein, phosphopanthetine attachment site). Shimia haliotis prophage 2 encoded no AMG.

Prophage Phylogenomics
A phylogenomics approach identified the closest relatives of the putative roseobacter prophages in the GL-UVAB [67] and IMG/VR databases. Only 17 of the putative roseobacter prophages shared at least three genes with viral contigs in both databases and myoviruses ( Figure 5). The only other RefSeq genome related to the prophages identified here was Rhodovulum phage RS1. Despite being the closest known relatives of the roseobacter prophages, these RefSeq viruses were distant from other sequences in the tree ( Figure 5). Other putative roseobacter prophages were related to five metagenomeassembled contigs, the putative hosts of which were also roseobacter: Rhodobacter sphaeroides, Loktanella vestifoldensis, Rhodobacter sp., Oceanicola sp., Sulfitobacter noctilucae, Leisingera daeponensis, Roseovarius sp., and Rugeria mobilis (Table S3).   (Table S3). For viruses with predicted taxonomy and/or predicted hosts, the lowest taxonomic level is shown.

Global Distribution of Roseobacter Prophages
To investigate the global distribution of the putative roseobacter prophages identified here, reads with similarity to the 50 high-and medium-quality prophage sequences were searched in 200 viromes from the TARA Oceans Expeditions ( Figure 6A). Sequences mapped to 41 of these prophages, and their relative abundances in the virome (in reads per kilobase per million), were significantly explained by latitude and temperature ( Figure 6, supervised permutational random forest test, variability explained by temperature and latitude = 76.92 and 71.06%, respectively). The random forest models with depth, chlorophyll at 10 m depth, and chlorophyll at sample depth had low explanation power (26.21, 18.26, and 20.78%, respectively). The variable importance analyses from the supervised random forest identified the three putative prophages with the strongest relationship with both temperature and latitude ( Figure 6B-D). The abundances of Litorimicrobium taeanense prophage 4, Celeribacter marinus prophage 2, and Loktanella atrilutea prophage 1 increased at higher latitudes. Both Celeribacter marinus prophage 2 and Loktanella atrilutea prophage 1 encode the AMG acpP, while Litorimicrobium taeanense prophage 4 does not encode any AMGs.

Prophages Contribute to the Evolution of Roseobacter Genomes
Throughout its evolutionary history, the roseobacter group has undergone a net genome reduction with periods of genome expansion that included biased gene acquisition [96]. These gene acquisitions likely included prophage integration events followed by prophage domestication. The bimodal distribution of phage genome sizes observed here was consistent with previous results obtained from bacteria with similar genome length [1,97,98]. This bimodal distribution is likely a result of genetic degradation of the prophage sequences due to gene loss and purifying selection [2]. This process is

Prophages Contribute to the Evolution of Roseobacter Genomes
Throughout its evolutionary history, the roseobacter group has undergone a net genome reduction with periods of genome expansion that included biased gene acquisition [96]. These gene acquisitions likely included prophage integration events followed by prophage domestication. The bimodal distribution of phage genome sizes observed here was consistent with previous results obtained from bacteria with similar genome length [1,97,98]. This bimodal distribution is likely a result of genetic degradation of the prophage sequences due to gene loss and purifying selection [2]. This process is known as prophage domestication by bacteria, and is a source of genetic novelty, contributing to a large fraction of the bacterial pangenomes [2]. The short putative prophage regions identified here are likely the result of these domestications. However, we found no evidence for a preferential retainment of auxiliary metabolic genes with ecologically or physiologically important functions for the host in the shorter putative prophages [99]. Likewise, no relationship between roseobacter genome length and prophage density (the fraction of the bacterial genome encoded in prophages) was observed ( Figure 1B). These results contrast with a positive relationship between genome length and prophage density in bacteria with genomes up to 5.5 Mb in length [6].

Roseobacter Prophages Have Narrow Host Range
Most phages have narrow host ranges, with some exceptions, such as T4-like cyanophages that infect more than one genus [100]. Here, only one putative prophage was observed in different roseobacter species, indicating that most of these phages are specialists. This is consistent with the infection patterns of isolated roseophages, most of which only infect their original host strain [40]. The only exception to phage specialism identified in this study was the complete circular Microviridae prophage found in the genomes of three roseobacter species belonging to two clades, Jannaschia rubra (clade 7), Shimia marina (clade 1), and Nautella italica (clade 1) (Figure 2) [59]. Prophages belonging to Microviridae have been found in other bacteria, including a marine alphaproteobacterium [101,102]. The microvirus identified here does not belong to the genus Gokushovirinae within the Microviridae, previously described in roseobacters [89,90]. Microviridae contain three core genes, the major capsid protein, the minor spike protein (pilot protein), and the replication initiation protein, typified in the Microviridae ΦX174 [89,90,103]. The generalism in this broad host range prophage could have been selected due to intraspecific competition for bacterial hosts or conserved cell surface targets for phage infection [104,105].

Auxiliary Metabolic Genes and Gene Transfer Agents
Among the 98 auxiliary metabolic genes identified here, none were consistently present across roseobacter genera or clades (Figure 3). Previously identified roseophage AMGs, such as thioredoxin genes (trx) and ribonucleoside triphosphate reductase (rnr), were not present in our dataset [23,44,46]. In a study categorizing the AMGs of isolated roseophages, seven genes were identified at high frequency: trx, grx, RNR, thyX, DCD, phoH, and mazG [106]. These AMGs are more frequent in lytic roseophages than those containing integrase, transposase, or repressor domains [106]. The putative prophages identified here did not encode these AMGs, suggesting that the previously described high-frequency genes are typical of lytic roseophages.
Three prophages in the genomes of Sulfitobacter delicatus, Sulfitobacter dubius, and Celeribacter indicus encode the genes acpP, fabF, and fabG. These three genes are together responsible for steps in the synthesis of fatty acids and phospholipids [107]. The gene acpP (acyl carrier protein) carries fatty acyl intermediates during fatty acid elongation, and fabF and fabG catalyze a condensation and reduction reaction in the elongation steps of synthesis, respectively [108]. In E. coli and Vibrio harveyi, the genes fabG and fabF are upstream and downstream of the gene acpP, respectively [107,109]. The genomic position in the three phages that encode all three genes acpP, fabF, and fabG is the same as that in E. coli, suggesting a conserved organization. The presence of this gene cluster in roseophages indicates that viruses are involved in fatty acid metabolism, analogously to the role of the fatty acid desaturases encoded by cyanophages [110].
The most common AMG in the predicted prophages, cysE, flanks a gene transfer agent (GTA) genomic region in Rhodobacter capsulatus, Rhodobacter sphaeroides, Roseobacter dentrificans, Oceanicola granulosus, Roseobacter sp. and Loktanella vestifoldensis [111]. The comparison between the Rhodobacter capsulatus GTA, rcGTA, with all putative prophage regions encoding cysE, revealed that these genomic regions have high synteny and homol-ogy to rcGTA (Figure 4) [94]. These results suggest that these 22 low-quality drafts initially identified as putative prophages are actually GTAs.

Roseobacter Prophages Represent Novel Viral Groups
Shimia haliotis prophage 1 was related to four Myoviridae phages infecting Pseudomonas and Enterobacter species and uncultured myoviruses ( Figure 5). Two of these phages, PPpW-3 and Arya, encode integrases, but attempts to lysogenize the host in the lab were unsuccessful [112,113]. The other two phages, ECO1230-10 and EcoM-ep3, do not encode integrases. These results and the lack of repressor genes lead to the proposal that this myovirus group is strictly lytic and received the integrase gene through lateral gene transfer [113]. The presence of the Shimia haliotis prophage 1 integrated in the host genome indicated the ability for lysogeny within this group ( Figure 5). Aside from the four Myoviridae phages related to Shimia haliotis prophage 1, the rest of the uncultivated contigs for which taxonomic identification was available belonged to the order Caudovirales.
The prophages identified in this study do not exhibit the genetic composition of the recently proposed Cobavirus group within Podoviridae [44]. This group is composed of two lytic phages infecting Lentibacter sp., Celeribacter marinus phage P12053L, Roseobacter sp. phage SIO1, and metagenome-assembled viral contigs [44]. Cobaviruses are characterized by a cobalamin-dependent ribonucleotide reductase (RNR) and a genomic organization into two arms, with replication genes to the left, and virion structure and morphogenesis on the right. The prophage sequences identified here do not display this organization or possess the cobalamin-dependent RNR.

Roseobacter Prophages Are Abundant in a Phytoplankton Bloom
Symbiotic interactions between roseobacter and microalgae are mediated by the exchange of metabolites, such as vitamin B 12 , nitrogen, DMSP, and carbon, and the ability of roseobacters to form biofilms on the algae surface [13,14,114,115]. By altering roseobacter metabolism, prophage-encoded AMGs may modulate chemical interactions that mediate these symbioses [114]. The presence of sequences with high similarity to the roseobacter prophages in metagenomes from a dinoflagellate bloom indicates that these prophages, or closely related temperate phages, are present in roseobacters that associate with phytoplankton ( Figure S3). The most abundant of these prophages also encoded the gene acpP involved in type II fatty acid synthesis [116]. The second most abundant prophage in the bloom (host Shimia haliotis) encoded the gene cluster rfbABCD ( Figure S3). These four highly conserved genes in Gram-positive and Gram-negative bacteria (also designated rml in Gram-positive bacteria) are involved in the synthesis of dTDP-rhamnose, a precursor to lipopolysaccharides, capsular polysaccharides, and exopolysaccharides [117,118]. The expression of rfbABCD complements O-antigen production mutation in E. coli [117]. Rhamnose is one of the sugar components of the O-repeating unit of the E. coli lipopolysaccharide, and a mutation in the rfb operon leads to loss of the O-antigen [119,120]. The O-antigen is used by some tailed bacteriophages as a receptor for attachment and infection [121]. In the roseobacter Phaeobacter inhibens, these four genes of the rhamnose operon are crucial for biofilm formation, a necessary step in the establishment of algal symbioses [75]. The genes forming the rhamnose operon in roseobacters are typically found in a RepA-I plasmid [22,75]. Loss of the extrachromosomal plasmid containing the rhamnose operon in P. inhibens leaves the bacteria unable to colonize the green algae Ulva lactuca [22]. The rhamnose operon identified here in a prophage in Shimia haliotis potentially implicates this prophage in the biofilm formation capabilities necessary for microalgae symbiosis.

The Global Distribution of Roseobacter Prophages Is Associated with Latitude
Roseophages are highly abundant in temperate and polar oceans, mirroring the distribution of their hosts [122][123][124][125][126]. Specifically, RCA, cobavirus and N4-like phageinfecting roseobacters increase in abundance in coastal environments [23,46,127]. Here, latitude and temperature were significantly associated with the distribution of roseobacter prophages in the TARA Oceans viromes ( Figure 6). The high abundance of these prophages in the viromes from free viral particles suggests that they may be capable of inducing the lytic cycle in their hosts [128]. The three phages with the strongest relationship to latitude were Litorimicrobium taeanense prophage 4, Celeribacter marinus prophage 2, and Loktanella atrilutea prophage 1. Both Loktanella atrilutea prophage 1 and Celeribacter marinus prophage 2 contain the AMG acpP, which could be involved in regulating lipid membrane fluidity in polar temperatures [129].

Conclusions
The diversity of predicted prophages and gene transfer agents identified here in the genomes of roseobacters suggests that prophages and GTAs are a significant source of genomic diversity in this bacterial group. The auxiliary metabolic genes encoded by these prophages with functions in fatty acid metabolism and secondary metabolites are likely involved in the symbioses of their roseophage hosts with primary producers. Of particular importance may be the genes involved in fatty acid metabolism and carbohydrate modification, which may be involved in biofilm formation and nutritional exchange. The high abundance of some of these prophage sequences in the TARA Oceans viromes indicates that many of them may represent active prophages capable of entering the lytic cycle. The data presented here highlight the need for future studies on the metabolic changes incurred by prophage integration in roseobacter genomes.  Figure S1: (A) Bacterial genome length plotted against the number of high and medium quality prophages identified in each of the 79 roseobacter genomes. The horizontal line in each box represents the mean, and the upper and lower bounds of the box represent the 75th and 25th percentiles. (B) Prophage density (total phage genome length divided by bacterial genome length) plotted against bacterial genome length. There is no correlation between the two variables (Pearson test p = 0.522). (C) Frequency of genome lengths for all high and medium quality prophages. The genome frequency is at least bimodal by Hartigans' dip test for unimodality (p > 0.05). Each phage genome is grouped by color for predicted taxonomic assignments; Figure S2. (A) Genome of the predicted complete circular prophage from host Celeribacter halophilus. (B) Genome of predicted complete circular prophage from Loktanella hongkongensis. The circular genomes are represented linearly, with the direction of the arrow representing the direction of transcription. Figure S3: The abundance of high and medium quality prophages mapped at 80 % identity to eight G. catenatum algal bloom metagenomes from Du et al., and Huang et al [78,79]. Phages with greater than ten reads mapped in a metagenome are represented here. The color bar on the left-hand side denotes each phage. Black bars represent the phage abundance in each metagenome. The abundances of algae and Rhodobacterales at each time point were extracted from Huang et. al. 2018 [79]. The genomes of the three most abundant phages (recruiting more than 1,000 total reads) are shown on the right-hand side. These phages are labeled with their host and genome length, as well as the gene names for auxiliary meta-bolic genes. The mapped reads were visualized with ANVI'O version 6.2 using the metagenomic workflow using the abundance visualization, and the genome maps were created with EasyFig version 2.2.2.   Table S1.

Conflicts of Interest:
The authors have no conflict of interest to declare.