Initial Virome Characterization of the Common Cnidarian Lab Model Nematostella vectensis

The role of viruses in forming a stable holobiont has been the subject of extensive research in recent years. However, many emerging model organisms still lack any data on the composition of the associated viral communities. Here, we re-analyzed seven publicly available transcriptome datasets of the starlet sea anemone Nematostella vectensis, the most commonly used anthozoan lab model, and searched for viral sequences. We applied a straightforward, yet powerful approach of de novo assembly followed by homology-based virus identification and a multi-step, thorough taxonomic validation. The comparison of different lab populations of N. vectensis revealed the existence of the core virome composed of 21 viral sequences, present in all adult datasets. Unexpectedly, we observed an almost complete lack of viruses in the samples from the early developmental stages, which together with the identification of the viruses shared with the major source of the food in the lab, the brine shrimp Artemia salina, shed new light on the course of viral species acquisition in N. vectensis. Our study provides an initial, yet comprehensive insight into N. vectensis virome and sets the first foundation for the functional studies of viruses and antiviral systems in this lab model cnidarian.


Introduction
Viruses, the absolute parasites of virtually all living organisms, constitute the most abundant and diverse entity on Earth [1,2]. Owing to their dependency on host organisms and the resulting continuous evolutionary arms race with their hosts, viruses are predominantly studied in the context of a pathogenic state of their host cells [3]. Moreover, much focus is given to deciphering viruses of humans and other economically important species, while the viral diversity of other hosts remains understudied [4]. Importantly, many viruses can remain in a dormant state, which is neutral to the cell environment, or maintain a commensal relationship with the host [5][6][7]. In 1990, Lynn Margulis first introduced the concept of 'holobiont' which referred to a metaorganism formed by a symbiosis of separate living entities, which constitutes an individual unit of selection [8]. Since the recognition that the collective of prokaryotic and eukaryotic viral species present in the host, hereinafter called 'virome', also forms a part of the metaorganism, its complex role in host disease, development, and evolution is a subject of studies and debates [9].
Nematostella vectensis, also known as the starlet sea anemone, is a non-symbiotic emerging cnidarian model species owing to the facility of its culture under laboratory conditions and a wide range of accessible tools for genetic engineering (reviewed in [10]). Cnidaria is a phylum representing early-branching Metazoa and has diverged from its sister group Bilateria, which includes the vast majority of extant animals, approximately 600 million years ago [11], making it an attractive group for wide range of comparative studies. Cnidarians are divided into two major classes: Anthozoa (sea 2 of 15 anemones and corals) and Medusozoa (jellyfish and hydroids) [12]. Among cnidarians, much focus has been placed on deciphering the composition and significance of the virome of corals (reviewed in [13]), their photosynthetic dinoflagellate symbionts from the Symbiodinium genus [14,15], and their symbiotic sea anemone proxy Exaiptasia pallida (formerly called Aiptasia pallida) [16], mainly due to the environmental significance of the coral reef ecosystems. Members of coral virome have been suggested to play a role in some coral diseases [17,18], and a general increase of viral abundance has been observed during the bleaching (loss of dinoflagellate symbionts) of several coral species [19][20][21].
Nematostella is arguably the most commonly used anthozoan lab model [10] and extensive research has been done so far to uncover its microbiome composition, significance and dynamics of the interplay with the host environment [22][23][24]. In contrast to microbial studies, and despite its wide use as a lab model, the composition of a stable viral community forming the holobiont of this sea anemone has not yet been studied. Likewise, no virus capable of infecting Nematostella has been identified so far, impairing our understanding of viral pathogenesis in the starlet sea anemone. Furthermore, lack of any insights into the Nematostella viral community hinders the research on the role of RNA interference (RNAi)-the major sequence-specific antiviral system of plants and invertebrates [25][26][27]-in the innate immune response of Nematostella to viruses.
In this study, we aimed to characterize for the first time N. vectensis virus communities. To this end, we re-analyzed various publicly available RNA-seq datasets and applied a strategy of de novo assembly of putative viral sequences, followed by thorough taxonomic validation. Our approach was expanded by generating novel transcriptome datasets from the primary food source of laboratory sea anemones, the brine shrimp Artemia salina. We have identified a set of unique Nematostella-specific and Artemia-specific viral sequences with sound homology to known viruses, as well as characterized the core N. vectensis virome present in all previously sequenced lab populations. Finally, we observed a lack of viral load in early developmental stages and determined approach-dependent differences between individual datasets which might serve as a guide for future RNA virome research in this species.

Materials and Methods
In this study, we used two types of datasets; publicly available RNA-seq paired-end data from N. vectensis spanning different developmental stages, and two novel RNA-seq datasets from A. salina nauplii, which constitutes the main food source for N. vectensis, generated in-house. Additionally, we analyzed two publicly available RNA-seq single-end libraries of the California mussel Mytilus californianus, which is a supplementary food source for some lab populations of N. vectensis [28,29].

RNA-Extraction and Sequencing
Approximately 100 µl of A. salina nauplii were used for each of the two biological replicates. Total RNA was extracted with Tri-Reagent (Sigma-Aldrich, St. Louis, MO, USA) according to manufacturer's protocol, treated with 2 µL of Turbo DNAse (Thermo Fisher Scientific, Waltham, MA, USA) and re-extracted with Tri-Reagent. The quality of total RNA was assessed on a Bioanalyzer Nanochip (Agilent, Santa Clara, CA, USA), although no RNA Integrity Number (RIN) was available due to the presence of a single peak representing 18S rRNA subunit, commonly observed in arthropods [30]. RNA-seq libraries were constructed using SENSE Total RNA-seq Library Prep Kit v2 (Lexogen, Vienna, Austria) following the manufacturer's protocol and sequenced on NextSeq 500 (Illumina, San Diego, CA, USA) with 75 nt read length. The raw data have been deposited at the NCBI SRA database (accession number PRJNA601424).

N. vectensis Transcriptome Datasets
We used previously published RNA-seq datasets of 50-100 nt paired-end reads from six different studies. The first dataset was reported by Babonis et al. 2016 [31] (NCBI BioProject accession: PRJEB13676) and includes three transcriptome replicates of nematosomes, mesenteries, and tentacles of  [32]. The next dataset, by Oren et al. 2013 (NCBI BioProject accession: PRJNA246707), reports the circadian rhythm transcriptome of adult sea anemones [33]. Two datasets generated at Vienna University, partially published by Schwaiger et al. 2014 (NCBI BioProject accession: PRJNA200689 and PRJNA213177), were depleted from duplicates and merged [34]. Samples spanning all Nematostella developmental stages were further divided into two groups encompassing polyA-selected and rRNA-depleted samples. The fifth transcriptomic adult sample came from the study by Fidler et al. 2014 (NCBI BioProject accession: PRJNA200318) [35]. The last dataset was reported by Warner et al. 2018 and includes samples spanning the first 144 h of regeneration of 6-week-old juveniles (NCBI BioProject accession: PRJNA419631) [36]. A detailed list of NCBI accession numbers, and raw and filtered read counts of all samples used in this study is shown in Supplementary File S1, Table S1.

Raw Reads Processing and Filtering
The quality of raw reads was assessed by FastQC software [37]. The reads were trimmed and the quality was filtered by Trimmomatic with the following parameters (LEADING:5 TRAILING:5 SLIDINGWINDOW:4:20 MINLEN:36) [38]. Only paired-end reads were used for downstream analysis. Bowtie2 [39], with the following parameters (-local -D 20 -R 3 -L 10 -N 1 -p 8 -mp 4) was used to align the reads to the Nematostella vectensis genome (NCBI accession: GCA_000209225.1) [40], as well as to all sequences of transfer, mitochondrial and cytoplasmic ribosomal RNA, retrieved from the RNAcentral database [41], retaining unmapped reads after each step. The same mapping strategy was used to remove external RNA Controls Consortium (ERCC) spike-ins whenever used during library construction. In order to further filter the retained datasets of all Nematostella-derived short reads, we performed a local stringent BLASTn (version 2.3.0+) search against the Nematostella genome. Next, we removed from the datasets all reads below the e-value cutoff of 1e-15 by a custom Python script (script available at https://github.com/yael-hzn/Nematostella-viruses-project/blob/master/NV_viruses_filter_ reads_out_script.ipynb). Of note, at these stages we removed all possible viral sequences that might have been falsely assembled or incorporated into the official version of the Nematostella genome [40].

Sequence Assembly and Viral Sequence Identification
The remaining reads of each dataset were de novo assembled by Trinity (version trinityrnaseq_r20140717) with default parameters [42]. The assembly was repeated by inputting merged filtered reads from all datasets. This assembly was processed equally to other datasets and treated henceforth as a general N. vectensis viral dataset. Next, we employed a thorough three-step BLAST-based filtering process in order to retrieve only the sequences of high, certain virus homology. After the assembly, we ran simultaneous local BLAST searches-first, BLASTn against the Nematostella genome to classify sequences which were previously too short for generating a significant homology score, and two BLASTx searches against the viral protein database (RefSeq release 86, Swiss-Prot Release 2018_02) and the prokaryote protein database (Swiss-Prot Release 2018_02) with e-value 1e-5 and 1e-10 as cutoffs, respectively. Only contigs longer than 200 nt and of unambiguous viral origin were retained for downstream analysis and were trimmed to recover only sequences with known homology. Custom Python script for comparison of multiple BLAST searches and trimming of putative virus-derived contigs is available at https://github.com/yael-hzn/Nematostella-viruses-project/blob/ master/Nv_viruses_alignment_summary_script.ipynb.
The second step of virus identification was a local BLASTx search against all proteins available in the SwissProt database (Release 2018_02), in order to remove sequences of clear non-viral origin. After the search with e-value 1e-10 as a cutoff, we retrieved all putative viral sequences, as well as contigs without any clear homology to any protein in the database. As a final filtering step, we performed a remote BLASTx 2.7.1+ search against the RefSeq Protein database (Release 89, e-value cutoff was 1e-10) to select only those sequences with identifiable homology to the known eukaryotic viruses. It is worth noting that this step of sequence filtering overlooks novel species without any clear homology to previously annotated viruses.

Taxonomic Annotation
The retrieved viral sequences were taxonomically annotated following the approach of Goodacre et al. [43] In brief, taxonomic identifiers obtained during the final BLASTx search were used for climbing the taxonomic tree by using NCBI parent-child taxonomic identifier definitions (file available at ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxcat.zip;nodes.dmpfile) until the family level was reached. Family names were recovered by mapping a family taxonomic identifier to the taxonomic name (file available at ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxcat.zip;names.dmpfile). Taxonomic annotation was manually checked to be consistent with the International Committee on Taxonomy of Viruses (ICTV) Master Species List 2018a v1.

Reads Quantification
Filtered reads from each dataset (after non-coding RNA/spike-ins removal) and Artemia transcriptome were remapped to the general N. vectensis viral dataset by Bowtie2, applying the following parameters (-N 1 -L 15). Additionally, we aligned filtered reads from two replicates of the California mussel M. californianus RNA-seq libraries to the general N. vectensis viral dataset [44]. To run a statistical comparison of the datasets, we downloaded two single-end replicates from BioProject PRJNA419631 [36] and performed the same remapping. Next, we established the core virome by selecting viral sequences from the general assembly, to which reads from all datasets were mapped (excluding reads from embryogenesis dataset due to a very low level of viral load [32]). A Venn diagram was generated with the online tool jvenn [45]. A relative abundance of viral sequences was measured by calculating transcript per million reads (TPM) divided by 1000. A heatmap of relative abundance was done in Trinity [42]. A detailed list of NCBI accession numbers of the single-end datasets is presented in Supplementary File S1, Table S1.

Validation of Candidate Viruses
To confirm the presence of viral sequences identified in RNA-seq libraries, we selected 10 contigs from the Nematostella core virome, for which we performed reverse transcription polymerase chain reaction (RT-PCR) assays. RNA was extracted from the adult female sea anemone and a 2-day-old planula following the same protocol used for Artemia RNA extraction. cDNA was constructed using SuperScript III (Thermo Fisher Scientific) according to the manufacturer's protocol. cDNA was amplified with Q5 ® Hot Start High-Fidelity DNA Polymerase (New England Biolabs, Ipswich, MA, USA) in a 25 µl reaction with thermocycling conditions as follows: 98 • C for 30 sec, followed by 35 cycles of 98 • C for 10 s, 60 • C for 20 sec, 72 • C for 20 sec and final extension at 72 • C for 2 min. A fragment of the Nematostella NVE5273 gene was amplified under the same conditions as a positive control. PCR products were analyzed on 1.5% agarose gel. Sequences of primers and length of amplified fragments are shown in Supplementary File S1, Table S2.

Statistical Analysis
To test whether datasets have significantly different viral composition, we compared normalized remapping results between PRJEB13676 and PRJNA419631 (Babonis et al. [31], Warner et al. [36], respectively), for which biological triplicates were available. PCA factor analysis (max no. of factors = 5) revealed two factors with an eigenvalue > 2, which together explained 90.66% of the observed variance (Supplementary File S2). After extracting component loadings for these factors, we ran a separate t-test for each factor. All the calculations were done in SYSTAT version 13.2. Mean and standard deviation (SD) from mean were calculated in RStudio [46].

Results
A total number of 1,908,174,590 reads pairs distributed over seven publicly available paired-end read RNA-seq datasets and 54 samples were used for de novo assembly and homology-based viral sequences search (Table 1). We identified 94 unique viral sequences in the merged dataset which served as a viral database for all further analyses, while the number of unique viral sequences in individual datasets varied from 6 to 76 (Table 1). Viral contigs in the general viral assemblage ranged in length from 200 to 7731 nt. All sequences from general viral assembly and individual dataset assemblies are presented in Supplementary File S1, Tables S3, S6-S12, and Text Files S1, S3, S5, S7, S9, S11, S13, S15. Furthermore, we retained all contigs of viral origin which were not trimmed to contain only fragments homologous to known viruses i.e., sequences with stretches of both viral and unknown homology (Supplementary File S1, Text Files S2, S4, S6, S8, S10, S12, S14, S16). Table 1. Summary of sequence data used for virus identification in RNA-seq studies. Retained reads refer to read counts after quality filtering, trimming, and removal of the Nematostella genome, transfer, mitochondrial and cytoplasmic rRNA, as well as ERCC spike-ins. Identified final viral reads were filtered from the total number of de novo assembled contigs and cleared from duplicates.

Reference
No. of Samples

Raw Read Pairs
Retained Reads Analysis of the dataset generated by Tulin et al., which captures the stage of Nematostella embryonic development, revealed only six short unique sequences with homology to Rous sarcoma virus, a representative of the Retroviridae family. Similar scarcity of viral sequences in early developmental stages was also observed in individual samples assemblage (unfertilized egg, blastula and gastrula samples from polyA-selected and rRNA-depleted libraries from Schwaiger et al., data not presented). Therefore, when searching for the common Nematostella virus community, we decided to exclude this dataset and focus on viromes from the non-embryonic developmental stages.

Viral Community Classification
Our homology-based identification of viral sequences in N. vectensis RNA-seq datasets revealed sequences belonging to 11 viral families, 2 unassigned orders and 3 unclassified groups ( Figure 1). Detected viral families included Baculoviridae, Iridoviridae, Marseilleviridae, Mimiviridae, Phycodnaviridae, Pithoviridae, Reoviridae, Retroviridae, Rhabdoviridae and Yueviridae. The most prevalent viral family was dsDNA Iridoviridae (23.26%), which was present in five out of six non-embryonic transcriptomes ( Figure 1c). However, the highest abundance of viral sequences falls into a group of unclassified RNA viruses (41.86%, Figure 1a), which is entirely composed of a group of novel viruses captured in a wide range of invertebrate species by Shi et al. (denoted in our results as "unclassified RNA viruses ShiM-2016") [4]. Similarly, this is also the most abundant group when analyzing the genomic composition of detected viruses (41.86%), with almost equal contribution of dsDNA viruses (37.21%, Figure 1b). this dataset of a representative of the most common family Iridoviridae, as a result of insufficient sequencing depth and underrepresentation of reads coming from less abundant viruses. Similarly, the sequence of Beihai picorna-like virus 57 was found only in the dataset from Babonis et al., which reported the transcriptome of nematosomes, mesenteries and tentacles of adult Nematostella, and comprised 52.5% of all viral reads in this lab population (Figure 2, Supplementary File 1, Table S13). To discern Nematostella-specific viruses from those derived from the primary source of food, A. salina nauplii, we generated two replicates of Artemia RNA-seq libraries (21,787,250 and 37,055,827 raw single-end reads). Raw reads were quality-filtered, trimmed and directly mapped to the constructed viral database composed of 94 viral contigs. Mapping to N. vectensis virome instead of de  Table S13).
To discern Nematostella-specific viruses from those derived from the primary source of food, A. salina nauplii, we generated two replicates of Artemia RNA-seq libraries (21,787,250 and 37,055,827 raw single-end reads). Raw reads were quality-filtered, trimmed and directly mapped to the constructed viral database composed of 94 viral contigs. Mapping to N. vectensis virome instead of de novo assembly of A. salina viruses was motivated by finding an overlap between the sea anemone and its food source at the lab, rather than revealing complete A. salina virome. Mapping to our general viral database identified 7 contigs which were shared between N. vectensis and A. salina (found in both RNA-seq replicates). Those viral sequences ranged in length from 279 to 5952 nt and represented the families Yueviridae, Rhabdoviridae and unclassified RNA viruses (Supplementary File S1, Table S4). To find viral Viruses 2020, 12, 218 7 of 15 sequences which could be derived from ovaries of the California mussel, a supplementary food source used at a lower frequency in several lab populations of N. vectensis, we mapped filtered reads from two publicly available RNA-seq libraries of M. californianus [44]. Interestingly, we found no reads mapping to our viral dataset, suggesting that all of the food-derived viruses originated from the most commonly used food source, A. salina. and its food source at the lab, rather than revealing complete A. salina virome. Mapping to our general viral database identified 7 contigs which were shared between N. vectensis and A. salina (found in both RNA-seq replicates). Those viral sequences ranged in length from 279 to 5952 nt and represented the families Yueviridae, Rhabdoviridae and unclassified RNA viruses (Supplementary File 1, Table S4). To find viral sequences which could be derived from ovaries of the California mussel, a supplementary food source used at a lower frequency in several lab populations of N. vectensis, we mapped filtered reads from two publicly available RNA-seq libraries of M. californianus [44]. Interestingly, we found no reads mapping to our viral dataset, suggesting that all of the food-derived viruses originated from the most commonly used food source, A. salina.

N. vectensis Core Virome
In order to establish a core virome of N. vectensis i.e., a collective of viral species present in all, but embryonic studied datasets, we decided to use the data from the remapping stage, rather than assembled contigs. We assumed that all viral fragments detected in a sample are more reliable representations of the true virome, since such an approach takes into account lowly expressed or incomplete viral sequences, which might not be assembled into contigs or may be partially undetected when sequencing depth is insufficient. The obtained set is composed of 21 viral sequences ( Figure 3, Supplementary File 1, Table S5), six of which are A. salina-derived viruses. In total, 61.9% of the sequences from the core virome represent Iridoviridae, the family of dsDNA viruses, spanning three genera-Chloriridovirus, Iridovirus and Lymphocystivirus [47]. Among all Iridoviridae sequences, none of them were mapped in Artemia libraries. This places Iridoviridae as the most common viral family, specific to N. vectensis rather than derived from the food source. Interestingly, the highest

N. vectensis Core Virome
In order to establish a core virome of N. vectensis i.e., a collective of viral species present in all, but embryonic studied datasets, we decided to use the data from the remapping stage, rather than assembled contigs. We assumed that all viral fragments detected in a sample are more reliable representations of the true virome, since such an approach takes into account lowly expressed or incomplete viral sequences, which might not be assembled into contigs or may be partially undetected when sequencing depth is insufficient. The obtained set is composed of 21 viral sequences (Figure 3, Supplementary File S1, Table S5), six of which are A. salina-derived viruses. In total, 61.9% of the sequences from the core virome represent Iridoviridae, the family of dsDNA viruses, spanning three genera-Chloriridovirus, Iridovirus and Lymphocystivirus [47]. Among all Iridoviridae sequences, none of them were mapped in Artemia libraries. This places Iridoviridae as the most common viral family, specific to N. vectensis rather than derived from the food source. Interestingly, the highest contribution of the population-specific virus to a total detected population virome was observed in more noise-prone rRNA-depleted libraries (Schwaiger et al. [34]) and in the tissue-specific dataset (Babonis et al. [31]). We further validated the presence of 10 randomly chosen sequences from the core virome set in cDNA preparations from A. salina, adult female sea anemone and two-day-old planula originating from our lab population. The RT-PCR analysis confirmed a complete lack of viral load in the planula stage and the presence of N. vectesis-specific and A. salina-derived viruses in our samples ( Figure S1). core virome set in cDNA preparations from A. salina, adult female sea anemone and two-day-old planula originating from our lab population. The RT-PCR analysis confirmed a complete lack of viral load in the planula stage and the presence of N. vectesis-specific and A. salina-derived viruses in our samples ( Figure S1).  [31,[33][34][35][36] revealed by remapping filtered reads to the viral contigs assembled from the merged dataset. For simplicity, only numbers of contigs common to all datasets and specific to each dataset are shown. Values in brackets indicate the total number of different viral contigs to which reads from each dataset were successfully remapped.

Interpopulation Comparison
In order to characterize the general pattern of inter-population differences in the viral load, we compared the normalized counts of viruses-mapped reads of each dataset. As expected, we observed a prominent enrichment in viral sequences in the rRNA-depleted libraries (SDs from mean = 2.218, Figure 4a, Table 2) when compared to the rest of the polyA-selected libraries. Next, we compared the contribution of viruses derived from the food source to the total viral load of Nematostella. Similarly to the general viral load pattern, rRNA-depleted libraries displayed a significantly higher load of A. salina-derived viruses in the total captured virome (SDs from mean = 2.2676, Figure 4b, Table 2), suggesting that the majority of these viruses are not polyadenylated when captured in the host sequencing. Interestingly, we noticed a significant variation in the percentage of A. salina-derived viruses in polyA-selected libraries, with a mean of 14.06%, which could be a result of differences in the sample preparation prior to library generation, i.e., how long the animals were deprived of food for before RNA extraction. Of note, we also detected fragments of four Artemia-derived viruses in the embryonic dataset (Fig 2, Supplementary File 1 , Table S13), however, the very low overall yield of mapped reads suggests that these fragments might be parentally deposited products of viral sequence degradation.  [31,[33][34][35][36] revealed by remapping filtered reads to the viral contigs assembled from the merged dataset. For simplicity, only numbers of contigs common to all datasets and specific to each dataset are shown. Values in brackets indicate the total number of different viral contigs to which reads from each dataset were successfully remapped.

Interpopulation Comparison
In order to characterize the general pattern of inter-population differences in the viral load, we compared the normalized counts of viruses-mapped reads of each dataset. As expected, we observed a prominent enrichment in viral sequences in the rRNA-depleted libraries (SDs from mean = 2.218, Figure 4a, Table 2) when compared to the rest of the polyA-selected libraries. Next, we compared the contribution of viruses derived from the food source to the total viral load of Nematostella. Similarly to the general viral load pattern, rRNA-depleted libraries displayed a significantly higher load of A. salina-derived viruses in the total captured virome (SDs from mean = 2.2676, Figure 4b, Table 2), suggesting that the majority of these viruses are not polyadenylated when captured in the host sequencing. Interestingly, we noticed a significant variation in the percentage of A. salina-derived viruses in polyA-selected libraries, with a mean of 14.06%, which could be a result of differences in the sample preparation prior to library generation, i.e., how long the animals were deprived of food for before RNA extraction. Of note, we also detected fragments of four Artemia-derived viruses in the embryonic dataset ( Figure 2, Supplementary File S1, Table S13), however, the very low overall yield of mapped reads suggests that these fragments might be parentally deposited products of viral sequence degradation.     Comparison of viral communities in individual non-embryonic datasets suggested that while the overall composition of the virome displays a stable pattern across the samples, different lab populations might carry unique viruses, not found in other Nematostella groups. To test this assumption, we compared two datasets from adult animals, for which three biological replicates were available (Babonis et al. [31] denoted as group one and Warner et al. [36] denoted as group seven). A PCA factor analysis revealed two factors with eigenvalues higher than 2 (3.161 and 2.278), which explained 52.69% and 37.97% of observed variance, respectively (Supplementary File S2). Statistical analysis of the extracted component loadings revealed a dual character of the observed variation between the two sets. The first factor showed no statistically significant differences and clustered these datasets together (t 2.854 = 0, P = 1), while the analysis of the second factor suggested strong implicit variance between the studied groups (t 2.159 = −12.3, P = 0.005). The result of partial overlap in viral sequences between two lab populations is not surprising in light of the previously described Nematostella core virome. However, the strong variation we detected between the two datasets seems to confirm the significant contribution of the unique viruses in the studied lab populations.

Discussion
In the current study, we identified 94 different sequences with sound homology to the known viruses from several viral families. Multiple-step removal of N. vectesis-mapping reads from the RNA-seq libraries resulted in the exclusion from the study of genome-integrated virus-derived sequences, such as retroviruses and endogenous-viral elements (EVEs). As no enrichment techniques for viral particles, such as ultracentrifugation or size-based filtration, have been applied to any of the analyzed datasets, our search for fragments of exogenous viral genomes was relatively unbiased [4]. Nevertheless, it is important to note that the majority of analyzed datasets have been generated through mRNA enrichment by oligo(dT) selection, as these datasets were primarily designed for the whole transcriptome analysis. This, in turn, biased our search towards an increase in polyadenylated sequences present in some ssRNA(+) viruses [48], mRNA of RNA and DNA viruses [49,50], viruses targeted by a host for degradation [51,52], or products of other uncharacterized host-virus interface [53]. As expected, we have observed a significantly higher number of virus-mapping reads in the adult female library where rRNA was depleted with RiboMinus TM treatment (dataset "Schwaiger et al. rRNA-depleted") when compared to the rest of the datasets. However, it needs to be taken into account that the commercially available probes used for rRNA depletion are not designed for non-bilaterian animals and are less efficient in depleting the 5S small ribosomal subunit due to its high sequence variability between different animal phyla [54], which might compromise the library depth available to low-frequency viruses. Therefore, a truly unbiased search for RNA viruses would certainly gain from an rRNA depletion method custom-fitted for N. vectensis sequences.
Despite these limitations, we were able to retrieve a relatively broad representation of N. vectensis viruses, composed of 94 viral sequences, 21 of which were common to all non-embryonic datasets. The most common family present in almost all non-embryonic datasets was Iridoviridae, which belongs to the group of linear dsDNA viruses. Known hosts of Iridoviridae include amphibians, fish, reptiles, insects and crustaceans [47]. None of the identified sequences of Iridoviridae representatives was either mapped in Artemia libraries, or amplified from Artemia cDNA, which confirms that these dsDNA viruses are actively expressed and specific to Nematostella, or organisms comprising this holobiont, rather than food-derived. Interestingly, members of the Iridoviridae family were missing from all viromes available for other sea anemones species: E. pallida [16], Actinia equina [55] and Bolocera sp. [4]. It seems plausible that such a major difference in the viral community composition between sea anemones might stem from the very distinct environmental conditions of these marine animals and therefore, a different ensemble of neighboring viral species. While Bolocera is an open-sea sea anemone [56], both Actinia and Exaiptasia occupy predominantly the intertidal zones which experience recurring but short-term fluctuations of water level and exposure to air [57]. In contrast, Nematostella inhabits mostly brackish lagoons of the east coast of North America, where it can be found burrowed into sand and mud [58]. Moreover, shallow waters of this habitat possess reduced buffering properties and expose Nematostella to strong shifts in environmental conditions throughout the year, as well as quite unique biota [59,60], which altogether might result in the altered susceptibility of N. vectensis to different viral species.
An overall comparison between the four available viromes of the sea anemones revealed a general similarity of N. vectensis to A. equina and Bolocera sp., while the viral community of E. pallida displayed considerable differences. For instance, in the A. equina dataset we found two novel viruses which display the highest homology to viral sequences found in N. vectensis (Caledonia beadlet anemone dicistro-like virus 3 isolate B and A, homology to Wenzhou picorna-like virus 28 and Beihai picorna-like virus 71, respectively) [55]. The same number of viral sequences were common between our data dataset and Bolocera sp. virome (Beihai picorna-like virus 70 and Beihai picorna-like virus 118) [4]. In both cases, sequences identified in our study only partially covered the described viral genomes. Another similarity emerges from the distribution of viral families across studies. In the data from A. equina and Bolocera the most prevalent group are picorna-like viruses (50% and 59.1%, respectively), which fall into a novel Picorna-Calici clade established in Shi et al. [4] Similarly, we identified in the Nematostella dataset 37.2% viral sequences which belong to the Picorna-Calici clade, although the vast majority of them are classified here as "other viruses" due to the applied ICTV classification. On the contrary, E. pallida had more diverse viral community composition, in which among 40 identified viral families Picornaviridae constitute only 9.87% [16]. Moreover, we did not observe any viruses with obvious homology shared between N. vectensis and E. pallida. Finally, the most common family of Exaiptasia virome, Herpesviridae, was not found in any of the remaining sea anemones. Herpesviruses have been previously associated with other cnidarian species [3,61,62] and with Symbiodinium microadriaticum, found in corals [14]. Given that E. pallida is also a host to several members of the Symbiodinium family [63] and the contribution of Herpesviridae decreases in aposymbiotic state when compared to a fully symbiotic Exaiptasia (8.1% and 12.9%, respectively) [16], it is possible that this viral family is associated with the presence of these symbionts and hence not found in sea anemone species which do not harbor zooxanthellae.
Among 21 viral sequences present in all non-embryonic datasets which we denominated as the core virome of Nematostella, six sequences homologous to four different viruses were identified in its primary source of food, A. salina nauplii. Unsurprisingly, known hosts of those viruses include insects and crustaceans, as well as insect and vertebrate parasitic nematodes [4]. Interestingly, we were able to amplify by the RT-PCR fragments of two RNA viruses included in our core genome (Sanxia water strider virus 10 and Hubei sobemo-like virus 41) from the cDNA of A. salina, while not detecting any matching reads in either of the two replicates of Artemia RNA-seq libraries. The most plausible explanation is that the lack of polyA tail on the 3 end of the RNA molecule would hinder their detection in the transcriptome analysis but would not bias a cDNA preparation constructed with random hexamers. Therefore, we cannot exclude the fact that the overlap between food-derived viruses and Nematostella holobiont-specific virome may be more significant than described here, and a less biased sequencing approach is needed to fully characterize it. However, the fact that the majority of viral sequences targeted by RT-PCR were amplified from adult Nematostella but not from Artemia nauplii ( Figure S1) is a strong indication that many of the viruses we detected in the RNA-seq are not food-derived. The presence of persistent or prevalent viruses in lab populations of model animals was shown before for Drosophila [64] and very recently was also reported for zebrafish [65].
Besides the presence of a stable core virome of Nematostella, we have detected several population-specific viruses. Namely, five out of seven analyzed datasets possess unique fragments of viral genomes, not found in any other dataset. Such specificity was previously reported on a species level within the cnidarian genus of Hydra [3], although the species-specific diversity was associated with an extensive ensemble of bacteriophages, which were not the subject of our study. In the case of two of the datasets analyzed here, the contribution of population-specific viruses was remarkable and reached 52.5%-93.03% of the total reads mapping to viruses (datasets "Babonis et al." and "Oren et al.", respectively). Interestingly, the unique virus detected in the tissue-specific dataset ("Babonis et al.") exhibits the highest homology to a virus identified previously in tunicates [4] and it is unevenly clustering in only one library replicate generated from the mesentery tissue. In natural populations, such a virome diversity could reflect unique environmental conditions. However, to the best of our knowledge, there are no significant differences in N. vectensis culture between different lab populations as they originate from the same population from Rhode River, MD, USA, which was cultured and used for the genome sequencing [40]. Overall, we cannot exclude the possibility that this unique virus might not represent a stable population-unique viral community, but instead, it was acquired from other species cultured in the research facility where the Nematostella polyps were kept.
Finally, our analysis of the Tulin et al. dataset, which spans 24 h of embryonic development, revealed that the viral load in this early life stage library was negligible. None of the sequences from the core virome specific to N. vectensis was present either in the Tulin et al. RNA-seq dataset [32] or in our early planulae cDNA preparation. Comparison to the individual assemblages of available early developmental stages, i.e., from an unfertilized egg, blastula and gastrula, confirmed this pattern, suggesting that the lab Nematostella is free of viruses in its early developmental stage and acquires them throughout life, both by food ingestion and uncharacterized ways of entry. Unfortunately, the viral datasets of other sea anemones do not span multiple developmental stages, which impedes a direct comparison of their embryonic viral load. Unexpectedly, the only viral sequences assembled from the data by Tulin et al. 2013, which are similar to the Rous sarcoma virus [66], exhibited remarkable homology (99% identity at the nucleotide level) to transcriptomic sequences from the reef-building coral Acropora millepora [67] and the stalked jellyfish Haliclystus sanjuanensis. Such an unusually high level of homology among species that separated more than 600 million years ago [68] raises the possibility of contamination. Of note, the homology level of the N. vectensis sequences to the Rous sarcoma virus and other closely-related vertebrate viruses (e.g., Avian leukosis virus) was lower (<97%) than the homology among these three far-related cnidarians. However, as these sequences failed to be filtered out by multiple steps of mapping to Nematostella genome and were missing from individual early developmental assemblages, this strengthens our prediction that they might represent a contamination of the RNA-seq libraries rather than a genome-integrated cnidarian retrovirus.
Although most studies are focusing on the role of viruses in the pathogenesis of vertebrates, there is an increasing understanding of the significance of viral communities in the formation of a stable holobiont among all living organisms. Here, we re-analyzed several high-throughput RNA-seq datasets available for a cnidarian model organism, N. vectensis, and we developed a straightforward approach of de novo assembly followed by a multi-step, homology-based virus identification. Our study revealed a diverse set of eukaryotic, non-integrated viruses spread across seven different lab populations, among which we identified both the core virome present in all datasets and several population-unique viruses. The observed absence of viral community during the early stages of development and the identification of viruses shared with the primary food source of N. vectensis (A. salina) provide an initial insight into the course of viral community acquisition in N. vectensis. Further research combining both non-targeted and virus-enriched deep sequencing approaches is essential for a full characterization of the Nematostella viral community.