Deep Sequencing Reveals the Complete Genome and Evidence for Transcriptional Activity of the First Virus-Like Sequences Identified in Aristotelia chilensis (Maqui Berry)

Here, we report the genome sequence and evidence for transcriptional activity of a virus-like element in the native Chilean berry tree Aristotelia chilensis. We propose to name the endogenous sequence as Aristotelia chilensis Virus 1 (AcV1). High-throughput sequencing of the genome of this tree uncovered an endogenous viral element, with a size of 7122 bp, corresponding to the complete genome of AcV1. Its sequence contains three open reading frames (ORFs): ORFs 1 and 2 shares 66%–73% amino acid similarity with members of the Caulimoviridae virus family, especially the Petunia vein clearing virus (PVCV), Petuvirus genus. ORF1 encodes a movement protein (MP); ORF2 a Reverse Transcriptase (RT) and a Ribonuclease H (RNase H) domain; and ORF3 showed no amino acid sequence similarity with any other known virus proteins. Analogous to other known endogenous pararetrovirus sequences (EPRVs), AcV1 is integrated in the genome of Maqui Berry and showed low viral transcriptional activity, which was detected by deep sequencing technology (DNA and RNA-seq). Phylogenetic analysis of AcV1 and other pararetroviruses revealed a closer resemblance with Petuvirus. Overall, our data suggests that AcV1 could be a new member of Caulimoviridae family, genus Petuvirus, and the first evidence of this kind of virus in a fruit plant.


Introduction
Similar to all eukaryotic organisms, plant genomes commonly have inserted mobile genetic elements and viruses [1]. New high throughput sequencing technologies (DNA and RNA-seq) and bioinformatics methods have accelerated the sequencing and analysis of new genomes and transcriptomes of complex organisms, allowing the identification of new virus species integrated within complete or draft genome sequences [2,3].
Maqui (Aristotelia chilensis [Molina] Stuntz) is an endemic Chilean tree. It has a high economic value due to the characteristics of its fruits, which possess a high antioxidant capacity [4]. This fruit's ability to scavenge reactive oxygen species (ROS) is conferred by its extremely elevated anthocyanin content, which surpasses the levels of other popular berries, i.e., strawberries, blueberries and grapes [4,5]. Moreover, it has been reported that Maqui Berry has anti-inflammatory properties [6], cardioprotective activity [7], inhibits LDL oxidation in vitro and protects human endothelial cells against oxidative stress [8]. To date, there is not genetic or phenotypic evidence of viral species in this kind of plants. The goal of this study was to demonstrate the presence of a virus in Maqui, using genomics and bioinformatics tools.
We used next generation sequencing (NGS) to obtain the first draft genome sequence of Maqui berry. This initiative allowed us to uncover for the first time the presence of a new virus inserted in the Maqui genome, named here as: Aristotelia chilensis Virus 1 (AcV1). Comparative genomics revealed a strong phylogenetic relationship of AcV1 with the Caulimoviridae family, genus Petuvirus. Viruses from this genus commonly infect plants and replicate by reverse transcription (RT) [9]. The unique member of this genus is the Petunia vein clearing virus (PVCV), which is able to integrate within the host's Petunia genome, and is classified as an endogenous pararetrovirus (EPRV). It affects plants, causing a variety of different phenotypes, like chlorotic tissues, leaf malformation, vein-clearing symptoms and other affections [10]. Besides the strong sequence similarities and phylogenetic relationship between AcV1 and PVCV, these symptoms were not observed in Maqui plants.
Finally, we decided to perform RNA-seq assays in order to obtain the overall transcriptional activity of the virus in different plant tissues. Our results showed that AcV1 genes have a low transcriptional activity, which might be associated with the absence of an infective phenotype. This study constitutes the first report evidencing the presence of a virus in Maqui berry and an important step towards the molecular characterization of mobile genetic elements in this native Chilean tree.

Deep Sequencing of Aristotelia Chilensis Genome Results in the Discovery of AcV1
Next generation sequencing approaches were used to study the genome of the native Chilean fruit tree A. chilensis (sample M152). Data analysis of assembled contigs allowed us to identify the initial signs of viral sequences integrated within the genome. Based on sequence similarities searches using BLASTn against the NCBI nucleotide NT database, we identified a contig sequence with high similarity with viral sequences. This contig contains a length of 12,109 nucleotides (nt), a GC content of 38.04% and is composed of 12,032 reads. This initial similarity analysis showed top 10 BLASTn hits with complete pararetrovirus and EPRV sequences (e-values ≤ 2e-05), suggesting the presence of a complete virus inserted in Maqui genome, named here as Aristotelia chilensis Virus 1 (AcV1), containing a full size estimated in 7122 bp. In order to check if AcV1 was present in a single copy in Maqui berry, we mapped the complete viral nucleotide sequence against all contigs and scaffolds from Maqui genome. It confirmed that the virus was present once in the genome (query coverage ≥ 50%).

Bioinformatics Characterization of AcV1
Open reading frame (ORF) predictions using the software Artemis [11] and BLASTx similarity searches against the NR protein database of NCBI, revealed the presence of three different ORFs along the contig sequence (Table 1). ORF1, the longest, starts on the +3 reading frame of the sense strand of the virus genome sequence and its predicted gene product is composed of 820 amino acids (aa), with a predicted molecular weight (MW) of 92.22 kDa. ORF2 is located at the +2 reading frame of the sense strand and is composed of 758 aa, with a predicted MW of 88.69 kDa. ORF3, the smallest one, is located at the −1 frame of the antisense strand, and is composed of 266 aa, with a predicted MW of 29.91 kDa. The stop codon of ORF1 (TGA, genomic coordinates: 2541-2543) overlaps with the start codon of ORF2 (ATG, genomic coordinates: 2540-2542). Similar characteristics have been previously described in pararetroviruses, such as Caulimoviruses [12], Badnaviruses [13] and Tungrovirus [14]. In order to predict the putative functions of the virus ORFs, we used their translated amino acid sequences to perform BLASTp searches. This analysis indicated that ORF1 encodes a Movement protein, whereas ORF2 corresponds to a Reverse transcriptase (RT). On the contrary, ORF3 showed no amino acid sequence similarity with any known virus protein sequence available in the NCBI database (unknown protein). Besides its absence in public databases, together with ORF1 and ORF2, the ORF3 is part of the virus due to its localization between key features available in viral elements: the tRNA binding site in one side, and the poly(A) signal on the other one. Therefore, we conclude that ORF3 is potentially exclusive of AcV1.
Our alignment searches revealed that ORF1 has high similarity with the movement proteins of PVCV (e-value = 8e-106) [15], followed by Citrus endogenous pararetrovirus (CitPRV) (e-value = 4e-94) [16] and Figwort mosaic virus (FMV) (e-value = 5e-06) [17]. A deeper exploration using multiple sequence alignments with CLUSTAL OMEGA [18] showed that ORF1 displays a movement protein family I domain, which is conserved in all Caulimoviridae family's movement proteins (plant pararetrovirus) [19]. ORF1 alignments also revealed a 66%, 63% and 20% identity with the movement proteins of PVCV, CitPRV and Carnation etched ring virus (CERV) respectively [12] ( Figure 1A). Additionally, we found two highly conserved residues Gly (G) and Asp (D), characteristic of MPs, constituting the representative evolution site of the MP protein in Caulimoviridae family and several other virus groups [19]. These proteins modify plasmodesmata by forming intra-channel tubules that can accommodate viral particles, comprising one of the most important transport mechanisms used by several plant viruses [20].
In addition, ORF2 showed a conserved RNase H domain according to BLASTp analysis. Multiple sequence alignments of this domain showed a 64% identity with RNase H protein from PVCV and 27% from Ty3 retrotransposon [23]. RNase H is an endonuclease that cleaves the RNA strand of RNA/DNA hybrid in a sequence nonspecific manner. This characteristic is widely present in various organisms and it is clearly important for its relation with the origins of viruses [21]. The conserved amino acids Asp, Glu, Asp, His (H) and Asp are known to be related to the catalytic region of the protein ( Figure 1C). Additionally, we showed that a histidine (His) motif was highly conserved with the pararetrovirus and not conserved with the retrotransposon. This may suggest that the His motif from Rnase H potentially has key roles in viral sequences [24]. were aligned with caulimoviruses previously selected by homology searches: CitPRV, PVCV, FMV, Rice tungro bacilliform virus (RTBV) [25], Dahlia mosaic virus (DMV), DMV-10-NZ, DMV-10-DR [26], Cauliflower mosaic virus CaMV, and CERV; (B) AcV1 ORF2 (RT) were aligned with reverse transcriptases (RT) from Caulimoviruses and a selected retrotransposon. Invariant motifs distinctive of RT's were underlined and numbered as described by Xiong and Eickbush [22]; (C) AcV1 ORF2 were aligned with Ribonuclease H regions from different Caulimoviruses and Ty3 retrotransposon [23]. Numbers located next to the retroviruses names indicate the amino acid position within the designated ORF. Conserved amino acid residues within a group are highlighted with a black box. Different amino acids containing similar chemical properties are shown in a gray box. Invariant amino acids are highlighted in boldface above the sequences. G = Gly, D = Asp, R = Arg, Y = Tyr, F = Phe, E = Glu, P = Pro, K = Lys, L = Leu, H = His.

Phylogenetic Analysis Reveals a Strong Relationship between AcV1 and the Caulimoviridae Family
Phylogenetic analysis is a powerful tool commonly used to make evolutionary comparisons in order to identify relationships between sequences and different species. Aiming to classify the viral family that AcV1 belongs, we built a phylogenetic tree based on the RT protein sequence from AcV1 in comparison with different members of two endogenous Dahlia viruses [26], CitPRV [16], one representative member of Florendoviruses based on the Family of different host plants [27] and Ty3 retrotransposon of S. cerevisiae (Figure 2).
Our results showed that AcV1 has a high phylogenetic relatedness with Caulimoviridae family, especially with Petuvirus. It gives an additional support that AcV1 is a viral sequence, represented as an endogenous viral sequence. In addition, this analysis revealed that each genus of the Caulimoviridae family clustered separated in different branches. AcV1 clustered together with PVCV and CitPRV, suggesting that they can be part of the same genus: Petuvirus ( Figure 2). However, it is necessary to have access to a bigger number of related viruses sequences and experimental assays in order to validate this hypothesis.
Also, in order to verify if each one of the viruses represented on the phylogenetic tree was also inserted in Maqui berry, we compared their complete sequence against all contigs and scaffolds from the Maqui genome. It showed that only AcV1 was present in Maqui. However, when we compared only the nucleotide sequences of AcV1 ORF2, containing both RT and RNase H domains, against Maqui genome, we found 66 copies using a query coverage bigger than 80% and 87 copies with a coverage above 60%. It might represent rearrangements of AcV1 or the RT and RNase H domain sequences of other still uncharacterized endogenous pararetroviral sequences available in the Maqui genome, that conserved the RT and RNase H domains.
Phylogenetic analysis was implemented using Maximum Likelihood [28]. The evolutionary distances were computed using the Dayhoff matrix-based approach [29]. The bootstrap consensus tree representing the evolutionary history of the taxa analyzed was inferred from 1000 replicates. Nodes below the bootstrap threshold were collapsed [30]. Evolutionary analyses were conducted using MEGA 5 [28]. Genbank accession numbers for each Caulimoviridae family virus and Ty3 retroelement are shown. Florendovirus sequences are still not available on GenBank, their sequences were extracted from Geering et al. [27], and are available on the Supplementary File 1.

General Genomic Organization of AcV1
Using a bioinformatics approach we identified that AcV1 is a single-copy endogenous pararetrovirus inserted in Maqui berry genome. The full length of AcV1 was estimated in 7122 bp, similar to other pararetroviruses from the Caulimoviridae family ( Figure 3A). We found that AcV1 is smaller in 87 bp than PVCV (length: 7206 bp) [15]. Also, it is 1036 bp smaller than CsVMV virus (length: 8159 bp) [31]. Amongst plant retroviruses, the smallest viruses reported are the Badnavirus Cacao swollen shoot virus (CSSV), 7161 bp long [13] and the Petuvirus PVCV, 7205 bp long. Based on previous reports, AcV1 could be classified as a small plant virus [13,15,31].
Furthermore, we found a short non-coding sequence, located 80 bp upstream of the ORF1, that has complementarity with the first 14 nucleotides of the 3' strand methionine tRNA, which was previously described in plants (TGGTATCAGAGCCA, AcV1 genomic coordinates: 1-14, Figure 3A) [12,32]. This region is predicted to be the first nucleotides of the priming binding site for replication by reverse transcription, representing the beginning of AcV1 sequence in the genome of A. chilensis. In addition, we found a TATA box region (TATATAA, AcV1 genomic coordinates: 7038-7044) and a GAGA motif (AGAGAGAGAGAGAGAG, AcV1 genomic coordinates: 6697-6712) located in the antisense strand, downstream to ORF3. This promoter structure is similar to the downstream promoter sequence (dps) of RTBV, which is known to contribute with its transcriptional activity. The efficacy of this promoter in RTBV is associated with the genomic localization of the GAGA motif, which might be located immediately downstream to the TATA-box [33]. Also, we found a poly(A) signal (AATAAAA, AcV1 genomic coordinates: 7025-7031) located downstream of ORF3.
The genomic coordinates of these motifs and the localization of each predicted gene were used in the estimation of the whole viral genome length and its organization. A general whole genome conservational analysis between AcV1 and three related virus from the Caulimoviridae family (PVCV, CaMV and RTBV) was performed ( Figure 3B). Alignments using MLAGAN [34], through VISTA tool [35] and displayed as "peaks and valleys" graphics showed a concordantly genomic structure, especially with PVCV. The conserved region between coordinates 1 and 2000 bp showed to be specific of the Petuvirus genus, especially due to the similar structural organization and the presence of the MP sequence. The genomic coordinates between 2500 and 4000 bp is conserved between all pararetroviruses, it is the region that hosts the RT and RNAse H domains. Finally, the region between 5000 and 7122 bp is not conserved. Interestingly, this region hosts the predicted ORF3, which was shown in our similarity analysis to be exclusive of AcV1.

Evidence of AcV1 Transcriptional Activity Based on Expression Levels Estimated by RNA-seq
A. chilensis has become an important case of study for plant research in Chile. Numerous studies reporting different insights of its biological properties and high antioxidant capacity have been published [6][7][8]. However, by now, there are no reports demonstrating phenotypic alterations in A. chilensis caused by viral activity or infection. Therefore, in order to find the viral expression in Maqui plants and obtain a first insight of its activity, we performed Illumina RNA-seq assays in two different specific tissues: fruits under three different developmental stages (green, half-ripe and ripe) and leaf.
A total of 43 million high quality reads from all libraries were mapped against the virus genomic region. Of these, 21 mapped to ORF1 (movement protein), 23 to ORF2 (reverse transcriptase), 11 to ORF3 (unknown protein) ( Figure 3C). As observed, the RNA-seq assays in all four samples revealed similar expression levels based on reads count. The results indicated that all tissues and samples had similarly low expression levels, concordantly to the absence of symptoms and the optimal growing conditions of these plant samples. Since this is the first virus described in A. chilensis, the absence of symptoms described to date makes it difficult to identify its real effects in the plant.

Discussion
We describe here the first virus inserted in the genome of the Chilean plant Maqui berry, AcV1, using deep sequencing and bioinformatics approaches. We provided evidence of its phylogenetic relationships, genomic structure and the first support of a potential expression activity. Our data, together with the known characteristics of PRV and EPRV viruses, are important evidence that support the classification of A. chilensis Virus 1 as an endogenous pararetrovirus, part of the Caulimovirus family, and specially related with PVCV from Petunia hybrid [15] and CitPRV from Citrus citrange [16]. This relationship can suggest that these three viruses (AcV1, CitPRV and PVCV) are part of the same genus: Petuvirus. However, additional genome sequences of other viruses and further experimental assays are necessary in order to support this hypothesis.
AcV1 sequences were identified inserted in the Maqui berry genome, suggesting an integration event, available in an episomal single copy identified in the contig sequence. These integration events were previously described for endogenous pararetroviruses, such as Banana Strike Virus (eBSV) [36], Tobacco Vein Clearing Virus (eTVCV) [37] and the endogenous Petunia vein-clearing virus (ePVCV), among others [38][39][40]. Based on comparative genomics approaches, we found a strong relationship between AcV1 and pararetrovirus, especially with the genome of PVCV. This relationship is higher mainly on the regions hosting the main proteins (MP, RT and the RNase H domains), commonly founded in all plant viruses. The main differences are on its genome structure organization, in which some domains, like the Protease (PR) and the Zinc finger motif, are present in PVCV and absent in AcV1; and the region containing the ORF3 in AcV1, which is possible exclusive of this virus. It could be part of integration/deletion events during the endogenous pararetrovirus evolution, as previously observed in different EPRVs [41,42], that are enriched in the Caulimoviridae family, and its presence is also strongly related to plants evolution [24,39]. Based on these similarities and differences, AcV1 could also be called eAcV1 (endogenous Aristothelia chilensis Virus 1) and classified as a Petuvirus. However, Southern blotting assays and transmission experiments using isolated full-length AcV1 clones could give a definitive support to classify AcV1 as a pararetrovirus.
The observed AcV1 transcriptional activity (low expression) is an additional signal of the viral presence and activity. The evidenced small transcriptional activity can be due to the fact that the sequenced Maqui plant was asymptomatic under normal growth conditions, and should be verified by PCR/qPCR or Northern blot assays. The low expression level of viral transcripts is probably associated with the absence of symptomatic phenotype. This small expression in EPRV viruses has been shown associated with an epigenetic control regulated by DNA methylation of integrated EPRV genome regions in Solanaceae healthy plants [39]. Its de-methylation results in phenotypic alterations associated to viral expression. Studies with ePVCV showed a relationship between DNA methylation and the production of small interfering RNAs. The authors of the research proposed that viral symptoms might be associated to this a de-methylation stage that increases viral transcripts and siRNA expression levels [39]. Bioinformatics exploration of ESTs databases and experimental validation by RT-PCR of various EPRVs in different plant species revealed that its low-level transcription is associated to asymptomatic plants under normal growth conditions [39]. This is the first described virus sequence and EPRV inserted in Maqui berry Genome. Until now, the only described similarities between these hosts plants, P. hybrida and A. chilensis, were their characteristic dark purple colors, directly associated to anthocyanin production [5,43], and the fact that both are endemic from South America, Argentina and Chile, respectively [8,38]. Additionally, this work showed the importance and application of next generation sequencing and bioinformatics in the identification and characterization of new viral sequences inserted in plants genomes.

DNA Sequencing and Data Processing
The Aristotelia chilensis virus 1 (AcV1) (GenBank accession number KJ094314) was identified from the recently sequenced genome of Maqui (Aristotelia chilensis) sample M152, plant #6 [44], which was collected at the 10th region of Chile. Total Maqui DNA was extracted from young leaves using a modified grapes method [45]. A total of 5 µg of DNA were used to prepare an Illumina paired-end (IPE) shotgun library using the Nextera kit (Illumina, San Diego, California, USA). This library was sequenced using the MiSeq Illumina platform (220 bp paired-end reads). The de novo assembly of the draft genome was performed using the MIRA assembler tool [46]. Viral sequences were identified using BLASTx, comparing the obtained nucleotides sequences against the non-redundant (NR) protein database from NCBI, followed by a taxonomic classification using MEGAN software [47].

Sequence Analysis and Genome Comparison
For the annotation process, all viral contigs were used as input for the Artemis tool [11]. All identified ORFs were annotated using BLASTp tool against the NR protein database of NCBI. Sequence comparisons against plant pararetroviruses of Caulimoviridae family and Ty3 retrotransposon were performed using multiple alignments tool available in CLUSTAL OMEGA [18]. The visualization was performed using the multiple alignment sequences editor Jalview [48]. The VISTA [35] comparison tool and MLAGAN align [34] were used to compare AcV1 against virus with similar phylogenetic characteristics. The alignment was built using a conservation threshold of 70% as identity, over a window of 100 bp between viral genomes in comparison to AcV1.
A phylogenetic tree was built with the multiple alignment of RT under bootstrap analysis (1000 replicates) [30], using Molecular Evolutionary Genetics Analysis (MEGA) software [28]. For the evolutionary comparisons, we used the RT region of the Caulimoviridae family and Ty3 retrotransposon. This approach was constructed using the Maximum Likelihood method [28]. The length of the branches is proportional to the evolutionary distances computed using the Dayhoff matrix based approach [29].

RNA Sequencing and Data Processing
Total RNA from Maqui berry samples was isolated using TRIzol ® Reagent (Invitrogen, San Diego, California, Country). RNA quality was determined by the quotient of the 28S to 18S ribosomal RNA electropherogram peak using a Bioanalyzer Chip RNA 7500 series II (Agilent, Santa Clara, California, USA). Four libraries were prepared using the TruSeq RNA Sample Preparation Kit v2 (Illumina) according to the manufacturer's instructions. First, magnetic beads containing poly-T molecules were used to isolate poly(A) mRNA from 1 μg of total RNA. Then, samples were fragmented by fragmentation buffer and reverse transcribed into cDNA with a PrimeScript™ 1st Strand cDNA Synthesis Kit (TaKaRa, Dalian, China). Next, end repair and A-base tailing was purified using the QIAquick PCR Purification Kit (QIAGEN, Hilden, Germany). Finally, Illumina adapters were ligated to the cDNA fragments. Each library had an insert size of 300 bp sequenced in paired-end reads of 150 bp using a MiSeq Desktop Sequencer.
Raw reads were used to identify expression levels of all three genes associated to AcV1. In order to align raw reads against our virus genome, Bowtie2 [49] was employed, followed by the software IGV (Integrative Genomics Viewer) [50]. This allowed the identification of reads distribution along the viral whole genome sequence and to obtain a general overview of AcV1 genomic structure.