biomolecules

: In this study, a previously little-studied group of viruses—virophages—was searched for and identiﬁed in the viromes of the ancient oligotrophic Lake Baikal. Virophages are small dsDNA viruses that parasitize giant viruses (e.g., Mimiviridae ), which in turn affect unicellular eukaryotes. We analyzed eight viromes obtained from the deep-water areas of three basins of Lake Baikal and the shallow-water strait Maloye More in different seasons. The sequences of virophages were revealed in all viromes and were dominant after bacteriophages and algal viruses. Sixteen putative complete genomes of virophages were assembled, all of which contained four conserved genes encoding major capsid protein (MCP), minor capsid protein (mCP), maturation cysteine protease (PRO), and FtsK-HerA family DNA-packaging ATPase (ATPase). The MCP-based cluster analysis showed a sequence separation according to seasons, and a dependence on the geographical localization was not detected.


Introduction
In 2008, Acanthamoeba polyphaga mimivirus (APMV) was isolated by inoculating A. polyphaga with water from a cooling tower.Using transmission electron microscopy, in addition to the giant virus APMV, a small virus with icosahedral virions 50 nm in size was observed, the genome of which was sequenced and found to be 18 kbp in length.The authors first proposed the term "virophage" in analogy to bacteriophage and named it Sputnik [1].
Virophages are proposed to be related to the family Lavidaviridae with two genera Sputnikvirus (Sputnik and Zamilon) and Mavirus, based on six conservative proteins: major capsid protein (MCP), minor capsid protein (mCP), also known as penton, FtsK-HerA family DNA-packaging ATPase (ATPase), maturation cysteine protease (PRO), primasesuperfamily 3 helicase (S3H), and a zinc-ribbon domain protein [19].Later, only four of them were found to be "core" and two (S3H and the zinc-ribbon domain protein) were reclassified as "near-core" [16].Next, 328 new virophage genomes containing all four main genes, MCP, mCP, ATPase, and PRO, were identified in 14,000 different publicly available Biomolecules 2023, 13, 1773 2 of 16 microbiomes.Based on the data obtained, the classification of the family Lavidaviridae was revised [20].
As mentioned above, virophages are found in various ecosystems and associated with viruses related to Mimiviridae or viruses infecting phytoplankton.The eukaryotic host of giant viruses has been shown to be algae and protists [22].Some authors hypothesize that virophages play an important ecological role in regulating the abundance of giant viruses, thereby increasing the survival rates of eukaryotic hosts [1,2,10].For example, it is predicted that virophages in an Antarctic lake stimulate algal production by reducing overall mortality and thereby increasing the frequency of blooms [10].
In 2015, G. Blanc and colleagues discovered provirophages by identifying about 300 putative genes of virophage origin in the nuclear genome of the unicellular alga Bigelowiella natans [27].The authors hypothesized that the integration of virophages into the genome of B. natans could be beneficial for both, as it leads to the protection of the latter from infection and the virophages benefit from an increased chance of encountering the giant virus.Virophage integration is also found in Acanthamoeba polyphaga (Lentille virus) [4] and in the self-synthesizing mobile element of Maverick/Polinton [2].
The ability of Mavirus to insert itself into the genome of the cultured protist Cafeteria burkhardae was previously tested.Eight different types of endogenous virophages (endogenous mavirus-like elements, EMALE) were discovered based on dot plots and phylogenetic analyses related to maviruses.EMALE can potentially re-activate and replicate in the presence of giant viruses [28].
We provide brief and basic information on virophages here because recent review articles fully reflect the current state of research on virophages [18,22,[29][30][31].To date, there is no detailed information on the presence of virophages in the ancient oligotrophic Lake Baikal, just as there are no data on virophages in other ancient and large lakes of the Earth.Previously, we and our colleagues conducted a detailed study of the DNA-and RNA-containing viral communities of Lake Baikal using amplicon and high-throughput metagenomic sequencing [32][33][34][35], thereby showing the diversity of viral communities.
The aim of this study was to identify virophages in the metagenomes of the viral fraction (smaller than 0.2 µm) from Lake Baikal using bioinformatic methods.

Sampling Sites
In 2018, water samples were taken 7 km from the settlement of Listvyanka (BVP1), 3 km from the settlement of Listvyanka (BVP2), 3 km from the settlement of Turka (BVP3), 3 km from Elokhin Cape (BVP4), at the central station in Maloye More Strait (BVP5), at the central station of the transect between the settlement of Listvyanka and the settlement of Tankhoy (BVP6), at the central station of Ukhan Cape-Tonky Cape (BVP7), and at the central station of Elokhin Cape-the settlement of Davsha (BVP8).The dates and coordinates of the sampling can be found in Table 1.Sampling was carried out on board the research vessels of the Limnological Institute Siberian Branch of the Russian Academy of Sciences (LIN SB RAS) using the SBE-3 bathometer system (Carousel Water Sampler, Sea Bird Electronics Inc., Bellevue, WA, USA).From each horizon (0, 5, 10, 15, 20, 25, 50 m), 3.5 L was sampled and mixed to obtain an integral sample of the 0-50 m layer (a total of ~25 L for each sample).During the ice-cover period, sample BVP1 was collected under ice using Niskin bottles.
To obtain free virus particles, the sample was treated with DNase (Thermo Fisher Scientific, Waltham, MA, USA) at 37 • C for 30 min.The DNAase was deactivated by adding 20 µL of EDTA 50 mM at 65 • C, which was held for 10 min.The presence of bacterial DNA was verified by PCR using the universal bacterial primers 27L (5 -AGAGTTTGATCATGGCTCAG-3 ) and 1542R (5 -AAGGAGGTGATCCAGCCS-3 ).Agarose gel analysis showed the absence of bands.

DNA Extraction and Libraries Preparation
DNA was isolated using the standard phenol-chloroform method.The DNA concentration was measured using the Qubit 2.0 fluorimeter (Invitrogen, Carlsbad, CA, USA) according to the manufacturer's instructions (extracted ~50 ng DNA).The extracted DNA was stored at −72 • C until further analysis.
DNA was fragmented on Covaris S2 (Woburn, MA, USA) and libraries were pre-pared using the NEBNext Ultra II reagent kit (New England Biolabs, Ipswich, MA, USA).The resulting DNA libraries were sequenced on a Miseq instrument using the reagent kit v3 2×300 (Illumina, San Diego, CA, USA).

Bioinformatic Analysis
An initial quality control was performed with the program Fast QC [36], then the data obtained were processed using Trimmomatic v. 0.36 [37], applying the parameter SLIDINGWINDOW:4:20, and sequences shorter than 50 nucleotides were removed from the analysis.A metagenomic "assembler" SPAdes v. 3.13.0(Saint Petersburg, Russia) [38] was used to assemble de novo, mode metaspades, with default settings.
To identify MCP virophages, we used hmmsearch from the HMMER 3.2.1 package [42] with 15 previously published models [20], e-value 10 −6 .For the analysis of complete MCP genes, the sequences with less than 500 amino acids were excluded from the analysis.According to the published data, the average MCP size is 593 amino acids ± 1 standard deviation (±40.1)[20].Identical sequences from each sample are removed using Usearch v. 9.2.64 [43].Identified proteins belonging to the major capsid protein were manually verified through online-blastp (NR).We also searched for closest relatives among the HQ-virophage MCP proteins [20] by performing a local blastp analysis with the parameter e-value 10 −5 .
For the phylogenetic tree based on the MCP proteins, the sequences were aligned using the program MAFFT v. 7.407 [44] with the parameter -auto.The alignment was visually verified to remove partial and non-homologous sequences.TrimAl v. 1.2 (-gappyout) [45] was used to remove ambiguous regions.Trees were computed using IQ-TREE software v. 1.6.9[46], model selection was performed using ModelFinder [47], and branch supports were determined using the approximate likelihood ratio test (1000 repetitions) [48] and the ultrafast bootstrap (1000 repetitions) [49].The resulting trees were visualized and edited in iTOL [50].
Contigs longer than 10,000 nucleotides from each sample were analyzed for the presence of the "core" genes of the virophages MCP, mCP, ATPase, and PRO, and affiliation was determined using an automatic classifier ICTV_VirophageSG (https://github.com/simroux/ICTV_VirophageSG, date of access 21 June 2023), and polinton-like viruses (PLVs) were also discovered.
The mapping of reads to the genomes of virophages was performed in Bowtie 2 [57], followed by the use of Samtools v. 1.13 [58].The samples were normalized to the lowest number of reads in the sample using the program SeqKit v. 2.3.0 [59].

Taxonomy Viruses of Lake Baikal Viromes
According to the taxonomic analysis at the class level, bacteriophages of the class Caudoviricetes dominated in all samples (80-94.6%).The second most abundant class was Megaviricetes, which included the giant DNA viruses (with the exception of the BVP1 sample, where Maveriviricetes ranked second) and accounted for 5.6-17.9%.The third most abundant class was Maveriviricetes, which contained virophages (0.9-2.7%).
Analysis at the family level showed that Kyanoviridae (Caudovoricetes) was the most abundant in samples BVP1 and BVP2 (30% each), while in the remaining samples, the family Phycodnaviridae (Megaviricetes) was the most abundant (26.3 to 57.3%) (Figure 1).According to the currently accepted classification, the family of virophages Lavidaviridae (Maveriviricetes) accounted for 5 to 23.5% in our data.Thus, it was found that virophages occupy the top position in terms of representation.
Biomolecules 2023, 13, x FOR PEER REVIEW 5 of 16 family Phycodnaviridae (Megaviricetes) was the most abundant (26.3 to 57.3%) (Figure 1).According to the currently accepted classification, the family of virophages Lavidaviridae (Maveriviricetes) accounted for 5 to 23.5% in our data.Thus, it was found that virophages occupy the top position in terms of representation.

Analyses of Virophages MCP Genes
Using hmmsearch, 974 MCP-like sequences were identified.After the removal of short sequences (less than 500 aa), 319 (32.8%) amino acid sequences remained.The deletion of sequences that were not aligned and had no conserved regions resulted in 294 MCP proteins.The average length of the remaining sequences was 602 ± 36 (sd) amino acid residues, with a maximum length of 709 aa.
Their similarity to the proteins represented in the NCBI NR ranged from 23.8% to 90.1% (Table S1).The greatest similarity was observed with the Dishui Lake virophage (AMF83737), which had an aa similarity of 90.1%, with a coverage of 99.7% (sequence BVP7_NODE_1419_ORF7).The most highly represented close relatives were Dishui Lake virophage 2 (QIG59351), which corresponded to 24.8% of the estimated MCP from all samples with an aa similarity ranging from 36.7% to 80.4%, and Yellowstone Lake virophage 5 (YP_009177804)-20.1% of the sequences with a similarity ranging from 38.5% to 63.5%.
We also compared the MCP proteins obtained in our study with those of HQ-virophages (Table S2).The greatest similarity was found for Ga0114980_10001820 (aa identity 96.2%, coverage 99%), which corresponded to the sequence BVP1_NODE_8945_ORF1.According to the description of this sequence [20], it was extracted from freshwater microbial communities from Lake Simoncouche (oligotrophic lake), Canada.The most abundant sequences were B570J40625_100003451 (aa identity 79.2-87.3%)and Ga0133913_10009135 (aa identity 48.5-51.8%),each corresponding to 8.2% of the Baikal MCP sequences.Virophage B570J40625_100003451 was extracted from freshwater microbial communities from Lake Mendota (eutrophic lake) and Ga0133913_10009135 was extracted from lakes in northern Canada (co-assembly).

Analyses of Virophages MCP Genes
Using hmmsearch, 974 MCP-like sequences were identified.After the removal of short sequences (less than 500 aa), 319 (32.8%) amino acid sequences remained.The deletion of sequences that were not aligned and had no conserved regions resulted in 294 MCP proteins.The average length of the remaining sequences was 602 ± 36 (sd) amino acid residues, with a maximum length of 709 aa.
Their similarity to the proteins represented in the NCBI NR ranged from 23.8% to 90.1% (Table S1).The greatest similarity was observed with the Dishui Lake virophage (AMF83737), which had an aa similarity of 90.1%, with a coverage of 99.7% (sequence BVP7_NODE_1419_ORF7).The most highly represented close relatives were Dishui Lake virophage 2 (QIG59351), which corresponded to 24.8% of the estimated MCP from all samples with an aa similarity ranging from 36.7% to 80.4%, and Yellowstone Lake virophage 5 (YP_009177804)-20.1% of the sequences with a similarity ranging from 38.5% to 63.5%.
We also compared the MCP proteins obtained in our study with those of HQ-virophages (Table S2).The greatest similarity was found for Ga0114980_10001820 (aa identity 96.2%, coverage 99%), which corresponded to the sequence BVP1_NODE_8945_ORF1.According to the description of this sequence [20], it was extracted from freshwater microbial communities from Lake Simoncouche (oligotrophic lake), Canada.The most abundant sequences were B570J40625_100003451 (aa identity 79.2-87.3%)and Ga0133913_10009135 (aa identity 48.5-51.8%),each corresponding to 8.2% of the Baikal MCP sequences.Virophage B570J40625_100003451 was extracted from freshwater microbial communities from Lake Mendota (eutrophic lake) and Ga0133913_10009135 was extracted from lakes in northern Canada (co-assembly).
Phylogenetic analysis with known MCP proteins from the complete genomes of virophages showed their division into groups according to the new classification proposed by S. Roux et al. [21].In total, we could identify seven groups with a support of more than 80% in the clade nodes calculated with two methods (Figure 2).The largest clades (containing the largest number of Baikal sequences) were SW01 virophages (38%)-named after the first isolated member, Aquatic virophages 1 (33.7%)-this group included representatives derived from a wide geographic range of freshwater lakes and Aquatic virophages 2 (20.1%)-most members of this cluster were related to large virophages.
Phylogenetic analysis with known MCP proteins from the complete genomes of virophages showed their division into groups according to the new classification proposed by S. Roux et al. [21].In total, we could identify seven groups with a support of more than 80% in the clade nodes calculated with two methods (Figure 2).The largest clades (containing the largest number of Baikal sequences) were SW01 virophages (38%)-named after the first isolated member, Aquatic virophages 1 (33.7%)-this group included representatives derived from a wide geographic range of freshwater lakes and Aquatic virophages 2 (20.1%)-most members of this cluster were related to large virophages.One sequence (BVP8_NODE_602_ORF13) is located in the Sputnik virophages cluster and forms a separate branch.According to the blastp analysis, this sequence has the closest relative Sputnik virophage 2 (AUG85006, isolate Rio Negro), an aa identity of 27.2%, and a coverage of 92.6% (Table S1).It should be noted that the other two identified virophage proteins in the contig from which this MCP originated also show low similarity to known proteins-an aa identity of 23.9% with the hypothetical protein ASQ67_gp08 (YSLV 7), and an aa identity of 33.7% with the DNA packaging protein (Zamilon virus).
No Baikal MCP sequences were included in the clades of Mavirus virophages, Large virophage, or Rumen virophages.Yellowstone Lake virophage 2 [21], which was not included in any group in the work of S. Roux et al., formed a joint cluster in the tree with 10 sequences from Lake Baikal from different samples (BVP1, BVP2, BVP3, BVP4, BVP5).The One sequence (BVP8_NODE_602_ORF13) is located in the Sputnik virophages cluster and forms a separate branch.According to the blastp analysis, this sequence has the closest relative Sputnik virophage 2 (AUG85006, isolate Rio Negro), an aa identity of 27.2%, and a coverage of 92.6% (Table S1).It should be noted that the other two identified virophage proteins in the contig from which this MCP originated also show low similarity to known proteins-an aa identity of 23.9% with the hypothetical protein ASQ67_gp08 (YSLV 7), and an aa identity of 33.7% with the DNA packaging protein (Zamilon virus).
No Baikal MCP sequences were included in the clades of Mavirus virophages, Large virophage, or Rumen virophages.Yellowstone Lake virophage 2 [21], which was not included in any group in the work of S. Roux et al., formed a joint cluster in the tree with 10 sequences from Lake Baikal from different samples (BVP1, BVP2, BVP3, BVP4, BVP5).The position of Yellowstone virophages 7, with which 13 sequences from Lake Baikal formed a common cluster, also remains unclear.In the original study, the MCP of YSLV7 is the most distant from other virophages in the phylogenetic analysis, suggesting a new lineage [14].The taxonomic classification of virophages is very difficult, so further research is needed in this area, especially the analysis of genome structure.
Cluster analysis of the samples based on MCP phylogeny resulted in clustering by seasons.BVP1, BVP2, BVP3, and BVP4 were sampled in winter and spring.According to M. Kozhov [60], the beginning of June at Lake Baikal corresponds to biological spring, so we distinguished a strict "spring" cluster.BVP6, BVP7, and BVP8 formed an "autumn" (September) cluster, while BVP5 (August), which entered the clade with the spring samples, represented a separate branch, but the bootstrap support was relatively low (Figure 3).Thus, we can assume that the virophage community is seasonally influenced, which is apparently explained by pronounced seasonal fluctuations in the composition and structure of the planktonic community characteristic of Lake Baikal [61], including their hosts.
position of Yellowstone virophages 7, with which 13 sequences from Lake Baikal formed a common cluster, also remains unclear.In the original study, the MCP of YSLV7 is the most distant from other virophages in the phylogenetic analysis, suggesting a new lineage [14].The taxonomic classification of virophages is very difficult, so further research is needed in this area, especially the analysis of genome structure.
Cluster analysis of the samples based on MCP phylogeny resulted in clustering by seasons.BVP1, BVP2, BVP3, and BVP4 were sampled in winter and spring.According to M. Kozhov [60], the beginning of June at Lake Baikal corresponds to biological spring, so we distinguished a strict "spring" cluster.BVP6, BVP7, and BVP8 formed an "autumn" (September) cluster, while BVP5 (August), which entered the clade with the spring samples, represented a separate branch, but the bootstrap support was relatively low (Figure 3).Thus, we can assume that the virophage community is seasonally influenced, which is apparently explained by pronounced seasonal fluctuations in the composition and structure of the planktonic community characteristic of Lake Baikal [61], including their hosts.

Identification of Complete or Nearly Complete Genomes of Virophages
In each set of contigs longer than 10,000 nucleotides, the automatic classifier ICTV_VirophageSG was able to identify between 5 and 32 contigs from each sample belonging to putative virophages (Table 2); in addition, polinton-like viruses (PLVs) were identified (Table S3).Most of the putative virophage contigs belonged to the recently proposed families Dishuiviroviridae (42.5% virophage contigs) and Omnilimnoviroviridae (31.9%).

Identification of Complete or Nearly Complete Genomes of Virophages
In each set of contigs longer than 10,000 nucleotides, the automatic classifier ICTV_ VirophageSG was able to identify between 5 and 32 contigs from each sample be-longing to putative virophages (Table 2); in addition, polinton-like viruses (PLVs) were identified (Table S3).Most of the putative virophage contigs belonged to the recently pro-posed families Dishuiviroviridae (42.5% virophage contigs) and Omnilimnoviroviridae (31.9%).Of all 159 identified virophage sequences, only 7 had DTRs (i.e., circular complete genomes).Invert terminal repeats (ITR) were detected in two sequences identified as PLVs-BVP1_NODE_129 and BVP2_NODE_281.
Transfer ribonucleic acid sequences (tRNAs) were encoded in LBV4 and LBV13 and recognized methionine (Met), anticodon CAT.We found no similarity between these sequences and known sequences in the NCBI NR or IMG/VR databases.
Trees with known virophages were constructed for all 4 genes from 16 contigs (Figure 4).In general, two clusters with highly supported Baikal sequences and their conservation for each protein were observed.In the MCP and PRO trees, three sequences form one cluster (LBV2, LBV4, LBV7), while in the Penton and ATPase trees, the LBV7 sequence is located in a neighboring branch.Two clusters included a great part of the sequences formed a cluster with OLV, YSLV1, YSLV4, YSLV6, QLV, DSLV2 (six sequences) with YSLV5 (four sequences), i.e., according to the classification proposed by S. Roux, and they belonged to Aquatic virophage 1 and Aquatic virophage 2, respectively.
Transfer ribonucleic acid sequences (tRNAs) were encoded in LBV4 and LBV13 and recognized methionine (Met), anticodon CAT.We found no similarity between these sequences and known sequences in the NCBI NR or IMG/VR databases.
Trees with known virophages were constructed for all 4 genes from 16 contigs (Figure 4).In general, two clusters with highly supported Baikal sequences and their conservation for each protein were observed.In the MCP and PRO trees, three sequences form one cluster (LBV2, LBV4, LBV7), while in the Penton and ATPase trees, the LBV7 sequence is located in a neighboring branch.Two clusters included a great part of the sequences and formed a cluster with OLV, YSLV1, YSLV4, YSLV6, QLV, DSLV2 (six sequences) with YSLV5 (four sequences), i.e., according to the classification proposed by S. Roux, and they belonged to Aquatic virophage 1 and Aquatic virophage 2, respectively.LBV-Lake Baikal virophage, DSLV-Dishui Lake virophage, YSLV-Yellowstone Lake virophage, OLV-Organic Lake virophage, RV-rumen virophage, ALM-Ace Lake Mavirus, Spezl-Maverick-related virus strain Spezl, QLM-Qinghai Lake virophage.
A total of 16 putative complete genomes were obtained (Figure 5) that met the completeness criteria (longer than 25 kbp), and four "core" genes, completeness (CheckV) ≥ 90%).The number of ORFs in these contigs ranged from 18 to 34, and the GC-content was 26.6-45.2%.It is known that the genomes of virophages have a low GC-content [31].LBV5, LBV7, LBV8, LBV14, and LBV15 had DTR.In addition to the identified four "core" genes, ORFs similar to eukaryote, bacteria, archaea, and viruses other than virophages were present in the genomes.Among the identified hits of bacteria similar to the ORFs of virophages according to the NR database (total of 41 ORFs), 2 are represented as phage (tail fiber domain-containing protein, phage tail protein) and are, probably, prophages.At the same time, the remaining ORFs are similar to the bacterial proteins of the HNH endonuclease (WP_105774808), primosomal protein (WP_066855374), transcriptional regulator (MBT4479066), etc.A complete list can be found in Table S5.The amino acid similarity ranged from 25 to 73.9%.One ORF was similar to the archaeal sequence of a hypothetical protein (MCX6749161), with an aa identity of 72.4%.Five ORFs were similar to the eukaryote sequences (aa identity of 32.1-53.3%).In addition, 66.7% of all hits that were not from virophages belonged to hypothetical or uncharacterized proteins.The similarity to eukaryotes, bacteria, and archaea is probably due to the presence of metagenome-assembled genomes (MAG) in databases, which makes it difficult to determine the exact affiliation of the sequences.
Ten hits belonged to the phylum Nucleocytoviricota (representatives of giant viruses), among which Phaeocystis globosa virus, Paramecium bursaria Chlorella virus CVM-1, Klosneuvirus KNV1, and Organic Lake phycodnavirus 2 were identified.The amino acid similarity of these sequences ranged from 32.1 to 65.3%, probably, indicating horizontal gene transfer.
Among the similar virophage ORFs, the integrase (CAI9421294), derived from the Maverick-related virus strain Spezl, was found only in the genome LBV9 (ORF_8), with an aa similarity of only 27.2%.Among the identified hits of bacteria similar to the ORFs of virophages according to the NR database (total of 41 ORFs), 2 are represented as phage (tail fiber domain-containing protein, phage tail protein) and are, probably, prophages.At the same time, the remaining ORFs are similar to the bacterial proteins of the HNH endonuclease (WP_105774808), primosomal protein (WP_066855374), transcriptional regulator (MBT4479066), etc.A complete list can be found in Table S5.The amino acid similarity ranged from 25 to 73.9%.One ORF was similar to the archaeal sequence of a hypothetical protein (MCX6749161), with an aa identity of 72.4%.Five ORFs were similar to the eukaryote sequences (aa identity of 32.1-53.3%).In addition, 66.7% of all hits that were not from virophages belonged to hypothetical or uncharacterized proteins.The similarity to eukaryotes, bacteria, and archaea is probably due to the presence of metagenome-assembled genomes (MAG) in databases, which makes it difficult to determine the exact affiliation of the sequences.
Ten hits belonged to the phylum Nucleocytoviricota (representatives of giant viruses), among which Phaeocystis globosa virus, Paramecium bursaria Chlorella virus CVM-1, Klosneuvirus KNV1, and Organic Lake phycodnavirus 2 were identified.The amino acid similarity of these sequences ranged from 32.1 to 65.3%, probably, indicating horizontal gene transfer.
Among the similar virophage ORFs, the integrase (CAI9421294), derived from the Maverick-related virus strain Spezl, was found only in the genome LBV9 (ORF_8), with an aa similarity of only 27.2%.
It should be noted that 34.5% of ORFs from 16 genomes had no significant hit, suggesting that virophages are underrepresented in the database.
Three clusters are formed in the proteome tree, corresponding to three new families: Burtonviroviridae, Dishuiviroviridae, Omnilimnoviroviridae.As expected, YSLV7 forms a separate branch.Due to the limited number of sequences in the VipTree of virophages, only YSLV5, YSLV6, and YSLV7 are the closest relatives (Figure 6).It should be noted that 34.5% of ORFs from 16 genomes had no significant hit, suggesting that virophages are underrepresented in the database.
Three clusters are formed in the proteome tree, corresponding to three new families: Burtonviroviridae, Dishuiviroviridae, Omnilimnoviroviridae.As expected, YSLV7 forms a separate branch.Due to the limited number of sequences in the VipTree of virophages, only YSLV5, YSLV6, and YSLV7 are the closest relatives (Figure 6).The mapping of reads on 16 Lake Baikal virophage genomes showed that 4 of them were in all seasons (LBV6, LBV7, LBV10, LBV12) with a number of reads of more than 100 (Figure 7).LBV6 was most strongly represented in all samples.LBV9 was only characteristic of the BVP5 sample (Maloye More Strait).LBV8 and LBV10 were more prevalent in the summer and autumn samples (BVP5, BVP6, BVP7, BVP8).LBV14 had the greater part of reads only in the sample from which it was extracted (BVP8).LBV16 was identified in BVP7 and BVP8, and the lowest number of reads were detected in other seasons, as well as LBV13 and LBV14.Figure 6.Proteomic tree constructed with the online service VipTree.Black branches show the closest relatives; red branches are from this study.For sequences from Lake Baikal, the sample to which the genome corresponds is given in parentheses; for the closest relatives, the accession number is given.
The mapping of reads on 16 Lake Baikal virophage genomes showed that 4 of them were in all seasons (LBV6, LBV7, LBV10, LBV12) with a number of reads of more than 100 (Figure 7).LBV6 was most strongly represented in all samples.LBV9 was only characteristic of the BVP5 sample (Maloye More Strait).LBV8 and LBV10 were more prevalent in the summer and autumn samples (BVP5, BVP6, BVP7, BVP8).LBV14 had the greater part of reads only in the sample from which it was extracted (BVP8).LBV16 was identified in BVP7 and BVP8, and the lowest number of reads were detected in other seasons, as well as LBV13 and LBV14.

Discussion
Virophages are currently a little-studied element in the viral community, but in recent years, the number of studies focusing on this topic has increased.
Here, for the first time we successfully detailed searched for and identified virophages in the metagenomes of the viral fraction from Lake Baikal.Virophages were found to be present in all seasons, and they were not confined to a greater extent to either the deep-water pelagic basins or the shallow water strait.Based on the MCP cluster analysis, it was shown that virophages from different seasons form their own clusters.Thus, we can assume that the virophage community is subject to seasonal influences, which could be obviously explained by pronounced seasonal fluctuations in the composition and structure of the planktonic community characteristic of Lake Baikal [61], including their hosts.
According to the newly proposed classification of virophages, we performed phylogenetic analysis of all the obtained MCPs and showed that clustering by groups generally reveals a clear distribution pattern.In the Baikal samples, the sequences of four groups were identified and no sequences belonging to the groups Mavirus virophages, Large virophages, or Rumen virophages by S. Roux et al. [21] were found.At the same time, we observed an expansion of the clusters with YSLV2 and YSLV7, which obviously represent new lineages.
The trees constructed on the basis of four conserved proteins (MCP, penton, ATPase, and PRO) showed three groups, conserved for all four proteins (except for certain sequences).
By applying the automatic classifier and genome completeness criteria, we managed to identify 16 putative complete genomes of virophages with 4 conserved proteins.Some virophage genomes, such as those collected from the sheep rumen metagenome, appeared to lack the penton gene [15], but as previously mentioned, we had found no similarities with members of this clade.Some genes found in the virophage genomes showed diverse similarities with giant viruses, phages, bacteria, eukaryote, and mobile genetic elements [31].It has been repeatedly shown that some virophage genes have similarities with genes in other viruses (giant viruses, bacteriophages) and bacteria, for example, in [1,10,11].We have also observed such similarities with non-virophage sequences in our genomes.It has previously been suggested that the common gene of integrase for Sputnik and archaeal viruses (plasmids) could have been isolated independently from an ancestral virus or may reside in an

Discussion
Virophages are currently a little-studied element in the viral community, but in recent years, the number of studies focusing on this topic has increased.
Here, for the first time we successfully detailed searched for and identified virophages in the metagenomes of the viral fraction from Lake Baikal.Virophages were found to be present in all seasons, and they were not confined to a greater extent to either the deepwater pelagic basins or the shallow water strait.Based on the MCP cluster analysis, it was shown that virophages from different seasons form their own clusters.Thus, we can assume that the virophage community is subject to seasonal influences, which could be obviously explained by pronounced seasonal fluctuations in the composition and structure of the planktonic community characteristic of Lake Baikal [61], including their hosts.
According to the newly proposed classification of virophages, we performed phylogenetic analysis of all the obtained MCPs and showed that clustering by groups generally reveals a clear distribution pattern.In the Baikal samples, the sequences of four groups were identified and no sequences belonging to the groups Mavirus virophages, Large virophages, or Rumen virophages by S. Roux et al. [21] were found.At the same time, we observed an expansion of the clusters with YSLV2 and YSLV7, which obviously represent new lineages.
The trees constructed on the basis of four conserved proteins (MCP, penton, ATPase, and PRO) showed three groups, conserved for all four proteins (except for certain sequences).
By applying the automatic classifier and genome completeness criteria, we managed to identify 16 putative complete genomes of virophages with 4 conserved proteins.Some virophage genomes, such as those collected from the sheep rumen metagenome, appeared to lack the penton gene [15], but as previously mentioned, we had found no similarities with members of this clade.Some genes found in the virophage genomes showed diverse similarities with giant viruses, phages, bacteria, eukaryote, and mobile genetic elements [31].It has been repeatedly shown that some virophage genes have similarities with genes in other viruses (giant viruses, bacteriophages) and bacteria, for example, in [1,10,11].We have also observed such similarities with non-virophage sequences in our genomes.It has previously been suggested that the common gene of integrase for Sputnik and archaeal viruses (plasmids) could have been isolated independently from an ancestral virus or may reside in an archaeal endosymbiont located in a eukaryotic cell [1].Mavirus, for example, has a close relationship with the Maverick/Polinton virus-like mobile elements [2], which in turn are found in a wide range of eukaryotes [62].A noteworthy detail is that polintons most likely originated from bacteriophages and give rise to the evolution of most major eukaryotic dsDNA viruses, as well as several groups of plasmids and transposons [63].Moreover, polinton-like virus Gezel-14T was most recently shown to be capable of forming virions [64].In our opinion, the presence of similar sequences in giant viruses and eukaryotic hosts naturally reflects their close relationship and the process of horizontal gene transfer.On the other hand, the similarities with eukaryotes, bacteria, and archaea may be due to the MAGs present in the databases, due to which sequences may be incorrectly determined.
The prediction of the host virus for virophages is difficult, which is also due to the small number of cultured virophages.The study by S. Roux [16], for example, used a set of co-occurrence analyses, but the authors warned that the results of this analysis should be interpreted with caution.In our dataset, we identified potential hosts based only on sequence similarity with known Nucleocytoviricota, i.e., potential hosts of virophages, but this only showed a possible range of hosts.As for eukaryotic hosts, it is difficult to draw conclusions as we are using a fraction smaller than 0.2 µm and the main pool of eukaryotic DNA is truncated during filtration by the removal of phyto-and zooplankton.However, in all samples, blastp analysis of all ORFs by the RefSeq database reveals a few ORFs similar to representatives such as picophytoplankton green flagellates Micromonas commoda (aa similarity of 23.7-95.2%)and Ostreococcus lucimarinus CCE9901 (aa similarity of 21.9-93.3%).Despite the fact that the above-mentioned representatives are marine species, there are, probably, close species in Lake Baikal.Nowadays, the prasinophytes of Lake Baikal have not been described according to morphological criteria, but using high-throughput sequencing of 18S rRNA, and amplicon sequences belonging to the family Mamiellophyceae (Prasinophyceae, Chlorophyta) are found in plankton [65].Only in the PLV sequences ORF similar to Fusarium oxysporum Fo47 (Ascomycota) was detected, and the remaining sequences obtained after ICTV_VirophageSG did not contain any ORF similar to the eukaryotic sequence.
tRNA genes have been predicted in the viruses of ssDNA (single-stranded DNA), ssRNA (single-stranded RNA) viruses, and many other viruses with dsDNA [66,67].The presence of the tRNA genes in the genomes of viruses should compensate for differences in codon and/or amino acid usage between the virus and host, thus promoting efficient protein synthesis and/or thereby expanding the host range [67,68].The origin of these tRNAs in virophages remains unclear.Previously, tRNA was identified in HQ-virophages, and seven genomes of them contained the integrase gene, supporting the hypothesis that it was possible to integrate into the host genome [20].In our data, we did not identify integrase in genomes containing tRNA (LBV4 and LBV13).Therefore, the function of tRNA in virophages needs to be unraveled in further studies.
The presence of most genotypes in different basins and seasons indicates that virophages are widely distributed at Lake Baikal in all seasons.The great similarity of the sequences we obtained with those of the virophages of Lake Dishui (China), Yellowstone Lake (USA), Ga0114980_10001820 (Lake Simoncouche, Canada), and B570J40625_100003451 (Lake Mendota, USA) proves their global distribution.
Our data expand knowledge of virophages both in general and in freshwater lakes in particular, especially in lakes of ancient origin.

Conclusions
Virophages are still a relatively understudied subject that represent a unique group of viruses.Our data, obtained from the deepest and oldest freshwater lake on the planet, will contribute to the understanding of the distribution, genetic composition, and host relationships of these viruses.By analyzing eight metagenomes of the viral fraction (smaller than 0.2 µm) obtained in different seasons, as well as from different basins and straits, we were able to detect their presence in all seasons.Determining the taxonomic affiliation of new viruses is a difficult task.The identification of virophages in different habitats and the formation of a data pool should eventually clarify their genetic diversity and possibly reveal patterns in the composition of virophage communities.In addition, we identified 294 MCP genes that potentially extend new lineages, as shown for YSLV2 and YSLV7 in phylogenetic trees.

Figure 1 .
Figure 1.Taxonomic representation of viral ORFs at the level of families detected in Lake Baikal according to the NCBI NR database, blastp (e-value 10 −5 ).Virophages are in bold.Others-less than 1%.

Figure 1 .
Figure 1.Taxonomic representation of viral ORFs at the level of families detected in Lake Baikal according to the NCBI NR database, blastp (e-value 10 −5 ).Virophages are in bold.Others-less than 1%.

Figure 5 .
Figure 5. Genome maps of putative complete/near complete genomes of virophages identified in Lake Baikal viromes.Arrows indicate ORFs, direction indicates synthesis with + or-chains.The arrows below represent ORFs with overlapping reading frames.

Figure 5 .
Figure 5. Genome maps of putative complete/near complete genomes of virophages identified in Lake Baikal viromes.Arrows indicate ORFs, direction indicates synthesis with + or-chains.The arrows below represent ORFs with overlapping reading frames.

Figure 6 .
Figure 6.Proteomic tree constructed with the online service VipTree.Black branches show the closest relatives; red branches are from this study.For sequences from Lake Baikal, the sample to which the genome corresponds is given in parentheses; for the closest relatives, the accession number is given.

Figure 7 .
Figure 7. Mapping of reads to 16 putative virophage genomes.The number of aligned reads was normalized.Value-the number of reads.The blue circles show from the reads of which sample the genome was assembled.

Figure 7 .
Figure 7. Mapping of reads to 16 putative virophage genomes.The number of aligned reads was normalized.Value-the number of reads.The blue circles show from the reads of which sample the genome was assembled.

Table 1 .
Dates and coordinates of the specimens sampling from Lake Baikal.

Table 2 .
Number of detected contigs of putative virophages and PLVs in samples.

Table 2 .
Number of detected contigs of putative virophages and PLVs in samples.