Extended Evaluation of Viral Diversity in Lake Baikal through Metagenomics

Lake Baikal is a unique oligotrophic freshwater lake with unusually cold conditions and amazing biological diversity. Studies of the lake’s viral communities have begun recently, and their full diversity is not elucidated yet. Here, we performed DNA viral metagenomic analysis on integral samples from four different deep-water and shallow stations of the southern and central basins of the lake. There was a strict distinction of viral communities in areas with different environmental conditions. Comparative analysis with other freshwater lakes revealed the highest similarity of Baikal viromes with those of the Asian lakes Soyang and Biwa. Analysis of new data, together with previously published data allowed us to get a deeper insight into the diversity and functional potential of Baikal viruses; however, the true diversity of Baikal viruses in the lake ecosystem remains still unknown. The new metaviromic data will be useful for future studies of viral composition, distribution, and the dynamics associated with global climatic and anthropogenic impacts on this ecosystem.


Introduction
Viruses (mainly bacteriophages) are the most numerous and diverse components of aquatic ecosystems. They significantly affect the biodiversity, composition, and productivity of aquatic ecosystems [1][2][3][4].
The rapid progress in aquatic virology several decades ago has become possible thanks to the development of methods and devices for virus counting, filtering, and concentration of virus-like particles (VLPs), and most importantly, to new genomic technologies, including high-throughput sequencing and new approaches to library construction [5,6]. Metagenomics, i.e., sequencing of the genomic content of an uncultured community, now represents a powerful approach for investigating the diversity and relationships of viruses in aquatic ecosystems.
Since the first metavirome reports [7,8], numerous studies have been carried out in various biomes; each of them has provided new information about the diversity and roles of viruses in natural ecosystems. Marine studies encompass a wide variety of seas and oceans around the world, with a large number of dedicated stations and samples from different depths, seasons, etc. [6]. In contrast to marine viruses, freshwater viruses remain less studied; however, the interest in freshwater ecosystems, including large water bodies of great social importance, is gradually increasing [9][10][11][12][13][14][15][16][17][18].
Baikal is a unique, ancient, oligotrophic freshwater lake-the world's deepest, oldest, and largest by volume. The lake is characterized by an unusual climatic environment and amazing biological diversity (mainly endemic flora and fauna) [19,20]. Some parameters (oligotrophy, large depth, and oxygen concentration) and dynamic processes occurring in the lake (horizontal cyclonic currents, stratification, thermal bar, spring and autumn homothermy, upwellings and downwellings, and the deep convection) [21] are similar to oceanic parameters. Lake Baikal has the shape of a giant narrow crescent stretching from southwest to northeast for 636 km. The lake consists of three basins (southern, central, and northern) that are separated by underwater elevations. There are many large and small bays in the water area of Lake Baikal. The distinctive geomorphological, climatic, hydrophysical, and hydrochemical parameters are typical of the various areas and basins of the lake. Lake Baikal freezes in the first half of January (ice thickness up to 130 cm); the ice breaks in early May. During the warm period (July to September) the temperature of the surface-water layer in open Baikal is 5 to 19 • C, and the temperature of the deep-water layer is circa 4 • C. [20].
The first viral genetic studies showed a high phylogenetic diversity of T4-like bacteriophages and cyanophages from the Myoviridae family in the pelagic and coastal zones of Lake Baikal based on g23 and g20 marker-gene sequencing [22,23]. The phage gene assemblages are different in the lake basins; the environmental factors specifying the trophic status of the lakes influence the diversity of viral communities and may be of greater importance than geographical patterns [24]. Previous studies [25,26] demonstrated through the UniFrac method that g23 and g20 assemblages from freshwater lakes, including Baikal ones, were more closely related to those from terrestrial aquatic environments (wetlands, paddy fields, and upland soils) than to those from marine ones.
The first metagenomic study of the Lake Baikal virome was carried out in 2013 [27]. We investigated the sample of coastal water from the southern basin of the lake and revealed a large potential diversity of viruses of different families and genera, affecting a wide range of bacteria, archaea, algae, amoeba, flagellates, fish, crustaceans, insects, etc. [27]. Recently, Potapov et al. [28] examined two more viromes from the photic layer of the pelagic zone of the lake during the under-ice and late-spring periods and revealed some differences in the taxonomic and functional composition of viruses in these datasets. A comparative analysis of viral communities from different environments indicated different distribution patterns in soil, marine, and freshwaters. Baikal viromes formed a separate clade with those from the Great Lakes of North America (Michigan, Erie, and Ontario).
The aim of this study was to more broadly estimate the viral diversity in Lake Baikal and compare viral communities (mainly dsDNA viruses) in different areas of the lake. We assumed and confirmed that the wide variety of habitats and conditions within the Baikal ecosystem determines the formation of diverse and distinct viral communities. We also expected to identify the special compositions of Baikal viruses, due to the extreme environment and high levels of endemism in the lake. The low level of similarity of the detected viral reads and scaffolds with known viral genomes partially confirmed the uniqueness of the viral populations. However, the chosen approach based on comparing viromes with known viral genomes did not allow us to fully verify this hypothesis, and this issue requires additional research.

Description of Study Sites and Sample Processing
Water samples were collected at deep-water and shallow stations of Lake Baikal in September 2014. Deep-water stations were located in the southern (central site of the Listvyanka-Tankhoy section and Listvennichny Bay) and central (the Ukhan-Tonky section) basins of the lake ( Figure 1, Table 1). The southern and central basins are the deepest, 1423 m and 1637 m, respectively [20]. Listvennichny Bay is one of the largest (with a depth over 1000 m); it is located near the Angara River source (Figure 1). Two shallow stations (Olkhonskiye Vorota Strait and Kurkut Bay) were located in the area of the Maloye More Strait (Figure 1, Table 1). The Maloye More Strait is one of the largest shallow areas (maximum depth 200 m) located between the west coast of the lake and the largest Olkhon island. The Olkhonskiye Vorota Strait (depth down to 100 m) [20] connects the Maloye More Strait with the central basin of Lake Baikal (Figure 1). There are many sunny days in this region; in summer, the water in numerous bays warms up to +20-24 • C. Figure 1. Lake Baikal and sampling sites (marked with a black circle). The lake shoreline contours and additional designations were reproduced using Google Earth Pro software (https://www.google. com/intl/ru/earth/versions/#earth-pro; accessed on 21 July 2020) with the free graphics editor "Inkscape" (https://inkscape.org/; accessed on 1 May 2020). At deep-water stations, the sampling was carried out from the board of the research vessel "G.Yu. Vereshchagin" using an SBE 32 Carousel Water Sampler bathometer system (USA) at depths of 0, 5, 10, 15, 25, 50, 100, 250, and 500 m; 2.5 liters of water were taken from each depth at these stations. The water samples were filtered to remove bacteria and other organisms larger than viruses through 0.45 µm nitrocellulose membrane filters (47 mm diameter, Vladisart, Vladimir, Russia) using a vacuum pump. Then, the water samples from different depths of each station were combined in a 25 l glass bottle, separately for each station (samples V1, V2, and V4; Table 1), and VLPs were concentrated in the resulting integrated samples (the total volume of each sample was approximately 22 l) with the Sartocon Slice tangential filtration pilot system (30 kDa, Sartorius, Germany) to 100 mL, filtered through the 0.2 µm syringe filters (PES, Sartorius) and additionally concentrated with the Amicon-15 ultrafiltration centrifugal devices, 30 kDa (Millipore, Darmstadt, Germany) to 1.5 mL as previously [27]. At shallow stations, the sampling was carried out using a five-liter bathometer and a winch. The sampling depths at shallow stations were as follows: 0, 5, 10, 15, 25 m in the Olkhonskiye Vorota Strait (three liters from each depth), and 0 m at the central site of Kurkut Bay. The prefiltered water samples (through 0.45 µm nitrocellulose membrane filters) from different depths at these two sites were combined (sample V3; Table 1), and the integrated sample (approximately 18 l) were processed as described above for the samples V1, V2, and V4.
Combining samples from different depths allowed us to identify the most complete variety of viral communities from different stations with the least number of samples and resources for their processing and analysis. However, this approach cannot trace the vertical distribution of viral communities and assess the differences in their composition in the surface and deep layers of the lake, which requires additional research.

DNA Extraction
The concentrates of VLPs were treated with DNase I (50 U mL −1 ; Thermo Fisher Scientific, Carlsbad, CA, USA) for 1.5 h to remove contaminating external nucleic acids. DNA was isolated using sodium dodecyl sulfate (SDS), proteinase K, and the phenol-chloroform extraction method [29]. The absence of bacterial contamination in the VLPs concentrate was checked by PCR amplification with the 16S rRNA universal primers, 27F and 1093R, as described in [30]. The concentration and quality of the extracted DNA were measured with a NanoDrop spectrophotometer (Thermo Fisher Scientific, Carlsbad, CA, USA).

Library Preparation and Sequencing
The total viral DNA was sheared in a microTUBE AFA Fiber Snap-cap using a Covaris S2 instrument (Covaris, Woburn, MA, USA) with a medium-size distribution of fragments of approximately 500 bp. The paired-end libraries were prepared using a NEBNext Ultra DNA library prep kit for Illumina (New England Biolabs, Ipswich, MA, USA).
Sequencing of the libraries was performed on a MiSeq genome sequencer using the MiSeq Reagent Kit v.3 (2 × 300 cycles, Illumina) in the Genomics Core Facility (Institute of Chemical Biology and Fundamental Medicine, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, Russia).

Preparation of Reads
The general scheme of bioinformatic data processing is shown in Supplementary Figure S1.
For comparative analysis, we also used the published datasets on freshwater viromes, including Baikal ones [27,28], sequenced using the same library preparation and sequencing techniques as in our study (the Illumina platform) and available in the NCBI SRA database at the time of our analysis ( Table 2).  SRA archives of data on other freshwater bodies (FASTQ files) were downloaded from the NCBI database using the "fastq-dump" utility. All data, including those from Lake Baikal and other lakes, were grouped into one FASTQ dataset for future analysis.
The visualization of the quality of the reads was carried out using the FASTQC program (http://www.bioinformatics.babraham.ac.uk/projects/fastqc; accessed on 8 January 2019).
Trimming by the quality of reads was carried out in the Trimmomatic V 0.39 program [31] using the adaptive quality trimmer options (MAXINFO: 40: 0.1); the reads of 100 bp or more were used for further analysis.

Taxonomic Analysis of Viral Reads
For taxonomic identification, the metagenomic reads were compared with the NCBI RefSeq complete viral genome database and the NCBI RefSeq complete viral proteome database [32] (released on 9 February 2020). A double comparison was necessary because some complete viral genomes of NCBI RefSeq were not annotated, and reads similar to these genomes would have been left out of the analysis (as unidentified) if the RefSeq viral proteome database only was used. The comparison was carried out in two stages. At the first stage, each read was compared with the DIAMOND program [33] against the NCBI RefSeq complete viral proteome. For the DIAMOND program, we used the "more-sensitive" option. Then, we selected reads with proteomic similarity ≥ 35%. If there was no similarity of a read with any on the NCBI RefSeq complete viral proteome database, such read passed to the second stage. At the second stage, the read was compared with the NCBI RefSeq complete viral genome database by the BLASTn algorithm [34]. The BLASTn parameters used were as follows: cost to open a gap, 2; gap extension cost, 1; word size, 9; penalty for a nucleotide mismatch, 1; reward, 1; e-value of ≤ 0.00001, bit score ≥ 50. If no similarity with the database was found at this stage, the read was not identified as viral. The reads similar to the same viral genome (ID) from the NCBI RefSeq database, were summarized and assigned to one virotype (virus species from the NCBI RefSeq database). The count of the virotypes in a sample was defined as the number of reads in the sample associated with this virotype by the DIAMOND or BLASTn program. Consequently, the count table of the various virotypes in the samples was obtained.
The count table of virotypes (number of hits per each virotype in the sample) was normalized to the genome length of the virus, representing the virotypes in the RefSeq database. Normalization was carried out according to the following algorithm: (1) The average length of virotype genomes was calculated by Formula 1: where Ls is the average length of genomes in a sample, N-number of virotypes in the sample, L i genome length of the i-th virotype.
(2) The normalization coefficient S i for each virotype in the sample was calculated by Formula 2: (3) Count of each virotype in the sample was recalculated using Formula 3: where n i is the initial value of the virotype count in the sample; nk i -the virotype count in the sample normalized to genome length. Normalization to the genome length was used because viruses with longer genomes obviously make a greater contribution to the DNA pool at the same concentration of viral particles in the sample. A similar RPKM normalization is used in the transcriptomic analysis.
The algorithm of taxonomic identification used in our study was similar to that of the known METAVIR 2 service [35], which identifies a read based on comparison with the complete viral proteomes. The advantage of our algorithm is the combined analysis of complete viral genomes and proteomes that allowed us to compare the reads with viral genomes, for which the annotation of their proteome was not represented in the NCBI RefSeq database, and, at the same time, to identify the distant similarity of the reads, comparing translated reads with proteins. Moreover, the scheme used allowed us to apply available computing resources with a large number of computing cores and a small amount of RAM.
For further statistical calculations, we took into account only 95% of virotypes, mostly represented by the number of reads.
The pipeline for processing the BLASTn and DIAMOND results were created using the R programming language (R Project for Statistical Computing: https://www.r-project.org/; accessed on 1 June 2020) [36].

Statistical Analysis of Taxonomic Diversity
A virotype count table was used for comparative statistical analysis of viral community samples. We evaluated the potential (underestimated) number of virotypes (α-diversity or species richness) in communities, with the increasing reading depth using Chao1 [37]. Shannon and Simpson indices [38] of biodiversity were also calculated for each sample.
For multivariate statistical analyses, the taxonomic composition based on the virotype count table was normalized to the relative abundance of reads per sample. To equalize the effect of virotypes with different counts per sample (from the highest to the lowest ones), the values ranged between 0 and 1.
The taxonomic composition similarity of the samples was visualized using the nonmetric multidimensional scaling (NMDS) ordination method with the Euclidian distance metric. Gradient vectors of the viral family composition and community biodiversity indices were fitted on the NMDS scatter plot. The reliability of linear approximation for gradient vectors was assessed by multivariate linear regression analysis.
Biodiversity analysis and NMDS were carried out in the "vegan" package for the R programming language [39], according to the tutorials [40].
The samples were clustered by similarity counts of virotypes using the Euclidian distance metric and "average" hierarchical cluster method by standard functions of the R programming language.
Dominant virotypes from the virotype count table in each sample were visualized with the heat map generated using the "gplots" [41] package in R. Rows (virotypes) and columns (samples) were clustered and grouped in similarity order (i.e., Euclidean distance metric and the complete-link clustering method).
The significance of the difference between the samples in counts of virotype reads was assessed using the chi-square test for independence. The p-value for the chi-square test was adjusted by the Bonferroni correction for multiple hypothesis testing.

De Novo Assembly of Metagenomic Reads and Identification of Viral Scaffolds
The SPAdes 3.13.1 metagenomics assembler, metaSPAdes [42], with parameters of paired-end reads and K-mer lengths of 21, 33, 55, and 77 were used for the de novo cross-assembly of Baikal samples.
The Burrows-Wheeler Aligner (BWA) software [43] was used to map paired-end reads on scaffolds and calculate the total coverage of scaffolds in the assembly and coverage of scaffolds by reads from each sample. The scaffolds with total coverage of more than five and a length of ≥5000 nucleotides were used for analysis.
The BWA results were used to determine the number of reads mapped on each predicted viral scaffold from each sample. Counts of the predicted viral proteins (ORFs) in samples were defined as the number of reads mapped on a scaffold containing a given protein. Consequently, the count table of viral scaffold representation in the analyzed samples was constructed.

Taxonomic Identification of Viral Scaffolds
Taxonomic identification for the viral scaffolds represented in the Baikal samples was carried out by comparisons of predicted viral proteins in scaffolds with the NCBI RefSeq complete viral proteome database. The comparison was carried out by the BLASTp algorithm with the following parameters: word size, 6; gap open cost, 6; gap extension cost, 2; e-value ≤ 0.00001; bit score ≥ 50, and identity ≥ 35%. For each protein in the scaffold, the best match in terms of the bit score value was selected. If a single scaffold had multiple proteins that matched different taxa (NCBI RefSeq ID), the one with the largest number of matching proteins was chosen as the most closely related virus taxon (virotype) of this scaffold. If the proteins were not repeated in the match list, the level of similarity of the matched proteins was taken into account, and the NCBI RefSeq taxon (ID) with the highest percentage of protein similarity was selected as the virotype.

Functional Annotations of the Viral Communities
The predicted viral proteins were matched with the UniProtKB/Swiss-Prot database [45] by the BLASTp algorithm with the following parameters: word size for word finder algorithm, 6; cost to open a gap, 6; gap extension cost, 2. Viral proteins were considered identified if the best hits had e-value ≤ 0.00001, bit score ≥ 50, and identity ≥ 35%. KEGG (Kyoto Encyclopedia of Genes and Genomes) Orthology (KO) protein identifiers were taken from annotations uploaded from UniProtKB/Swiss-Prot [46], and processed in the «KEGGREST» package [47] for R programming language to obtain the KEGG pathway classification (https://www.genome.jp/kegg/pathway.html; accessed on 4 July 2020) of the predicted viral proteins. The count of the predicted viral proteins in samples was transformed in counts of the KEGG pathway classification groups that were normalized for the average number of hits on the viral proteins in each sample.
The predicted viral proteins were matched with the clusters of orthologous groups-COG [48] database by the BLASTp algorithm. For BLASTp, we used the following parameters: word size for word finder algorithm, 6; cost to open a gap, 6; gap extension cost, 2. Viral proteins were considered identified if the best hits had e-value ≤ 0.00001, bit score ≥ 50, and identity ≥ 35%. Counts of predicted viral proteins in each sample were used to calculate the general functional COG categories. Counts of COG protein categories in each sample were normalized for the average number of hits on the predicted viral proteins.
KEGG pathway classification groups and counts of the COG protein category in each sample were visualized with a heat map generated using the «gplots» package [41] in R. Rows (protein functional categories) and columns (samples) in the protein classification count tables were clustered and grouped in similarity order (i.e., Euclidean distance metric and the complete-link clustering method).
VIBRANT v. 1.2.1 software (https://github.com/AnantharamanLab/VIBRANT; accessed on 2 March 2021) [49] was used to determine auxiliary metabolic genes (AMGs) among predicted viral proteins that affect the metabolic activity of the host community. VIBRANT was run with a single thread and all settings set to default as recommended in the manual. AMG functional gene categories were classified according to the KEGG functional identifier. The count of the AMG viral proteins in samples was transformed in counts of the KEGG pathway classification groups that were normalized for the average number of hits on the viral proteins in each sample.

Host Prediction
Host prediction for Baikal viral scaffolds was carried out by two methods. The first method was based on the Virus-Host database [50]. After taxonomic identification of predicted viral scaffolds (see section "Taxonomic identification of viral scaffolds"), the spectrum of corresponding hosts from the Virus-Host database was obtained. The second method was based on the VirHostMatcher-Net software [51] with default settings in "short contig" mode. Data preparation and running were performed according to the manual (https://github.com/WeiliWw/VirHostMatcher-Net; accessed on 4 March 2021). This software provides a prediction of the prokaryotic host based on 62,493 genomes of bacteria and archaea with a series of models: CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) sequences and alignment-free similarity measures developed previously [52]. The use of two identification methods ensured a more detailed search for hosts among prokaryotic organisms (VirHostMatcher-Net software) and identification of potential hosts among eukaryotic organisms (Virus-Host database).
For each of the performed identification methods, the count of the predicted viral scaffolds was transformed into tables representing DNA viruses that infect a certain host species.

Comparison of Viral Scaffold Taxonomic Composition, Predicted Hosts, and AMGs
Comparative statistical analysis of viral taxonomy, predicted host composition, and AMGs based on scaffold count tables was carried out by the NMDS ordination method with the algorithm described above.

Identification and Taxonomic Annotation of Viral Reads
In this study, we obtained four virome datasets (V1-V4) from deep-water (pelagic zone) and shallow stations (area of the Maloye More Strait) of Lake Baikal (Table 1, Figure 1). After primary processing and filtering, our datasets contained from 2,218,572 to 3,573,602 reads, ranging from 100 to 301 bp ( Table 3). The proportions of sequences identified as viral ones with significant matches, with the known sequences (e-value < 10 −5 , bit score ≥ 50) varied from 7.6 to 15.2% (168 557-423 054 of the reads); it was comparable to the number of viral reads in datasets from other freshwater ecosystems (Table 3).  Overall, 24 viral families of DNA viruses were revealed in the V1-V4 Baikal samples-19 families of dsDNA and 5 families of ssDNA viruses (Table 4; Supplementary Figure S2). The families of bacteriophages, Siphoviridae (37.7-49.5% of reads), Podoviridae (26.9-52.0%), and Myoviridae (5.8-13.4%), were the most abundant. Other recently identified bacteriophage families of the order Caudovirales (Ackermannviridae and Herelleviridae) were minor (0.05-0.21% and 0.02-0.1%, respectively). The families Lavidaviridae, Phycodnaviridae, and Mimiviridae were also numerous (up to 2.4%). Totally, the listed dominant families accounted for more than 90% of the identified sequences. Some percentage of viral reads (2.89-7.67%) was similar to unclassified viruses (mainly to viruses of bacteria and archaea, Supplementary Tables S1-S3). The composition of the families in samples V1-V4 was the same, but the proportions of the families significantly differed. The water sample from the Maloye More Strait (V3) was the most distinctive: there was a much larger proportion of reads related to the Podoviridae and much fewer reads of Myoviridae compared to other samples. The smallest number of Phycodnaviridae and Mimiviridae was observed in sample V3, whereas the largest number was found in sample V1 (Table 4; Supplementary Figure S2).

Composition of Virotypes in the Lake Baikal Datasets
Using selected options for taxonomic identification of viral reads (Section 2.5), we revealed 936, 937, 958, and 967 different virotypes in samples V1-V4, respectively ( Table 3). The numbers of virotypes for the previously characterized Baikal samples, 6C, BVP1, and BVP2, were 958, 967, and 983, respectively. The Simpson diversity index for Baikal samples was more than 0.98, and the Shannon index ranged from 5.07 (for V3) to 5.6 (V4) ( Table 3).
A large number and diversity of cyanophages that infected unicellular Cyanobacteria, Synechococcus sp., were among the most represented in all Baikal viromes (Figure 2; Supplementary Table S3). The virotypes of cyanophages, Synechococcus phage S-CBS4, Synechococcus phage S-CBS3, Synechococcus phage S-CBS1, and Cyanophage KBS-S-2A (representatives of the Siphoviridae family), dominated (>1% of virome reads). The myocyanophages were previously reported to be the most diverse and abundant in marine ecosystems [8,53]. In our study, cyanophages of the family Siphoviridae, including the recently isolated and characterized Synechococcus phages, S-CBS1, S-CBS3, and S-CBS4 [54], were among the dominant ones in all viromes. The sequences similar to Synechococcus phage S-SM2 of the family Myoviridae were abundant in sample V1 (1.03%) and the published 6C and BVP2 samples (6.96% and 1.4%, respectively) ( Figure 2; Supplementary Table S3).
The virotypes related to virophages Yellowstone Lake virophages 5, 6, and 7, which were obtained by de novo assembly of metagenomic shotgun sequencing databases from different locations of Yellowstone Lake [56] were among the most numerous in the Lake Baikal datasets, especially in the sample BVP1 (Figure 2; Supplementary Table S3). Previously, it was shown that virophages (Lavidaviridae family) are genetically diverse and widespread; they were found in various unique environments and geographical areas [56,57].

Analysis of Read Assemblies
As a result of metagenomic assembly using the SPAdes software, 4716 scaffolds were created with lengths from 5000 to 628,329 bp and coverage of more than 10× (Supplementary File S1); of these, 1386 viral scaffolds with lengths from 5000 to 138,502 bp were identified with the VirSorter tool (Supplementary File S2). Taxonomic affiliation (as virotype) was assigned for 1093 (78.9%) scaffolds (Supplementary Table S4). The resulting assembly, the viral scaffolds, and the predicted open reading frames (FASTA files) are available in the Figshare repository [58] (Supplementary File S1-S3). Table 5 contains a set of viral scaffolds most represented in the Baikal viromes (the first five scaffolds arranged in descending order of the number of reads per scaffold in each sample were selected). The similarities of the predicted ORFs with those from the RefSeq proteome database averaged between 36.7 and 67.1% ( Table 5). The dominant virotypes mainly included a large number of cyanophages of Synechococcus spp. and other bacteriophages, as well as the Acanthocystis turfacea chlorella virus 1 and Acanthamoeba polyphaga mimivirus. As can be seen from Table 5, the composition of prevailing scaffolds varied in different samples. The longest scaffold, NODE_24_length_120194, that is closely related (in the number of similar proteins) to Caulobacter phage Cr30 predominated in sample V1. The most abundant NODE_310_length_34107 (related to Acinetobacter phage YMC11/11/R3177), NODE_206_length_42308 (Caulobacter phage Percy), NODE_257_length_37045 (Ralstonia phage RSK1), and NODE_2600_length_10244 (not identified) prevailed to a large extent only in one or two samples. The NODE_424_length_29748 (attributed to Pelagibacter phage HTVC008M) were the most represented in all summer samples (V1-V4). It is noteworthy that the number of similar ORFs in this scaffold was extremely low (one of 39/40). The detected reading frames of NODE_2600_length_10244, NODE_4064_length_7673, and NODE_4202_length_7505, which mainly predominated in spring samples (BVP1 and BVP2, [28]), had no similarities with known viral proteins from the RefSeq database. In general, the number of identified proteins in the presented scaffolds was small, except for some scaffolds close to cyanophages (for example, NODE_696_length_22189 where the ratio of detected ORFs and predicted proteins was 25/21); the maximum identity (85.8 and 89.4%) of detected ORFs among dominant scaffolds was also observed with proteins of known cyanophage. This indicates a wide distribution of closely related viruses of picocyanobacteria and emphasizes the better study of the cyanophages in comparison with other bacteriophages inhabiting aquatic ecosystems.  [27,28]); the maximum and average similarity (in %) of predicted viral proteins with the NCBI RefSeq viral proteome database and related virotypes (the five largest sets of reads corresponding to specific virotype in each sample are marked in bold).

Functional Analysis of the Baikal Viral Communities
In this study, we used an assembled dataset (Supplementary Files S2 and S3) for functional analysis of viral communities from different areas of Lake Baikal. The revealed groups of genes in the general set of assembled metavirome reads of Lake Baikal, according to the classification of COG and KEGG databases, are presented in Figure 3, Supplementary Figure S3 and Supplementary Tables S5 and S6. The most representative functional categories, according to the COG database, included mobile genomic elements (prophages and transposons), the proteins and enzymes of cell wall/membrane/envelope biogenesis, as well as of replication, recombination, and repair, and unknown or only generally predicted functions. Structural viral proteins, such as portal, major capsid, tail tip/sheath, and other proteins, as well as enzymes, involved in the synthesis, modification, and packaging of DNA/RNA (polymerase, reductase, terminase, and helicase) were among the most numerous ones in the composition of the Baikal viromes (Supplementary Table S7). The KEGG pathway database revealed the genes of all five highest hierarchical categories, and the predominance of "metabolism" functions. The proteins of replication and repair ("genetic information processing" level), aging ("organismal systems"), signal transduction ("environmental information processing"), and cell growth and death ("cellular processing") also prevailed in summary data on Lake Baikal. Listed categories predominated in all Baikal samples, but the ratio of other functional groups varied ( Figure 3). Dendrograms based on the distribution of general functional categories demonstrated the similarity of the viral communities from deep-water columns in the pelagic zone of Southern and Central Baikal (V1 and V2), as well as of Listvennichny Bay (V4), and their difference from the shallow water samples from the Maloye More Strait area (V3) and the littoral zone of Southern Baikal (6C). Samples BVP1 and BVP2 (photic zone, early ice cover, and late spring sampling period, respectively) also differed from the other samples, similar to obtained taxonomic data ( Figure 2).
The auxiliary metabolic genes (AMGs) from 11 secondary KO categories were defined by the VIBRANT program ( Figure 4; Supplementary Table S8). Overall, 61 AMGs with different functions were identified in Baikal virome samples. Previously, [59] created the "global ocean virome" dataset (of dsDNA viruses) and identified 243 viral-encoded auxiliary metabolic genes. The AMGs composition in both datasets partially overlapped; a total of 44 common genes were identified, but 21 genes of them were assigned to the category "others" (not to AMGs) in the study [59]. Furthermore, 17 genes were identified in the Baikal sets (such as dcyD, D-cysteine desulfhydrase; csn, chitosanase; nadM, nicotinamidenucleotide adenylyltransferase, and others); however, it should be noted that for many of them no similarities were found with the known Pfam domains, according to which the identification of AMGs was carried out in [59]. Generally, most of the enzymes encoded by the identified AMGs are known to be present in the genomes of viruses.
The identified enzymes from the category "glycan biosynthesis and metabolism", are mainly involved in lipopolysaccharide biosynthesis (such as waaC, waaF, and waaQ, heptosyltransferases I-III, and others). The category "metabolism of cofactors and vitamins" is represented by the enzymes of the porphyrin and chlorophyll pathways (cobaltochelatases cobS and cobT), riboflavin (acid phosphatases phoN and ACP5), nicotinate, and nicotinamide metabolism (nadM; nicotinamide-nucleotide adenylyltransferase and others) as well as the folate (7-cyano-7-deazaguanine synthases queC, queE, and reductase queF, and others), ubiquinone and other terpenoid-quinone (ubiG; 2-polyprenyl-6-hydroxyphenyl methylase/3-demethylubiquinone-9 3-methyltransferase) biosynthesis. The enzymes classified as "energy metabolism" are associated with the sulfur pathway (cysC, adenylylsulfate kinase; cysD, sulfate adenylyltransferase subunit 2; and cysH, phosphoadenosine phosphosulfate reductase) and photosynthesis (psbA; photosystem II P680 reaction center D1 protein). The psbA gene in the genomes of cyanophages support the photosynthetic activity of infected cells, which in turn contributes to an increase in viral replication. The domains of phosphoadenosine phosphosulfate reductase (cysH) were regularly found in the genomes of viruses from pelagic freshwater habitats. This is one of the enzymes that are likely involved in the protection of their hosts from reactive oxygen species and in the oxidative burst in protist phagolysosomes [60].

Host Range Prediction
The potential host range (at the phylum level) for the revealed Baikal viruses ( Figure 5A) was estimated, based on known hosts for the identified virotypes according to the Virus -Host database [50], and using the VirHostMatcher-Net software [51]. The former approach allowed us to estimate the putative hosts for a wide range of viruses (not only bacteriophages but also other viruses). Using this method, the putative hosts were determined for 1009 viral scaffolds (72.8%). In total, seven bacterial, one archaeal, and seven eukaryotic phyla, as well as one viral family (Mimiviridae affected by virophages during coinfection of protists), were identified ( Figure 5B Although the number and ratio of bacterial host groups predicted by different methods varied, the same five microbial phyla dominated, which are Proteobacteria, Bacteroidetes, Cyanobacteria, Actinobacteria, and Firmicutes. Interestingly, the samples were clustered in the same way, according to the taxonomic affiliation of the viral scaffolds and AMGs profiles ( Figure 5D), and based on their predicted hosts. The investigated Baikal viromes were divided into three group that included: (i) summer samples from this study (V1-V4), (ii) spring samples (BVP1-BVP2), and (iii) an autumn sample 6C from the littoral zone ( Figure 5). Thus, a comprehensive analysis of the virome data clearly demonstrated the non-random nature of the distribution and formation of viral communities in the lake ecosystem.

Comparative Analysis of Viromes from Different Lake Ecosystems
The list of viral families identified in Lake Baikal (Table 4) and other lake ecosystems considered in this study (Table 2) were generally similar. In addition to the families listed in Table 4, we identified viruses of the families Parvoviridae (in all lakes, except for Baikal and Soyang) and Polydnaviridae (in small numbers in lakes Michigan and Ontario only) (Supplementary Table S3). Some families, especially those with a low number of reads, were absent in some lakes or samples. For example, reads close to Astroviridae viruses were revealed in all analyzed samples of Lake Michigan but only in one sample of Lake Soyang (Soyang_31) and most samples from Lake Baikal (except for samples V3 and 6C); Poxviridae were present in all viromes of Lake Baikal but only in some viromes of other lakes (Ontario_11, Erie_12, LoughN_17, Soyang_31, and Soyang_32). The virotypes of the Baculoviridae family were absent in Lake Biwa, in one Baikal sample (V3), and some samples from Lough Neagh (LoughN_16 and LoughN_21), etc. (Supplementary Table S3). To compare the Baikal and other selected freshwater viromes in terms of the similarity of the taxonomic diversity (the composition and ratio of virotypes), we used hierarchical clustering and identified four clusters of the samples ( Figure 6A). The first two clusters included the samples of Great Lakes: (1) Ontario_11; and (2) all samples of Lake Michigan. The third cluster (3) contained the samples Ontario_7, Ontario_9, Erie_12, Erie_14, Erie_15, and Baikal_V3. The fourth, most numerous, cluster (4) consisted of 21 samples from the lakes Baikal, Biwa, Soyang, and Lough Neagh. This cluster had several sub-clusters, one of those was formed by the samples from lakes Baikal, Biwa, and Soyang. Samples V1, V2, and V4 were the closest related among all Baikal samples; samples BVP1 and BVP2 were grouped separately. The separate sub-clusters composed all samples from Lough Neagh. The most distant samples in the fourth cluster were Biwa_40 and Soyang_29.  Table 2) based on virotype counts (similarity of the taxonomic composition of viral communities). Blue arrows correspond to the vectors of gradient counts for ten dominant viral families in the samples (unreliable vector for the family Lavidaviridae is marked with a dotted line); pink arrows show biodiversity index gradients of viral communities; samples from Lake Baikal are marked in red.
On the NMDS biplot ( Figure 6B), the samples from cluster 4 formed a scattered group in the coordinate plane and partially intersected with cluster 3, specifically, samples Erie_12 and Baikal_V3 were closely located to cluster 4. Clusters 1 and 2 were the most distant from each other and clusters 3 and 4; accordingly, the samples of clusters 1 and 2 differed significantly in composition and virotype ratio from each other and the other clusters. Figure 6 also shows the gradient vectors of ten dominant families in the samples, their species richness, and the Simpson and Shannon species diversity indices obtained in the NMDS analysis. Based on the direction of the gradient vectors, the samples of clusters 1 and 3 showed low values of species richness and the Simpson and Shannon indices, as well as high virotype counts of the families Phycodnaviridae, Myoviridae, Mimiviridae, and especially Parvoviridae and Bacilladnaviridae, and low virotype counts of the families Podoviridae and Siphoviridae. Cluster 2 had high virotype counts of the families Poxviridea and Inoviridae and low virotype counts of Lavidaviridae in contrast to the samples of cluster 4. The direction of the gradient vectors suggests that an increase in species richness and diversity is accompanied by an increase in the representation of viruses of the families Podoviridae and Siphoviridae and a decrease in viruses of the families Phycodnaviridae, Myoviridae, Mimiviridae, Parvoviridae, and Bacilladnaviridae in the samples.
On dendrogram and NMDS biplot, the Baikal samples, V1, V2, and V4, from the deep-water column taken during the same period were the most closely related, but sample V3 from the shallow area (Maloye More Strait) was distant. Among all available samples from Lake Baikal, viromes 6C and BVP1 were the most distant. The distinction of virome 6C from others is additionally associated with a specific sampling site (surface coastal waters) and season (November, biological autumn at Lake Baikal according to Kozhov [61]). Sample BVP1 from the photic layer of the pelagic zone (Listvyanka-Tankhoy section) was taken during early spring [61] in the under-ice period. Thus, cluster analysis, as well as data on taxonomic and functional analysis, likely represent both spatial and temporal differences of viral communities in Lake Baikal.

Discussion
In the present study, we investigated the composition and diversity of viral communities in Lake Baikal using a metagenomic approach. The new metavirome datasets provide an opportunity for a more complete assessment of the diversity and functional potential of viral communities of a large and ancient freshwater lake with unique characteristics and properties [19].
The bulk of viral sequences (99.3-99.9%) identified in our study were similar to ds-DNA viruses. This predominance is associated with the library preparation technique for Illumina where, during linker ligation and amplification steps, dsDNA has an advantage [62]. Overall, only a small group of reads matched with single-stranded (ss) DNA viruses (0.10-0.62%). Recently, Roux et al. [63] have shown that ssDNA viruses are usually less abundant than dsDNA viruses in aquatic samples.
The three families of bacteriophages from the order Caudovirales-Siphoviridae, Podoviridae, and Myoviridae-were the most abundant in the Baikal samples (Table 4, Supplementary Figure S2). Unexpectedly, the Siphoviridae family was dominant in samples V1, V2, and V4 (Table 4), as well as in previously published viromes from Lake Baikal (6C, BVP1, and BVP2), although the Myoviridae family dominated all samples of previous studies [27,28]. There are some causes of the change in the ratio of dominant families. Firstly, the RefSeq database is constantly enriching; in the present analysis, we used the updated version of early 2020. Secondly, as shown previously [27], the BLAST analysis parameters influence potential viral taxonomy; in this study, unlike previous studies [27,28], we used more strict conditions for the BLAST hit search. Thirdly, in our analysis, normalization to the length of the genome strictly shifted the read ratio towards the Siphoviridae family. For example, before normalization, the ratio of the Myoviridae and Siphoviridae families in samples V1 and 6C was 30.3% vs. 28.8% and 58.0% vs. 23.8%, respectively, but after normalization, the ratio became 13.7% vs. 40.2% and 29.6% vs. 43%, respectively (Supplementary Tables S1 and S2).
The viruses closely related to known cyanophages of unicellular Cyanobacteria, mainly Synechococcus sp., predominated in all Baikal viromes (Figure 2; Supplementary  Table S3). On the one hand, cyanophage dominance is a consequence of the high content of picocyanobacteria in Lake Baikal [64]; on the other hand, this is due to the great interest of researchers in cyanophages and the wide representation of their genomes in the NCBI RefSeq database. Hence, the proportion of cyanophages may be overestimated in our result. It should be noted that Proteobacteria and Actinobacteria are usually the most representative bacterial phyla in Lake Baikal as in other freshwater ecosystems [65,66], especially in deep-water samples; the contribution of the phylum Cyanobacteria to the analysis of the 16S rDNA genes is usually small (<10%) [27,[67][68][69], except for the summer period (≥70%) [70]. In our previous study of the viral metagenome [27], cyanophages were also the most numerous in the sample; however, the contribution of Cyanobacteria to the bacterial community according to the analysis of the 16S rRNA gene was not great (only 8.7%). This indicates the bias of the resulting data towards well-studied viruses, as also mentioned previously [71].
It should be taken into account that most of the known cyanophages, as well as other viruses of aquatic origin, were isolated from marine ecosystems; therefore, the list of virotypes mainly includes the marine viruses (Supplementary Tables S1-S3). However, this situation is gradually changing, and the database of complete viral genomes is constantly enriched with freshwater viruses [32], including Cyanobacterial ones. Thus, in the future, bioinformatic analysis of available viral metagenomes from freshwater ecosystems will be more accurate.
In addition to a wide variety of bacteriophages, some hypothetical eukaryotic viruses were identified in our datasets. Among them, viruses of the families Phycodnaviridae (viruses of small algae), Mimiviridae (viruses of protists), and Poxviridae (viruses of invertebrates and vertebrates) were the most represented by the number of viral reads. The Phycodnaviridae viruses and the other giant nucleo-cytoplasmic large DNA viruses (NCLDVs) of the family Mimiviridae might be potentially affected by virophages [72] during co-infection of eukaryotic cells. The virophages (similar to Yellowstone Lake virophages 5, 6, and 7) also predominated in the Baikal viromes ( Figure 2). Yellowstone Lake phycodnaviruses 1, 2, and 3, as well as Yellowstone lake mimivirus [73], were also the common virotypes in the Baikal and other freshwater viromes (Supplementary Table S3). We can assume the presence in the freshwater lakes of closely related eukaryotic hosts and their giant viruses that serve as the likely hosts for these virophages.
Based on the results of the taxonomic and functional analysis we revealed a high diversity of viral sequences, a greater similarity of viral communities of integral summer samples from deep-water columns (pelagic zone of southern and central basins and deepwater bay) of Lake Baikal and their difference from shallow areas (littoral zone and shallow strait), which is not surprising because of the significant difference in conditions and depth range of the stations. The littoral zone of the lake differs from the open pelagic area in more intensive biogeochemical processes, as high oxygen concentration, light penetration, and temperature conditions predetermine an increased content of organic matter and organism biomass [19,61]. The area of the Maloye More Strait, as mentioned above, has a special microclimate that is warmer than open Baikal. This is one of the most favorite and popular tourist places, leading to a high level of anthropogenic pollution, which most likely affects the composition of bacterioplankton and, accordingly, viral communities. Interestingly, the highest percentage of identified viral reads was observed for the samples taken from the area of the Maloye More Strait (maximum depth 25 m) and the coastal zone (sample 6C), i.e., in shallow areas: 15.2% and 19.5%, respectively (Table 3). We can assume that in the deep-water columns of Lake Baikal, in comparison with shallow ones, there are more unknown viruses that have no analogues in the NCBI RefSeq database.
In our study, we did not aim to trace seasonal variations or spatial distribution of viral communities (samples V1-V4 were taken at the same time and integral water samples collected from different depths were used); however, during the comparison of all known Baikal viromes, these trends are clearly observed. Most likely, in communities similar to bacterial [69,70,74], the greatest variety and the strongest seasonal changes in viral abundance and diversity occur in the upper layers of the lake where the temperature, light, and other factors play a significant role. However, the general processes that occur in the lake (deep convection, currents, seasonal developing, subsequent die-off, sedimentation of phyto-and bacterioplankton biomasses from the photic zone, and others) also make a certain contribution to the formation and changes of viral communities in the deep layers of the lake. The mixing processes in Lake Baikal determine the high diversity and some similarity of viral taxonomic composition throughout the lake; however, the local environmental conditions strongly affect the structure and diversity of plankton communities. Therefore, the viral communities in shallow and deep-water regions, as well as in the southern and central basins, also significantly differ (p-value < 0.05). A recent study [75] showed a vertical distribution and significant changes between epipelagic and bathypelagic Baikal viral community composition recovered from cellular metagenomes (fraction more than 0.22 µm). Accordingly, the difference between deep-water samples V1, V2, and V4 from other Baikal ones is partly explained by the presence of bathypelagic viral communities in them. However, based on the data on the greatest abundance of viruses at a depth of 0-25 m [76], we believe that in viromes from deep-water columns, the genomes of viruses from the photic surface layer of the lake water are the most represented and mainly determine the differences between samples.
Prediction of viral hosts based on metagenomic reads is a valuable tool for defining uncultured viruses, evaluation of virus-host interactions, the role of viruses in the regulation of various microbial populations, and the functioning of aquatic ecosystems in general. Using the chosen approaches, the different microbial taxa known as abundant in freshwater environment (Proteobacteria, Bacteroidetes, Cyanobacteria, Actinobacteria, Firmicutes, Verrucomicrobia, and others) were predicted as dominant viral hosts ( Figure 5; Supplementary Figure S4). These bacterial phyla were previously recorded based on the analysis of 16S ribosomal genes and shotgun sequencing [55,69,70,74,77]. Our results agree with the known data on the diversity of Baikal bacterial communities and demonstrate the presence of viruses infecting various taxonomic groups of microorganisms. Host prediction analysis, as well as taxonomic and functional (AMGs) analysis of viral scaffolds, revealed three groups of Baikal viromes, divided mainly by season ( Figure 5). So far as the samples also differed in the depths and sites of sampling, it also most likely also influenced their clustering. Previously, the differences in the temporal (seasonal and interannual), vertical, and geographic distribution of microbial taxa in the pelagic and littoral zones of the lake were also shown [69,74,77]. Our results demonstrate a clear difference in composition of samples of both viral and microbial communities, and reflect the dynamics of these populations in Lake Baikal depending on habitat conditions. Unfortunately, it is not possible to identify the main factors of difference between the available viromes because the analyzed samples differ in many characteristics (sampling season, sampling site, depth range, etc.), and this issue requires more targeted studies. However, the results of taxonomic, functional and host prediction analyses are generally consistent and indicate that the distribution and diversity of Baikal viruses are not accidental and depend on various environmental factors.
Sampling for our study was carried out in early September (late summer, according to [61]) during the high abundance of phyto-(including blue-green algae) and bacterioplankton. Another peak of algal bloom and subsequent development of bacterioplankton occurs in March-April during the under-ice period. The different groups of diatoms and other algae dominate during these two periods [78]. Thus, we can assume that during these contrasting seasons, completely different groups of viruses of algae and metabolically related bacteria actively develop in the lake water, which is observed during a comparison of dsDNA viral communities (mainly bacteriophages) from spring (BVP1 and BVP2) and summer (V1-V4) Baikal samples. Unfortunately, viruses of eukaryotic phytoplankton (including diatoms) are less studied than prokaryotic ones. Furthermore, studies of RNA viruses, many of which affect phytoplankton, are necessary for a more complete comparative analysis of phytoplankton viruses. At present, the Refseq database contains 19 diatom viruses, all of which belong to RNA-containing viruses of the families Totiviridae, Flaviviridae, and Narnaviridae [32]. Among the DNA-containing viruses, the family Bacilladnaviridae also includes the viruses of diatoms. Bacillariodnavirus LDMD-2013, the genome of which is assembled from the marine virome [79], was revealed as the closest relative virotype of this family for all freshwater viromes.
The functional analysis of viral scaffolds revealed the different auxiliary metabolic genes involved in the cycles of carbon, nitrogen, sulfur, phosphorus, and their compounds (Figure 4; Supplementary Table S8); many of the identified AMGs have been previously found in viromes from aquatic environments (marine and freshwater) [59,60,[80][81][82]. These metabolic genes stimulate the metabolic functions and defense mechanisms in host cells under various conditions, including rapidly changing and stressful ones, and, in general, contribute to the maintenance of the vital activity of the hosts and phage production (cyanophages, pelagiphages, and others) during infection. The sets of AMGs in the Baikal samples varied, as well as the diversity of viruses and their potential hosts. Thus, our results clearly demonstrate differences in the diversity and composition of viral communities, as well as in functional profiles, and rearrangements of their metabolic functions depending on environmental conditions. Long-term studies of the ecosystem of Lake Baikal have shown the presence of stable seasonal dynamics of the phytoplankton community. As mentioned above, the bulk of the primary photosynthetic production of phytoplankton in Baikal is formed in the under-ice spring period (from March to May) due to the bloom of diatoms [19,83]. By the end of summer, the active phytoplankton growth ends; gradual cell death and active destruction of most of the accumulated organic matter by organotrophic bacteria occur [83]. In our study, the proportion of AMGs responsible for the metabolism of carbohydrates in the samples taken in the summer period (V1-V4) was significantly higher than in other samples ( Figure 4). Thus, it is likely that during this period, dying phytoplankton provided organotrophic bacteria with a carbohydrate-rich substrate; and the proportion of viral species containing genes for carbohydrate metabolism increased. During infection, these phages stimulate the host metabolism and thereby accelerate the utilization of the carbohydrate substrate from the environment. In the samples from the ice, late spring and autumn periods (BVP1, BVP1, and 6C), the phages containing genes for the metabolism of complex substrates, amino acids, cofactors, vitamins, and nucleotides dominated in viral communities. In spring, phytoplankton is actively developing, and cell death has not yet occurred. In autumn the abundance of phytoplankton in the water is minimal; therefore, carbohydrates are not available in a large amount. Perhaps, in the spring and autumn seasons, the organotrophic bacterial community switches to the assimilation of amino acids, cofactors, vitamins, and nucleotides, and viral AMGs help them in these pathways.
Here, we also compared the Baikal virome data with those from other freshwater lakes. We used the same conditions of processing, identification, and description of virome reads for the correct comparison of different samples. It should be noted that we used a limited set of samples of freshwater lakes for our comparative analysis. In the future, we plan to expand the number of samples from various freshwater ecosystems and carry out a more complete and detailed comparison of freshwater viral communities. Based on our data, lakes Michigan, Ontario, and Erie were the most distinctive from Lake Baikal and other analyzed lacustrine ecosystems. Baikal samples (especially deep-water ones) tend to be closely related to lakes Soyang and Biwa, which is most likely due to the similarity of some hydrological parameters and a relatively close geographic location. Previously, we have indicated that the trophic status of lakes has a more significant effect on the structure of viral communities than their distance [24]. Therefore, we believe that the former factor is more important. Similar to the original studies [10,17] strong temporal dynamics were also observed in lakes Soyang (Korea) and Biwa (Japan) (Figure 6). In Lake Soyang, Soyang_29 (January) and Soyang_33 (September) were the most distinctive samples. Interestingly, the water quality in Lake Soyang strongly depends on monthly rainfall: during the monsoon period (July-August) the status of the water body changes from oligotrophic to highmesotrophic and even low-eutrophic [84]. In ancient mesotrophic Lake Biwa, sample Biwa_40 (September, depth 65 m) was the most different. The wide NMDS biplot dispersion of samples from the same ecosystem indicated significant rearrangements in the structure of viral communities and their hosts in freshwater lakes, depending on environmental conditions. The smallest variations in viral communities were observed in Lough Neagh, all samples from which were closely related [16,18] (Figure 6). Lough Neagh is one of the well-known hypertrophic lakes and, unlike other lakes [66], it is characterized by low levels of Actinobacteria but the stable predominance of Cyanobacteria (in particular Planktothrix) [16,85].
Based on the environmental conditions, Lake Baikal is an extreme habitat; this explains the great variety of unique flora and fauna here. Comparative taxonomic analysis based on the identification of metagenomic reads using the NCBI RefSeq revealed a similar list of virotypes in Lake Baikal and other lakes (Supplementary Table S3). The question arises why the samples from Lake Baikal appeared to be close to other lakes (oligotrophic and mesotrophic). Organisms of Lake Baikal can be divided into two groups. The first group is typical Palaearctic flora and fauna, and the second group is endemics (endemic species, genera, and even families) [19]. Endemic species of Lake Baikal form the bulk of the biomass in the ecosystem [86]; endemic phytoplankton species provide a significant proportion of primary production [87,88]. The NCBI RefSeq database mainly contains viruses that infect widespread organisms of aquatic ecosystems (Palaearctic species). The viral communities of Lake Baikal (like other lakes) are in the initial phase of the study. The genomes of viruses that infect Baikal endemic species are not represented in the NCBI RefSeq database; thus, the DNA of unique Baikal viruses most likely has not yet been identified. In general, the variety and composition of viruses in Lake Baikal, especially in deep-water zones of the lake remains largely unexplored.

Conclusions
Here, we analyzed only several Baikal areas (the pelagic zone, the deep-water bay, and the shallow zone of the lake) and identified a high diversity and significant differences in the composition of their viral communities. Lake Baikal is characterized by various habitat conditions (different basins, bays and sors, a wide range of depths, strong seasonal changes, etc.). Therefore, future studies will require expansion of the range of stations, analysis of the vertical distribution of viral communities, identification of the relationship of viral and other planktonic communities, determination of the functional role of viruses in the ecosystem in more detail, etc.
Studies of freshwater viruses are relevant in light of global climate change, the accumulation of adverse anthropogenic factors for water bodies, and the deterioration of freshwater quality. Unfortunately, Lake Baikal is no exception. Currently, the lake is undergoing serious changes, which is associated with a climatic shift and an increase in anthropogenic pressure. The irreversible phenomena in Lake Baikal began after the construction of large industrial plants [89]; the signs of strong negative changes have been recording here since 2011 [90][91][92]. The highest eutrophication (changes in the structure and vertical zonation of benthic algae; the overgrowth of the lake bottom by filamentous green algae; mass development of species of the genus Spirogyra, which are normally not typical for the lake; the presence of the eutrophic diatom indicator Fragilaria capucina var. vaucheriae; disease and mass mortality of sponges; and other changes) is observed in the littoral zone [87,[90][91][92][93]. The pelagic zone of the lake retains the characteristics of a highly oligotrophic reservoir; however, there is a tendency towards negative change [88]. Water samples for this study were taken in 2014, at the very beginning of observed changes in the ecosystem. Therefore, the viromes obtained in our study can be useful for future studies, especially for assessing the dynamics and long-term temporary changes in viral communities of the lake, and tracking the processes that occur in the coastal and pelagic areas of the lake.  [58], and at https://www.mdpi. com/article/10.3390/microorganisms9040760/s1, Figure S1: A complete scheme of bioinformatic analysis. The stages of the analysis are highlighted with red boxes, the resulting datasets are blue with a dashed stroke, and the databases (DB) used are purple. Figure S2: Dominated viral families in the investigated Baikal viromes. The percentages greater than one percent are shown in the diagrams. Figure S3: The general functional annotation of the Baikal virome datasets using the COG (A) and KEGG pathway (B) databases. Figure S4: The phyla of the hosts predicted for revealed Baikal viruses using the Virus-Host database (A) and the VirHostMatcher-Net software (B). Table S1: The list and taxonomy of virotypes revealed in analyzed freshwater viromes (initial number of reads per virotype). Table S2: The list and taxonomy of virotypes revealed in analyzed freshwater viromes (number of reads per virotype normalized to genome length). Table S3: The list and taxonomy of virotypes revealed in analyzed freshwater viromes (number of reads per virotype normalized to genome length for 95% dominant pool). Table S4: Taxonomic identification of the Baikal viral scaffolds, the similarity of detected ORFs and RefSeq proteins, and the number of virome reads per scaffold. Table S5: Main and secondary KO (KEGG Orthology) functional categories of predicted viral proteins and the number of reads related to these functions in the Baikal samples. Table S6: COG (Clusters of Orthologous Groups) classification of viral proteins revealed in the Baikal viromes (the columns indicate the number of reads per COG functional category for each sample). Table S7: The predicted viral proteins or enzymes identified with the COG database in the Baikal viromes (the columns indicate the number of reads for each protein in the samples). Table S8: The auxiliary metabolic genes (AMGs) revealed among predicted proteins in the Baikal viromes. Table S9: The known hosts for the Baikal virotypes identified using the Virus-Host database. Table S10: The list of the hosts predicted for the revealed Baikal viruses using the VirHostMatcher-Net software. File S1: Scaffolds obtained using the metaSPAdes software. File S2: Viral scaffolds identified with the VirSorter tool. File S3: Viral proteins identified with the VirSorter tool.