RNA-Seq Virus Fraction in Lake Baikal and Treated Wastewaters

In this study, we analyzed the transcriptomes of RNA and DNA viruses from the oligotrophic water of Lake Baikal and the effluent from wastewater treatment plants (WWTPs) discharged into the lake from the towns of Severobaikalsk and Slyudyanka located on the lake shores. Given the uniqueness and importance of Lake Baikal, the issues of biodiversity conservation and the monitoring of potential virological hazards to hydrobionts and humans are important. Wastewater treatment plants discharge treated effluent directly into the lake. In this context, the identification and monitoring of allochthonous microorganisms entering the lake play an important role. Using high-throughput sequencing methods, we found that dsDNA-containing viruses of the class Caudoviricetes were the most abundant in all samples, while Leviviricetes (ssRNA(+) viruses) dominated the treated water samples. RNA viruses of the families Nodaviridae, Tombusviridae, Dicitroviridae, Picobirnaviridae, Botourmiaviridae, Marnaviridae, Solemoviridae, and Endornavirida were found in the pelagic zone of three lake basins. Complete or nearly complete genomes of RNA viruses belonging to such families as Dicistroviridae, Marnaviridae, Blumeviridae, Virgaviridae, Solspiviridae, Nodaviridae, and Fiersviridae and the unassigned genus Chimpavirus, as well as unclassified picorna-like viruses, were identified. In general, the data of sanitary/microbiological and genetic analyses showed that WWTPs inadequately purify the discharged water, but, at the same time, we did not observe viruses pathogenic to humans in the pelagic zone of the lake.


Introduction
Aquatic microorganisms, such as bacteria, archaea, and viruses, are numerous and diverse components of freshwater and marine ecosystems. They are integral to nutrient cycling and energy turnover, making the study of their diversity, dynamics, and interactions critical to understanding all ecological aspects of aquatic ecology [1][2][3]. Viruses are considered significant but still poorly studied members of the microbial community.
Viruses are obligate intracellular parasites consisting of RNA or DNA molecules having various forms with or without the capsid protein; some of them have an additional lipid membrane shell. Viral communities in aquatic ecosystems are represented by viruses of eukaryotes, bacteria, and archaea; viruses of the latter two domains are commonly referred to as phages. Viruses are present both as free particles and insiders of the host cell in aquatic environments.
The first metagenomic approach to study DNA-containing viral communities in Lake Baikal was used in 2019 [4], and it included the study of a 0.2 µm fraction (approximately 30 m offshore at Bolshie Koty, southern basin). Taxonomic and functional diversity was later studied in a fraction below 0.2 µm from a pelagic layer of 0-50 m near the Listvyanka settlement in spring (ice cover) and summer [5]. The viral diversity was described in the 2 of 26 epipelagic (photic), mesopelagic (aphotic), and bathypelagic (aphotic) zones in winter and summer. Significant changes in the composition of different viral communities between the epipelagic and bathypelagic zones of Lake Baikal have been reported [6]. The virome of the surface water near the Bolshye Koty settlement was found to be dominated by Siphoviridae and Myoviridae bacteriophages [7]. Later, the authors described and compared previously identified and new viromes from the pelagic zone collected during ice cover and three spring and summer periods. A cluster analysis indicated that viromes from Lake Baikal formed a cluster with viromes from the world's largest (Michigan, Ontario, and Erie) and oldest (Biwa) freshwater lakes, which allowed for the identification of the World's Largest Lakes (WLL) clade [8]. In 2021, a metagenomic analysis of DNA viruses on integral samples from four different deepwater and shallow stations evidenced a clear isolation of viral communities within different ecological zones [9]. The first transcriptome analysis of viruses in Lake Baikal revealed the predominance of transcripts belonging to DNA-containing bacteriophages, as well as representatives of a nucleocytoplasmic large DNA virus (NCLDV), including the closest relative of Pithovirus sibericum. Moreover, we identified a small number of sequences belonging to RNA viruses [10].
RNA viral communities are currently considered to be more diverse and complex than previously known, and their effect on community composition and global carbon flux into aquatic environments appears to be underestimated [11]. The initial evidence suggesting that RNA viruses may have significant ecological impacts was obtained by isolating viruses that infect various phytoplankton species, a basis for aquatic food webs. By now, researchers have identified RNA viruses that infect most of the basic photosynthetic Protist taxa. They include diatoms [12][13][14][15], dinoflagellates [16], Raphidophyceae [17], Prasinophyceae [18], and Thraustochytriaceae [19].
Autochthonous viruses are permanent inhabitants of pure oligotrophic waters, but their composition can change due to the transmission of allochthonous (alien) viruses by migratory birds; influx with river runoff; and anthropogenic pollution, especially from the influx of untreated sewage. One of the most important sources for the spread of viruses that pose a risk to human and animal health is birds. For example, by collecting cloacal swabs in the delta of the Selenga River, an important tributary of Lake Baikal, the authors identified several subtypes of avian influenza, as well as paramyxoviruses. The Baikal basin provides an important habitat for migratory birds from Northeast Asia, as three spring migration routes converge here, attracting millions of birds [20].
Viruses in wastewater have the potential to cause many diseases in humans and animals, for example, gastroenteritis, hepatitis, and some respiratory infections. The monitoring of water-transmitted viruses has a long history [21,22]. As previously reported, outbreaks of human diseases in aquatic environments are mainly caused by noroviruses (46%) and adenoviruses (24%) [23]. Thus, in the Great Lakes region, enteroviruses have been isolated from recreational beach substrates and ground waters for municipal use, indicating a higher risk to human health [24]. Waste and surface waters contaminated by sewage fluxes were found to contain high concentrations of adenoviruses and were not acceptable for recreation in the lower reaches of the Grand River (Michigan) [25]. Wastewater treatment plants are known to remove and inactivate pathogenic and conditionally pathogenic bacteria. Meanwhile, viruses such as noroviruses and rotaviruses are more resistant to processing than fecal bacteria [26].
The study of wastewater samples using transcriptomics methods allowed for the monitoring of intestinal infections [33,34]. Moreover, this method has allowed for the identification of the composition of the viral community as a whole. For example, untreated wastewater samples collected in the United States, Spain, and Ethiopia contained 234 known viruses. The virome of the untreated wastewater was dominated by bacteriophages. Approximately 85% of the sequence reads were from 18 different species of the family Virgaviridae, and a number of representatives of the families Dicistroviridae, Iridoviridae, Nodaviridae, and Parvoviridae that infect insects were also identified. Three wastewater samples contained 17 viruses that can infect humans (Adenoviridae, Astroviridae, Caliciviridae, Papillomaviridae, Parvoviridae, Picobirnaviridae, Picornaviridae, and Polyomaviridae) [33]. A metagenomic analysis of viral communities from untreated wastewater in the USA (San Francisco, CA, USA), Nigeria (Maiduguri), Thailand (Bangkok), and Nepal (Kathmandu) revealed sequences belonging to 29 eukaryotic virus families infecting vertebrates, invertebrates, and humans [34]. In Kampala (Uganda), the same approach was applied to the study of wastewater and a swamp, detecting several human and vertebrate viruses, including Herpesvirales, Iridoviridae, Poxviridae, Circoviridae, Parvoviridae, and Bunyaviridae [35]. The virome from municipal untreated wastewater in San Adrian de Besos (Barcelona) had 41 viral communities, including pathogenic species of the families Caliciviridae, Adenoviridae, Astroviridae, Picornaviridae, Polyomaviridae, Papillomaviridae, and Hepeviridae [36]. F-specific (F+) RNA phages, a group of single-stranded RNA bacteriophages belonging to the family Leviviridae, are widely used as indicators of fecal contamination and/or intestinal viruses in water in several countries [37][38][39][40].
Lake Baikal is the largest oligotrophic water body, an enormous freshwater reserve (23,615 km 3 ) [41], and a habitat for endemic hydrobionts [42]. In 1996, Lake Baikal was listed as a UNESCO World Heritage Site (https://whc.unesco.org/en/list/754/, accessed on 4 April 2023). It was found that the littoral zone of Lake Baikal and its shallow bays, currently exposed to growing human activities, suffer from extremely high fecal contamination from groundwaters [43].
Based on the above, we focused on performing a virus composition analysis on treated wastewater and Baikal water in an attempt to trace the route of viruses entering Lake Baikal with treated sewage water and to gain insights into the potential risks to human and hydrobiont health.

Assessment of Microbial Quality of Effluent Wastewater and Water in Lake Baikal
An analysis of the treated water in all samples showed deviations from Russian sanitation regulations and hygienic standards [44]. In the town of Slyudyanka, the total coliforms (TCs) in the water discharged from the wastewater treatment plants in May exceeded the standards by four times. In August, microbial concentrations were all above the guide limits: total coliforms-1600 times, Escherichia coli-8000 times, enterococci-800 times, and coliphages-162 times. In August, the number of total coliforms in the treated water of the town of Severobaikalsk exceeded the standard limits by 1.8 times (Table 1). A sanitary/microbiological examination of the waters processed at the wastewater treatment plants of the towns of Slyudyanka and Severobaikalsk confirmed the ineffective disinfection of municipal wastewater. The sanitary indicative microorganisms at the central sites of Listvyanka-Tankhoy, Ukhan-Tonkiy, and Elokhin-Davsha hydrological sections were consistent with the hygienic standards for surface waters. The studied bacteria were not detected in most of the samples and were present in the integral sample from the Ukhan-Tonkiy site (TC, E. coli, and enterococci-1 CFU/100 cm 3 ) and from the surface layer at the Elokhin-Davsha site (TC-3 CFU/100 cm 3 ); TC and E. coli was 1 CFU/100 cm 3 at a depth of 5 m (TC and E. coli-1 CFU/100 cm 3 ).

General Information
The number of reads after each processing step and the resulting contigs after assembly into SPAdes are shown in the Supplementary Materials in Table S1. The ORFs determined using the NR database were predominantly represented by bacterial transcripts (87.5-96.9%), and viral ORFs accounted for 0.6-5.1% (Figure 1). The proportion of unidentified sequences reached 25%.
Cyanophage ORFs were solely found in pelagic samples represented in the largest amounts by the following taxa: Synechoccus phage S-SRP02 (isolated from a tropical freshwater lake), Synechococcus phage S-CBP2 (isolated from Chesapeake Bay water), Synechococcus phage S-SRP01 (Singapore Serangoon Reservoir), with the most similar proteins being photosystem II protein D2 (QPB08139)-94% amino acid (aa) identity, endonuclease (YP_009103176)-79.7% aa identity, and primase/helicase (YP_009103177)-79.4%. At the same time, hypothetical proteins comprised 45% of identifications with a 30% to 60.3% aa identity. All cyanophage ORFs belonged to the families Autographiviridae and Kyanoviridae, as well as to unclassified Caudoviricetes.

IMG/VR Database
The ORFs were determined based on the contigs identified using VirSorter 2 and GeneMarkS. A comparison of the amino acid sequences against IMG/VR revealed the dominance of the dsDNA phages belonging to the class Caudoviricetes in all samples, with the Autographiviridae family being the second in number in RVP4 ( Figure 2). Most Caudoviricetes sequences found in the IMG/VR database were classified to the class level only. Their number appeared to be higher than that in NR with similar Diamond settings; therefore, the choice of database was critical for the taxonomical identification. We used the contigs identified using VirSorter 2 to avoid false positives. Small amounts of Crassvirales representatives were found in the samples RVP4, RVP6, and SRVP22_05. CrAss-like phages are a diverse group of viruses that include some of the most abundant viruses of the human gut [45]. The presence of these phages in the pelagic samples was probably due to untreated sewage from vessels during the navigation season, a period of intense recreational activity. According to the State Small Vessels Inspectorate, approximately 300 heavy-lift vessels (http://geol.irk.ru/baikal/activ/mactiv2017, accessed on 4 April 2023) and 5.5 thousand small vessels are now in operation on Baikal, some of which discharge raw or untreated sewage into the lake.
A comparison of the results from RVP4, RVP5, and RVP6 obtained by searching IMG/VR revealed the total number of similar sequences (sequences from the Uncultivated Viral Genome (UVIG)) to be 106; SevRVP22_08, SluRVP22_08, and SRVP22_05-only 2; SluRVP22_08 and SevRVP22_08-142; and SRVP22_05 and SluRVP22_08-13 ( Figure S2). As for the DNA virus composition, SevRVP22_08 and SluRVP22_08 shared 88 and 89 amino acid sequences, respectively, with sequences from pelagic samples, whereas SRVP22_05 had the lowest number of similar sequences: 4 with RVP4, 6 with RVP6, and 1 with RVP5. Judging by the DNA virus compositions, SRVP22_05 appeared to be the most divergent, whereas SevRVP22_08 and SluRVP22_08 showed the highest similarity. Most of the viral sequences of SRVP22_05 had hosts in the phylum Proteobacteria, and RVP4, RVP5, RVP6, SevRVP22_08, and SluRVP22_08 were dominated by phage sequences that infect Actinobacteria.    Negative-strand (ssRNA(−)) viruses assigned to the family Qinviridae, which infect invertebrates, were found in the lowest amount. The amino acid sequences belonged to RdRp Fitzroy Crossing qinvirus 1 (QLJ83493) and the hypothetical protein Wuhan insect virus 15 (YP_009342457), the viruses of which were detected only in SRVP22_05.
The dsRNA viruses represented in the datasets included five families: Picobirnaviridae (animals and humans), Totiviridae (fungi and protozoa), Partitiviridae (plants, fungi, and protozoa), Birnaviridae (salmonid fish, birds, and insects), and Cystoviridae (bacteria), which were mainly present in the samples of SluRVP22_08 and SRVP22_05. In SluRVP22_08, Picobirnaviridae dominated (32% of all RNA viruses identified to the family level).
In addition, unclassified viruses, i.e., viruses identified only to the "virus" level or classified as Riboviria, comprised 18.5% to 100% of the samples.
In 2020, the family Leviviridae was renamed Fiersviridae [47], but some sequences in the NR database remained Leviviridae. Therefore, we used the sequence names from the NR database.

IMG/VR Database
The annotation of ORFs with IMG/VR yielded 861 sequences belonging to 23 RNA virus families, as well as 422 unclassified (to the family level) amino acid sequences. Sequences belonging to the family Fiersviridae were found predominantly in the treated wastewater samples SluRVP22_08 and SRVP22_05, but none was found in the pelagic samples ( Figure 3).
The families Nodaviridae, Tombusviridae, Dicitroviridae, Picobirnaviridae, Botourmiaviridae, Solemoviridae, and Endornaviridae, and unclassified (to the family level) Riboviria, Orthornavirae, Kitrinoviricota, Durnavirales, Picornavirales, Ghabrivirales, and Nodamuvirales were detected in the RVP4 sample. The sequences in the RVP5 sample were represented by the families Picobirnaviridae and Nodaviridae and unclassified Riboviria, Orthornavirae, Durnavirales, Picornavirales, and Ghabrivirales. RVP6 had the smallest number of ORFs associated with RNA viruses and only three sequences represented by unclassified Durnavirales and Marnaviridae. and 36.1% of the identified RNA viruses corresponded to dsRNA viruses. In the treated wastewater samples, 97.5% of the ORFs were assigned to ssRNA(+) viruses and 2.5% to dsRNA viruses. It should be noted that the ssRNA(−) ORFs (family Qinviridae) detected according to the NR database were not identified with the IMG/VR database because ORFcontaining contigs were not identified using VirSorter 2.
For the first time, we report the presence of detected viruses in the lake water despite the low number of RNA viral sequences from the pelagic zone of Lake Baikal.  The families Virgaviridae, Steitzviridae, Duinviridae, Solspiviridae, Partitiviridae, and Hepeviridae were identified only in SluRVP22_08 and SRVP22_05. The largest number of ORFs corresponding to representatives of the order Picornavirales was found in the sample SevRVP22_08. Sequences belonging to the order Durnavirales (dsRNA infecting eukaryotes) were detected only in the pelagic samples.

RNA-Dependent RNA Polymerase Analysis
In the pelagic samples, 63.9% of the sequences corresponded to ssRNA(+) viruses, and 36.1% of the identified RNA viruses corresponded to dsRNA viruses. In the treated wastewater samples, 97.5% of the ORFs were assigned to ssRNA(+) viruses and 2.5% to dsRNA viruses. It should be noted that the ssRNA(−) ORFs (family Qinviridae) detected according to the NR database were not identified with the IMG/VR database because ORF-containing contigs were not identified using VirSorter 2.
For the first time, we report the presence of detected viruses in the lake water despite the low number of RNA viral sequences from the pelagic zone of Lake Baikal.
The number of unique sequences in SRVP22_05 was 325, and the largest number of common taxonomic units was recorded between SRVP22_05 and SluRVP22_08 (45), i.e., in the treated wastewater samples collected in the town of Slyudyanka in May and August. Leviviricetes comprised 53% of the 45 common taxa. In addition, the identified 672 RdRp sequences were annotated using the RefSeq database. A total of 554 RdRp were identified, with the majority of sequences (313) belonging to the family Fiersviridae and derived primarily from the samples SRVP22_05 (88.9%), SluRVP22_08 (10.8%), and SevRVP22_08 (0.3%  The number of unique sequences in SRVP22_05 was 325, and the largest number of common taxonomic units was recorded between SRVP22_05 and SluRVP22_08 (45), i.e., in the treated wastewater samples collected in the town of Slyudyanka in May and August. Leviviricetes comprised 53% of the 45 common taxa.
In addition, the identified 672 RdRp sequences were annotated using the RefSeq database. A total of 554 RdRp were identified, with the majority of sequences (313) belonging to the family Fiersviridae and derived primarily from the samples SRVP22_05 (88.9%), SluRVP22_08 (10.8%), and SevRVP22_08 (0.3% , and Tomato brown rugose fruit virus (99.9-100% aa identity). PMMoV is a plant virus found in human feces, treated and untreated wastewaters, and aquatic environments contaminated by human feces [49]. Over the past decade, PMMoV has been proposed as a potential viral indicator of fecal pollution in marine and riverine waters [50,51].
In general, a set of intact RdRp sequences from the RVMT database showed more matches, as the sequences from the database (77,510 RdRp) outnumbered those from the RefSeq database (5767 RdRp).
In this work, a taxonomic tree based on the RdRp protein, the most representative RNA bacteriophage group (Leviviricetes), showed that the sequences obtained were distributed throughout the tree demonstrating wide diversity ( Figure 5). It should be noted that we found no geographical confinement of the samples. For instance, the sequences from the NR database included representatives sampled elsewhere, namely, in Japan, China, the USA, and Austria, and this may attest to the cosmopolitan distribution of these viruses, testifying different sources of their retention, such as rice fields, soil, pond sediments, and activated sludge. identity). In addition to bacteriophages (Leviviricetes), we also registered representatives of Pisuviricota, namely, Picobirnavirus dog/KNA/2015 (53.5-75.4% aa identity), Drosophila C virus (71.8-85% aa identity), and Human picobirnavirus (56-75.3% aa identity), and representatives of Kitrinoviricota, such as Tobacco mosaic virus (98-100% aa identity), Pepper mild mottle virus (PMMoV) (98-99% aa identity), Cucumber green mottle mosaic virus (97.9-100% aa identity), and Tomato brown rugose fruit virus (99.9-100% aa identity). PMMoV is a plant virus found in human feces, treated and untreated wastewaters, and aquatic environments contaminated by human feces [49]. Over the past decade, PMMoV has been proposed as a potential viral indicator of fecal pollution in marine and riverine waters [50,51].
In general, a set of intact RdRp sequences from the RVMT database showed more matches, as the sequences from the database (77,510 RdRp) outnumbered those from the RefSeq database (5767 RdRp).
In this work, a taxonomic tree based on the RdRp protein, the most representative RNA bacteriophage group (Leviviricetes), showed that the sequences obtained were distributed throughout the tree demonstrating wide diversity ( Figure 5). It should be noted that we found no geographical confinement of the samples. For instance, the sequences from the NR database included representatives sampled elsewhere, namely, in Japan, China, the USA, and Austria, and this may attest to the cosmopolitan distribution of these viruses, testifying different sources of their retention, such as rice fields, soil, pond sediments, and activated sludge. Only 135 of the 375 RdRp sequences obtained in this study, annotated as Leviviricetes in the RVMT database, were selected for tree construction because they overlapped the conserved region, which varied in length from 200 to 268 aa.

Search for the Complete Genomes of Viruses
VirSorter 2 was used to identify 3049 contigs in all samples and 1174 contigs in the SRVP22_05 sample. Twenty-six contigs were determined to be the most complete compared with their closest relatives using the blastn program, taking into account similarity, coverage, and aligned region length. Information on the closest relatives, coverage, and similarity can be found in Table S3.
In view of the data on the closest relative, these sequences referred to such RNA virus families as Dicistroviridae, Marnaviridae, Blumeviridae, Virgaviridae, Solspiviridae, Nodaviridae, and Fiersviridae; an unassigned genus Chimpavirus; and unclassified picorna-like viruses. All whole or nearly whole genomes belonged to the realm Riboviria. The sequence length varied from 3035 nucleotides (nt) to 9467 nt. We also found a Boolarra virus segment RNA1 (AF329080), with a nucleotide identity of 99%. The contigs shown in Figure 6 presumably represent the complete genomes of the identified viruses. The highest similarities were found with Tomato brown rugose fruit virus isolate Tom1-Jo (KT383474)-99.9% nt identity, Tomato mosaic virus isolate Queensland (AF332868)-99.5% nt identity, Cucumber green mottle mosaic virus (D12505)-98.9% nt identity, and Pepper mild mottle virus (M81413)-98.2% nt identity. The trees based on deduced amino acid RdRp sequences from 26 genomes and GenBank sequences are given in Figure S1, generally showing high node support values. A treated wastewater sample from the town of Severobaikalsk (SevRVP22_08, August)) contained Astrovirus VA4 (YP_006905856) with an aa identity of 26.3%.
No human-associated virus sequences were identified in the samples from the pe-lagic zone of Lake Baikal, based on the Virus-Host DB database.

Animal Viruses
ORFs with a similarity to the proteins of viruses known to infect animals identified from the NR database generally had low resemblance to the known ones (under 50%, aa level). This probably indicates the lack of analogous sequences in the databases. Nonetheless, the highly similar amino acid sequences detected in this study give a first insight into their availability.

Cluster Analysis of Transcriptomes
Clustering performed with the Metafast program showed that SRVP22_05 (Slyudyanka, May) was the most distant from the samples studied (Figure 7), consistent with the taxonomic analysis shown previously. The difference in the number of reads in the samples does not affect the analysis because the program provides the normalization of the data based on the total number of k-mers. The samples RVP5 and RVP6 from the pelagic zone of Lake Baikal proved to be the closest. The sample of treated wastewater from Severobaikalsk (SevRVP22_08) was part of a cluster with the pelagic samples. It is likely that clustering is caused by the fact that water for the Severobaikalsk central water supply comes from the nearby Tyya River, and, hence, municipal wastewater, as part of river water, contains allochthonous viruses similar in composition to those of the lake water. In addition, the treated wastewater from Severobaikalsk had a lower concentration of fecal bacteria; i.e., it was purified more effectively. In contrast, the water supplied to Slyudyanka is from underground water sources.
The ORFs similar to those of the ATP-binding protein HI0361 (RVP4-219, RVP5-338, RVP6-327, SevRVP22_08-269, SluRVP22_08-259, and SRVP22_05-145) were compared to the amino acid sequences from the NCBI NR database showing similarities with ABC transporters (aa identity): • A comparison focusing on the virus database of the same sequences against GenBank revealed maximal similarity with the following: Clustering was verified using reads mapped with Bowtie 2 (human DNA removal) and SortMeRNA (ribosomal gene removal). The clustering characteristics remained unchanged.
The only sample grouped with other wastewater samples was SRVP22_05, which formed a cluster with the untreated wastewater collected in Southern California (the USA). This is probably because of the presence of RNA-containing viruses similar to those observed in these samples; for example, in a previous study, as well as in ours, tomato brown rugose fruit virus (to 66%), pepper mild mottle virus (10.6%), cucumber green mottle mosaic virus (10.4%), tomato mosaic virus (4.8%), and others were detected [31].
Viral ORFs annotated using the VOG database include probable membrane antigen 75, the major capsid protein, the tail fiber assembly protein, the minor capsid protein P30, RNA-directed RNA polymerase, the coat protein, and packaging_enzyme_P4.

KEGG Database
The KEGG pathway classification of the transcriptomes revealed that the following metabolic categories are the most abundant: "Carbohydrate metabolism", "Protein families: genetic information processing", "Protein families: signaling and cellular processes", and "Translation" (Figure 8).
In the category "Carbohydrate metabolism", the pelagic samples were dominated by the following:

•
Glutamine synthetase (an essential enzyme in cellular nitrogen metabolism); • Acetyl-CoA C-acetyltransferase (an enzyme that catalyzes the final step of fatty acid oxidation).
The transcripts in the effluent samples were dominated by the following: • The 2-oxoglutarate dehydrogenase E1 component (involved in the tricarboxylic acid cycle); • Acetolactate synthase I/II/III large subunit (a protein found in plants and microorganisms that catalyzes the first step in the synthesis of branched-chain amino acids).
In the category "Protein families: genetic information processing", the following were predominant: • RVP4-chromosome partitioning protein (required for efficient plasmid and chromosome partitioning in many bacterial species); • RVP5, SevRVP22_08-DNA gyrase subunit A (belongs to the group of topoisomerases); • RVP6-DNA segregation ATPase FtsK/SpoIIIE (mediates proper chromosome segregation in dividing bacteria); • SRVP22_05-elongation factor G (prokaryotic elongation factor involved in protein translation); • SluRVP22_08-DNA gyrase subunit B.
In SRVP22_05 SRVP22_05, "Cell motility", including RNA polymerase primary sigma factor, flagellar hook-associated protein 2, and chemotaxis protein MotB, and "Neurodegenerative disease", with the dominant dynein axonemal heavy chain, were the most divergent categories.
In general, a functional analysis unveils the active metabolic patterns in bacterial life and their crucial role in the biological decomposition of organic substances.

Discussion
As expected, the number of bacterial transcripts was high (up to 96.9%). Despite the methodological complexity of separating virus particles or viral nucleic acid from bacterial nucleic acid, we were able to identify the viral sequences of both DNA and RNA viruses in the transcriptomes studied. Different approaches to virus identification were applied using the NCBI NR and IMG/VR databases. RNA viruses were identified according to the RVMT and RefSeq databases, comparing ORFs and RdRp. In our opinion, this method enhances the chances of achieving a more precise identification of viral sequences. At present, the databases are rapidly expanding, and their selection is of great In the category "Protein families: signaling and cellular processes", the most abundant proteins were as follows:
In SRVP22_05 SRVP22_05, "Cell motility", including RNA polymerase primary sigma factor, flagellar hook-associated protein 2, and chemotaxis protein MotB, and "Neurodegenerative disease", with the dominant dynein axonemal heavy chain, were the most divergent categories.
In general, a functional analysis unveils the active metabolic patterns in bacterial life and their crucial role in the biological decomposition of organic substances.

Discussion
As expected, the number of bacterial transcripts was high (up to 96.9%). Despite the methodological complexity of separating virus particles or viral nucleic acid from bacterial nucleic acid, we were able to identify the viral sequences of both DNA and RNA viruses in the transcriptomes studied. Different approaches to virus identification were applied using the NCBI NR and IMG/VR databases. RNA viruses were identified according to the RVMT and RefSeq databases, comparing ORFs and RdRp. In our opinion, this method enhances the chances of achieving a more precise identification of viral sequences. At present, the databases are rapidly expanding, and their selection is of great importance for data interpretation, since the taxon is identified by performing a comparison with the already known sequences.
Here, we compared viral communities from effluent wastewaters discharged into Lake Baikal and viromes from its pelagic zone and searched for complete genomes. The samples collected were dominated by DNA viruses belonging to the class Caudoviricetes (dsDNA bacteriophages), mainly to the families Siphoviridae, Autographiviridae, and Myoviridae. Caudoviricetes are the most abundant and widespread viruses in natural ecosystems [53,54].
Untreated discharge in different localities (Pennsylvania, Barcelona, and Addis Ababa) was also dominated by DNA bacteriophages belonging to the families Microviridae, Siphoviridae, Myoviridae, Podoviridae, and Inoviridae. Approximately 85% of the sequence reads classified as identified viruses belonged to RNA viruses, namely, 18 different species of the family Virgaviridae [33]. Leviviricetes (ssRNA(+)) bacteriophages, along with Caudoviricetes, dominated the effluent flow running from the River Conwy catchment area (North Wales, UK) [28].
The effluent wastewater samples from the town of Slyudyanka located on the shore of Lake Baikal (SRVP22_05 and SluRVP22_08) contained RNA viruses (ssRNA(+)) similar to plant viruses (Tobamovirus), especially those infecting vegetables: tomatoes, pepper, and cucumbers. The obtained data are consistent with those in reports on the diversity and largescale distribution range of plant viruses in wastewater, the possible sources of which reside in agricultural discharge or human feces [31,33].
Here, we report an extremely diverse composition of ssRNA(+) viruses. Negativestrand viruses (ssRNA(−)) were present in negligible amounts, and dsRNA viruses accounted for 2.5% and 36.1% in treated wastewater and pelagic zone water, respectively. The samples SRVP22_05 and SluRVP22_08 were dominated by sequences belonging to the class Leviviricetes, apparently originating from a microbial community of activated sludge, since their closest relatives registered in GenBank are its inhabitants. Equal proportions of ORFs detected in the sample SevRVP22_08 belonged to Pisuviricota, Kitrinoviricota, and, to a lesser extent, Lenarviricota.
For the pelagic samples, the search in the IMG/VR database revealed Nodaviridae, Tombusviridae, Dicitroviridae, Picobirnaviridae, Botourmiaviridae, Marnaviridae, Solemoviridae, and Endornaviridae. Regretfully, it is generally a challenge to detect the host due to the absence of cultivated viral sequences in the databases. For instance, representatives of Nodaviridae are known to infect insects and fish in freshwater ecosystems [55] and cause the white tail disease of freshwater prawn [56]. The families Dicitroviridae, Picobirnaviridae, Marnaviridae, and Nodaviridae were encountered in freshwater Lake Tai (China) during an outbreak of a Microcystis spp. bloom, and RNA viruses comprised 42.5% of the total number of virus transcripts [57]. In Antarctica, Lake Limnopolar was dominated by the order Caudovirales (dsDNA viruses) and ssRNA(+) viruses of the family Dicistroviridae and the genus Bacillarnavirus (the order Picornavirales), and Secoviridae, Marnaviridae, Iflaviridae, Potyviridae, and Tombusviridae were found in smaller amounts [58]. Representatives of Marnaviridae are acknowledged pathogens of marine diatoms [14]. Until recently, our knowledge of RNA viruses in freshwater environments, especially in ancient waters, was strongly limited.
A taxonomic analysis showed that the viromes in the effluent wastewater samples SRVP22_05 and SluRVP22_08 were significantly different from other viromes based on the content of RNA viruses. In contrast, Baikal viromes (RVP4, RVP5, and RVP6) and treated wastewater collected in August in the town of Severobaikalsk (SevRVP22_08) and Slyudyanka (SluRVP22_08) demonstrated a close similarity based on the composition of DNA viruses.
Of the 3049 contigs identified using VirSorter 2, only 26 were putative complete genomes (including a segment similar to the Boolarra virus segment RNA1), with RNA viruses being their closest relatives, including terminal untranslated regions (UTRs), RdRp genes, capsid proteins, and maturation proteins. The maximum similarity was observed with the viruses belonging to the families Dicistroviridae, Marnaviridae, Blumeviridae, Virgaviridae, Solspiviridae, Nodaviridae, and Fiersviridae; the unassigned genus Chimpavirus; and unclassified picorna-like viruses. All genomes detected belonged to ssRNA(+) viruses 3 Kb to 9.5 Kb long infecting bacteria, plants, invertebrates, and, presumably, protozoans. One genome was not classified to the taxon level.
To compare transcriptomes as a whole, we used the Metafast program to cluster them. The samples from the pelagic zone of the lake were positioned separately on the tree. A previous UPGMA analysis of transcriptomes from Lake Baikal, based on a comparison of taxa from various natural sources (lakes, sea, and bays), supported the same isolated position of the Baikal samples in the dendrogram [10]. Bearing in mind the unique origin of the lake, the composition of hydrobionts, and hydrochemical and hydrophysical characteristics, the pronounced divergency of most Baikal viruses at the genetic level compared to the representatives of other aquatic ecosystems seems highly likely. The most distinct SRVP22_05 sample shared a clade with influent wastewater samples from Southern California (SB, OC, SJ, JWPCP, PL, NC, and HTP) [31].
In the effluent wastewater, we found several human-infecting viruses, and among them were enteroviruses (Aichi virus 1, Human astrovirus, and Astrovirus VA4) and viruses infecting the skin and mucosal epithelium (Human papillomavirus 4). Additionally, we detected the following human viruses, the ability of which to affect and induce diseases remains poorly studied: Human picobirnavirus, Husavirus sp., Human fecal virus Jorvi3, Salivirus FHB, and Circular ssDNA virus sp.
Aichi virus 1 (ssRNA(+), the genus Kobuvirus, the family Picornaviridae) is a human gastroenteritis agent transmitted via the fecal-oral route with contaminated food or water. This worldwide-spread virus is detected in various media: wastewater, river water, groundwater, and mollusks. The virus is found with a higher frequency and in larger numbers than any of the human enteroviruses. Aichi virus 1 may serve as an appropriate gastroenterovirus indicator [34,59,60].
Human astrovirus and Astrovirus VA4 (ssRNA(+), Astroviridae). The fact that astroviruses are pathogenic to humans was widely acknowledged by researchers in Thailand in 1991 [61]. Abundant evidence proved that astroviruses were a major cause of severe acute gastroenteritis in children, elderly people, and immunocompromised individuals (persons with a weakened immune system) [62][63][64].
Human picobirnavirus (dsRNA, Picobirnaviridae) was extracted from the stool of an infant with acute non-bacterial gastroenteritis in Thailand [65]. Picobirnaviruses have been found in different animal species, including invertebrates, and environmental samples. Since picobirnaviruses are ubiquitous in the feces/intestinal contents of humans and other animals with or without diarrhea, they are considered opportunistic pathogens of mammals and avian species, but the actual host remains unknown so far [66].
Husavirus sp. (ssRNA(+), Picornavirales) has been found in human feces and identified globally in different samples. Knowledge of the epidemiological and molecular features is hitherto fragmentary. Despite this limited understanding, all Husavirus sequences described have been detected in the stools of humans with different clinical manifestations: patients with a HIV-1-positive status, trachoma, acute diarrhea, and clinical silence. It is not yet known whether these viruses directly affect humans or other organisms in the human body; for instance, the hypothesis that helminths are natural Husavirus hosts was not supported [67,68].
Salivirus FHB (ssRNA(+), Picornaviridae) was present in the feces of 3.5% of ill children with diarrhea and 2.8% of clinically silent control patients. Saliviruses were observed from June to September during the warmest days and never on cooler days. No direct relationship between saliviruses and gastroenteritis has been reported. All known viruses may be found in the feces of children with or without gastroenteritis; however, the authors admit that gastroenteritis is caused by saliviruses at high viral loads [69]. Saliviruses were found in wastewater [34].
Human fecal virus Jorvi3 (ssDNA, the family Circoviridae) has been detected in human feces, yet there is no evidence that virus availability is associated with human illness. The authors suggest that infection with the virus may be beneficial to the host by preventing the development of autoimmune diabetes, but the studies need to be confirmed due to small sample sizes [70].
Circular ssDNA virus sp. has been isolated from pericardial fluid, but the authors found no association between this virus and pericarditis. It is suggested that the abovementioned viruses may be replicated in human cells, presumably as opportunistic pathogens [72], notwithstanding that some circular replication-encoding single-stranded DNA viruses (CRESS-DNA) are animal pathogens and that high amounts of their representatives have been detected in samples of sick humans [72].
Coronavirus sequences were not detected in our dataset as in previous studies of wastewater [28]. At the same time, the use of respiratory-virus-enriched library preparation and sequencing has allowed for improvements in the detection of influenza A and coronaviruses [31]. Coronaviruses are able to survive in wastewater for hours or days while remaining contagious. The survival of coronaviruses in wastewater depends on many factors, such as temperature, pH, and treatment procedures. Wastewater treatment processes may inactivate or remove viruses, but viral RNA may still be detected in treated wastewater after a long time [73].
Human pathogenic viruses have not been identified at the central stations at Lake Baikal. Apparently, the concentration of fecal viruses decreases significantly with distance from the WWTP discharge point due to the dilution of the treated effluent flow by the lake water. As previously noted, viruses may spread over a large space from the discharge point [74]. However, when some of the virus particles reach a water body via wastewater, they degrade, disperse, and accumulate in hydrobionts, for instance, in mollusks [28]. Further observations of virome composition along the transect of sites at different distances from the effluent wastewater source points would allow for an evaluation of the variations in the composition and virus distribution pattern of viruses throughout the lake.
The microbiological examination of water on special seed culture media and a transcriptome analysis revealed the inadequate operation of WWTPs in the towns located on the shore of Lake Baikal. As previously reported, the concentrations of microorganisms serving as sanitary indicators were high in the coastal zone of the studied localities (the towns of Slyudyanka and Severobaikalsk) [75,76]. The sanitary indicators, including coliphages, registered at the pelagic sites of the lake complied with SanPiN requirements (the standards and sanitary rules of Russia), according to which the water from the surface sources should be free of enteroviruses.
Summing up, despite the inadequate treatment of wastewater discharged into Lake Baikal from the municipal WWTPs located on the lake shores, the present microbiological and virological parameters of the water in the deep pelagic zone meet the regulatory compliance requirements and pose no threat to human health.
Promoting recreation in the various bays of Lake Baikal demands further analysis that implies the use of complex methods, including microbiology, metagenomics, and metatranscriptomics approaches.

Sample Collection
Water samples of 100 L were collected from three pelagic sites of Lake Baikal in July and August 2022 at depths from 1 to 15 m (integral, 25 L from 1, 5, 10, 15 m depths). The sampling sites were located in three lake basins: RVP4-central station "Listvyanka settlement-Tankhoy settlement" (51 •

Sample Preparation
To obtain the virome, the samples were sequentially filtered through 0.4 and 0.2 µm polycarbonate filters (Sartorius, Göttingen, Germany; Reatrack-Filter, Obninsk, Russia) to remove detritus and zoo-, phyto-, and bacterio-plankton. The filtrates from each sample were concentrated with a tangential flow filtration VivaFlow 200 (Sartorius, Göttingen, Germany) to a volume of 100 mL, and then centrifuge concentrators (50 kDa) were used to further concentrate them to 1 mL, at 4 • C and 3000 rpm using VivaSpin Turbo 15 (Sartorius, Göttingen, Germany). The concentrate was frozen in liquid nitrogen and stored at −70 • C until further analysis.
Total RNA was isolated using ExtractRNA (Evrogen, Moscow, Russia) according to the manufacturer's protocol. To prepare the RNA-seq library according to the MGIEasy RNAseq Library Prep Set protocol (MGI Tech, Shenzhen, China), 100-200 ng of isolated RNA was used. The following steps were performed: RNA fragmentation, reverse transcription, second chain synthesis, the polishing of dsDNA fragment ends, and adapter ligation (containing 10 nucleotide single-end indexes). Sequencing was run on the DNBSEQ-400 platform (MGI Tech, Shenzhen, China) with paired-end reads (2 × 150 bp).
VirSorter 2 was used to identify the contigs belonging to RNA and DNA viruses [81]; only contigs above 500 nt were taken for analysis, as was performed in [82].
The open reading frames were identified using GeneMarkS (v. 3.36) [83]. The local version blastn 2.12.0+ (e-value 10 −5 ) and the databases NCBI NT (release 250) and RefSeq (release v. 214) were used for the taxonomic annotation of the contigs. Diamond (v. 2.0.15) was used to annotate the ORFs [84], with the parameters -more-sensitive, -min-score 50, -e-value 10 −5 , using the database NCBI NR (release 250). The amino acid sequences obtained after processing in GeneMarkS from contigs after assembly in SPAdes were taken for analysis. Meanwhile, the taxonomic assignment was based on IMG/VR v.4 using Diamond with the parameters -more-sensitive, -min-score 50, -e-value 10 −5 ; the amino acid sequences obtained after processing in VirSorter 2 and GeneMarkS were taken for analysis.
The search for RdRp genes in transcriptomes (in obtained ORFs, amino acids) was performed using the component local blastp 2.12.0+ (e-value 10 −5 ) using the database of intact RdRp genes RVMT [89]. Sequences longer than 150 amino acids were used; the minimum threshold was chosen based on the database of intact RdRp genes, corresponding to the nearest minimum length of the sequence found in the database. The identified RdRp gene sequences were used for a phylogenetic analysis; alignment with sequences from NR, RefSeq, and RVMT was performed in the program MAFFT v. 7.407 [90], using the algorithm E-INS-i. The aligned sequences were manually checked, and the trimming of the beginning and the end of the alignment was performed using the program Mega 7 [91]. The tree was created using the program MrBayes (v. 3.2) [92].
A cluster analysis of transcriptomes with heat maps was performed in Metafast (v. 1.3.0) [99] with default parameters (SortMeRNA-processed reads were used as an input file). We then performed a comparative analysis based on the available transcriptome data [28][29][30][31] following the above described procedure.

Conclusions
In this study, we analyzed DNA and RNA viruses from samples collected at three central stations on Lake Baikal and effluent wastewater discharge points. No human pathogenic viruses were registered in the samples from the pelagic zone, whereas the wastewater contained viruses known to infect humans. The treated wastewater samples contained plant viruses that may serve as complementary indicators of fecal contamination. In general, the RNA-seq method is suggested as an alternative approach for the detection of hazardous viruses.