The Metagenomic Analysis of Viral Diversity in Colorado Potato Beetle Public NGS Data

The Colorado potato beetle (CPB) is one of the most serious insect pests due to its high ecological plasticity and ability to rapidly develop resistance to insecticides. The use of biological insecticides based on viruses is a promising approach to control insect pests, but the information on viruses which infect leaf feeding beetles is scarce. We performed a metagenomic analysis of 297 CPB genomic and transcriptomic samples from the public National Center for Biotechnology Information Sequence Read Archive (NCBI SRA) database. The reads that were not aligned to the reference genome were assembled with metaSPAdes, and 13314 selected contigs were analyzed with BLAST tools. The contigs and non-aligned reads were also analyzed with Kraken2 software. A total of 3137 virus-positive contigs were attributed to different viruses belonging to 6 types, 17 orders, and 32 families, matching over 97 viral species. The annotated sequences can be divided into several groups: those that are homologous to genetic sequences of insect viruses (Adintoviridae, Ascoviridae, Baculoviridae, Dicistroviridae, Chuviridae, Hytrosaviridae, Iflaviridae, Iridoviridae, Nimaviridae, Nudiviridae, Phasmaviridae, Picornaviridae, Polydnaviriformidae, Xinmoviridae etc.), plant viruses (Betaflexiviridae, Bromoviridae, Kitaviridae, Potyviridae), and endogenous retroviral elements (Retroviridae, Metaviridae). Additionally, the full-length genomes and near-full length genome sequences of several viruses were assembled. We also found sequences belonging to Bracoviriform viruses and, for the first time, experimentally validated the presence of bracoviral genetic fragments in the CPB genome. Our work represents the first attempt to discover the viral genetic material in CPB samples, and we hope that further studies will help to identify new viruses to extend the arsenal of biopesticides against CPB.


Introduction
The metagenomic approach has proven to be a highly efficient way of detecting viruses in a variety of environments [1,2]. Unlike traditional cultural and diagnostic methods, viral metagenomics uses high-throughput next-generation sequencing, which makes it possible to identify rare and new viruses and to establish ecological links between microorganisms and their natural habitat [3]. Among the various routes of virus transmission, insect transmission is one of the most widespread. Insects are the largest group of animals on the planet and are of significant ecological, agricultural, and medical importance [2]. However, our knowledge of insect viruses is still limited. Establishing an insect virome will provide new insights into the ecology and evolution of viruses and help address public health issues in the fight against arbovirus infections. In addition, pathogenic insect viruses can be used for the biological control of insect pests, and, thus, the identification of new insect viruses can expand the arsenal of potential biocontrol agents [2].

Metagenome Assembly and Sequence Classification
We constructed an analytical pipeline to analyze the genomic and transcriptomic data of Leptinotarsa decemlineata obtained from the NCBI SRA database ( Figure 1). The pipeline, written in Snakemake [21], involves several approaches for classifying genetic sequences and is freely available at: https://github.com/starchevskayamaria17/uncoVir, accessed on December 2022.
In the first step, low-quality and short reads (less than 20 nucleotides in terms of quality and less than 30 nucleotides in terms of length), overrepresented sequences, and adapters were removed using the Trimmomatic tool [22]. For contamination control, the reads aligned to the human genome (GRCh38.p12) using the Bowtie2 tool were deleted [23]. The reads of synthetic origin (identified with BLAST against the UniVec database) were also deleted. The reads aligned to the reference L. decemlineata genome (assembly GCA_000500325.2 Ldec_2.0) using the Bowtie2 tool were also excluded from the subsequent analysis.
In the next step, viral diversity was assessed using the k-mer spectrum analysis. The unaligned reads were classified with Kraken2 [24] using our custom compiled viral sequence database BigViralDB. The results were visualized using Pavian [25]. To assess the completeness of particular viral genomes, we also aligned the reads using Bowtie2 to the viral genomic sequences contained in our BigViralDB database.
GCA_000500325.2 Ldec_2.0) using the Bowtie2 tool were also excluded from the subsequent analysis.
In the next step, viral diversity was assessed using the k-mer spectrum analysis. The unaligned reads were classified with Kraken2 [24] using our custom compiled viral sequence database BigViralDB. The results were visualized using Pavian [25]. To assess the completeness of particular viral genomes, we also aligned the reads using Bowtie2 to the viral genomic sequences contained in our BigViralDB database. The reads unaligned on the L. decemlineata genome were assembled into contigs using the SPAdes assembler tool [26]. Contigs longer than 500 nucleotides and with coverages of more than 20 were selected and analyzed using BLASTn against the NCBI Nucleotide database. The unclassified contigs and contigs annotated with BLASTn as virus-positive The reads unaligned on the L. decemlineata genome were assembled into contigs using the SPAdes assembler tool [26]. Contigs longer than 500 nucleotides and with coverages of more than 20 were selected and analyzed using BLASTn against the NCBI Nucleotide database. The unclassified contigs and contigs annotated with BLASTn as virus-positive were then analyzed using BLASTx against the NCBI nonredundant protein database. The obtained annotation results were analyzed and visualized using custom scripts written in the Python programming language. The details of the running parameters of the programs were included in the pipelines and are given in the source code in the repository (https://github.com/starchevskayamaria17/uncoVir, accessed on December 2022).

Analysis of Reference Genomes of Selected Coleopterans
The genome sequence of Leptinotarsa decemlineata (assembly GCA_000500325.2 Ldec_2.0) used as a reference for alignment was split into 31-nucleotide long k-mers and analyzed using the BLASTn tool against the NCBI Nucleotide database. Similarly, other reference genomes of Coleoptera obtained from the NCBI Genome database were examined: Diabrotica virgifera (GCA_003013835.

Biological Sample Collection and Preparation
The experiments were carried out on biological samples (L. decemlineata adults, larvae, and eggs) collected from private potato fields free from treatment by biological insecticides (Novosibirsk region, Russian Federation; 53 • 44 3.534 N, 77 • 39 0.0576 E). Due to these potato fields not being located in protected areas, there was no need for special permission to collect beetles. The landowners did not prevent access to the fields. Endangered or protected species were not used in this work. The insects were kept in a ventilated laboratory room for a 12:12 h dark/light period at a constant temperature of 25 • C. The insects were fed with fresh Solanum tuberosum foliage. Fourth-instar larvae (2-4 h postmolt in IV instar) were used for cuticle and hemolymph sample collection. The cuticle was cleared of the fat body in a cold phosphate buffer and frozen in liquid nitrogen, and each sample contained the cuticles of six insects. Hemolymph samples were collected as follows. A puncture was made with a sterile needle, and 25 µl of hemolymph was taken. The samples were cooled on ice and frozen in liquid nitrogen, with each being taken from a single insect [27]. For a more detailed description of biological sample collection and preparation, one can refer to the following papers [28,29].
Colorado potato beetle egg samples were prepared as follows. Ten eggs from three clutches were pulled in a single 1.5 ml Eppendorf tube (one sample contained thirty eggs). Then, 0.5 ml of 6% hydrogen peroxide was added to the sample, the sample was stirred for 5 minutes at 500 rpm on a TS-100C (BioSan) shaker, and the free liquid was removed. Then, 0.5 ml of autoclaved water was added to the sample, the tube was inverted several times, and the remaining liquid was removed with a sterile filter paper. The procedure was then repeated with 2.5% sodium hypochlorite. All manipulations were performed at room temperature with room temperature reagents, with the shaker was used without heating. The prepared samples were frozen in liquid nitrogen. This method was adapted from Rotskaya et al. [27].

Confirmation of the Presence of Bracovirus Genetic Fragments in the Genetic Material of L. decemlineata
For the two fragments classified as bracovirus fragments, the oligonucleotide primers were designed using Primer3 software [30]. The sequences of the fragments and primers are presented in the supplementary materials in Supplementary Table S2. DNA was extracted from 20 mg of sterile eggs, cuticle, hemolymph, and whole insect (imago) of the Colorado potato beetle using the PureLink genomic DNA mini kit (Thermo Fisher Scientific, Waltham, MA, USA). The amplification reaction was performed in a GeneAmp PCR System 9700 thermal cycler (Thermo Fisher Scientific, USA) in a volume of 50 µL. The reaction mixture contained Taq-DNA polymerase buffer (Thermo Fisher Scientific, USA), 1.5 mM MgCl2, 0.2 mM dTTP, 0.2 mM dGTP, 0.2 mM dATP, 0.2 mM dCTP, 10 pmol of each oligonucleotide primer, 1.25 Taq-DNA polymerase activity units (Thermo Fisher Scientific, USA), and 2-10 ng matrix DNA. The amplification was performed for 30 cycles using a stepwise program (94 • , 45 s; 55 • , 45 s; and 72 • , 2 min). The PCR products were purified using a QIAquick PCR Purification Kit (250) (QIAGEN, Germany). The amplification products were separated in a 1.2% agarose gel. The fragments obtained were confirmed using Sanger sequencing. The reaction was performed in a GeneAmp PCR System 9700 thermal cycler (Thermo Fisher Scientific, USA) in a volume of 5 µL. The reaction mixture contained prepurified DNA matrix, 2 µL of BigDye-v.3.1 sequencing solution (Thermo Fisher Scientific, USA), and a 4 pM oligonucleotide primer for 30 cycles using a stepwise program (94 • , 20 s; 50 • , 20 s; and 60 • , 4 min).

Analysis of the k-Mer Spectra
The genomic and transcriptomic data of 297 L. decemlineata samples were analyzed in this study. The quality and contamination control of the reads for each sample resulted in no more than 5% of the reads being removed. Up to 10% of the reads were not aligned with the L. decemlineata genome. Among the reads not aligned with the L. decemlineata genome, 1.36% of the reads from the genomic samples (8,461,508 of 620,747,359) and 3.04% of the reads from the transcriptomic samples (5,905,253 of 193,959,828) were analyzed using Kraken2 and our custom BigViralDB database. The results of the analysis down to the family level are visualized in Figure 2. In the transcriptomic samples, the viral reads were classified as mainly belonging to the viral families of Betaflexviridae, Baculoviridae, Peribunyaviridae, Potyviridae, Orthomyxoviridae, Paramyxoviridae, Dicistroviridae, Polydnaviridae, and others. In contrast to the transcriptomic data, the genomic data lacked the reads attributed to viral families of Betaflexviridae, Potyviridae, and Dicistroviridae. The sequences attributed as retroviral or bacteriophage-related are not represented in the plots.

Contig Analysis
A total of 31,069,568 contigs were collected from all samples as a result of the assembly of reads unaligned to the L. decemlineata genome. Contig filtering by length and coverage resulted in the selection of 122,506 contigs (with an average length of 1861 bp and

Contig Analysis
A total of 31,069,568 contigs were collected from all samples as a result of the assembly of reads unaligned to the L. decemlineata genome. Contig filtering by length and coverage resulted in the selection of 122,506 contigs (with an average length of 1861 bp and maximum contig length of 618,970 bp), which were further classified using BLAST tools. The final table (Supplementary Table S3) with all the results of contigs annotated with BLASTx contains 5,175,489 lines (the e-value < 10 −3 was chosen as the filtering criterion; matches with bacteriophage proteins were excluded from the analysis). This table provides information on 13,314 unique contigs, with the superkingdom Eukaryota accounting for 13,113 unique contigs, the superkingdom Bacteria accounting for 6220, Archaea accounting for 269, and viruses accounting for 3137 contigs. A total of 719 contigs annotated with BLASTn as viral were then annotated with BLASTx. Among the contigs that could not be annotated with BLASTn, 3395 contigs were assembled from Colorado potato beetle transcriptomic reads and 9203 from genomic reads, with these then being annotated with BLASTx.
The following tables were created based on the results of BLASTx annotation: • Among the viral protein BLASTx hits for which the virus family was known, the best match was selected for each contig based on the e-value (minimum) and the bitscore (maximum)-2954 rows (Supplementary Table S4); • Among the viral protein BLASTx hits for which the virus family was unknown, the best match was selected for each contig based on the e-value (minimum) and the bitscore (maximum)-1662 rows (Supplementary Table S5); • Among the non-viral protein BLASTx hits, the best match was selected for each contig based on the e-value (minimum) and the bitscore (maximum)-13,252 rows (Supplementary Table S6).
Supplementary Table S7 presents the results of the virus-positive contig annotation after filtering, including 1652 contigs assigned to different virus species with an unspecified virus family and 2954 contigs with an established virus family (with the information on contigs the protein products of which have homology to bacteriophage proteins that were excluded). The table also provides the identifiers of genomic and transcriptome data projects, the information on protein products, the distribution of contig and alignment lengths, identity scores, and e-value values. Table 1 presents the most interesting results, specifically virus-positive contigs belong to six virus types: Artverviricota, Kitrinoviricota, Negarnaviricota, Nucleocytoviricota, Pisuviricota, and Preplasmiviricota, distributed in 17 orders, 32 viral families, and 97 species. In addition, at least 48 virus species with unknown viral families were also matched. Figure 3 shows the distributions of viruspositive contig numbers derived from genomic and transcriptomic samples by virus family or species (if family is unknown). If homology with viral sequences was not taken into account, then most of these contigs were attributed to insects (Coleoptera), as is shown in Supplementary Figures S1 and S2.
The highest number of annotated virus-positive contigs belonged to the endogenous viral elements of retroviral origin: species of Aedes aegypti To virus 2 (AAToV2), Aedes aegypti To virus 1 (AAToV1), and Atrato Retro-like virus with unspecified taxonomic affiliations and a number of virus representatives of the families Metaviridae and Retroviridae. The identity percentage ranges from 31.5% to 85.7% and is higher for the members of the Metaviridae family, AAToV1 and AAToV2. For Trichoplusia ni TED virus (Metaviridae), the maximum alignment length reaches 1157 aa. More than 1000 contigs obtained mainly from genomic data were annotated as belonging to different members of the family Adintoviridae. The alignments indicate homology not only in terms of retrovirus-like integrase and polymerase B but also in terms of structural proteins of the family Adintoviridae.  Many virus-positive contigs were found to be associated with viruses infecting insects. Twenty-six contigs with a similarity to Lampyris noctiluca iflavirus 1 (family Iflaviridae) were identified. The maximum length of alignments between the proteins encoded in these contigs and the polyprotein of this virus was 1642 aa (median 1638 aa), with a median identity of 33.8% (maximum identity percentage 39.7%). The genome size of the Iflaviridae family members varies from 9 to 11 kb, with a median length among the corresponding virus-positive contigs of 9870 nucleotides (maximum 9934). Two of the four contigs assigned to the family Dicistroviridae were attributed to the species Aphis gossypii virus and Aphid lethal paralysis virus, with the contig lengths being 6635 and 3468 nucleotides, respectively, and the genome sizes of representatives of the family Dicistroviridae ranged from 8 to 10 kb. The percentage of alignment identity predicted for these contigs is 93.9% for the ORF1 protein (alignment length 1039 aa) and 92.8% for the capsid protein (alignment length 808 aa). Along with the families Iflaviridae and Dicistroviridae, other viruspositive contigs homologous to the genetic sequences of the order Picornavirales were assigned to insect-infecting members of the Picornaviridae, Solinviviridae, and Caliciviridae families. The identity percentage of the proteins encoded by these contigs versus the polyproteins of these virus families does not exceed 30%. Many of these contigs are extended and range from 3350 to 11648 nucleotides in length. The protein products of 36 contigs annotated as belonging to insect-infecting members of the family Orthomyxoviridae (Photinus pyralis orthomyxo-like virus 1, Quaranjavirus araguariense, Coleopteran orthomyxo-related Many virus-positive contigs were found to be associated with viruses infecting insects. Twenty-six contigs with a similarity to Lampyris noctiluca iflavirus 1 (family Iflaviridae) were identified. The maximum length of alignments between the proteins encoded in these contigs and the polyprotein of this virus was 1642 aa (median 1638 aa), with a median identity of 33.8% (maximum identity percentage 39.7%). The genome size of the Iflaviridae family members varies from 9 to 11 kb, with a median length among the corresponding virus-positive contigs of 9870 nucleotides (maximum 9934). Two of the four contigs assigned to the family Dicistroviridae were attributed to the species Aphis gossypii virus and Aphid lethal paralysis virus, with the contig lengths being 6635 and 3468 nucleotides, respectively, and the genome sizes of representatives of the family Dicistroviridae ranged from 8 to 10 kb. The percentage of alignment identity predicted for these contigs is 93.9% for the ORF1 protein (alignment length 1039 aa) and 92.8% for the capsid protein (alignment length 808 aa). Along with the families Iflaviridae and Dicistroviridae, other virus-positive contigs homologous to the genetic sequences of the order Picornavirales were assigned to insect-infecting members of the Picornaviridae, Solinviviridae, and Caliciviridae families. The identity percentage of the proteins encoded by these contigs versus the polyproteins of these virus families does not exceed 30%. Many of these contigs are extended and range from 3350 to 11648 nucleotides in length. The protein products of 36 contigs annotated as belonging to insect-infecting members of the family Orthomyxoviridae (Photinus pyralis orthomyxo-like virus 1, Quaranjavirus araguariense, Coleopteran orthomyxo-related viruses, etc.) showed homology with various structural and nonstructural proteins, with identities of 25 to 55% and alignment lengths reaching 768 aa.
A total of 55 contigs derived from genomic samples were annotated as belonging to the family Baculoviridae, with most contigs encoding polypeptides homologous to Fprotein (with a maximal alignment length of 474 aa and maximal identity of 30.9%). In addition, translation frames were identified, with their products homologous to vp80 (156 aa alignment length, 35.9% identity), ORF27 (220 aa alignment length, 26.8% identity), and PxORF26 (271 aa alignment length, 21% identity). Additionally, more than 70 contigs containing translation frames homologous to glycoproteins from various members of the families Chuviridae and Phasmaviridae were also detected. The identity of the encoded amino acid sequences to the Coleopteran chu-related virus glycoprotein OKIAV127 (family Chuviridae) was as high as 72.8%. A total of 62 contigs up to 6 kb in length were identified, with their predicted proteins being homologous with various amino acid sequences of representatives of the genus Bracoviriform (family Polydnaviriformidae). Additionally, a contig with a length of 17638 nucleotides encoding a protein 28.6% identical to the protein P74 of Penaeus monodon nudivirus (family Nudiviridae) was found, with an alignment length of 655 aa.
Additionally, we also identified contigs attributed to plant viruses. Within 66 extended contigs exceeding 8 kb in length, translation frames encoding proteins with 100% identity to Potato virus S (family Betaflexiviridae) were found, with an alignment length of 1975 aa. A contig was found with a protein product 99.6% homologous to the Potato virus H (family Betaflexiviridae) shell protein, with an alignment length of 292 aa. Twenty-five contigs longer than 9 kb were found to be assigned to Potato virus Y (family Potyviridae), with an alignment length of 3061 aa (polyprotein). Given that the genomes of these potato viruses range from 5.4 kb to 12 kb in length, we obtained almost complete genome assemblies. In addition, we identified 63 extended (up to 8.5 kb) contigs belonging to the Papaya mottle associated virus, with their protein products being homologous to RdRp of the virus specified (maximal identity was 64.4%, maximal alignment length was 1017 aa). Additionally, we identified single contigs of plant viruses with protein products homologous to those of Garlic common virus, Bromoviridae, and Kitaviridae (with identity less than 50%, and the alignment length less than 200 aa).
In addition, contigs were found encoding protein products homologous to viral proteins of the replication system, such as RdRp, and to uncharacterized viral proteins, with identities of 30-40% and alignment lengths not exceeding 300 aa. In particular, 19 contigs identified in the study were attributed to Trichoplusia ni ascovirus 2c (family Ascoviridae), with the alignment of the identity of the encoded protein products to various uncharacterized viral proteins being 40.6% and the maximum alignment length being 137 aa. In the genomic samples, we also identified two contigs whose protein products were 31% identical to the short amino acid sequences, 168 aa in length, of uncharacterized Iridovirus LCIVAC01 proteins (family Iridoviridae). Extended contigs of more than 10 kb in length were also detected, with alignment lengths of 1930 aa and with them being 30% identical to the RdRp of Xinmoviridae family representatives. Extended contigs encoding proteins homologous to RdRp, ORF3, and hypothetical proteins of the Virgaviridae family were also identified. Moreover, several contigs exhibited homology in terms of their protein products to uncharacterized proteins of various representatives of the Poxviridae, Hytrosaviridae, Partitiviradae, Bornaviridae, Rhabdoviridae, Parvoviridae families and several viruses of unknown viral families.
Protein products encoded by the same contig tend to exhibit significant homology with related viruses. To understand how viral annotations are related to each other, we additionally clustered contigs based on the homology of the encoded proteins with proteins from different viral families (Figure 4). The maximum alignment length of the proteins encoded by each contig was calculated for each contig relative to the proteins of each of the presented viral families. The alignment lengths of the proteins encoded by each of the contigs with the amino acid sequences of each of the viral families are presented as values in the heatmap cells. Figure 4 shows the contigs with only a unique homology pattern. values in the heatmap cells. Figure 4 shows the contigs with only a unique homology pattern. It can be seen that contigs are clustered largely based on the phylogenetic kinship of viral groups. In particular, contigs with proteins homologous to those of viruses of the Dicistroviridae family are clustered with RNA-containing viruses, including other families of the Picornavirales order (Picornaviridae, Iflavirudae, Secoviridae, Caliciviridae, and Marnaviridae) as well as with virus families of the Martellivirales order (Virgaviridae, Kitaviridae, and Closteroviridae). Contigs encoding proteins that are homologous to proteins of phylogenetically close orders, such as Mononegavirales (Xinmoviridae, Bornaviridae, and Rhabdoviridae) and Jingchuvirales (Qinviridae and Chuviridae), are also clustered. A separate cluster is formed by contigs encoding proteins homologous to those of viruses of the families Xinmoviridae, Bornaviridae, Rhabdoviridae, Artoviridae, Mymonaviridae, Lispiviridae, and Nyamiviridae of the order Mononegavirales. A number of contig clusters encode proteins that are homologous to proteins of Nucleocytoviricota representatives (families Iridoviridae, Ascoviridae, Phycodnaviridae, Pithoviridae, Mimiviridae, and Marseilleviridae). A full version of the table containing the data used in Figure 4 is presented in Supplementary  Table S8. An interactive visualization of all the contigs analyzed can be found at the project repository at https://github.com/starchevskayamaria17/uncoVir. Most contigs whose protein products were not classified as viral proteins were found to have homology with the amino acid sequences of arthropods and mainly with insect proteins. It can be seen that contigs are clustered largely based on the phylogenetic kinship of viral groups. In particular, contigs with proteins homologous to those of viruses of the Dicistroviridae family are clustered with RNA-containing viruses, including other families of the Picornavirales order (Picornaviridae, Iflavirudae, Secoviridae, Caliciviridae, and Marnaviridae) as well as with virus families of the Martellivirales order (Virgaviridae, Kitaviridae, and Closteroviridae). Contigs encoding proteins that are homologous to proteins of phylogenetically close orders, such as Mononegavirales (Xinmoviridae, Bornaviridae, and Rhabdoviridae) and Jingchuvirales (Qinviridae and Chuviridae), are also clustered. A separate cluster is formed by contigs encoding proteins homologous to those of viruses of the families Xinmoviridae, Bornaviridae, Rhabdoviridae, Artoviridae, Mymonaviridae, Lispiviridae, and Nyamiviridae of the order Mononegavirales. A number of contig clusters encode proteins that are homologous to proteins of Nucleocytoviricota representatives (families Iridoviridae, Ascoviridae, Phycodnaviridae, Pithoviridae, Mimiviridae, and Marseilleviridae). A full version of the table containing the data used in Figure 4 is presented in Supplementary Table S8. An interactive visualization of all the contigs analyzed can be found at the project repository at https://github.com/starchevskayamaria17/uncoVir, accessed on December 2022. Most contigs whose protein products were not classified as viral proteins were found to have homology with the amino acid sequences of arthropods and mainly with insect proteins.

Non-Viral Protein Hits Are Significantly Enriched with Unknown and Uncharacterized Proteins as Compared to Viral Protein Hits
The next step was to analyze the enrichment of the contigs under study with unknown and uncharacterized proteins. The statistical analysis was performed using Fisher's exact test for unpaired sets (nonoverlapping sets of contigs) and the McNemar test for paired sets (contigs whose proteins have homologs among both viral and non-viral proteins). The entire set of identified homologous proteins was divided into the following groups: "viral" and "non-viral" and "known" and "unknown." Proteins with the following keywords in their name were considered unknown: unnamed, uncharacterized, unknown, or hypothetical. The set of contigs were also categorized into groups: "viral" and "non-viral" contigs whose proteins have and lack homology with viral proteins, respectively ( Figure 5).

Non-Viral Protein Hits Are Significantly Enriched with Unknown and Uncharacterized Proteins as Compared to Viral Protein Hits
The next step was to analyze the enrichment of the contigs under study with unknown and uncharacterized proteins. The statistical analysis was performed using Fisher's exact test for unpaired sets (nonoverlapping sets of contigs) and the McNemar test for paired sets (contigs whose proteins have homologs among both viral and nonviral proteins). The entire set of identified homologous proteins was divided into the following groups: "viral" and "non-viral" and "known" and "unknown." Proteins with the following keywords in their name were considered unknown: unnamed, uncharacterized, unknown, or hypothetical. The set of contigs were also categorized into groups: "viral" and "non-viral" contigs whose proteins have and lack homology with viral proteins, respectively ( Figure 5). The analysis was performed for all "viral" contigs, which were found to be homologous with any of the viral families, and "non-viral" contigs, whose proteins were found to be homologous only with non-viral proteins. A total of 2817 "known" and 137 "unknown" proteins were found for "viral" contigs, and 3144 and 7033 were found for "nonviral" contigs, respectively. Thus, the enrichment with "unknown" proteins among the non-viral proteins was shown to be significant (p < 10 −6 , one-way Fisher's exact test).
We also performed unknown protein enrichment analysis for contigs with proteins shown to be homologous with viral and non-viral proteins. In this case, 2817 "known" and 137 "unknown" proteins were found among "viral" proteins, and 427 and 2484 were found among "non-viral" proteins, respectively. Thus, the "unknown" protein enrichment among non-viral proteins was shown to be significant (p < 10 −6 , one-way McNemar test). This finding can be explained by the fact that many eukaryotic genes encoding uncharacterized and hypothetical proteins may be of viral origin. The analysis was performed for all "viral" contigs, which were found to be homologous with any of the viral families, and "non-viral" contigs, whose proteins were found to be homologous only with non-viral proteins. A total of 2817 "known" and 137 "unknown" proteins were found for "viral" contigs, and 3144 and 7033 were found for "non-viral" contigs, respectively. Thus, the enrichment with "unknown" proteins among the non-viral proteins was shown to be significant (p < 10 −6 , one-way Fisher's exact test).
We also performed unknown protein enrichment analysis for contigs with proteins shown to be homologous with viral and non-viral proteins. In this case, 2817 "known" and 137 "unknown" proteins were found among "viral" proteins, and 427 and 2484 were found among "non-viral" proteins, respectively. Thus, the "unknown" protein enrichment among non-viral proteins was shown to be significant (p < 10 −6 , one-way McNemar test). This finding can be explained by the fact that many eukaryotic genes encoding uncharacterized and hypothetical proteins may be of viral origin.

Alignment Results
The mapping of reads to viral genomes and the extraction of consensus sequences yielded the near full-length or extended genomic sequences for a number of viruses. Table 2 shows the sizes of the reference genomes and the most extended consensus sequences (in nucleotides and % relative to the reference size). Extended information about the alignments is provided in Supplementary Table S9. Table 2. Characteristics of reads' alignments to viral genomes. This table specifies the family, genus, and species of the virus, genome coverage information (maximum, average, and median, in nucleotides and percentages), the minimum and maximum genome lengths stored in the BigViralDB database for the specified viral family, and the number of L. decemlineata genomic and transcriptomic projects where these viral sequences were found. The sequences representing 99.63% and 99.95% of the corresponding reference genomes were obtained for Potato virus Y (family Potyviridae) and Potato virus S (family Betaflexiviridae), respectively. The size of the consensus sequences obtained as a result of the alignment for five members of the family Dicistroviridae ranged from 10.39% to 70.84% of their genome size. Additionally, the alignment allowed us to obtain the consensus sequences of genome fragments of other viruses known to affect insects, such as the Wuhan insect virus 33, and the representatives of the families Virgaviridae, Baculoviridae, and Polydnaviriformidae.

Analysis of Selected Coleopteran Species Reference Genomes to Detect Bracoviral Genetic Fragments
The k-mer analysis and the contigs annotation revealed numerous sequences belonging to members of the genus Bracoviriform (family Polydnaviriformidae). The analysis of the reference genome Leptinotarsa decemlineata revealed numerous sequences homologous to the genus Bracoviriform, with a total length of approximately 25 kbp. Several other coleopteran genomes were also analyzed. Figure 6 shows a heatmap of the distribution of k-mers classified as Bracoviriform depending on the e-value. and 2100 nucleotides long were homologous to Cotesia sesamiae Mombasa bracovir (EF710639.1, the coordinates of the reference sequence alignment: 57046-58557, identi percentage 100%) and Cotesia vestalis bracovirus segment 35 (HQ009558.1, the coordinat of the reference sequence alignment: 1797-3455, identity percentage of 98.13%), respe tively. These contigs were also found to be aligned to the Leptinotarsa decemlineata genom The first bracovirus contig was aligned to NW_019290920.1, and the coordinates of th reference sequence alignment were 2663-4169; the second contig was aligned NW_019296350.1, and the coordinates of the reference sequence alignment were 997 11595. The alignment allows us to assume the insertion of viral DNA of Bracoviriform g nus into the genome of Leptinotarsa decemlineata. Figure 6. Distribution of bracovirus k-mers depending on e-value in reference genomes of sever members of Coleoptera, including Leptinotarsa decemlineata. The color indicates the number bracovirus k-mers identified within the specified e-value interval (BLASTn short). The absence the color designates the absence of a match.
We calculated the primers so as to confirm the presence of bracovirus fragments the genetic material of Leptinotarsa decemlineata, to prove the insertion into the Leptinotar decemlineata genome, and to exclude the presence of fragments containing regions flan ing the insertion in the corresponding bracovirus genome. The PCR analysis of DNA o tained from different tissues of Leptinotarsa decemlineata (eggs, sterile eggs, cuticle, hem lymph, and whole imago insect) confirmed the presence of both bracovirus fragments an revealed the absence of fragments containing the regions flanking the insertion in th bracovirus genome. The nucleotide sequence of the fragments was confirmed by Sang sequencing. The insertion into the Leptinotarsa decemlineata genome was not confirmed b Two sequences with homology to the segments of the bracovirus genomes were selected for the subsequent experimental studies. The nucleotide sequences of contigs 1513 and 2100 nucleotides long were homologous to Cotesia sesamiae Mombasa bracovirus (EF710639.1, the coordinates of the reference sequence alignment: 57046-58557, identity percentage 100%) and Cotesia vestalis bracovirus segment 35 (HQ009558.1, the coordinates of the reference sequence alignment: 1797-3455, identity percentage of 98.13%), respectively. These contigs were also found to be aligned to the Leptinotarsa decemlineata genome. The first bracovirus contig was aligned to NW_019290920.1, and the coordinates of the reference sequence alignment were 2663-4169; the second contig was aligned to NW_019296350.1, and the coordinates of the reference sequence alignment were 9973-11595. The alignment allows us to assume the insertion of viral DNA of Bracoviriform genus into the genome of Leptinotarsa decemlineata.
We calculated the primers so as to confirm the presence of bracovirus fragments in the genetic material of Leptinotarsa decemlineata, to prove the insertion into the Leptinotarsa decemlineata genome, and to exclude the presence of fragments containing regions flanking the insertion in the corresponding bracovirus genome. The PCR analysis of DNA obtained from different tissues of Leptinotarsa decemlineata (eggs, sterile eggs, cuticle, hemolymph, and whole imago insect) confirmed the presence of both bracovirus fragments and revealed the absence of fragments containing the regions flanking the insertion in the bracovirus genome. The nucleotide sequence of the fragments was confirmed by Sanger sequencing. The insertion into the Leptinotarsa decemlineata genome was not confirmed by this experiment, which may be due to assembly errors or high variability of corresponding regions of the L. decemlineata genome (Supplementary Figures S3-S6).

Insect Viruses
Numerous sequences of insect-infecting viruses were found in the data analyzed. Special attention should be given to the viruses of the families Iflaviridae and Dicistroviridae (the members of the order Picornavirales), whose genetic sequences were detected at different stages of the analysis. The members of the families in question are characterized by a rather small ss(+)RNA genome measuring 8-10 kb for the family Dicistroviridae and 9-11 kb for the family Iflaviridae. It is worth noting that new virus species of these families are often identified by metagenomic analysis in the NGS data of various insects, particularly mosquitoes [31]. The Homalodisca coagulata dicistrovirus also identified in this way is the virus of a glassy-winged sharpshooter Homalodisca coagulata (Cicadellidae), which feeds on hundreds of plant species and carries plant pathogens, causing enormous damage to agriculture [32]. Here, we obtained numerous sequences attributed to these two families of viruses, with contig lengths commensurate with their complete genome lengths, some of which are likely the genetic sequences of new viruses of these families. The contigs annotated as homologous to Lampyris noctiluca iflavirus 1 reach a length of 10 kb. However, the amino acid sequence of the polyprotein encoded by these contigs is only 30% identical to that of the iflavirus. In this study, we identified sequences homologous to Homalodisca coagulata dicistrovirus (comprising over 40% of the genome), Aphid lethal paralysis virus (up to 70% of the genome and an extended contig with a length of 3468 nucleotides, encoding a polypeptide that is 92% identical to the viral protein), Aphid glycines virus 3 (over 70% of the genome), Aphis gossypii virus (a 6635-nucleotide-long extended contig encoding a protein product that is 93% identical to the viral polyprotein), and others. The majority of known members of the family Dicistroviridae have been shown to infect only a limited number of related insect species (belonging to the same order). However, there are viruses in this family that infect a vast range of hosts. For example, Cricket paralysis virus can infect more than 20 insect species from different orders [33].
Earlier studies suggested that the Colorado potato beetle could be infected by viruses of the Baculoviridae and Iridoviridae families [34]. Various baculoviruses are known to affect a wide variety of insect taxonomic groups. Currently, the members of the family Baculoviridae are used as safe and effective biopesticides to protect forests, fields, and horticultural crops against lepidopteran pests [35]. Most of the contigs found in the genomic data and annotated as baculoviruses exhibit homology to the F-protein gene of various members of the genera Alphabaculovirus and Betabaculovirus and will be described in more detail below in the "Endogenous viral elements" section. Despite a number of contigs being found to be homologous with baculovirus sequences different from the F-protein gene, the low length of such fragments did not allow us to annotate them reliably.
Among the members of the Nucleocytoviricota type, a number of viruses of the families Iridoviridae, Poxviridae, and Ascoviridae are known to affect insects. However, despite the presence of contigs assigned to these families, the low percentage of identity (30-40%) and alignment length (64 to 191 aa), as well as the short length of contigs (from 500 to 2044 nucleotides), did not allow conclusions to be drawn about the presence of genetic material of specific viruses of these families in the samples analyzed. These sequences may also be fragments of endogenous viral elements integrated into the L. decemlineata genome over the course of evolution. The presence of endogenous viral elements of iridovirus origin in insect genomes is indicated, in particular, by the fact that certain representatives of leaf-feeding beetles, such as Gastrophysa atrocyanea, were shown to produce a diapausespecific peptide that acts as a blocker of potential-dependent calcium channels, provides antifungal activity, and is homologous to the sequence of the peptide encoded by Iridoviridae. The sequences encoding these peptides have been suggested to be suitable as probes for analyzing insects and this group of viruses [36].
For several other viruses affecting insects, such as members of the families Ortomyxoviridae, Virgaviridae, Xinmoviridae, and Partiviridae, and viruses with unspecified families, such as Wuhan insect virus 33, Hubei picorna-like virus 15, and Allermuir Hill virus 1, extended amino acid alignments with various structural and nonstructural viral proteins were found, with the corresponding virus-positive contigs being highly extended.

Plant Viruses
Plant pathogenic viruses pose a serious threat to livestock as well as agricultural, horticultural, and ornamental crops [37,38]. The Potato virus Y detected in this study belongs to the RNA-containing virus family Potyvirus of the family Potyviridae, which represents approximately 30% of all known plant pathogenic viruses. Potato virus Y, usually transmitted by aphids, can significantly reduce potato yields [39]. Potato virus S, the fulllength genome of which was also identified in this work, belongs to the RNA-containing viruses of the Betaflexviridae family. Despite having less pronounced disease symptoms in the affected plants, Potato virus S can destroy up to 20% of crops [40]. These results confirm the presence of vast amounts of potato viruses in L. decemlineata tissues and especially in gut and head samples.
In addition, leaf-feeding beetles transmit a variety of plant viruses [41,42]. Thus, Leptinotarsa decemlineata might also be involved in the mechanistic transmission of potato viruses. It is interesting to note that the Potato virus Y reduces the production of potato defense factors, such as sesquiterpenes, namely beta-barbatene, thus increasing the growth of Colorado potato beetle larvae feeding on infected plants [39], and, as a result, the plants are exposed to a combination of severe biotic stressors, increasing crop losses [39].
In addition, we identified 63 extended (up to 8.5 kb) contigs belonging to the Papaya mottle associated virus, with their protein products being homologous to RdRp. Additionally, we identified several plant virus-positive contigs with protein products that are homologous to those of Garlic common virus, Bromoviridae, and Kitaviridae with alignment lengths of less than 200 aa and identities of less than 50%. Thus, it was not possible to make any strong conclusions on these contigs.

Endogenous Virus Elements
The ability of many viruses to integrate their genetic material into the host genome is widely known. It was reported that 0.59% of the genome of L. decemlineata contains LTR regions, including Copia, Gypsy, Gypsy-Cigr, and Pao [8]. This study identified numerous retroviral sequences homologous to the members of the families Retroviridae and, to a greater extent, Metaviridae, which are characteristic to insects (LTR retrotransposon, Ty3/gypsy family), as well as sequences homologous to the unclassified viruses AAToV1 and AAToV2. Along with integrated retroviral sequences, we discovered viral sequences of non-retroviral origin, called endogenous viral elements (EVEs). The currently known insect EVEs belong to at least 28 viral families [43]. The endogenization of viral sequences into the host genome and EVE formation can occur via nonhomologous recombination or via the participation of the reverse transcriptase and integrase of endogenous retrotransposons. EVEs can also contribute to the emergence of new functional host genes and even play an antiviral role by generating small RNA, thereby preventing the maturation of viral particles at various stages of the viral life cycle [44][45][46].
Recent studies of insect genomes have uncovered numerous genetic sequences of negative sense RNA-containing viruses [43,47]. Such studies confirm the fact that arthropods are one of the main reservoirs of viral genetic diversity and undoubtedly play an essential role in virus evolution. Genetic sequences of the members of the family Chuviridae, characterized by an unusual ordering of genes, have been found in many genomic sequences of various invertebrates [48]. In particular, integrated genes of Chuviridae glycoproteins were found in the genomes of Dendroctonus ponderosae and Tribolium castaneum (Coleoptera) [47]. For the mosquitoes Aedes aegypti and Aedes albopictus, the presence of endogenous Chuviridae-like genes in Bel/Pao elements (family Belpaoviridae) was described. In our study of genomic data samples, we have found contigs encoding proteins similar to the glycoproteins of viruses of the Chuviridae family, including the Coleopteran chu-related virus. In addition, in the genomic data of L. decemlineata, we also found sequences homologous to the genes of viruses of the families Phasmaviridae, Rhabdoviridae, Xinmoviridae, Bornaviridae, and Orthomyxoviridae. These viruses have previously been shown to be present as EVEs within the genomes of various insect families [43].
Of interest is the fact that sequences homologous to the baculovirus F-protein gene were found in the genomic data of L. decemlineata. It is believed that the ancestral baculoviruses could replicate only in the midgut epithelial cells of insects, and the F-protein gene acquired during the evolution allowed the virus to penetrate the hemocoel and cause generalized infection. Previously, the orthologs of the baculovirus F-protein gene were found in the endogenous retroviral sequences of lepidopterans, dipterans, and other insect species, indicating that the regions encoding F-protein homologs identified in our study can also be incorporated into the genome of L. decemlineata [49,50], with this possibly indicating that the Colorado potato beetle can also be affected by baculoviruses.
We also identified numerous extended fragments homologous to the genetic sequences of viruses of the genus Bracoviriform. The members of the genus Bracoviriform belong to the family Polydnaviriformidae and are associated with parasitoid wasps, including approximately 18,000 species of five subfamilies of the family Braconidae (Microgastrinae, Cardiochilinae, Miracinae, Khoikholinae, and Cheloninae). The genomes of Polydnaviriformidae family viruses are segmented; their segments have multiple copies and present in virions non-equimolarly in the form of circular supercoiled double-stranded DNA measuring from 2 to 31 kbp. The cumulative non-redundant size of the bracovirus genome varies from 190 to 500 kbp [51]. The viral genome is transmitted as proviral DNA in parasitoid wasp cells and circular episomal DNA within virions. The replication of the virus genome and virion assembly occurs in ovarian cells, with viral particles being released and accumulated in the lumen of the oviduct and entering the host larvae or eggs during wasps laying their own eggs. In other wasp cells, the virus does not replicate and exists predominantly in a proviral form. Interestingly, the expression of viral genes causes changes in the physiology of the infected host and suppresses the encapsulation of wasp eggs, promoting the development and emergence of wasps [52]. Wasps parasitize on the larvae and eggs of lepidopterans, coleopterans, and hymenopterans. Despite the absence of replication in the cells of lepidopterans, the integration of bracovirus genome segments has been demonstrated both in vitro, in cell cultures, and in vivo [51]. Certain bracovirus genes were shown to contribute to protecting insects against baculoviruses-lethal pathogens, which explains why these genes have been conserved in the genomes of many lepidopterans [53].
We found extended bracovirus sequences in the reference genome of L. decemlineata and in the genomes of several other representatives of Coleoptera. We also demonstrated the presence of bracovirus sequences in the genetic material isolated not only from imago and larva insect tissues but also from sterile eggs of L. decemlineata, suggesting the integration of bracovirus genomic sequences within the Colorado potato beetle genome. To our knowledge, we are the first to report on the detection of bracovirus sequences in the genomes of the Colorado potato beetle and leaf-feeding beetles. The presence of these fragments in the Colorado potato beetle genome also appears to be associated with parasitoid wasps, such as the wasp Edovium puttleri, which parasitizes the eggs of Leptinotarsa decemlineata [54].
Additionally, the other sequences identified in our study, which are homologous to the genetic sequences of Nudiviridae family viruses, are also likely associated with bracovirus genetic fragments, given that the genus Bracoviriform originated from the Nudiviridae family [55].
Thus, the metagenomic studies of accumulated public high-throughput sequencing data serve as an important source of information for the search for new viruses. This work highlights the significance of metagenomic studies of genomic and transcriptomic data obtained from insect pests for a better understanding of their ecology and evolution. The discovery of genetic sequences potentially belonging to insect-infecting viruses points to the possibility of expanding the arsenal of biological methods to control the Colorado potato beetle population based on entomopathogenic viruses.
Supplementary Materials: The following supporting information can be downloaded at: https://www. mdpi.com/article/10.3390/v15020395/s1, Figure S1 Supplementary Table S2. Table S1: The summary of bioprojects used in this research (NCBI SRA); Table S2: Bracoviral genetic sequences and primers used to confirm the presence of certain viral fragments and inserts in the genome of Leptinotarsa decemlineata; Table S3: Contigs annotations with BLASTx (e-value < 10**-3, matches with bacteriophage proteins were excluded ); Table S4: Annotation BLASTx of virus-positive contigs with a known family (minimum e-value, maximum bitscore); Table S5: Annotation BLASTx of virus-positive contigs with a unknown family (minimum e-value, maximum bitscore); Table S6: Annotation BLASTx of virus-negative contigs (minimum e-value, maximum bitscore); Table S7: The summary of ST4 and ST5 tables. The distribution of virus-positive contigs by viral taxonomic groups; Table S8: Clusterization (Heat map); Table S9: Read alignment metrics for viral genomes in the BigViralDB database.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.