Bioinformatic Surveillance Leads to Discovery of Two Novel Putative Bunyaviruses Associated with Black Soldier Fly

The black soldier fly (Hermetia illucens, BSF) has emerged as an industrial insect of high promise because of its ability to convert organic waste into nutritious feedstock, making it an environmentally sustainable alternative protein source. As global interest rises, rearing efforts have also been upscaled, which is highly conducive to pathogen transmission. Viral epidemics have stifled mass-rearing efforts of other insects of economic importance, such as crickets, silkworms, and honeybees, but little is known about the viruses that associate with BSF. Although BSFs are thought to be unusually resistant to pathogens because of their expansive antimicrobial gene repertoire, surveillance techniques could be useful in identifying emerging pathogens and common BSF microbes. In this study, we used high-throughput sequencing data to survey BSF larvae and frass samples, and we identified two novel bunyavirus-like sequences. Our phylogenetic analysis grouped one in the family Nairoviridae and the other with two unclassified bunyaviruses. We describe these putative novel viruses as BSF Nairovirus-like 1 and BSF uncharacterized bunyavirus-like 1. We identified candidate segments for the full BSF Nairovirus-like 1 genome using a technique based on transcript co-occurrence and only a partial genome for BSF uncharacterized bunyavirus-like 1. These results emphasize the value of routine BSF colony surveillance and add to the number of viruses associated with BSF.


Introduction
Insects have become a point of interest as sustainable alternative sources of food and feed. Insect rearing is less environmentally impactful than traditional food and feed production, but insects must be reared in mass to produce a substantial amount of biomass [1]. Mass rearing of animals is highly conducive to pathogen transmission, as many are reared in close quarters. Insects are no exception, and the production of insects of economic importance, such as crickets (Orthoptera: Gryllidae), silkworms (Lepidoptera: Bombycidae), and honeybees (Hymenoptera: Apidae), has been historically stifled by viral epidemics [2].
The black soldier fly (BSF) (Hermetia illucens) is an insect of emerging industrial importance. This is because of its ability to convert organic waste to highly nutritious biomass for animal feed and potentially food for humans [3]. It has also been investigated as a source of antimicrobial compounds, antioxidants, and biodiesel [4][5][6]. As a result, there is global interest in upscaling the rearing of the species; however, relatively little is known about the microbes that could help or hinder its production [7]. It has been proposed that BSF are unusually resistant to pathogens because of their large repertoire of antimicrobial peptides, and until recently, no viruses were known to be associated with BSF [8][9][10][11]. Pienaar et al. [11] mined publicly available BSF data for endogenous viral elements (EVEs) and viruses and found multiple EVEs and one exogenous toti-like virus. This suggests that there is a history of BSF-virus interactions and emphasizes the need for investigating BSF-virus interaction.
In this study, we use a computational approach to survey RNA-seq datasets for viruses associated with BSF. We use transcriptomic data generated from previous BSF studies to search for putative novel viral-like sequences associated with BSF. We use a sequence similarity-based approach to identify putative viral sequences in our data, infer phylogenetic relationships based on the RNA-dependent RNA polymerase (RdRp) gene, and identify candidate sequences of diverged genome segments of these viruses based on the co-occurrence of transcripts between samples [12]. Our results emphasize the potential of high-throughput sequencing technology as a biological surveillance tool that can also identify emerging pathogens of BSF. In addition, this type of data provides a foundation to develop molecular diagnostic tools to actively monitor colonies for viral infection [13].

Sample Preparation and Sequencing
Larvae (n = 19), BSF digestate substrate (frass) (n = 17), and a sample of the diet (n = 1) with no larvae (spent grain) were sampled from industrial-scale rearing enclosures. Larvae were surface sterilized by dipping in 10% bleach solution for 1 min, followed by two successive deionized water washes for one minute. Following this, the intestinal tract (gut) was dissected from the larvae. RNA was purified from the larval guts, frass, and the initial spent grain diet using a Direct-zol™ RNA MiniPrep kit (Zymo, Irvine, CA), following the manufacturer's protocol, and RNA products were quantified with a Qubit ® 2.0 fluorometer (Invitrogen, Waltham, MA), and then ran on a gel to determine the RNA quality. A NEB Ultra RNA Library Kit (New England Biolabs, Ipswich, MA) was used to convert the RNA to cDNA, avoiding steps involving rRNA depletion and mRNA enrichment while preserving total RNA concentrations. cDNA libraries were multiplexed using NEBNext Oligos for Illumina (New England Biolabs, Ipswich, MA) to create five pooled libraries. The resulting multiplexed libraries were quality verified using a 4150 TapeStation System (Agilent, Santa Clara, CA) and then submitted for shotgun whole metatranscriptome sequencing using an Illumina HiSeq 2000 instrument (Illumina, San Diego, CA) generating 151 bp reads. Sequences were deposited in the Sequence Read Archive database under BioProject ID: PRJA976287.

Identification of Conserved Viral Sequences
To identify the putative viral sequences, all transcripts in the assembled transcriptomes were renamed by sample ID and a unique identifier and concatenated. Next, the concatenated transcriptome was clustered using CD-hit-EST [20] at a threshold of 95% identity (calculated as the number of identical bases divided by the length of the shorter sequence), a word size of 8, and a minimum length of 500 nt. The representative sequences from these clusters were then annotated using DIAMOND + BLASTx using the -very-sensitive flag and an e-value threshold of 1e-2 [21,22]. Sequences matching RNA viruses were considered of probable viral origin. To eliminate false positives, these sequences were further inspected to establish the presence of ORFs of viral origin using NCBI's ORFfinder (https://www.ncbi.nlm.nih.gov/orffinder/ accessed 5 March 2023). All large ORFs were translated and aligned to NCBI's nr protein database using the BLASTp webserver (https://blast.ncbi.nlm.nih.gov/Blast.cgi accessed 5 March 2023). Viral conserved protein domains were verified using NCBI's conserved domain database (https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml accessed 5 March 2023) and InterProScan (https://www.ebi.ac.uk/interpro/search/sequence/ accessed 5 March 2023) [23]. Sequences matching RNA viruses that passed the criteria listed above were considered as viral conserved sequences.

Identification of Diverged Viral Fragments
To find diverged genomic segments of viruses in the assembled transcriptomes, we implemented a method to identify transcripts that co-occur with the viral conserved sequences [12]. We assumed that diverged genomic segments would not have significant BLAST hits, so we only considered unannotated transcripts in the ensuing analysis. We wrote a Python script that determined all the samples where a viral conserved sequence is found. Next, we used two metrics that assessed how often any other transcripts co-occur with the viral conserved sequence: a viral co-occurrence metric (V co ) and a transcript co-occurrence metric (T co ). V co measures how frequently a transcript is found in the same sample as a viral conserved sequence and is used to filter out transcripts that are not frequently found in the sample as the viral conserved sequence. Mathematically, V co is equal to the cardinal number of the intersection of V and T divided by the cardinal number of V, where V is the set of samples where a viral conserved sequence is found, and T is the set of samples that a transcript occurs in (Equation (1)).
Because n(V ∩ T) ≤ n(V), V co ranges from 0 (where the transcript and viral conserved sequence of interest are never found in the same sample) to 1 (where the transcript and viral conserved sequence of interest are only found together). Intermediate values of V co (0 < V co < 1) indicate that the viral conserved sequence and the transcript co-occur in some samples but not in all.
The transcript co-occurrence metric (T co ) is a complementary metric that assesses whether a transcript is found in samples without the viral conserved sequence. T co is described by Equation (2) and is the cardinal number of the intersection of V and T divided by the cardinal number of T.
Because n(V ∩ T) ≤ n(T), T co ranges from 0 (where the transcript and the viral conserved sequence of interest are never found in the same sample) to 1 (where the transcript is only found with the viral conserved sequence). Intermediate values of T co (0 < T co < 1) indicate that the transcript is found in samples with the viral conserved sequence but also in additional samples. Ideally, T co and V co will be close to 1. In our case, we used a threshold of V co = 0.75 and T co = 0.5 to designate a transcript as putatively viral. The remaining transcripts were filtered by an ORF size > 500 nucleotides and inspected for other viral attributes using NCBI's conserved domain database, BLAST, and InterProscan.

Phylogenetic Analysis
To confirm the identity of the novel viruses, we performed a phylogenetic analysis using the RNA-dependent RNA-polymerase (RdRp) gene from a diverse selection of bunyaviruses. The amino acid sequence for each bunyavirus RdRp was downloaded from NCBI (https://www.ncbi.nlm.nih.gov/ accessed 5 March 2023). The ORF containing the novel viruses' RdRp was extracted and translated using NCBI's ORFfinder. All resulting amino acid sequences were aligned using mafft v7.490, using the E-INS-I algorithm and the -maxiterate option set to 1000 [24]. ModelFinder [25] was used through the IQ-TREE2 v2.0.7 to find the best evolutionary model, and IQ-TREE2 v2.0.7 was used to infer the most likely tree [26]. Branch support was measured using ultrafast bootstrap with 1000 replicates, the Shimodaira-Hasegawa-like approximate likelihood ratio test (SH-aLRT) with 1000 replicates, and the aBayes test [27][28][29].

Incorporation of Public Data
All publicly available BSF transcriptomic data (n = 116, Table S4) was downloaded using the SRA toolkit v.3.0.0 (https://hpc.nih.gov/apps/sratoolkit.html accessed 27 October 2022) and mapped to the candidate viral segments using BWA-mem v2.2.1 [30]. Resulting sequence alignment files (.SAMs) were converted to binary (.BAMs), sorted, and indexed using SAMtools v.1.6 [31]. The presence of viral transcripts in the public data was inspected manually using Integrative Genomics Viewer v.2.14.1 [32]. If a viral conserved sequence was found, the corresponding SRA dataset was processed using the same process described in Section 2.2, added to the concatenated metatranscriptome, and the co-occurrence analysis was run again.

Results
We mapped 13,352,390 reads generated from 37 BSF larvae and frass samples to the BSF reference genome (GCA_905115235.1) [16]. An overview of the samples and the percentage of the reads that mapped to the BSF genome is shown in Table 1. For each sample, we assembled the reads that did not map to the BSF genome, resulting in 37 metatranscriptomes. CD-hit clustering of the metatranscriptomes resulted in 7962 distinct sequence clusters, all of which shared greater than 95% sequence identity. The representative sequence from each cluster was annotated using diamond BLASTx, resulting in 7830 (98.3%) annotations. We filtered these results for viral hits and found two transcripts with significant hits to RdRps of arthropod-infecting bunyaviruses: one transcript was 5696 nucleotides in length, and the other was 4542 nucleotides. We hereon refer to the virus with the longer RdRp transcript as BSF uncharacterized bunyavirus-like 1, and the other we refer to as BSF Nairovirus-like virus 1. We detected BSF uncharacterized bunyavirus-like 1 sequences in the samples T6WA1, T6WA2, T6WC1, T6WR3, and T6WR4, and BSF Nairovirus-like 1 sequences in samples T6WA1, T6WA2, T6WC1, and T6WR4.  (Table S1). We confirmed the presence of the bunyaviral RdRp domain using the NCBI conserved domain database and Inter-Proscan. We found the BSF uncharacterized bunyavirus-like 1 encodes for a bunyavirus-like RdRp conserved domain between nucleotides 3950-4294 (E-value = 2.43 × 10 −6 ), and BSF Nairovirus-like 1 encodes for a bunyavirus-like RdRp conserved domain between nucleotides 2052-2888 (E-value = 1.10 × 10 −15 ). Other putative viral sequences, such as RNA phages, were found and were not included in our analyses because they are likely associated with microbes (Table S2) [33].
To phylogenetically analyze the two novel putative viruses, we translated their coding sequences and aligned them with 55 known bunyavirus RdRp amino acid sequences retrieved from NCBI to construct a phylogenetic tree using the VT+F+R7 amino acid substitution model [34], which was chosen as the best fit by ModelFinder (Figure 1). The sequences were chosen to represent all bunyavirus families documented in NCBI [35]. The phylogeny shows that one putative virus groups with two taxonomically uncharacterized arthropod-infecting bunyaviruses: Kristianstad virus (detected in mosquitoes) [36] and Beihai barnacle virus 6 (detected in barnacles) [37]. Interestingly, BSF Nairovirus-like virus 1 groups with the RdRp of Abu Hammad virus, which is in the Nairoviridae family that was isolated from a tick. Many of the viruses in the Nairoviridae family also infect vertebrates and can be transmitted by arthropod vectors [38]. We hereon refer to this putative virus as BSF Nairovirus-like 1. The full phylogenetic tree with no collapsed nodes is shown in Figure S1. Generally, bunyaviruses have a segmented genomic architecture with a large (L) segment containing the conserved RdRp domain, a medium (M) segment, and a small (S) segment. Because of the error-prone replication by RdRp and extreme selection pressures from host-virus interactions, the M and S segments of bunyaviruses can become very diverged and sometimes impossible to detect based on sequence similarity. We did not find any BLAST results that matched other viral M or S segments, so we employed an approach based on transcript co-occurrence across samples, similar to the approach taken by Batson et al. (2021) [12]. We assumed that if an L segment occurred in multiple samples, the M and S segments would also occur in those samples. Because BSF uncharacterized bunyavirus-like 1 occurred in five samples (out of 37) (Figure 2), and BSF Nairovirus-like 1 occurred in four samples (out of 37) (Figure 2), we were able to employ this method.
We wrote a Python script that takes a defined list of viral conserved sequences (the two novel bunyavirus RdRp transcripts in our case), determines the samples where the viral conserved sequence occurred, and returns a viral co-occurrence metric (Vco, Equation (1)) and a transcript co-occurrence metric (Tco, Equation (2)). Vco is a metric to ensure that a transcript occurs in the same samples as the viral conserved sequence, while Tco is a metric to assess whether a transcript co-occurs with the viral conserved sequence and no other samples. Ideally, Vco and Tco should equal 1 (i.e., a transcript perfectly occurs with a viral conserved sequence and only in those samples), but because of limited sequencing depth, we set the Vco threshold to 0.75 and the Tco threshold to 0.5 to identify candidate M and S segments. We assumed that diverged viral genome segments had no blast hits with Generally, bunyaviruses have a segmented genomic architecture with a large (L) segment containing the conserved RdRp domain, a medium (M) segment, and a small (S) segment. Because of the error-prone replication by RdRp and extreme selection pressures from host-virus interactions, the M and S segments of bunyaviruses can become very diverged and sometimes impossible to detect based on sequence similarity. We did not find any BLAST results that matched other viral M or S segments, so we employed an approach based on transcript co-occurrence across samples, similar to the approach taken by Batson et al. (2021) [12]. We assumed that if an L segment occurred in multiple samples, the M and S segments would also occur in those samples. Because BSF uncharacterized bunyavirus-like 1 occurred in five samples (out of 37) (Figure 2), and BSF Nairovirus-like 1 occurred in four samples (out of 37) (Figure 2), we were able to employ this method.
We wrote a Python script that takes a defined list of viral conserved sequences (the two novel bunyavirus RdRp transcripts in our case), determines the samples where the viral conserved sequence occurred, and returns a viral co-occurrence metric (V co , Equation (1)) and a transcript co-occurrence metric (T co , Equation (2)). V co is a metric to ensure that a transcript occurs in the same samples as the viral conserved sequence, while T co is a metric to assess whether a transcript co-occurs with the viral conserved sequence and no other samples. Ideally, V co and T co should equal 1 (i.e., a transcript perfectly occurs with a viral conserved sequence and only in those samples), but because of limited sequencing depth, we set the V co threshold to 0.75 and the T co threshold to 0.5 to identify candidate M and S segments. We assumed that diverged viral genome segments had no blast hits with an evalue higher than our 1 × 10 −2 threshold, so only unannotated transcripts were considered. We found five candidate genomic segments for uncharacterized BSF bunyavirus-like 1, and seven candidate genomic segments for BSF Nairovirus-like 1. These transcripts were filtered by the presence of complete ORFs greater than 500 nucleotides in length. This left us with four candidates for each viral genome, but they were the same for both BSF uncharacterized bunyavirus-like 1 and BSF Nairovirus-like 1.
Viruses 2023, 15, x FOR PEER REVIEW 7 of 11 an e-value higher than our 1 × 10 −2 threshold, so only unannotated transcripts were considered. We found five candidate genomic segments for uncharacterized BSF bunyaviruslike 1, and seven candidate genomic segments for BSF Nairovirus-like 1. These transcripts were filtered by the presence of complete ORFs greater than 500 nucleotides in length. This left us with four candidates for each viral genome, but they were the same for both BSF uncharacterized bunyavirus-like 1 and BSF Nairovirus-like 1. Figure 2. Viral genome occurrence across samples. Occurrence matrix showing the samples in which the putative viral sequences discovered in this study were found. Black boxes indicate that a viral genome segment was present, and grey boxes indicate that a viral genome segment was absent. The full sample occurrence matrix is shown in Figure S2.
To determine which candidate genomic segments go with each virus, we mined all available BSF transcriptomic data by mapping their reads to the putative L segments. No SRA reads mapped to BSF Nairovirus-like 1, but reads from six datasets mapped to BSF uncharacterized bunyavirus-like 1 (Table S3). Notably, all the public data this virus occurred in were processed in the same lab as the samples used in this study, but for a different experiment (BioProject Accession: PRJNA542977). The datasets containing BSF uncharacterized bunyavirus-like 1 were processed in the same way as the reads generated in our study (i.e., trimmed with Trimmomatic, quality checked with FastQC, and assembled with Trinity), and the transcripts were added to the co-occurrence analysis. After repeating the co-occurrence analysis with the SRA data included, BSF uncharacterized bunyavirus-like 1 only had one candidate that met our criteria. This sequence was also present in the first co-occurrence analysis. Two candidate diverged viral segments appeared with BSF Nairovirus-like 1 with Vco = 1 and Tco = 1. Thus, we assigned these two transcripts as the M and S segments for BSF Nairovirus-like 1, which are 2837 and 922 nucleotides long, respectively ( Figure 3A). We assigned the transcript identified with BSF uncharacterized bunyavirus-like 1 as its S segment, as it is 853 nucleotides long ( Figure 3B). the putative viral sequences discovered in this study were found. Black boxes indicate that a viral genome segment was present, and grey boxes indicate that a viral genome segment was absent. The full sample occurrence matrix is shown in Figure S2.
To determine which candidate genomic segments go with each virus, we mined all available BSF transcriptomic data by mapping their reads to the putative L segments. No SRA reads mapped to BSF Nairovirus-like 1, but reads from six datasets mapped to BSF uncharacterized bunyavirus-like 1 (Table S3). Notably, all the public data this virus occurred in were processed in the same lab as the samples used in this study, but for a different experiment (BioProject Accession: PRJNA542977). The datasets containing BSF uncharacterized bunyavirus-like 1 were processed in the same way as the reads generated in our study (i.e., trimmed with Trimmomatic, quality checked with FastQC, and assembled with Trinity), and the transcripts were added to the co-occurrence analysis. After repeating the co-occurrence analysis with the SRA data included, BSF uncharacterized bunyaviruslike 1 only had one candidate that met our criteria. This sequence was also present in the first co-occurrence analysis. Two candidate diverged viral segments appeared with BSF Nairovirus-like 1 with V co = 1 and T co = 1. Thus, we assigned these two transcripts as the M and S segments for BSF Nairovirus-like 1, which are 2837 and 922 nucleotides long, respectively ( Figure 3A). We assigned the transcript identified with BSF uncharacterized bunyavirus-like 1 as its S segment, as it is 853 nucleotides long ( Figure 3B).

Discussion
In this study, we discovered two novel putative bunyaviruses associated with BSF. To our knowledge, they are only the second and third viruses associated with BSF. Although we cannot confirm if these viruses are pathogenic to BSF, diagnostic molecular tools can be used to investigate the prevalence of these viruses and if they coincide with BSF disease symptoms. We show that both viruses share most recent common ancestors with other insect viruses, although both viruses were only found in frass samples. We suspect two reasons why these viruses were only found in frass samples. (1.) The virus might infect tissues other than the gut, which are the only BSF tissues used in this study. (2.) Sequencing depth might not have been sufficient to detect these viruses in larval tissues. Alternatively, these viruses may infect components of the BSF frass, as bunyaviruses are diverse and can infect a broad range of hosts (Abudurexiti et al. 2019). Either way, our phylogeny suggests that these are arthropod-infecting viruses, as their closest phylogenetic relatives also infect arthropods. Interestingly, BSF Nairovirus-like 1 is phylogenetically related to viruses that infect vertebrates, which could be a point of concern for food and feed safety. Additionally, we detected BSF uncharacterized bunyavirus-like 1 in 11 datasets (five from this study, six from a previous study, BioProject Accession: PRJNA542977, Table S3). These datasets were all produced in a single lab, though from different BSF experiments conducted in differing years and in different rearing locations. However, the BSFs utilized for each of these experiments were from the same original colony. It will therefore be important to understand the broader distribution of this virus across different BSF strains. Future studies should be conducted to confirm the host specificity and pathogenicity of these viruses to ensure that they do not affect BSF rearing.

Discussion
In this study, we discovered two novel putative bunyaviruses associated with BSF. To our knowledge, they are only the second and third viruses associated with BSF. Although we cannot confirm if these viruses are pathogenic to BSF, diagnostic molecular tools can be used to investigate the prevalence of these viruses and if they coincide with BSF disease symptoms. We show that both viruses share most recent common ancestors with other insect viruses, although both viruses were only found in frass samples. We suspect two reasons why these viruses were only found in frass samples. (1.) The virus might infect tissues other than the gut, which are the only BSF tissues used in this study. (2.) Sequencing depth might not have been sufficient to detect these viruses in larval tissues. Alternatively, these viruses may infect components of the BSF frass, as bunyaviruses are diverse and can infect a broad range of hosts (Abudurexiti et al. 2019). Either way, our phylogeny suggests that these are arthropod-infecting viruses, as their closest phylogenetic relatives also infect arthropods. Interestingly, BSF Nairovirus-like 1 is phylogenetically related to viruses that infect vertebrates, which could be a point of concern for food and feed safety. Additionally, we detected BSF uncharacterized bunyavirus-like 1 in 11 datasets (five from this study, six from a previous study, BioProject Accession: PRJNA542977, Table S3). These datasets were all produced in a single lab, though from different BSF experiments conducted in differing years and in different rearing locations. However, the BSFs utilized for each of these experiments were from the same original colony. It will therefore be important to understand the broader distribution of this virus across different BSF strains. Future studies should be conducted to confirm the host specificity and pathogenicity of these viruses to ensure that they do not affect BSF rearing.
The main limitations of this study relate to the data used, as our sequencing data had poor depth. This may be why we were unable to obtain a V co and T co of one for BSF uncharacterized bunyavirus-like 1, hence why we could not identify a candidate M segment. On the other hand, we show that novel viruses can be discovered even at low sequencing depths, especially when frass samples are used. Finally, this method can be easily applied to any system, although it is optimal to have a high-quality genome to reduce the amount of host reads.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/v15081654/s1, Figure S1: Fully uncollapsed bunyavirus phylogenetic tree of the Bunyavirus RdRps used in this study; Figure S2: Transcript co-occurrence across samples; Table S1: Top five BLAST hits of the two novel bunyavirus L segments discovered in this study; Table S2: Putative microbe-associated viruses found within BSF and frass samples; Table S3: SRA data that were positive for BSF uncharacterized bunyavirus-like 1; Table S4: List of all SRA data used in this study by accession number.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: All reads generated from this study were deposited in the NCBI sequence read archive under BioProject ID: PRJNA976287. Python code for co-occurrence analysis can be found at https://github.com/hunterkwalt/virus_co-occur accessed on 20 June 2023.