A Divergent Articulavirus in an Australian Gecko Identified Using Meta-Transcriptomics and Protein Structure Comparisons

The discovery of highly divergent RNA viruses is compromised by their limited sequence similarity to known viruses. Evolutionary information obtained from protein structural modelling offers a powerful approach to detect distantly related viruses based on the conservation of tertiary structures in key proteins such as the RNA-dependent RNA polymerase (RdRp). We utilised a template-based approach for protein structure prediction from amino acid sequences to identify distant evolutionary relationships among viruses detected in meta-transcriptomic sequencing data from Australian wildlife. The best predicted protein structural model was compared with the results of similarity searches against protein databases. Using this combination of meta-transcriptomics and protein structure prediction we identified the RdRp (PB1) gene segment of a divergent negative-sense RNA virus, denoted Lauta virus (LTAV), in a native Australian gecko (Gehyra lauta). The presence of this virus was confirmed by PCR and Sanger sequencing. Phylogenetic analysis revealed that Lauta virus likely represents a newly described genus within the family Amnoonviridae, order Articulavirales, that is most closely related to the fish virus Tilapia tilapinevirus (TiLV). These findings provide important insights into the evolution of negative-sense RNA viruses and structural conservation of the viral replicase among members of the order Articulavirales.


Introduction
The development of next-generation sequencing (NGS) technologies, including total RNA sequencing (meta-transcriptomics), has revolutionized studies of virome diversity and evolution [1][2][3]. Despite this, the discovery of highly divergent viruses remains challenging because of the often limited (or no) primary sequence similarity between putative novel viruses and those for which genome sequences are already available [4][5][6]. For example, it is possible that the small number of families of RNA viruses found in bacteria, as well as their effective absence in archaeabacteria, in reality reflects the difficulties in detecting highly divergent sequences rather than their true absence from these taxa [3].
The conservation of protein structures in evolution and the limited number of proteins folds (fold space) in nature form the basis of template-based protein structure prediction [7], providing a powerful way to reveal the origins and evolutionary history of viruses [8,9]. Indeed, the utility of protein

Sample Collection
A total of seven individuals corresponding to the reptile species Carlia amax, Carlia gracilis, Carlia munda, Gehyra lauta, Gehyra nana, Heteronotia binoei, and Heteronotia planiceps were collected alive in 2013 from Queensland, Australia. Specimens were identified by mtDNA typing and/or morphological data. Livers were harvested and stored in RNAlater at -80 • C before downstream processing. All sampling was conducted in accordance with animal ethics approval (#A2012/14) from the Australian National University and collection permits from the Parks and Wildlife Commission of the Northern Territory (#45090), the Australian Government (#AU- COM2013-192), and the Department of Environment and Conservation (#SF009270).

Sampling Processing and Sequencing
RNA extraction was performed using the RNeasy Plus minikit (Qiagen, Hilden, Germany) following the manufacturer's instructions. Each of the seven livers was extracted individually and then pooled in equal amounts. For RNA sequencing, ribosomal RNA (rRNA) was depleted using the RiboZero (epidemiology) depletion kit and libraries were prepared with the TruSeq stranded RNA library prep kit before sequencing on an Illumina HiSeq 2500 platform (100 bp paired end reads). Library preparation and sequencing was performed by the Australian Genome Research Facility (AGRF), generating a total of 22,394,787 paired end reads for the pooled liver RNA library.

Protein Structure Prediction for Virus Detection
To further screen the meta-transcriptomic data, all the assembled sequences below the assigned threshold (e-value ≥ 10 −5 ) were assigned as "orphan" contigs (n = 293,586). These were then analysed using a protein structure-informed approach. Specifically, orphan contigs were translated into all six open reading frames (ORFs) using the getorf program [29] to identify continuous ORFs of at least 1000nt in length (n = 57). To detect distant sequence homologies and predict viral protein structures, this subset of translated ORFs was then analysed using a template-based modelling approach as implemented in Phyre2 (http://www.sbg.bio.ic.ac.uk/phyre2) [30]. In brief, target proteins were compared against proteins of known structure via homology modelling and fold recognition, followed by loop modelling and sidechain fitting [30]. In total, 6 of 14 confident (i.e., confidence values > 90%) matches to known viral structures were identified. These included a single match to the RdRp of a vertebrate-associated virus, and the queried contig was selected for downstream analyses. Annotations from the predicted model were used as preliminary data for tentative taxonomic assignment and protein classification. The structural alignment between the PDB of the predicted model and the PDB of the template was performed using TM-align v.20190822 [31] with default settings, and visualized using PyMOLv.2.3.5 [32].

Phylogenetic Analysis
The predicted contig encoding the RdRp of the newly discovered virus was aligned with reference protein sequences of the order Articulavirales (Table S3). A multiple-amino acid sequence alignment was performed using the E-INS-i algorithm as implemented in the MAFFT v7.450 program [33]. Selection of the best-fit model of amino acid substitution was carried out using the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) with the standard model selection option (-m TEST) in IQ-TREE [34]. Phylogenetic analysis of these data was then performed using the maximum likelihood (ML) method available in IQ-TREE, with node support estimated with the ultra-fast bootstrap (UFBoot) approximation (1000 replicates) and the Shimodaira-Hasegawa approximate likelihood ratio test (SH-aLRT). Sequencing reads are available at the NCBI Sequence Read Archive (SRA) under the Bioproject PRJNA626677 (BioSample: SAMN14647831; sample name: VERT7; SRA: SRS6507258). The assembled sequence for the newly determined Lauta virus was deposited in GenBank under the accession number MT386081.

PCR Validation
To validate the presence of the novel gecko amnoonvirus, and to identify the putative host species, we screened the individual liver RNA using RT-PCR. Briefly, cDNA was prepared using Superscript IV VILO master mix and RT-PCR was performed with the Platinum SuperFi Green PCR master mix and two primers sets targeting the gecko RdRp contig-F2V7 and F3V7 (Table S4). The resultant RT-PCR products were analysed by agarose gel electrophoresis and validated by Sanger sequencing.

Virus Discovery Using Meta-Transcriptomics and Protein Structural Features
We employed a meta-transcriptomic approach to screen a single pooled library containing liver RNA of seven Australian native reptile species (Gehyra lauta, Carlia amax, Heteronotia binoei, Gehyra nana, Carlia gracilis, Carlia munda, and Heteronotia planiceps). We focused on the de novo-assembled contigs that had no significant hits using initial searches against the NCBI nucleotide and non-redundant databases. Accordingly, of 293,586 orphan contigs, 57 contained translatable ORFs of more than 1000 nt in length, and because we hypothesized that some may correspond to undetected virus sequences, we interrogated them using a protein structure prediction approach with template-based modelling (TBM) in Phyre2 [30]. From the 57 queried contigs, we obtained a 3D model of a 407 amino acid (1227 bp) contig with a high confidence hit (98.3%) to the RdRp catalytic subunit of a bat influenza A virus (family Orthomyxoviridae) (Table 1, Figure 1a,b). This level of confidence is indicative of a high probability of modelling success. Predicted secondary structures for the modeled protein corresponded to α-helix (50%) and β-strand (9%) conformations. In addition, the alignment coverage between our query and the viral template (PDB identifier: 4WSB) corresponded to 52% (213 residues) of the query Viruses 2020, 12, 613 5 of 10 sequence, while the proportion of identical amino acids (i.e., sequence identity) was 19% (Table 1). Despite this low sequence similarity, we observed common folding patterns in the palm domain of the RdRp between the aligned protein structures (Figure 1a). To corroborate these findings, the structural results were compared with those obtained from other analyses based on primary sequence similarity searches against public databases (Table 1). This revealed matches to the RdRp subunit (PB1 gene segment) of different members of the order Articulavirales, including influenza A virus (FLUAV), TiLV, and infectious salmon anaemia virus (ISAV). Comparisons of the assembled contigs against a custom database containing only members of the Articulavirales were then performed to improve sequence alignments. Accordingly, the best hit matches were obtained to TiLV (e-values < 10 −15 ) ( Table 1). To identify additional viral segments, the assembled contigs were aligned to the ten segments of TiLV using DIAMOND. A total of 87 contigs were scored across the genome, although we did not recover any significant hit for segments 2-10 likely because they are so divergent in sequence (Table S1).

Sequence Alignment and Phylogenetic Relationships
We tentatively name the new virus identified here as Lauta virus (reflecting the species name of the gecko in which it was identified), abbreviated as LTAV. Multiple sequence alignment of the RdRp between Lauta virus and other members the order Articulavirales identified a number of well conserved amino acid motifs (I-IV) ranging in length from 5 to 11 amino acids in length (Figure 2). Phylogenetic analysis of the aligned RdRp region revealed that LTAV falls within the order Articulavirales and, along with TiLV (family Amnoonviridae), comprises a distinct monophyletic group. The close relationship between LTAV and TiLV was supported by high UFBoot/SH-aLRT values (99%/99%) (Figure 1c).
Likewise, estimates of the amino acid identity in the RdRp showed a closer (but still distant) sequence similarity (15.35%) with TiLV than other members of the order Articulavirales (Table 2).

Host Association and In Vitro Validation
Lauta virus was initially identified in the pooled sequencing library comprising a mix of several Australian reptile species. To identify the exact host species, we screened each individual species sample separately using RT-PCR and Sanger sequencing. As a result, we detected the presence of the novel Lauta virus RdRp sequence in liver tissue of G. lauta (paratype QM J96622) ( Figure S1), a gecko species native to north-western Queensland and the north-eastern Northern Territory in Australia [35].

Host Association and In Vitro Validation
Lauta virus was initially identified in the pooled sequencing library comprising a mix of several Australian reptile species. To identify the exact host species, we screened each individual species sample separately using RT-PCR and Sanger sequencing. As a result, we detected the presence of the novel Lauta virus RdRp sequence in liver tissue of G. lauta (paratype QM J96622) ( Figure S1), a gecko species native to north-western Queensland and the north-eastern Northern Territory in Australia [35].

Discussion
Advances in protein modelling and sequence analysis based on structural comparisons with well-characterized protein templates constitute an attractive approach for the identification of highly divergent RNA viruses [30]. The RdRp is ubiquitous in RNA viruses with different genomic architectures and replication strategies, showing a conserved core with sequence motifs that adopt specific folds. The protein is critically required for RNA synthesis and replication in RNA viruses (i.e., template recognition, initiation, elongation and regulation) [15]. As proteins such as the RdRp play such a central role in the life-cycle of RNA viruses, it is expected that structures and key motifs for catalytic functionality will be relatively well conserved through evolutionary history [36,37]. Based on this premise, it is expected that template-based protein structure modelling could be a powerful tool in the identification of highly divergent viruses [7,30,38]. Accordingly, we used protein structural similarity in combination with sequence and a profile similarity to identify a novel and divergent RNA virus in an Australian gecko (G. lauta).
We obtained a confident predicted 3D model for the RdRp of Lauta virus based on its structural similarity with the RdRp subunit PB1 of influenza virus (family Orthomyxoviridae) (Figure 1a,b; Table 1). Although the structural data suggested that Lauta virus belonged to the family Orthomyxoviridae (order Articulavirales) [30], additional sequence analysis revealed a closer relationship to members of the Amnoonviridae (Figure 1c). In this context, it is important to recall that biases in taxonomic assignment can occur because of the limited number of available proteins with known structures in the PDB. Although this is clearly a limitation, template-based approaches offer a tractable starting point for virus discovery and its taxonomic classification.
Although compromised by the large evolutionary distances involved, phylogenetic analysis among members of the order Articulavirales revealed that Lauta virus was most closely related to TiLV, in turn suggesting that it represents a novel and divergent (and as yet unnamed) genus within the Amnoonviridae. To date, members of the Amnoonviridae have only been detected in fish [17], such that the discovery of Lauta virus expands the host range of this family. Indeed, given the huge genetic distance between TiLV and LTAV, we expect that further uncharacterized phylogenetic diversity exists in the Amnoonviridae especially in fish and reptiles, and that more studies using the form of genomic surveillance performed here will capture a far greater diversity of negative-sense RNA viruses [6,39].
Comparisons of the RdRp subunit PB1 from different articulaviruses revealed the presence of four well conserved motifs in Lauta virus, broadly consistent with observations made for TiLV [17]. As suggested by several studies, motifs I-IV are critically implicated in the catalytic activity of PB1 [40,41]. Despite minor variations, we identified the serine-aspartic acid-aspartic acid (SDD) sequence in motif III that is presumed to be essential for protein functionality in FLUV [40,41]. Hence, the presence of well conserved motifs I-IV across the order Articulavirales may constitute effective molecular fingerprints for these viruses. Unfortunately, the marked lack of sequence similarity meant that we did not recover any conclusive evidence regarding the presence of other genome segments in Lauta virus. Further studies that include sequencing, microscopy, and cell culture techniques are, therefore, required to fully characterize the genome of this novel virus.
The identification of a novel virus in an Australian gecko (G. lauta) highlights the importance of virus surveillance in native species. Although Lauta virus was detected in liver tissue, we currently cannot draw any conclusions regarding its pathogenic potential and impact on the health of G. lauta, particularly since a limited number of individuals were collected and all were apparently healthy. Additional research is therefore needed to establish the type of biological interaction between Lauta virus and G. lauta. While a previous study reported the isolation of the arbovirus Charleville virus (family Rhabdoviridae) in G. australis (possibly G. dubia based on its distribution) collected in Queensland [36,37], this is the first report of a divergent articulavirus in reptiles. Taken together, these findings hint at a hidden diversity of RNA viruses in reptiles that remains to be characterized.

Supplementary Materials:
The following are available online at http://www.mdpi.com/1999-4915/12/6/613/s1. Figure S1: PCR detection and host association of Lauta virus. (a,b) Agarose gels electrophoresis showing PCR products from two sets of primers that target a region in the PB1 gene segment (RdRp). Samples correspond to (c) liver tissue from seven different reptile species. A 355 bp PCR product was only amplified in G. lauta. Table  S1: Summary of the contig alignment to genomic segments of TiLV using DIAMOND. The relative abundance of each transcript was also calculated (see Methods). Table S2: Summary of hits recovered after alignment of the untranslated contigs with reference protein sequences of the RdRp subunit PB1. The custom database included virus reference sequences from the order Articulavirales.