Bioinformatics Analysis Identifies a Small ORF in the Genome of Fish Nidoviruses of Genus Oncotshavirus Predicted to Encode a Novel Integral Protein

: Genome sequence analysis of Atlantic salmon bafinivirus (ASBV) revealed a small open reading frame (ORF) predicted to encode a Type I membrane protein with an N-terminal cleaved signal sequence (110 aa), likely an envelope (E) protein. Bioinformatic analyses showed that the predicted protein is strikingly similar to the coronavirus E protein in structure. This is the first report to identify a putative E protein ORF in the genome of members of the Oncotshavirus genus (subfam-ily Piscavirinae, family Tobaniviridae, order Nidovirales) and, if expressed would be the third family (after Coronaviridae and Arteriviridae) within the order to have the E protein as a major structural protein.


Introduction
Nidoviruses that infect fish are classified into two families, Coronaviridae and Tobaniviridae, order Nidovirales [1]. The virus order designation is derived from Latin "nidus", referring to the 3′-coterminal nested set of subgenomic mRNAs that characterize its genome transcription [2]. Nidovirus particles are enveloped with long single-stranded positive-sense polycistronic RNA genomes of ~12-41 kb in length-the largest among RNA viruses [3,4]. These viruses cause important diseases in many hosts, including humans, other mammals, birds, pythons, shrimp, and fish. Their relevance has skyrocketed with the emergence of coronavirus disease 2019 (COVID- 19), an extremely infectious pandemic disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). As of 01 October 2021, this pandemic had infected ~234.8 million people worldwide, with more than 4.8 million deaths [5]. While the most currently known nidovirus species are associated with terrestrial hosts, the greatest phylogenetic diversity of nidoviruses is associated with hosts living in aquatic environments [6].
We isolated a novel salmonid nidovirus from farmed Atlantic salmon; its ultrastructural and genomic characteristics placed it in the genus Oncotshavirus (subfamily Piscavirinae, family Tobaniviridae, order Nidovirales), and we named it Atlantic salmon bafinivirus (ASBV) [7,8]. Oncotshavirus are among the least studied nidoviruses, with very little known about their host range, pathogenicity, geographical distribution, and replication kinetics. To date, genomic sequences of seven oncotshaviruses isolated from various freshwater fish species have been deposited in the GenBank database (Table 1). All the genomes show an organization characteristic of the family Tobaniviridae [9] with a 5′ leader sequence followed by four major open reading frames (ORFs) that encode the putative replicase polyprotein (pp1ab) and the putative structural proteins spike (S), membrane (M), and nucleocapsid (N) proteins [10,11]. Consistent with the subfamily Torovirinae, the subfamily Piscavirinae is also known to lack a homolog of the coronavirus (CoV) E protein [12]. The difference in the presence of E protein has been used to explain the structural differences between the coronaviruses and toroviruses, and E gene mutants of mouse hepatitis virus (a coronavirus) were shown to display bacilliform morphology [13] resembling that of members of the subfamilies Torovirinae and Piscavirinae [14]. Here, we describe the structure and topology of a novel integral membrane protein encoded by a small ORF in the ASBV genome hitherto unknown in Tobaniviridae. We show that this ORF is present in members of the genus Oncotshavirus but not in Bafinivirus. Furthermore, the predicted protein is strikingly similar to the CoV E protein in structure. Biological validation of these predictions and elucidation of the role of E protein in the life cycle of oncotshaviruses are planned in future experiments. Thus, in future experiments, we will establish whether the ASBV E protein is produced during virus replication by investigating the kinetics of both transcript and protein production by virus-infected fish cell lines and cells transfected with recombinant plasmids encoding the ASBV E protein.
The ASBV E transcripts would be quantified by RT-qPCR whereas the viral proteins would be detected by Western blotting or immunoprecipitation with antibody reagents to ASBV or ASBV E protein.

Materials and Methods
Nucleotide and amino acid sequences of 15 selected nidoviruses were obtained from the GenBank database [19]. Similarity analysis of the DNA sequences was performed using BLAST programs available via the National Center for Biotechnology Information [20]. The phylogenetic analysis was performed using CLUSTAL X package [21][22][23]. Different sequence sets were explored to find stable conserved areas. Both amino acid sequences and nucleotide sequences were explored. The sequence of Equine arteritis virus (EAV) (genus Alphaarterivirus, family Arteriviridae) GenBank accession number: NC_002532, was chosen as the outgroup to determine the root of the phylogenetic trees. The bootstrapping procedure was performed to estimate the confidence level on the branches of these phylogenetic trees. The number of bootstrapping trials was 1000; random numbers were used as seeds to simulate the random processes. Transmembrane topology prediction was obtained by using DeepTMHMM [24] and TOPCONS [25], two predictors available on the internet that separate signal peptides from N-terminal transmembrane domains [26,27]. Any predicted signal peptides and the location of their cleavage sites were confirmed using SignalP 5.0 [28]. NetNGlyc-1.0 [29] was used to check for N-linked glycosylation sites. GPS-Palm [30] was used for the prediction of S-palmitoylation sites. Phyre2 and Alphafold2 [31] were used to build a 3D model of the predicted protein structure.

Results and Discussion
In the current study, the genomic sequence of ASBV VT01292015-09 obtained by next-generation sequencing (NGS) on the Illumina ® HiSeq 2000 platform (LC Sciences, Houston, Texas, USA) and completed by filling in the gaps between the assembled contigs using conventional RT-PCR and the 5′-RACE and 3′-RACE to obtain the 5′ and 3′ terminal sequences, respectively [7], was further analyzed. The full-length viral genome is 26,492 nt starting at the 5′ GCA terminus with the untranslated region of 846 nt and the 3′ end of the untranslated region of 220 nt, including a poly(A) tail of 23 nt. The four major ORFs identified using the ORF Finder program [32], which are shown in Figure 1, are in the following order: ORF 1ab encodes the putative replicase polyprotein, nt 847 to 14,361 for pp1a and nt 14,496 to 21,338 for pp1b (pp1ab, 6778 aa); S, nt 21,341 to 24,958 (1205 aa); M, nt 24,975 to 25,712 (245 aa); and N, nt 25,736 to 26,272 (178 aa). In addition, a small ORF (marked as E), nt 21,510 to 21,846 (110 aa), is present between the pp1ab and M ORFs. Similar to coronaviruses [33,34] and toroviruses [35], ASBV uses a "slippery" heptanucleotide: 14,341 TTTAAAC 14,347 at a ribosomal frameshift (RFS) site, resulting in a −1 frameshift from ORF1a to ORF1b (Figure 1). Boxes represent major open reading frames (ORFs). The proteins encoded by the ORFs are also indicated within, above, or below the boxes. The arrow indicates the position of the putative ribosomal frameshift (RFS) "slippery" sequence (putative replicase polyprotein 1a and 1b; putative envelope protein E (p12.7); S, spike protein; M, membrane protein; and N, nucleocapsid). The 5′ capped mRNA with a leader sequence is depicted by a small black box. The poly(A) tail is indicated by A(n). The predicted structure of the ASBV E protein consisting of three domains: the amino-terminal (N-terminal at aa positions 1-50), the putative transmembrane α-helical hydrophobic domain (at aa positions 51-79), and the carboxy-terminal (C-terminal at positions 80-110) is shown. The ASBV E protein is predicted to be a Type I membrane protein with an N-terminal cleaved signal sequence. The residues in the two hydrophobic domains are in green, one in italics (aa positions [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22] is in the signal peptide region (pink at aa positions 1- 25), and another in the transmembrane domain (green at aa positions 51-79), which would impose a hairpin topology for the ASBV E protein before the N-terminal signal peptide is cleaved off. A signal peptidase cleavage site is marked with a star (*) at residue R26 in purple (proposed to be in the ER lumen/virion exterior resulting in a Type I membrane protein (i.e., a mature protein with N-terminus on the ER lumen/virion exterior and the C-terminus exposed to the cytoplasmic side/virion interior). The predicted palmitoylation site at residue C84 is marked with a star (*) in orange.
The small ORF in the ASBV genome is predicted to encode an α-helical transmembrane protein (110 aa); it is an integral viral membrane protein [36], likely a putative E protein previously overlooked in genomes of members of the family Tobaniviridae probably because of its small size. This ORF is present in members of the genus Oncotshavirus but not in Bafinivirus. Moreover, the gene order is conserved in oncotshaviruses as expected of a structural protein, although the different structural protein ORFs of oncotshaviruses have not yet been experimentally verified for expression. Relative to ASBV E protein (110 aa), the other six oncotshavirus isolates (Table S1) have a truncated E protein (86 aa, i.e., a shorter C-terminus). However, all the essential features of the putative integral membrane protein are conserved in the seven oncotshaviruses. As shown in Fig . Some of these mutations may be due to the complex transcription characteristics of nidoviruses [37][38][39], or they could be sequencing errors.
Several lines of evidence lend support to the view that the ASBV small ORF is authentic and encodes an integral membrane protein. Bioinformatic analyses of the amino acid sequence using the latest and most accurate softwares, DeepTMHMM [24] and TOP-CONS [25], show that the predicted protein is strikingly similar to the CoV E protein in structure [40][41][42], albeit unique to members of genus Oncotshavirus. The CoV E protein is a Type I membrane protein with a single TMD [40,43,44] and does not have a canonical cleaved signal sequence [40,45] whereas, as defined by Goder and Spiess [46], the ASBV E protein is predicted to be a Type I membrane protein with an N-terminal cleaved signal sequence as shown in Figures 1 and 2. Thus the N-terminus consists of a signal peptide (SP) (aa positions 4-24) followed by 25 amino acids and then a TMD of 29 residues (aa positions 51-79) and C-terminus approximately 30 amino acids (aa positions 80-110) (Figure 1). The signal peptide was confirmed using SignalP 5.0 software [28]. The proposed topology of the putative ASBV E protein is illustrated in Figure 2B. The hydrophobic regions in the SP and the TMD impose a hairpin topology for the ASBV E protein before the N-terminal SP being cleaved off at residue R26. The cleavage site at R26 was confirmed using PeptideCutter software [47] and is +3 amino acids from the hydrophobic segment of the SP, which conforms to the structure of a typical cleavable amino-terminal signal sequence [28]. Moreover, the location and size of the predicted SP region are characteristic of signal sequences in eukaryotes (usually 16 to 30 amino acid residues in length and comprising a hydrophilic, usually positively charged N-terminal region, a central hydrophobic domain, and a C-terminal region with the cleavage site for signal peptidase [48]). The mature protein would have the N-terminus on the ER lumen/virion exterior and the Cterminus exposed to the cytoplasmic side/virion interior, as illustrated by Goder and Spiess [46] and Figure 2B. The predicted 3D structure of ASBV E protein shown in Figure  2C removes any ambiguity because it is similar to that of SARS-CoV-2 E protein [49]. The proposed topology of ASBV E protein is predicted to be a Type I membrane protein with an N-terminal cleaved signal sequence. The hydrophobic regions in the signal peptide and the transmembrane domains impose a hairpin topology for the ASBV E protein before the N-terminal signal peptide cleaves off. The mature protein would have the N-terminus on the ER Lumen/virion exterior, and the C-terminus exposed to the cytoplasmic side/virion interior with a putative palmitoylation site at residue C84 in the C-terminus. The proposed signal peptidase cleavage site at residue R26 is marked with X. (C) The predicted 3D structure of ASBV E protein using Phyre2 and Alphafold2 [31] and homology modelling to the SARS-CoV-2 E protein structure [49]. The ASBV E protein model includes the cleavable signal peptide which is absent in SARS-CoV-2 E. Image coloured by rainbow N → C terminus. Model dimensions (Å): X:45.023 Y:73.238 Z:39.549.
The E protein amino acid sequence found in different CoVs is variable [40], but the predicted structure and functional properties are highly conserved [42]. In this study, we were unable to construct phylogenetic trees to include the E gene sequences of all 15 selected nidoviruses (Figure 2A) as we could not find evidence that they were related. However, we were able to generate two almost identical phylogenetic trees, one based on amino acid sequences and another on nucleotide sequences, revealing two major clades consisting of the seven members of genus Oncotshavirus (Table 1) and two members of genus Betacoronavirus, lineage B (SARS-CoV and SARS-CoV-2) [1]) ( Figure 3). The onchotshavirus group of seven sequences and the betacoronavirus group of two sequences are stable groups; we are confident in them. For the rest of the sequences, their relationships with these two groups cannot be determined through the analyses of the E protein. Such relationships might be found by analyzing other protein sequences or nucleotide sequences such as the replicase polyprotein and nucleoprotein sequences, and any failures demonstrate limitations of these phylogenetic approaches in dealing with the diversity of nidoviruses [6].
The ASBV E protein is non-glycosylated. To date, among the CoV E proteins, only the SARS-CoV E protein has been glycosylated (residue N66), although this constituted an alternative minor form [42,50]; the glycosylation of SARS-CoV E protein during actual infection and its biological function remain to be further investigated [43]. A post-translational modification of more functional importance for E proteins is palmitoylation. Bioinformatic analysis of the ASBV E amino acid sequence using the GPS-Palm software to predict palmitoylation sites in proteins [29] demonstrated that, under the high threshold, residue C84 in the C-terminus is modified by palmitoylation. Of the CoV E proteins, only infectious bronchitis virus (IBV), SARS-CoV, and mouse hepatitis virus (MHV) are palmitoylated [51][52][53].
Similar to the CoV E protein [48], the ASBV E protein may belong to the class of small viral integral proteins called 'viroporins' that oligomerize within the membrane bilayer to form channels that facilitate the transport of ions or small molecules. They include the M2 protein of influenza virus, the 6K protein of alphaviruses, Vpu of HIV-1 [36], p7 ion channel of Hepatitis C virus (HCV) [54,55], as well as protein 3A of CoVs [56]. They are thought to function in various ways to facilitate the assembly and release of new viral particles from the infected cells [36].
In summary, this is the first report to identify a putative E protein ORF in Tobaniviridae, and if expressed, would be the third family (after Coronaviridae and Arteriviridae) within the order Nidovirales to have the E protein as a major structural protein. Experiments to biologically validate these predictions and elucidate the role of E protein in the life cycle of oncotshaviruses are planned.  Table 1) and two members of genus Betacoronavirus, lineage B (SARS-CoV and SARS-CoV-2) with Equine arteritis virus (EAV) E protein sequence GenBank accession number: NC_002532) as the outgroup to determine its root. The graphical editor for phylogenetic trees, TreeGraph 2 [57] was used to produce the figures. Each bootstrapping value corresponds to the branch on the same vertical level. Only the bootstrapping supports higher than 70% are marked in the phylogenetic trees. Abbreviations at the end of each sequence correspond to the virus names: ASBV = Atlantic salmon bafinivirus VT01292015-09; CSBV-NIDO = Chinook salmon bafinivirus NIDO; CSBV Cefas-W054; CSBV WHQSR4345; CSBV HB93; YCBV = Yellow catfish bafinivirus Shaoxing; PFO-1 = Pelteobagrus fulvidraco oncotshavirus-1 ZJLH18531; SARS-CoV = Severe acute respiratory syndrome coronavirus Tor2; SARS-CoV-2 Wuhan-Hu-1; and EAV = Equine arteritis virus Bucyrus.

Conflicts of Interest:
The authors declare no conflict of interest.