The Potential Functions of Protein Domains during COVID Infection; An Analysis and a Review

Coronaviruses (CoVs) are a large viral family that can evolve rapidly emerging new strains that cause outbreaks and life-loss, including SARS-CoV, MERS-CoV, and SARS-CoV-2 (COVID-19). CoVs encode a diverse number of proteins, ranging from 5 proteins in bat CoV, to 14 in SARS CoV, which could have implication on viral tropism and pathogenicity. Here, we highlight the functional protein motifs (domains) that could contribute in the coronavirus infection and severity, including SARS-CoV-2. For this role, we used the experimentally validated domain (motif) datasets that are known to be crucial for viral infection. Then, we highlight the potential molecular pathways and interactions of SARS-CoV-2 proteins within human cells. Interestingly, the C-terminal of SARS-CoV-2 nsp1 protein encodes MREL motif, which a signature motif of the tubulin superfamily, and regulate tubulin expression. The C-terminal region of nsp1 protein can bind to ribosome and regulation viral

The phylogenetic analysis shows coronaviruses are conserved within the same subgroup [1]. However, coronaviruses encode diverse number of proteins, ranging from 5 protein in bat CoV, to 14 in SARS CoV, Supplementary Table S1. They evolve rapidly emerging new strains that cause outbreaks and life-loss. Although the high rate of homology of CoV genomes, mutation of one or more nucleotides may lead to significant changes on the short protein motifs or domains. Particularly, the newly isolated viruses (despite of evolutionary relationships with other CoVs) may encode new proteins of unknown functions and utilize different molecular interactions and pathways within the COVID 2021, 1 385 host cell [5][6][7][8][9]. These domains could contribute in viral virulence and ability to infect wide-range of host cells.

Materials and Methods
Here, we highlight the role of functional protein domains and the potential pathways that could be triggered during CoVs infection. To perform this analysis, the full protein sequences of 32 coronaviruses, including SARS-CoV and MERS-CoV, in addition to 10 protein sequences of the newly isolated SARS-CoV-2 are obtained from NCBI database, during 2020, listed in Supplementary Table S1. It worth to note that for precise genome coordination, only sequences that are reviewed and annotated in UniProt database were selected. Additionally, some proteins are small to detect functional or structural motifs; therefore, these proteins are excluded.
The functional motifs in the coronavirus proteomes identified using exact text-finding (mining) implemented in Shetti and Shetti-Motif tools, for detailed method [6]. For short, we used the datasets of functional motifs that experimentally validated to be involved in viral infection and virulence, see reviews [7][8][9]. Additional protein domains are downloaded from PROSITE database (https://prosite.expasy.org/ (accessed on April 2020)). Together these construct a dataset of over 1500 experimentally validated motifs and domains, Supplementary Table S2. It worth to note that, the pattern of the proteins motifs are automatically listed and loaded into Shetti-Motif tool. The whole proteomes were loaded to the tool as well. The tool searches for the motifs within the whole proteome of CoVs [6]. The output of the tool is collected, which is then used to construct a table contains columns with CoVs proteomes and the rows represent the pattern (motif), in addition to the number of occurrence of each pattern. If a protein encode (harbor) multiple instance of a pattern, it considered as one instance, Table S3. If the protein does not encode this motif, the cell filled with zero. Then, the number of motif-containing proteins normalized to number of the total proteins in the proteome (in percent to the total number of proteins). The results are shown as a matrix of proteomes and number of motifs-containing proteins (in percent) in each proteome, Tables S3-S5. Details of the motifs-containing proteins, protein names and locations are shown in Tables S6-S8. Secondly, we aimed to visualize the differences between SARS-CoV-2 and its evolutionary closely relative viruses, such as bat CoV, human CoV, SARS-CoV, and MER-CoV. Together, it constructs a matrix of 17 proteomes. We collected the number of occurrence of the motifs in each of these 17 proteomes, Table S4. The matrix consists of motif-proteome enrichment (i.e., motifs as rows and their number of occurrence in each proteome represented as columns). The data clustered using hierarchical clustering and the heat-map is constructed using the default settings (MeV tool, http://mev.tm4.org/ (accessed on April 2020)). The heat-map clusters the viruses that harbor similar motif patterns, as well as it visualizes the number of occurrence in each proteome. In principle, the viruses encode similar motifs (domains), could share the same viral tropism and molecular pathways. Finally, as a proof of concept, we visualize structure of some of the viral protein-containing domain bind to host proteins.

Coronaviruses Encode Wide-Range of Functional Motifs
In the first part, we aimed to highlight the different type of protein motif (domain) each of coronavirus, which could help to identify the domain(s) that contribute in viral virulence, as shown in method and Figure 1A. The results suggests that CoVs encode widerange of functional motifs that differ from one virus to another; the closely related viruses encode different motifs, Tables S3-S5. Clustering based on the number of occurrence of the motifs in the proteome could reveal the potential virus-host protein interactions. In addition, viruses encode similar motifs could share similar viral tropism. We found that SARS-CoV-2 and bat-SL-CoVZC45 clustered in one clade close to SARS-CoV-1, Figure 1B,C. This finding is consistent with the fact that SARS-CoV 1 and 2 are evolutionary closely related, infect the same host, and could transmitted from the same zoonotic animal. Details on these motifs and proteins found in Tables S3-S8. 021, 1, FOR PEER REVIEW 3 found that SARS-CoV-2 and bat-SL-CoVZC45 clustered in one clade close to SARS-CoV-1, Figure 1B,C. This finding is consistent with the fact that SARS-CoV 1 and 2 are evolutionary closely related, infect the same host, and could transmitted from the same zoonotic animal. Details on these motifs and proteins found in Tables S3-S8. Interestingly, although SARS-CoV-1 and 2 are evolutionary-related, they encode some different motifs, which suggest that they have different viral tropism within the host cells. For example, two motifs are required for cell signaling transduction, which can recognize TRAF2 protein (PxQxT motif) or PDZ domain-containing proteins (KTxxx[W/I]), where x means any residue, and [W/I] or [WI] means W or I residue. These two motifs are deleted in SARS-CoV-2, but encoded in the closely related bat-SL-CoVZC45 virus. However, SARS-CoV-2 encodes multiple domains that can recognize ubiquitination proteins, such as E3 ubiquitin ligases (E3 Ub), Elongin C (ELOC), TRAF6, and SIAH1 (SLxxxLxxxI, PxExxE or PxAxV motifs). Noting that the closely related viruses does not harbor these motifs. Additionally, the canonical PPxY motif is largely encoded and utilized by viruses to hijack the cellular machinery, reviewed [7]. PPxY motif is needed to recruit NEDD4 E3 ubiquitin ligases for protein degradation, and endosomal sorting complexes required for the transport (ESCRT) pathway. ESCRT pathway is crucial for budding of HIV-1 and paramyxoviruses and exit from the cell. Additionally, adenoviruses utilize PPxY motif during cell entry and cellular trafficking. In coronaviruses, surface glycoprotein (S) of SARS-CoV-2 harbor PPxY motif. Two proteins of MERS-CoVs and three proteins of Erinaceus CoV harbor PPxY motif, whereas other human CoVs and SARS-CoV-1 do not encode these motifs, Table S3.
Coronaviruses encode other canonical ESCRT-interacting motifs, such as P[T/S]AP, [F/I/L/V]PxV, YxxL, and LYPxL. Although ORF1ab polyprotein of SARS-2 harbors LYPTL, LPGV and VPFV motifs, the virus does not encode P[T/S]AP motif, which is encoded only by MERS-CoV EMC/2012, Table S3. P[T/S]AP motif recruits TSG101 protein, Interestingly, although SARS-CoV-1 and 2 are evolutionary-related, they encode some different motifs, which suggest that they have different viral tropism within the host cells. For example, two motifs are required for cell signaling transduction, which can recognize TRAF2 protein (PxQxT motif) or PDZ domain-containing proteins (KTxxx[W/I]), where x means any residue, and [W/I] or [WI] means W or I residue. These two motifs are deleted in SARS-CoV-2, but encoded in the closely related bat-SL-CoVZC45 virus. However, SARS-CoV-2 encodes multiple domains that can recognize ubiquitination proteins, such as E3 ubiquitin ligases (E3 Ub), Elongin C (ELOC), TRAF6, and SIAH1 (SLxxxLxxxI, PxExxE or PxAxV motifs). Noting that the closely related viruses does not harbor these motifs. Additionally, the canonical PPxY motif is largely encoded and utilized by viruses to hijack the cellular machinery, reviewed [7]. PPxY motif is needed to recruit NEDD4 E3 ubiquitin ligases for protein degradation, and endosomal sorting complexes required for the transport (ESCRT) pathway. ESCRT pathway is crucial for budding of HIV-1 and paramyxoviruses and exit from the cell. Additionally, adenoviruses utilize PPxY motif during cell entry and cellular trafficking. In coronaviruses, surface glycoprotein (S) of SARS-CoV-2 harbor PPxY motif. Two proteins of MERS-CoVs and three proteins of Erinaceus CoV harbor PPxY motif, whereas other human CoVs and SARS-CoV-1 do not encode these motifs,  [7,[10][11][12][13][14]. In fact, quinolones antiviral therapeutics can target ESCRT pathway and viral budding, such as FGI-104, FGI-103, FGI-106, and chloroquine. SARS-1 and 2, but not MERS encode motif that recognizes host cell factor 1 (HCFC1), which is crucial to regulate the cell cycle. Additionally, two Cys-rich motifs are predicted to link between spike and envelope proteins [15]. SARS-1 and 2, but not MERS subgroup encode these two motifs. Other C-rich motifs, which are needed for baculovirus virions production and nucleocapsid assembly, are encoded by multiple coronaviruses, Table S3. On the other hand, coronaviruses harbor multiple integrin-binding (RGD) motif. In absence of RGD, viruses may utilize other motifs to attach to cellular receptors and enter into host cells, such as KGE, LDV, LDI, and SDI, Table S3.

The Potential CoV-Human Protein Interactions Pathways Based on Functional Motif
We used the functional motif encoded by each of SARS-CoV-2 proteins to predict the potential molecular interactions and pathways within the host cell. For this part of analysis, we searched for the motif pattern within CoV proteomes using Shetti-Motif tool, as described before. Then, we constructed a matrix of SARS-2 proteins (columns) versus the motifs and their number of occurrence (rows), Table S9. The table used to construct a schematic diagram (manually curated), which represents SARS-CoV-2 proteins, the motifs encoded, and the potential interactions that validated to be utilized during viral infections, Figure 2. We observed that SARS-2 encodes multiple Ub-, and SUMO-binding motif, including the PPxY motif, which is localized on surface (S) protein. Recruiting ubiquitin proteins are essential to degrade antiviral proteins and hijack the immune response imposed by the cells. SARS-2 encodes motifs to recognize heparan sulfate (HS), which is required for post-internalization events of viral entry, discussed in [7]. Lung epithelium and endothelium are covered with a layer of heparin sulfate (closely related in structure to heparan), glycoproteins and glycolipids, so-called endothelial glycocalyx. Our results support the hypothesis that SARS-CoV-2 could adhere to heparin sulfate of glycocalyx layer, which could potential drug target [16,17]. As mentioned above ORF1ab harbors MREI-like tubulin motif, which will be discussed later. Besides, it harbors the canonical motif for protein trafficking and nuclear localization signal (NLS), which is essential for trafficking through the nuclear membrane.
On the other hand, furin endoprotease belongs to group of proprotein convertases that cleave the precursor proteins to the activated form. It binds to the canonical motif RxRK/R||x, where || denotes the cleavage site. Coronaviruses encode R||S motif, such as RRRR||S motif, which interact with furin leading to proteolytic activation of the spike protein and viral entry into host cells [18]. In addition, the R||S motif is essential for syncytium formation [18]. ORF1ab is the largest polyprotein encoded by SARS-2, which is auto-proteolytically processed into 16 non-structural proteins (nsp1 to nsp16), reviewed in [4], Figure 3A. ORF1ab harbors at least seven Rx 0-3 RS motifs, which could correspond to the cleavage of the polyprotein to the non-structural proteins.
SARS-2 harbors clathrin-binding motifs and clathrin adaptor protein (AP)-binding motifs, which are required for endocytosis. Coronaviruses enter the cells by fusion or endocytosis, which may require clathrin for some strains, reviewed in [19]. The integrins (ITGs) and heparan sulfates could have role in the entry into host cell. Noteworthy, HIV-1 utilizes AP-binding motifs to direct anti-tetherin (BST2) to the lysozyme and antagonizes the antiviral immune response, reviewed in [7]. Regarding the cellular signaling, ORF1b polyprotein harbors motifs involved in multiple cellular signaling, including JAK, MAPK, TRADD, TRAF6, and caspases-binding motifs. Caspases and TRAFs are linked with inflammation and apoptosis [20,21], therefore multiple viruses (e.g., herpesviruses and influenza) hijack the caspase pathways to regulate the programmed cell death.
On the other hand, furin endoprotease belongs to group of proprotein convertases that cleave the precursor proteins to the activated form. It binds to the canonical motif RxRK/R||x, where || denotes the cleavage site. Coronaviruses encode R||S motif, such as RRRR||S motif, which interact with furin leading to proteolytic activation of the spike protein and viral entry into host cells [18]. In addition, the R||S motif is essential for syncytium formation [18]. ORF1ab is the largest polyprotein encoded by SARS-2, which is auto-proteolytically processed into 16 non-structural proteins (nsp1 to nsp16), reviewed Among the interesting motifs, ORF1ab harbors the canonical motif for binding with palmitoyl acyltransferase, in addition to multiple thiol disulphide and Cys-rich motifs, which are needed for protein palmitoylation [22]. Noteworthy, the envelope (E) protein of some strains of coronaviruses are shown to be palmitoylated [23][24][25]. Myristoylation is another lipid post-translational modification event. ORF1ab harbors the canonical MGxxxS motif for binding with N-myristoyltransferase (NMT1), which adds a myristoyl group to the N-terminal glycine residue of the proteins [26]. Noteworthy, myristoylation is crucial for egress of some viruses, including coronaviruses [27], as well as activation or inhibition of the immune response by phosphorylation of tyrosine residues in ITAM and ITIM (immunoreceptor tyrosine-based activation and inhibition motifs, respectively) [26]. SARS-2 proteins encode both ITAM and ITIM motifs. palmitoyl acyltransferase, in addition to multiple thiol disulphide and Cys-rich motifs, which are needed for protein palmitoylation [22]. Noteworthy, the envelope (E) protein of some strains of coronaviruses are shown to be palmitoylated [23][24][25]. Myristoylation is another lipid post-translational modification event. ORF1ab harbors the canonical MGxxxS motif for binding with N-myristoyltransferase (NMT1), which adds a myristoyl group to the N-terminal glycine residue of the proteins [26]. Noteworthy, myristoylation is crucial for egress of some viruses, including coronaviruses [27], as well as activation or inhibition of the immune response by phosphorylation of tyrosine residues in ITAM and ITIM (immunoreceptor tyrosine-based activation and inhibition motifs, respectively) [26]. SARS-2 proteins encode both ITAM and ITIM motifs. SARS-2 proteins harbor multiple domain-interacting motifs, such as motifs recognizing PDZ, SH2, SH3 domains [29][30][31]33], Figure 3B. These motifs are essential for cellular signaling and protein trafficking within host cells. For example, the PDZ-binding and DLLV motifs in SARS-1 E protein can influence the subcellular localization of PALS1 (MPP5), which may disrupt the tight junction and apicobasal polarity of the cell [30,31,34], Figure 3B. The L-rich and PDZ-binding motifs are used by multiple viruses to recruits mTORC1, kinase signaling to initiate translation [7]. In SARS-2, S, N, M, ORF1ab, and ORF7b proteins harbor additional L-rich motifs, Table S9. ORF7b harbors LxxLL motif, which is crucial for HIV retro-transposition, and papillomavirus-induced oncogenesis and cell transformation. Additional motifs such as [RKY]xxPxxP or RxxK can interact with host SH3 leading to endosome sorting. For example, HIV and HCV interact with tyrosine kinase through the SH2 and SH3 domains-interacting motifs, e.g., PxxPxR [8,9]. Although some human CoVs and MERS-CoV encode PxxPxR or [RKY]xxPxxP motifs, SARA-CoV-1 and 2 do not encode the same motifs.
An interesting phenomenon in viruses is the ability of a virus to disturb the host transcription for the sake of the viral gene expression. An example is the adenovirus E1A oncoprotein that regulates host transcription by binding to transcription regulators, histone acetyltransferase/CREB-binding protein (p300/CBP), through the Fx[DE]xxxL motif [35,36]. Similarly, the adenoviral E1A and the Epstein-Barr virus EBNA2 oncoproteins interact with the C-terminal Mynd domain of ZMYND11 (BS69) transcription regulator, which is facilitated by PxLxP motif in the viral proteins [37]. The transcription regulation Fx [DE]xxxL and PxLxP motifs are observed on ORF1ab and N protein sequences (but not ORF1a), suggesting their potential roles in virus replication. Furthermore, oncoproteins with conserved LxCxE motif can phosphorylate and inactivate retinoblastoma protein (RB1), leading to initiation of the gene expression and virus replication, reviewed in [7]. Although the motif is a five residues long, ORF1ab is the only protein that harbors this motif. An additional contributor on viral replication is ORF3a protein, which harbors the HCFC1-binding motif. HCFC1 regulates the cell cycle by recruiting the regulator p300 and histone deacetylase (HDACs).

The SARS-CoV-2 Nsp1 Protein Encodes for a Tubulin MREL Motif
ORF1ab polyprotein is a long protein that cleaved in host cells into 16 non-structural proteins (nsp1 to nsp16). Among all coronaviruses, only nsp1 of SARS-2 harbors the tubulin-beta mRNA autoregulation signal motif (PROSITE accession: PS00228) in the Cterminal. Tubulins are able to auto-regulate their expression through the binding between the polymerized tubulin protein to the MREI motif of the nascent tubulin peptides, which, by an unknown mechanism, can terminate the translation of tubulin [38]. The recent studies suggest that the polymerized tubulin may bind to an unknown mediator or ribonucleases, which cause ribosome stalling and terminate translation, causing constant level of the tubulin in the cells [38]. The exact biochemical and structural mechanism by which the tubulins recognize the motif and regulate the translation is remain to be discovered.
It worth to note that all tubulins harbor MR[E/D][I/L] motif in the N-terminal end of the tubulin, which thought to be a signature of tubulin superfamily. The motif tends to be found in few proteins, for example it is detected in 219 proteins belong to tubulin family, among them 11 are human tubulins. The motif can be found in 269 other proteins (do not belong to tubulins), among them seven proteins are encoded by human, including the RNA-binding TNRC6A, the tyrosine-protein kinase MUSK, and the phospholipid phosphatase PLPP4 proteins.
To test the hypothesis that this motif is ubiquitous in viruses, we searched for the tubulin motif in herpesvirus, iridovirus, and poxvirus proteins. The search shows that some of these large viruses (the genome sizes are 10-20 times larger than CoV) do not encode the tubulin motif, while some of these large proteomes harbor only one version. Together, this shows that the motif is not widely encoded by human nor viruses. Moreover, the motif could have additional regulatory functions other than regulation of tubulin expression, for example the motif could have a roles in regulation of translation.
On the other hand, nsp1 protein is thought to have a role in viral RNA translation. It can bind to 40S ribosome complex to inhibit host RNA translation via endonucleolytic cleavage at the 5 UTR [32,39]. Structure analysis shows the binding between MREL domain with the P residue of S30 protein, Figure 3C. Interestingly, the MRELNGG is conserved among some coronaviruses, including SARS-CoV-2. However, the closely-related coronaviruses, such as SARS-CoV and MERS-CoV encode T, L, or I residues, instead of M, Figure 3D,E. The function of C-terminal domain and MREL tubulin-motif, and their roles in viral virulence deserve to be studied by future studies.

Discussion
This analysis highlights the fact that coronaviruses encode wide-range of motif that could help virus to trigger new molecular functions and interactions. Therefore, the protein domain/motifs datasets could help future research to discover new aspects about MERS, SARS-1, and SARS-2 infection. Furthermore, we used functional motifs to predict the potential interactions SARS-CoV-2 proteins, Figure 2.
In consistent with our analysis, a recent interatomics study shows that SARS-CoV-2 proteins interacts with ubiquitin ligases, kinases, lipid modifications, as well as proteins contain zinc finger, SH2, SH3, and PDZ domains [28], Supplementary Table S10. Additionally, PDZ-binding motifs (PBM) is crucial during SARS-CoV infection. To study the implication of PBM on infection, PBM located in SARS-1 E protein was mutated and deleted [40]. Astonishingly, the virus restored the motif after several passages in vitro or in mice, however mutated PBM in nsp1 protein leads to virus attenuation. We noticed in our results that a virus could harbor multiple copies of the same motif, such as PDZ-binding motifs and SH2 and SH3 domains, which may function in the severity of the virus. Noteworthy, ORF1ab is the largest protein and it contains almost a copy of all binding motifs. One could suggest that multiple copies of the same motifs are kept on ORF1ab protein for restoration of these motifs in case of their drastic loss; however, this possibility remains to be validated.
Finally, the advantage of our in silico analysis is the usage of dataset of the motifs (domains) that have been experimentally validated by multiple methods for other viruses, which increases the robustness, discussed in [5][6][7]. The resulting datasets can make the future functional studies easy and can enrich the attempts to understand coronaviruses (e.g.,  infection and the attempts to find antiviral drug. Most of the domains used in the analysis are confer to a structure that increases their importance during binding with protein domains of host cells. Interestingly, viruses could acquire the motifs from an evolutionary distant virus or organism. It is of interest to study the molecular mechanisms govern the transfer of the protein motifs and domains, which helps to predict the future and emerging pathogens. Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/covid1010032/s1, Table S1. List of proteomes inculded in the analysis; Table S2. List of pattern motifs downloaded from ExPASy database, or motifs that experimentally validated and reviewed (Davey, et al. 2011, Hraber, et al. 2020, Sobhy 2016 and Sobhy 2017); Table S3. Results show the percent the proteins harbouring certain motifs per coronavirus proteome; Table S4. The data used to construct Figure 1B,C; Table S5. Results show the percent the proteins harbouring certain motifs per SARS-CoV-2 proteome; Table S6. Details of the motifs encoded by SARS coronavirus NS-1; Table S7. Details of the motifs encoded by human betaCoV 2c EMC/2012 (MERS-CoV); Table S8. Details of the motifs encoded by SARS-CoV-2 isolate Wuhan-Hu-1; Table S6. The matrix shows presence of the motif in the SARS-2 proteins. The data used to construct Figure 1D; Table S10. Examples of SARS-CoV-2 interacting proteins from Gordon et al. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature (2020).