Ontological Analysis of Coronavirus Associated Human Genes at the COVID-19 Disease Portal

The COVID-19 pandemic stemmed a parallel upsurge in the scientific literature about SARS-CoV-2 infection and its health burden. The Rat Genome Database (RGD) created a COVID-19 Disease Portal to leverage information from the scientific literature. In the COVID-19 Portal, gene-disease associations are established by manual curation of PubMed literature. The portal contains data for nine ontologies related to COVID-19, an embedded enrichment analysis tool, as well as links to a toolkit. Using these information and tools, we performed analyses on the curated COVID-19 disease genes. As expected, Disease Ontology enrichment analysis showed that the COVID-19 gene set is highly enriched with coronavirus infectious disease and related diseases. However, other less related diseases were also highly enriched, such as liver and rheumatic diseases. Using the comparison heatmap tool, we found nearly 60 percent of the COVID-19 genes were associated with nervous system disease and 40 percent were associated with gastrointestinal disease. Our analysis confirms the role of the immune system in COVID-19 pathogenesis as shown by substantial enrichment of immune system related Gene Ontology terms. The information in RGD’s COVID-19 disease portal can generate new hypotheses to potentiate novel therapies and prevention of acute and long-term complications of COVID-19.


Introduction
Coronavirus Disease 2019 (COVID- 19) was declared a global pandemic in March 2020 and now, two years later, the disease still plagues the globe due to rapid mutations within the causal virus, SARS-CoV-2, and inadequate preventive measures. Individuals infected with the virus exhibit a wide spectrum of symptoms, from asymptomatic infection to acute respiratory distress requiring hospitalization [1]. The major manifestation of COVID-19 is in the respiratory system, where the virus infects nasal mucosa and spreads into the host body [2,3]; however, non-respiratory systems such as liver, heart, kidney and brain, are also involved in certain patients and, in severe cases, result in multiple organ failure and death [4]. ACE2, the receptor for SARS-CoV-2, has a broad distribution in tissues such as blood vessels [5], small intestine, heart, kidney, thyroid, adipose and testis [6]. These ACE 2 expressing organs become targets of SARS-CoV-2 infection. Once entering the cell, the binding of viral spike protein to ACE2 initiates signaling pathways promoting inflammatory mediator production [7]. Other non-ACE2 mediated infection

Materials and Methods
Targeted curation of literature related to coronavirus infection was performed at RGD using the in-house curation tool [18] integrated with the OntoMate [21] literature searching tool. A standalone OntoMate tool is also accessible at https://rgd.mcw.edu/QueryBuilder/. The prioritized disease gene list was constructed as previously described [22] with added COVID-19-related genes from other sources such as the Gene Ontology Consortium (GOC) (http://geneontology.org/covid-19.html (accessed on 1 June 2020)) and LitCovid (https://www.ncbi.nlm.nih.gov/research/coronavirus/ (accessed on 1 May 2020 )). In brief, three data bases, MalaCards (https://www.malacards.org/ (accessed on 1 May 2020)), DisGeNET (https://www.disgenet.org/ (accessed on 1 May 2020)) and PhenoPedia (https://phgkb.cdc.gov/PHGKB/startPagePhenoPedia.action (accessed on 1 May 2020)) were queried for coronavirus disease, related viral diseases, and other infectious diseases. The purpose of the queries was to find human genes associated with those diseases in the biomedical literature. A prioritized list of genes was made based on appearance of the genes in multiple databases and the combined number of publications connected to each gene-disease association across those databases. A small subset of unique coronavirus disease-related genes not found in multiple databases was curated first, before curation of the main infectious gene list began. The gene symbol and the disease term "coronavirus infectious disease" was used to find publications associated with coronavirus disease. Using ontological approaches, OntoMate retrieves publications tagged with the coronavirus infectious disease, and any of its child terms such as COVID-19, Middle East respiratory syndrome, severe acute respiratory syndrome . . . , etc. The resulting publication list was ranked by relevance or sorted by publication dates. In the curation process the relationship between a gene and a disease is indicated by evidence codes [23]. The evidence code IDA (inferred from direct assay) is used to indicate direct involvement of a gene product in causing or treating a disease. IMP (inferred from phenotype manipulation) is used in cases where gene expression/function is artificially altered and a genetic or mechanistic connection between a disease is implied. IAGP (Inferred by Association of Genotype from Phenotype) is used in an association of a disease with genetic mutations or polymorphisms of a gene. IEP (Inferred from Expression Pattern), or HEP (expression changes measured by high throughput assays) is used when a gene changes its expression pattern during the disease course. In addition to in-house manual annotations, RGD regularly imports annotations from other data resources, including the Gene Ontology Consortium, Clinvar, Mouse Genome Informatics, Online Mendelian Inheritance in Man, Online Mendelian Inheritance in Animals and the Comparative Toxicogenomic Database (CTD) [22]. The evidence codes of imported annotations are assigned with the same criteria except for EXP evidence codes used in the annotations imported from CTD. EXP indicates a gene may be a biomarker of a disease or play a role in the etiology of a disease. RGD also propagate annotations from other organisms to human orthologous genes by using the ISO (Inferred from Sequence Orthology) evidence code. These imported annotations are organized into categories such as 'Disease', 'Human Phenotype', 'Mammalian Phenotypes' for non-human organisms, and others as shown in the COVID-19 Disease Portal ( Figure 1A).
The genes derived from the COVID-19 Disease Portal were further evaluated using tools available at RGD. The Gene Annotator tool retrieves all functional annotations for a gene list or a chromosomal region and visualizes the gene count distribution across disease terms in a Comparison Heat Map (https://rgd.mcw.edu/rgdweb/ga/start.jsp (accessed on 1 June 2022)). Two publicly available enrichment tools, MOET (the Multi Ontology Enrichment Tool (https://rgd.mcw.edu/rgdweb/enrichment/start.html (accessed on 1 June 2022)) at RGD, and Set Analyzer (http://ctdbase.org/tools/analyzer.go (accessed on 1 June 2022)) at the CTD were used to perform enrichment analysis of COVID-19 disease genes. Both tools are web-based analysis tools that generate a list of ontology terms statistically over-represented with the input gene symbol list. The Set Analyzer finds enriched disease, Gene Ontology, pathways, and gene-gene interaction terms for human genes while MOET is capable of performing enrichment analyses in multiple species (including rat, Genes 2022, 13, 2304 4 of 16 mouse, human, bonobo, squirrel, dog, pig, chinchilla, naked mole-rat and vervet) and multiple ontologies (including Disease, GO, Pathway, Phenotype, and Chemical entities (ChEBI)). The Ancestor chart from QuickGO (https://www.ebi.ac.uk/QuickGO/ (accessed on 1 June 2022)) [24] was used to visualize the relationship among enriched GO terms.
genes. Both tools are web-based analysis tools that generate a list of ontology terms statistically over-represented with the input gene symbol list. The Set Analyzer finds enriched disease, Gene Ontology, pathways, and gene-gene interaction terms for human genes while MOET is capable of performing enrichment analyses in multiple species (including rat, mouse, human, bonobo, squirrel, dog, pig, chinchilla, naked mole-rat and vervet) and multiple ontologies (including Disease, GO, Pathway, Phenotype, and Chemical entities (ChEBI)). The Ancestor chart from QuickGO (https://www.ebi.ac.uk/QuickGO/ (accessed on 1 June 2022)) [24] was used to visualize the relationship among enriched GO terms.

The COVID-19 Disease Portal
The landing page of the COVID-19 Disease Portal (https://rgd.mcw.edu/rgdweb/portal/home.jsp?p=14 (accessed on 1 June 2022)) ( Figure 1) provides accesses to all the integrated COVID-19 data. The disease browser ( Figure 1B) links to the Annotations page where annotations can be downloaded for analysis. The annotations associated with COVID-19 and its child terms were downloaded from the Annotations page and the associated genes were sent to the Gene Annotators and MOET tool [19] for further analysis. The disease gene list can also be obtained from the OLGA tool (https://rgd.mcw.edu/rgdweb/generator/list.html) using the target disease term as a key word. The "Gene Set Enrichment" section ( Figure 1C) at the bottom of the page sends genes curated with the highlighted ontology term (COVID-19) to the enrichment tool MOET for analysis. Seven ontologies are available in MOET for enrichment analysis.

The COVID-19 Disease Portal
The landing page of the COVID-19 Disease Portal (https://rgd.mcw.edu/rgdweb/ portal/home.jsp?p=14 (accessed on 1 June 2022)) ( Figure 1) provides accesses to all the integrated COVID-19 data. The disease browser ( Figure 1B) links to the Annotations page where annotations can be downloaded for analysis. The annotations associated with COVID-19 and its child terms were downloaded from the Annotations page and the associated genes were sent to the Gene Annotators and MOET tool [19] for further analysis. The disease gene list can also be obtained from the OLGA tool (https://rgd.mcw.edu/ rgdweb/generator/list.html) using the target disease term as a key word. The "Gene Set Enrichment" section ( Figure 1C) at the bottom of the page sends genes curated with the highlighted ontology term  to the enrichment tool MOET for analysis. Seven ontologies are available in MOET for enrichment analysis.

Human COVID-19-Associated Gene Analysis
COVID-19 is a member of the coronavirus infectious disease family. This family includes 'Middle East respiratory syndrome' (DOID:0080642) (MERS) and 'severe acute respiratory syndrome' (DOID:2945) (SARS) ( Figure 1B). There are 1257 human genes associated with COVID-19, 19 genes associated with MERS and 90 genes associated with SARS, totaling 1338 coronavirus infectious disease genes at RGD (accessed on 1 June 2022). Their overlapping coverage is visualized in the Venn diagram in Figure 2. The intensity of COVID-19 disease research is reflected by more than one thousand disease related genes identified in just over two years of the pandemic. They comprise more than 90% of the disease genes associated with coronaviruses.

Human COVID-19-Associated Gene Analysis
COVID-19 is a member of the coronavirus infectious disease family. This family includes 'Middle East respiratory syndrome' (DOID:0080642) (MERS) and 'severe acute respiratory syndrome' (DOID:2945) (SARS) ( Figure 1B). There are 1257 human genes associated with COVID-19, 19 genes associated with MERS and 90 genes associated with SARS, totaling 1338 coronavirus infectious disease genes at RGD (accessed on 1 June 2022). Their overlapping coverage is visualized in the Venn diagram in Figure 2. The intensity of COVID-19 disease research is reflected by more than one thousand disease related genes identified in just over two years of the pandemic. They comprise more than 90% of the disease genes associated with coronaviruses . Figure 2. The coronavirus disease gene distribution among the parent term (Coronavirus infectious disease) and its three child terms: COVID-19, Middle East respiratory syndrome (MERS) and severe acute respiratory syndrome (SARS). The numbers in each area represent the gene count of that section and the percentage to all the coronavirus infectious disease genes. There are 1257 genes associated with COVID-19, 19 genes associated with MERS and 90 genes with SARS, totaling 1338 coronavirus infectious disease genes on display.

Gene Disease Association
Most of the COVID-19 annotations were identified by IEP and HEP evidence codes. They account for more than 95% of the total 1407 COVID-19 annotations. The COVID-19 annotations are associated with 1257 unique genes (Table 1A). Among these genes, only ACE2 is associated with four types of evidence codes: IAGP, IDA, IMP and EXP, most of them (1221 genes) are with one type of evidence code (Table 1B). Table 1. Evidence code analysis of COVID-19 disease annotations and associated genes. IAGP  21  19  1  221  IDA  4  3  2  28  IMP  3  3  3  7  IEP  145  63  4  1  HEP  1195  1175  EXP  37  37  ISO  2  2  Total 1407 1257

Gene Disease Association
Most of the COVID-19 annotations were identified by IEP and HEP evidence codes. They account for more than 95% of the total 1407 COVID-19 annotations. The COVID-19 annotations are associated with 1257 unique genes (Table 1A). Among these genes, only ACE2 is associated with four types of evidence codes: IAGP, IDA, IMP and EXP, most of them (1221 genes) are with one type of evidence code (Table 1B). Table 1. Evidence code analysis of COVID-19 disease annotations and associated genes. IAGP  21  19  1  221  IDA  4  3  2  28  IMP  3  3  3  7  IEP  145  63  4  1  HEP  1195  1175  EXP  37  37  ISO  2  2  Total  1407  1257 A. The breakdown of 1407 COVID-19 disease annotations and 1257 associated disease genes according to evidence code types. B. The breakdown of gene counts by their association with unique evidence codes.

ECO Type Annotation Count Gene Count Gene/Unique ECO Gene Count
These COVID-19 associated genes are also involved in other diseases as viewed from the annotation distribution heatmap in the Gene Annotator tool ( Figure 3A). More than half (703) of the COVID genes are associated with 'developmental disease.' In the 'developmental disease' branch, 650 genes are associated with 'congenital, hereditary and neonatal disease' and 334 with 'neurodevelopmental disorders' ( Figure 3B). There are 911 COVID genes associated with 'disease of anatomical entity.' The breakdowns of the anatomical entities associated COVID-19 disease genes are shown in Figure 3C and discussed in the 'COVID-19 affected organ systems' section later.
Genes 2022, 13, x FOR PEER REVIEW 6 of 17 A. The breakdown of 1407 COVID-19 disease annotations and 1257 associated disease genes according to evidence code types. B. The breakdown of gene counts by their association with unique evidence codes.
These COVID-19 associated genes are also involved in other diseases as viewed from the annotation distribution heatmap in the Gene Annotator tool ( Figure 3A). More than half (703) of the COVID genes are associated with 'developmental disease.' In the 'developmental disease' branch, 650 genes are associated with 'congenital, hereditary and neonatal disease' and 334 with 'neurodevelopmental disorders' ( Figure 3B). There are 911 COVID genes associated with 'disease of anatomical entity.' The breakdowns of the anatomical entities associated COVID-19 disease genes are shown in Figure 3C and discussed in the 'COVID-19 affected organ systems' section later. COVID disease genes were visualized by their association with high level disease terms. (B). COVID genes associated with developmental disease were expanded to show their association with more granular terms under the branch. (C). COVID genes associated with disease of anatomical entity were expanded to show their association with more granular terms under the branch.

Disease Term Enrichment Analysis
COVID-19 is a disease with a broad spectrum of symptoms, including some atypical symptoms of respiratory diseases like loss of smell, and neurological symptoms [25]. Use of the existing knowledge of how these COVID-19 genes are involved in other human diseases could shed light on the pathogenesis of COVID-19 and facilitate development of therapeutic strategies. To find these relationships, we next looked at the disease enrichment patterns of the COVID-19 disease genes using MOET [19] developed at RGD. Several high-level disease terms such as 'coronavirus infectious disease', 'RNA virus infection' and 'viral infectious disease' are highly enriched since they are parent terms for COVID-19. The enriched disease table was downloaded from MOET, and the top 40 enriched diseases were selected, from 'respiratory tract infections' to 'lung injury' and were listed in Table 2A. As expected, the term 'respiratory tract infections' was on top of the list. Surprisingly, there

Disease Term Enrichment Analysis
COVID-19 is a disease with a broad spectrum of symptoms, including some atypical symptoms of respiratory diseases like loss of smell, and neurological symptoms [25]. Use of the existing knowledge of how these COVID-19 genes are involved in other human diseases could shed light on the pathogenesis of COVID-19 and facilitate development of therapeutic strategies. To find these relationships, we next looked at the disease enrichment patterns of the COVID-19 disease genes using MOET [19] developed at RGD. Several high-level disease terms such as 'coronavirus infectious disease', 'RNA virus infection' and 'viral infectious disease' are highly enriched since they are parent terms for COVID-19. The enriched disease table was downloaded from MOET, and the top 40 enriched diseases were selected, from 'respiratory tract infections' to 'lung injury' and were listed in Table 2A. As expected, the term 'respiratory tract infections' was on top of the list. Surprisingly, there were several enriched terms in the liver disease branch, including liver neoplasms, hepatobiliary system cancer and others. Additional enriched terms included rheumatic disease, autoimmune disease of musculoskeletal system, allergic disease, pneumonia, and immune/inflammatory diseases of non-respiratory system diseases. The same COVID-19 disease genes were sent to the Set Analyzer, another enrichment tool at CTD, and the top 40-enriched diseases are listed in Table 2B. Most of the enriched diseases in MOET were also enriched in the Set Analyzer, however, some of the ranking orders were shifted. The 'respiratory tract infections' was on top of the MOET list while it was ranked 17th on the list from the Set Analyzer. These differences could be attributed to different ways of data integration and two different disease vocabularies used by RGD [22] and CTD [20]. Overall, liver diseases, immune system diseases, autoimmune diseases and respiratory tract diseases were highly enriched in both tools. The enriched list of the Set Analyzer includes more organ system disease terms while there are more granular terms on the MOET list. Using 'Nervous System Diseases' (MESH: D009422) as an example, on the MOET enrichment list, 'autoimmune disease of central nervous system (DOID:0060004)' is ranked 38th, however, the high level term of its parents 'Nervous System Diseases' (MESH: D009422) is ranked 15th on the enrichment list from Set Analyzer. On the MOET list several granular kidney disease terms such as nephritis, glomerulonephritis, and glomerular diseases are ranked 14th, 17th, and 21st, respectively, while on the Set Analyzer list, 'Urologic Diseases' (MESH: D014570) (parent of kidney diseases) is ranked 26th and 'Nephritis' (MESH: D009393) 32nd.

COVID-19 Affected Organ Systems
We now look at the target organ distribution of these COVID-19 associated genes ( Figure 3). COVID-19 associated disease genes are associated with nervous system disease (772), gastrointestinal system disease (513), endocrine system disease (498), musculoskeletal system disease (498), skin & connective tissue disease (479) and immune & inflammatory disease (431) ( Figure 3C). Among these six organ systems affected by COVID genes, we drill down into each disease branch and list 10 granular disease terms selected by gene counts from the Comparison Heat Map in the Gene Annotator tool (Table 3). Among COVID-19 genes, over sixty percent (772/1257) are also involved in nervous system diseases. Out of 772 COVID/nervous system disease genes, over 650 are involved in the central nervous system followed by sensory system, neurologic manifestation, and neurodegenerative disease. Among gastrointestinal system diseases and endocrine system diseases, liver diseases, including liver neoplasms and cancers, show high prevalence. This correlates with the results of enrichment analysis where liver diseases were highly enriched by both tools. There are only 431 genes involved in the immune & inflammatory disease branch; however, diseases related to immune and inflammatory are present in all the six organ systems listed in Table 3. They are autoimmune disease of the nervous system, autoimmune disease of gastrointestinal tract, autoimmune disease of endocrine system, autoimmune disease of musculoskeletal system and dermatitis.

Gene Ontology Enrichment of COVID Genes
The Gene Ontology enrichment patterns of COVID genes were examined using the MOET tool. The Biological Process (BP) enrichment list is heavily concentrated in the 'immune system process' branch which includes immune response, immune effector process, leukocyte activation, and their child terms (Table 4 and Figure S1). Another highly represented branch is 'response to stimulus' where the fourth enriched term 'defense response' resides. The cellular component annotations of COVID genes are highly enriched in the branches of 'immunoglobulin complex,' 'extracellular region', 'membrane' and 'cell periphery' (Table 4 and Figure S2). Most of the top 40 Molecular Function (MF) terms are either 'binding' or its child terms such as antigen binding, protein binding or carbohydrate binding in the binding branch. The rest of the terms are 'molecular function regulator,' and its child terms under the branch (Table 4 and Figure S3). The six organ system diseases associated with highest number of COVID-19 disease genes were drilled down to more granular diseases within the branch. The number in parentheses next to the disease term shows the number of COVID-19 associated genes that are associated the disease term. These numbers were taken from the Comparison Heat Map in the Gene Annotator tool. The 6 organ system diseases with highest disease gene counts were selected from Figure 3C.

Discussion
COVID-19 was declared a pandemic in March 2020, less than three months after its first identification. The whole world has allocated resources to study COVID-19 with the hope to find preventive and therapeutic measures to control the pandemic. The immense efforts in studying COVID-19 have produced over 280,000 publications and related datasets. From these available resources, the RGD team was able to curate and integrate COVID-19 associated data and to release the COVID-19 Disease Portal just four months later in July 2020 with regular updates. In this manuscript, COVID-19 genes were taken from the portal and analyzed with tools developed in-house and other publicly available tools.
Most of the curated disease genes were curated with evidence codes HEP or IEP, which identify changes in the gene expression pattern during COVID-19 disease (Table 1). These genes can serve as biomarkers to monitor the disease course and devise treatment plans. Currently, immune/inflammatory cytokine patterns are known to be useful in predicting disease progression [10], and treatments based on blocking excessive cytokine release have been proposed as a treatment regime [26]. The COVID-19 associated genes were examined by their distribution among anatomical entities and the over-represented diseases. Over sixty percent (772/1257) of the COVID-19 disease genes are also involved in 'nervous system diseases' and 'autoimmune disease of central nervous system' was among the top 40 enriched diseases. According to the breakdowns of nervous system disease gene counts in Table 3, the central nervous system, peripheral nervous system and sensory system are affected by the COVID genes and activation of autoimmunity in these systems is the major disease mechanism. More than likely, the central nervous system is the most significant target since its autoimmune disease is among one of the top 40 most enriched diseases. These results suggest that immune/inflammatory attacks on the nervous system play roles in the neurological manifestations such as loss of smell, headache, nausea, and impaired consciousness experienced by some COVID-19 patients [8,25,27]. The immune system, which is important to fight off infection, when dysregulated, becomes destructive and causes severe disease complications such as 'cytokine storm' in severe COVID cases [10,28]. Our enrichment analysis shows that several immune/inflammation diseases were on the top 40 list, and immune dysregulation was implicated in all the six organ systems examined in Table 3. The involvement of COVID-19 genes in the immune system was further confirmed by Gene Ontology enrichment profiles as shown in Table 4. All three aspects pointed to terms associated with the immune system. It has been shown previously that different disease gene sets exhibit unique GO enrichment profiles which reflect the unique pathophysiology of the disease [29]. Our analysis of COVID-19 disease genes suggests that dysregulation of immune function is a common mechanism affecting pathogenesis of COVID-19 disease, especially in the severe cases where multiple organ systems are affected.
COVID-19 was first identified as a respiratory disease and in severe cases, caused serious lung injury, particularly the high ACE 2 expression type II alveoli, and resulting respiratory distress [2,30]. In addition to lungs, ACE2 has a broad distribution in tissues and vasculature and its expression provides viral entry to the organ system. Viral damage to liver, brain, and kidney has been documented in biopsies from patients [2,5,6]. However, as a newly evolving disease, whether direct SARS-CoV2 entry through ACE2 leads to multiple organ damages is not clear. The unexpected finding of liver diseases as being highly enriched in genes associated with COVID-19 by two enrichment tools could point to an important direction to understand the pathogenesis of COVID-19. It has been shown that liver organoids express viral receptor ACE2 and can be infected by SARS-Co-V2 and become a replication reservoir for the virus [2]. The percentage of liver injury associated with COVID-19 patients varied among patient groups [31]; however, liver damages seemed to link to severe cases and poor disease outcome [32,33]. The liver damage caused by SARS-CoV2 infectionis attributed to several mechanisms such as augmented expression of the viral receptor ACE2 during diseases, uncontrolled immune cell infiltration, and cytokine storms after infection in the hepatobiliary system [32][33][34]. These mechanisms might also be involved in multiple organ injuries observed COVID-19 patients. What is unique in liver injury and COVID is its association with unbalanced coagulation control. It has been shown that COVID-19 patients exhibited a hypercoagulable state and this condition is related to impaired liver function resulted from liver injury [34]. Since the liver is the major organ producing coagulation factors [35], damage to liver would aggravate coagulation control thus exacerbating liver failure or even multiple organ failure in the severe cases. However, alteration in hemostasis control could be secondary to cytokine dysregulation since our enrichment analyses did not show overrepresentation of blood coagulation diseases. Here, we performed a detailed analysis of COVID-19 genes by their association with organ systems, disease enrichment, and GO enrichment using the resources available at RGD. Some of the results confirm with the clinical features of COVID-19 such as the involvement of nervous system and immune system in the disease. Additionally, the enrichment results point out the link between COVID-19 and liver diseases. As a globally accessible disease bioinformatic resource, RGD strives to provide researchers with utilities to resolve the complexity of disease research. During the challenging time of the COVID-19 pandemic, producing, organizing, and integrating data sets related to the disease is a timely contribution to the disease research community.