Interoperability of COVID-19 Clinical Phenotype Data with Host and Viral Genetics Data

The outbreak of the COVID-19 epidemic has focused enormous attention on the genetics of viral infection and related disease. Since the beginning of the pandemic, we focused on the collection and integration of SARS-CoV-2 databases, which contain information on the structure of the virus and on its ability to spread, mutate, and evolve; data are made available from several open-source databases. In the past, we gathered experience on human genomics data by building models and integrated databases of genomic datasets (representing, e.g., mutations, gene expression profiles, epigenetic signals). We also coordinated the development of a data dictionary describing the clinical phenotype of the COVID19 disease, in the context of a very large consortium. The main objective of this paper is to describe the content of the data dictionary and the process of data collection and organization. We also argue that—in the context of the COVID-19 disease—interoperability between the three domains of viral genomics, clinical phenotype, and human host genomics is essential for empowering important analysis processes and results. We call for actions that could be performed to link these data.


Introduction
The outbreak of COVID-19 has presented novel challenges to the research community, pushed by the intent of rapidly mitigating the pandemic effects. During these times, we have observed the production of an exorbitant amount of data; the total number of sequences of SARS-CoV-2 available worldwide went from few hundreds in March 2020, up to about one hundred thousand in August 2020, and more than 5 million in December 2021. Inspired by our work on genomic data integration [1,2], we searched for effective ways to help investigate the new phenomenon. We produced concise models to understand and organize data as a basis for building search, visualization, and analysis systems.
In this paper, we overview human and viral genomic data systems in place (Section 2), clarifying our position in the COVID-19 data modeling and tool development community; then, we explore the conceptual model proposed for capturing COVID-19 disease phenotype information within an important global initiative and briefly discuss currently available alternatives (Section 3). Finally, we overview efforts that link COVID-19-related datasets on the viral/host genotype and phenotype levels (Section 4). Limitations of the possible solutions are discussed before we conclude (Section 5).

Background
Understanding viruses from a conceptual modeling perspective is very important. In April 2020, we designed the Viral Conceptual Model (VCM, [3]): the sequence of the virus is the central entity, described by three views regarding (i) the information on the virus and on the infected host; (ii) details of the technology and process used for extracting sequences; (iii) metadata on the project and laboratory managing the sampling, sequencing, and analysis pipelines. Additionally, we modeled sequences' annotated parts (known genes, coding and untranslated regions, etc.) and their nucleotide/amino acids mutations, computed with respect to the reference sequence of the viral species. An enriched model-resulting from the ontological unpacking of the initial VCM-was also proposed [4]. We then described an abstract model that allows representing both the data (thus embedding the VCM) and the external knowledge that is being collected about SARS-CoV-2. This includes notions on variants, their effects (in terms of disease severity, transmissibility, vaccine escape, etc.), their composition (in terms of sets of mutations), the peculiarities of mutations due to their original and alternative nucleotide or amino acid residues, and the definition of particular regions of the genome with given functions. The proposal is preliminarily sketched in CoV2K [5], but we are working towards a much broader definition, allowing performing a targeted search that interconnects data and knowledge.
After our modeling effort, we built solid pipelines for extracting data from the original deposition portals and integrating them within our global model VCM. In doing so, the most difficult aspect we had to consider was the growth of data (which reached more than 5 million genomes in December 2021, in only 1.5 years). Such data continuously need to be mastered by increasingly powerful computing resources with several logical and physical optimizations. ViruSurf [6] is our first system, designed for collecting sequences from the two biggest-completely open-SARS-CoV-2 data sources, e.g., GenBank [7] and COG-UK [8]. We also implemented the dual system ViruSurf-GISAID, storing sequences from GISAID [9], currently hosting the major deposition database (accessing GISAID data is subject to a Data Agreement that is typically granted to users from research institutes and academy). ViruSurf offers a practical interface where each drop-down menu is a metadata attribute describing viral sequences. Possible values are paired with the number of available sequences in the database. Different conditions can be built on the presence of specific mutational patterns (predicating on either nucleotides or amino acid residues). EpiSurf and EpiSurf-GISAID [10] are companion systems for analyzing sequences mutations in the context of specific viral genomic regions, i.e., epitopes. Epitopes, extracted from the Immune Epitope Database (https://www.iedb.org/, accessed on 26 December 2021), are strings of amino acid residues from a virus protein that can be recognized by antibodies or other host's receptors. In the mentioned interfaces, the results are produced as browsable tables of sequences and epitopes, described by their metadata. They can be downloaded as textual files that are easily embedded in bioinformatic pipelines. A more visual and interactive support is provided by VirusViz [11] can be opened directly from sets of sequences defined within ViruSurf or EpiSurf, as well as from a user-input file of sequences. VirusViz allows partitioning the population of interest into groups and comparatively visualizing their mutation distributions, with several options for highlighting positions, mutational patterns, and regions of interests. VirusViz has a dual close-source solution, i.e., VirusLab [12], commercialized by the Quantia Consulting S.R.L. company and developed in collaboration with our group at Politecnico di Milano, within the EIT project N. 20663. We have just finalized ViruClust [13], a tool for comparing SARS-CoV-2 genomic sequences and lineages in space and time without any computational background. We are currently developing VariantHunter (https://github.com/DEIB-GECO/Docker_VariantHunter, accessed on 26 December 2021), an application that indicates possible emerging variants, and CoV2K-API (http://gmql.eu/cov2k/api/redoc, accessed on 26 December 2021), a flexible API for exploring the interplay between SARS-CoV-2 data and knowledge-an ever-growing source of information for variants, their effects, and genetic characteristics.
All mentioned systems are part of the broad architecture illustrated in Figure 1, showing different areas: data sources (orange background), data bases (pink), data models (blue), and systems (gray), such as web applications and tools. Our work on viral resources was very efficient thanks to our background in human genomics. We previously proposed another conceptual model focused on human genomics [1], based on a central entity rep-resenting files of genomic regions, similarly described from various dimensions. We next developed and implemented an integrated database, searchable through the GenoSurf interface [2]. The methods for data loading, integration, and cleaning of viral sequences were adapted from META-BASE [14], a pipeline for data ingestion developed for GenoSurf. Datasets are organized by using the Genomic Data Model [15], which couples the outcome of a biological experiment with clinical and biological information of the studied sample. Open data are retrieved from different sources such as ENCODE [16], TCGA [17], Roadmap Epigenomics [18], and 1000 Genomes [19] and can be used for answering complex biological queries with the GenoMetric Query Language system [20]. The models and systems are general enough to consider many different signals of the human genome, including studies that may be useful to represent COVID-19-related problems. Comprehensive SARS-CoV-2 data sources, models, bases, and tools overview. Undirected links in the schema represent relationships of use between the different actors. For instance, the data extracted from the GenBank source are loaded within the ViruSurf relational database, which is queried by systems such as ViruSurf. In turn, VirusViz, VirusLab, ViruClust, and CoV2K-API employ abstractions defined in the CoV2K abstract model. VariantHunter is the most recent tool, directly using GISAID data retrieved by users that have a specific data access agreement.
The main challenges that need to be addressed at the data modeling level correspond to: (i) associating concepts to terminology standard codes provided by organizations (e.g., SNOMED, LOINC, ATC, and ICD); (ii) building appropriate questionnaires to collect data (WHO has provided some guidelines for Case Report Forms at https://www.who.int/ teams/health-care-readiness-clinical-unit/covid-19/data-platform, accessed on 26 December 2021); (iii) building cohorts with statistical significance [25]; (iv) defining phenotypes (shared set of phenotypes to be combined with genomic data for standard GWAS and further meta-analysis).
The collection of homogenized data allows a series of studies (e.g., [26][27][28]). However, they become more powerful when associated with genetic information of the human host or of the virus, as discussed next.

The COVID-19 Phenotype Data Dictionary
In addition to the already cited cooperative efforts, we wish to mention the COVID-19 Host Genetics Initiative [29] (https://www.covid19hg.org/, accessed on 26 December 2021), which aims at gathering an open community of thousands of researchers who produce, share, and analyze data to learn the genetic determinants of COVID-19 susceptibility, severity, and outcomes. Within this international group, of which we share all motivations, we engaged in the design, structuring, and harmonization of a comprehensive data dictionary to help with the submission of individual-level data. For the collection of requirements and cooperative design of the dictionary, we employed a Slack channel dedicated to "covid19-hg-phenotypes" with 1489 members, started on 18 March 2020 by the leaders of the Initiative. The channel hosted 210 posts, plus all the derived public and private replies between active members. In this process, we coordinated about 50 clinicians for cooperatively designing the patients' clinical phenotype definition. The phenotype refers to severe patients who were hospitalized; it has about 200 clinical variables that have been progressively consolidated and annotated, describing demographics, exposure, risk factors, co-morbidities, hospitalization admission and course, and longitudinal encounters with symptoms, treatments, and lab data.
The data dictionary was released on 16 April 2020 (FREEZE 1) and updated on 16 August 2020 (FREEZE 2); both versions are available at http://gmql.eu/phenotype/ (accessed on 26 December 2021); genetic and related clinical phenotype data are currently being collected and hosted by EGA [30], the European Genome-Phenome Archive of EMBL-EBI. The initiative recommends clinical phenotype data to be submitted following the mentioned data dictionary [31,32]. It has already collected a considerable amount of results, currently reaching~9.4 K critically ill cases, 25 K hospitalized cases, 125 K reported cases of SARS-CoV-2 infection with almost 3 M controls, as an update [33] of the flagship statement published in Nature at the beginning of 2021 [32], where only 6 K critically ill cases, 13 K hospitalized cases, 50 K reported cases, and 2 M controls were considered.

Cooperative Construction of the Dictionary
An initial draft of the survey was inspired by about 15 COVID-19 questionnaires used across different studies, including the UCSF CHIRP clinical intake, the Canada CanPATH questionnaire, the Columbia University COVID-19 questionnaire, the case report form for Confirmed Novel Coronavirus COVID-19 by WHO, handouts of NIH Intramural, 23andMe, all of U.S. and Helix research programs, the University of Michigan and Universities of Chile COVID-19 surveys, and the Finnish institute for health and welfare. A document was circulated among all groups participating in the initiative, requesting active feedback from those who were designing studies that required to record patient information. A first COVID-19 Phenotype Definition was drafted. Then, variables were organized in sets that correspond to specific clinical questions. In the following, we will outline the principles that were used to evolve the first draft into the FREEZE 1 and FREEZE 2 versions; we acted as moderators of the process.
General Principles. Every variable had a unique Variable Name that could be used to cross-relate variables across multiple studies. Variables were grouped according to the clinical question that they related to; some groups described general variables that should always be present. Categorical variables had a list of possible values. Variables with plural names were used to indicate that they could have multiple values. Variables could be entered by a Contributor (P = Patient, MD = Healthcare Professional, Any) and were divided into the categories One-Time (associated with the patient, never changing) and Visit-Time (associated with a visit/follow-up event, progressively collected after the first patient identification).
Identification. We noted that, without an identifier, patients cannot be related to an organization, nor their clinical progression can be traced. We thus considered the important issue of creating identification mechanisms for both patients and organizations:

•
For the hospital taking care of the patient, we chose a simple identification, i.e., the pair country code-city code of the hospital phone contact. Note that this strategy left space for possible extensions, e.g., contributors could add a number or ZIP code for the cities having multiple hospitals or a doctor ID for the contributors who would enable doctors to collect records in the territory.

•
For the anonymized patient identifier, we assumed that each contributor would provide her method. When such information was omitted, all the records input for a single patient were treated as uncorrelated.
Document construction: The process of data contribution was very open and collaborative. A "dataspace" was available to contributors, that could add clinical questions or add variables or even add values to existing variables but could not change or drop its content without moderation and should avoid adding variables about overlapping topics or values with overlapping meaning with respect to existing values.

Proposed Model
The final dictionary is illustrated by the entity relationship diagram [34] in the green central rectangle of Figure 2; the Patient is the central concept, whose phenotype information is collected at admission and during the course of hospitalizations, hosted by a given Hospital. For ease of visualization, attributes are clustered within Attribute Groups, and attributes within groups can be further clustered within Subgroups, denoted by white squares (for brevity, these are not further expanded into specific attributes). Groups connected directly to the PATIENT describe:  Each patient is characterized by multiple instances of Encounter; attribute groups of encounters describe an EncounterSymptom (with their subgroup SymptomsDef ), a Treatment (with a TreatmentsDef subgroup that has a start and an end date), and a LaboratoryResult (with a test date, name of the laboratory, and a subgroup of MeasuredParameters with specific measure units and their typical ranges).
Researchers can extract the patient phenotype and differentiate cases and controls in a number of ways. For example, one of the COVID-19 Host Genetics Initiative analyses discriminates between mild, severe, or critical COVID-19 disease severity based on a set of EncounterSymptoms and HospitalizationCourse conditions, whilst another analysis distinguishes cases and controls based on Comorbidities and AdmissionSymptoms.

Host Genotype and Host Phenotype
To find important genotype-phenotype correlations, well-defined phenotypes need to be ascertained in a quantitative and reproducible way [35]. Since the very first months of the COVID-19 pandemic, large efforts are being conducted for linking the genotype of the human host to the COVID-19 phenotype.
Genome-wide association studies (GWAS) [36] and multi-omic approaches [37] (e.g., gene expression and proteomics) can uncover common variants and networks underlying the host response. In particular, GWASs have reached interesting results [38], associating human genetic variants with severe COVID-19 [39], COVID-19 requiring intensive careunit admission [40], or respiratory failure (the 3p21.31 gene cluster and the ABO locus, a marker identifying type A blood [38]). Zeberg and Paabo focused on genomic regions that protect against severe COVID-19 [41], finding that the gene cluster identified in [38] is located in an area of chromosome 3 inherited from Neanderthals, possibly explaining statistical differences pointing to population groups (European, Asian) that are hit harder by COVID-19 than others. Exploiting the UK Biobank data (https://www.ukbiobank. ac.uk/, accessed on 26 December 2021) several studies have been conducted, showing, e.g., variants that increase the risk of COVID-19-related mortality [36], lifestyle factors such as obesity-associated with impaired pulmonary functions-that are related to illness severity [42], specific genotypes (ApoE e4e4) associated with COVID-19 test positivity [43], and the impact of sex-related genetic differences on the disease [44,45].
Several international organizations were set up to allow the agreement between researchers and studies around the world. The COVID-19 Host Genetics Initiative has contributed several important findings among the mentioned ones [38,40]; in their flagship paper [32], they described three GWAS meta-analyses comprising 49,562 COVID-19 patients from 46 studies across 19 countries worldwide. The results included 15 genome-wide significant loci associated with SARS-CoV-2 infection or severe illness, several in genes previously associated with autoimmune and inflammatory diseases. The implications of these findings for treatment or prevention will require further evaluation.
The most recent results concern the definition of phenotypes related to COVID-19 outcomes (focusing on susceptibility, given exposure, mild clinical manifestations, and an aggregate score of symptom severity) [48] by the AncestryDNA Science Team [49]. Evidence has been gathered that the expression of the ACE2 gene has an influence on COVID-19 risk and is also predictive of severe disease [50]. The GenOMICC (Genetics Of Mortality In Critical Care) performed GWAS in 2244 critically ill patients with COVID-19 from 208 UK intensive care units, able to identify robust genetic signals that relate to key host antiviral defense mechanisms and act as mediators of inflammatory organ damage in patients with the disease [40].
COVID-19 hospitalized patients have been characterized by means of clinical and molecular parameters [51]. An organized and systematic approach to biobanking [52] has allowed a broad coverage of clinical and genetic data for the GEN-COVID Multicenter Study, massively advancing COVID-19 research. A post-Mendelian genetic model has been proposed in [53], designing an Integrated PolyGenic Score to measure the combined effect of common and rare variants. Outcome severity has been associated with blood type and a gene-rich locus on chr3p21.31 in [54]; life-threatening COVID-19 has been connected to inborn errors of type I IFN immunity [47]; other conditions have been linked to male and female characteristics [55][56][57], from an age-dependent [58] or an age-independent [59] perspective. COVID-19 susceptibility genetic factors have been studied using symptombased case predictions [31].
Although research on COVID-19 host genomics is still in progress, all these collaborative efforts show that the human genetics scientific community has been able to gather from around the world to jointly address the pandemic. As a consequence, the identification of new host genetic factors associated with COVID-19 is driven and encouraged by new collaborative platforms, data sharing, joint analyses, and publications of findings.

Viral Genotype and Host Conditions
GISAID [9], the most adopted database for viral sequence depositions, provides an annotation type regarding the "patient status" (e.g., "ICU; Serious", "Hospitalized; Stable", "Released", "Discharged") for a restricted number of sequences (only 4% of the entire collection in December 2021). In addition, 2019nCoVR [61] has only stored 208 clinical records (https://bigd.big.ac.cn/ncov/clinic, accessed on 26 December 2021) related to specific sequences since the beginning of the pandemic; these include key-value pairs about the onset date, travel/contact history, clinical symptoms, and tests. Some early findings connected virus sequences with human phenotype, including very small datasets (e.g., [62][63][64][65]. We are currently not aware of open data sources or datasets sharing viral sequences linked to the phenotype data of the host organism. Such combination would enable interesting and comprehensive queries concerning the impact of sequence variants on the clinical phenotype of patients affected by COVID-19. From a data-driven perspective, the only information that is shared concerns the general effects of mutations on epidemiological or immunological aspects of the SARS-CoV-2 infection, without a link to specific patients. This general information is collected, for instance, by COG-UK Mutation Explorer [8] (http://sars2.cvr.gla.ac.uk/cog-uk/, accessed on 26 December 2021) in tabular format for each specific mutation and by CoVariants [66], CDC [67], ECDC [68] in relation to given Variants of Concern or Interest or to variants that have raised the research community attention. These are provided in textual form. Our own work [5] has attempted to systematize such effects into a taxonomy described in https://github.com/DEIB-GECO/cov2k_data_collector/blob/master/CoV2K_Effects_ Taxonomy.pdf (accessed on 26 December 2021).
In Figure 2, in the yellow rectangle, each Patient can be connected to the information on the ViralSequence by which he/she is infected; such sequence may hold mutations on two different levels, i.e., Amino Acid Changes or Nucleotide Mutations, which yield Effects on multiple levels.

Host Genetics, Host Clinical Phenotype, and Viral Genome
A viral disease is a complex system including the virus's genotype and the host's genotype and phenotype, as captured in Figure 2. The databases for each of these systems are so far curated by different communities of scientists; links connecting patients to their phenotype and viral sequences are normally missing. If such links existed in public databases, patients could be connected to their genetic profiles (including several signals such as mutations and gene expressions) and to the sequence of the virus that infected them, yielding at most one-to-one relationships between the corresponding databases (see Figure 2). However, genome evolution can be traced, as it happens for tumors, and longitudinal studies of viral sequences are starting to emerge for COVID-19, showing how the virus mutates when repeatedly sampled from the same human [69]. Therefore, the clinical course of a patient should be linked to multiple sequencing events of both the human genome and the virus.
Very few works relate the three systems of viral genomics and host genotype and phenotype in the literature. An interesting approach to study the interactome of viruses with their host was proposed by Messina et al. [70] and further studied in [71].
Infrastructures that provide support for handling all these different kinds of data are already emerging. Among these, we mention the National COVID Cohort Collaborative (N3C [72], https://github.com/National-COVID-Cohort-Collaborative, accessed on 26 December 2021), leveraging on an organized and inclusive workstream of data extraction, harmonization, and analysis, and the secure SCOR infrastructure [73], ready to be deployed to support trackable data sharing and facilitate automated legally compliant federated analyses on an international scale. A broad collection of datasets related to COVID-19 data has been accomplished within on FAIRsharing [74] and can be reached at https://fairsharing. org/collection/TDRCOVID19Participantleveldatasharingplatformsregistries (accessed on 26 December 2021).

Discussion
In general, there is a strong need to connect human genotype, human phenotype, and viral genotype so as to build a complete and fully encompassing scenario for data analysis. When links are absent in the data, they could be learned through computational methods. With such connections available, viral mutations could be linked to the effects in specific organs or to global disease severity (e.g., requiring intensive care) and be seen as an aspect of clinical practice. While some ongoing studies are connecting human genetics to COVID-19 (e.g., [75]), few studies so far connected COVID-19 to the viral sequence, each of them based on very few patients (e.g., [62,64]); studies encompassing combinations of variables from the three systems have not been performed yet.
We conclude with the belief that a better linking among databases (and, correspondingly, improved communication between specialists in the various disciplines) will help us to better understand infectious diseases and to empower a richer precision medicine. For example, it has been studied that the co-occurrence of certain lab-generated mutations modifies the antigenicity of the SARS-CoV-2 virus and, therefore, its sensitivity to specific neutralizing monoclonal antibodies [76]. This kind of knowledge, when properly structured and applied in a hospital context dealing with COVID-19 patients, can practically inform treatment decisions of clinicians on given sets of patients, whose clinical profile or infecting viral strain is known. Based on our contributions described in Section 2, we built a strong background for the three described systems: we developed both GenoSurf and ViruSurf and, within the COVID-19 Host Genetics Initiative, we coordinated the cooperative design of COVID-19 s patient phenotype. We envision an environment where a tool such as Geno-Surf could be fueled with the results of research on the genetic and genomic determinants of the COVID-19 disease and, in particular, with datasets highlighting results from GWAS (linking mutations to specific traits of COVID-19 disease) from exome sequencing studies, revealing the role and impact of mutations localized within specific genes, considered either individually or collectively (e.g., sets of co-occurring mutations). Then, ViruSurf would include up-to-date sequences of SARS-CoV-2 as deposited by laboratories over the world. Finally, a clinical phenotype database, based on the described data dictionary, would collect information on the patients and the course of their disease.
In the context of the COVID-19 disease, interoperability between the three domains of viral genomics, clinical phenotype, and human host genomics is essential. We call for actions that can be performed to link these data and make them widely accessible.
Author Contributions: Data dictionary conceptualization, A.B. and S.C.; original draft preparation, A.B.; funding acquisition, S.C. All authors have read and agreed to the published version of the manuscript.

Funding:
Human and viral genomic research reported in this paper was funded by the ERC AdG GeCo (data-driven Genomic Computing) n. 693174.

Acknowledgments:
We are grateful to Arif Canakoglu and Pietro Pinoli for their long-lasting collaboration on human and viral genomics, to Andrea Ganna (Univ. Helsinki) for giving us the coordinating role in the data dictionary construction, to Francesca Mari and Alessandra Renieri (Univ. Siena) for sharing the coordination decisions, and to Sulggi Lee (UCSF), Kathrin Aprile von Hohenstaufen Puoti (independent), Sara Pigazzini (UniMiB), and Catherine Moermans (CHU Liege) for their contributions to the data dictionary production.

Conflicts of Interest:
The authors declare no conflict of interest.