A Methodology for Open Information Extraction and Representation from Large Scientiﬁc Corpora: The CORD-19 Data Exploration Use Case

Featured Application: Open Information Extraction on the COVID-19 Open Research Dataset (CORD-19). Abstract: The usefulness of automated information extraction tools in generating structured knowledge from unstructured and semi-structured machine-readable documents is limited by challenges related to the variety and intricacy of the targeted entities, the complex linguistic features of heterogeneous corpora, and the computational availability for readily scaling to large amounts of text. In this paper, we argue that the redundancy and ambiguity of subject–predicate–object (SPO) triples in open information extraction systems has to be treated as an equally important step in order to ensure the quality and preciseness of generated triples. To this end, we propose a pipeline approach for information extraction from large corpora, encompassing a series of natural language processing tasks. Our methodology consists of four steps: i. in-place coreference resolution, ii. extractive text summarization, iii. parallel triple extraction, and iv. entity enrichment and graph representation. We manifest our methodology on a large medical dataset (CORD-19), relying on state-of-the-art tools to fulﬁl the aforementioned steps and extract triples that are subsequently mapped to a comprehensive ontology of biomedical concepts. We evaluate the e ﬀ ectiveness of our information extraction method by comparing it in terms of precision, recall, and F1-score with state-of-the-art OIE engines and demonstrate its capabilities on a set of data exploration tasks. infectivity was measured on human 293 cells expressing the ALV receptor Tva (293-Tva) {2} . The e ﬀ ect of PMB treatment of these particles {1} Puriﬁed ALV-A virus particles {1} was comparable to native MLV particles. These ﬁndings suggest that Tva {2} the ALV receptor Tva (293-Tva) {2} binding creates or exposes a functionally important cysteine thiolate target for PMB in ALV-A Env. FCoVs and CCoVs {1} are common pathogens and readily evolve. It is necessary to pursue epidemiological surveillance of these viruses {2} FCoVs and CCoVs {1} , so as to detect the emergence of new variants, which may have increased pathogenicity and / or a new host range, as early as possible.


Introduction
Open information extraction (OIE) systems aim at distilling structured representations of information from natural language text, usually in the form of triples or n-ary propositions. Contrary to ontology-based information extraction (OBIE) systems which rely on pre-defined ontology schemas, OIE systems follow a relation-independent extraction paradigm tailored to massive and heterogeneous corpora. Therefore, they can play a key role in many NLP (natural language processing) applications like natural understanding and knowledge base construction by extracting phrases that indicate novel semantic relationships between entities. Although there are many approaches for extracting triples in the form of {subject, predicate, object} from unstructured text, there is no standardized way of efficiently generating, mapping, and representing these triples in a manner that facilitates end-user applications. These limitations are primarily prevalent in larger corpora, where the high number of duplicate and/or low-quality triples as a result of topic irrelevant sentences or complex syntactic phenomena (e.g., coreference) further hinders robust triple extraction, ultimately discouraging their extensive deployment.
The goals of this paper are twofold: first, to present a methodology for efficiently extracting information from large corpora, covering all phases from natural language text pre-processing to triple extraction, intuitive visualization, and querying; and second, to concretize this methodology on a set of downstream tasks relying on state-of-the-art tools and pretrained deep learning models to demonstrate its effectiveness in a real-world scenario involving the CORD-19 dataset, which represents the most extensive machine-readable coronavirus-related collection of literature available for data mining to date.

Related Work
There is an abundance of proposed strategies for transforming raw text to a structured representation in order to populate a knowledge graph [1][2][3][4]. However, especially in the case of OIE approaches and due to concerns on scaling, the use of syntactic or semantic relation extraction techniques has been relatively sparse, with the exception of a few recent examples aiming at domain-specific knowledge extraction [5][6][7][8][9]. Most domain-specific information extraction approaches are focused primarily on evaluating the efficiency of different triple extraction tools on raw data, not taking useful pre-processing and post-processing strategies into account, thus resulting in a large number of potentially uninformative triples [10][11][12]. There exist a few systems that go beyond triple extraction by implementing a more thorough preprocessing strategy, including coreference resolution or discourse analysis to improve the quality of the extracted triples; however, these do not address the scalability issues that arise from processing large corpora [13,14]. By treating each sentence in the corpus equally, we run the risk of overflowing the graph database with unrelated information compared to the documents' true scope, seriously impeding data exploration tasks. On the other hand, by complying only with a strictly defined ontology schema, we are likely to lose all information that is not covered by the existing ontology properties [15]. Finally, the available bibliography lacks a clear triple representation strategy that would resolve duplication issues and would equip the user with a set of data enrichment processes for visualizing latent information such as the temporal dimension (continuity) of the extracted triples, connections to existing ontologies (entity linking), sentence polarity, and hidden interconnections between different corpora based on similar extracted entities. Our methodology encompasses a number of pre-processing (coreference resolution, text summarization) and post-processing (entity enrichment, graph representation) tasks coupled with a core parallel triple extraction process, combining different approaches to enhance the contextual connectivity of the extracted information.

Advances in Coreference Resolution
Coreference resolution is the task of finding and grouping all expressions (mentions) that refer to the same entity in a text. Two noun phrases are said to be co-referring to each other if both of them unambiguously resolve to a unique referent. In many cases, the term "coreference" is used interchangeably with the term "anaphora", denoting the non-symmetric syntactic phenomenon of a noun phrase being the anaphoric antecedent of a another noun phrase (i.e., only the former is required for the interpretation of the other) [16]. A key challenge of coreference resolution is that entity information may be spread across multiple mentions over the corpus, thus requiring information to be aggregated from all mentions [17]. Over the last decades, several approaches in tackling coreference problems have emerged, spanning from early, rule-based, and linguistically-motivated approaches [18,19] which are based on the syntactic constraints of the language, to modern deep learning techniques that rely on pairwise scoring of entity mentions [20][21][22]. The latest research in the field leverages finetuning of existing state-of-the-art language models (e.g., span-based pretraining of BERT models) which are repurposed for the task of coreference resolution [17,23].

Advances in Text Summarization
Automatic text summarization is the process of computationally shortening a set of data to create a subset that represents the most important information within the original content [24]. Summarization is considered one of the most increasingly demanded tasks in natural language processing, as a means to unlock the abundance of wealth hidden underneath the vastness of textual data. At present, there exist two main methods for text summarization: abstractive and extractive [25]. In abstractive summarization, the summary is generated by novel sentences paraphrasing existing words, while in extractive summarization, the content is composed by unmodified sentences from the original text.
Recent work in abstractive summarization involves the use of sequence-to-sequence frameworks based on attentional recurrent neural network (RNN) encoder-decoder [26][27][28] or transformer [29,30] architectures to generate concise summaries from input documents. With regard to extractive summarization, proposed approaches employ lexical features [31,32], statistical methods such as TF-IDF [33,34], or unsupervised learning techniques [35] to extract keywords and phrases from large corpora, with the most recent ones also involving transformer architectures [36]. While implementations following the abstractive approach are more closely emulating human summarization, even those based on ANNs (and considered state-of-the-art) are relying on large training corpora, have limited generalization on the document level, and usually suffer from semantic and grammatical errors [37]. Extractive summarization, on the other hand, has reached its maturity stage and-although most extractive summaries may lack in readability and cohesion-they generally succeed at capturing the key points of the digested text [38,39]. At the same time, since they just highlight portions of the original content, there is no danger of generating sentences with irrelevant/wrong interpretations, which is especially important in sensitive domains like biology or medicine.

Advances in Open Information Extraction
Open information extraction (OIE) systems aim at converting the unstructured information expressed in natural language into a more structured representation, in the form of relational triples consisting of a set of arguments (subject, object) and their semantic relation (predicate), e.g., <subject, predicate, object> [40]. Unlike closed information extraction approaches that are limited to a narrow set of predefined target relations, OIE systems are able to extract any kind of relation, providing increased scalability and usability over heterogeneous corpora [41]. In order to extract OIE triples, most approaches try to identify linguistic extraction patterns, which may be either hand-crafted or automatically learned from annotated data. Rule-based approaches that rely on hand-crafted extraction rules focus on syntactic constraints expressed as part-of-speech (POS)-based regular expressions [42,43]. Self-supervised learning approaches usually leverage annotated data sources (e.g., Wikipedia infoboxes) to train classifiers [44,45] or bootstrap a large training set over which they learn a set of extraction POS pattern templates [46]. Some recent OIE systems are clause-based, using linguistic knowledge about the grammatical and syntactic properties of the language to identify clause constituents, thus restructuring larger complex sentences to many simple ones [47,48]. The emergence of annotated corpora for OIE evaluation paved the way for supervised neural-based models that further pushed the state-of-the-art in this domain [49,50]. Latest approaches are extending the use of deep BIO taggers used for semantic role labeling (where B is assigned to the beginning of named entities, I is assigned to the interior, and O is assigned to other) by leveraging the word embeddings of the processed sentences in deep neural networks (e.g., bi-LSTMs) to produce probability distributions over possible BIO tags [51,52].

Advances in Entity Linking, Enrichment and Representation
Over the years, various entity enrichment approaches have been introduced, aiming at augmenting the usefulness of the extracted triples, including entity linking and polarity detection processes before representing them through a graph database. Linking the identified entity mentions in text to an ontology or dictionary is considered an essential step in creating informative triples, with various knowledge bases (KBs) pertaining to general knowledge being employed for this purpose based on morphological similarity, such as DBpedia [53], Freebase [54], and Yago [55]. There are cases, however, where many of the entities in these general-interest KBs are irrelevant for certain applications, therefore domain-specific ontologies for semantic information brokering, based on inter-ontology relationships such as synonyms, hyponyms, and hypernyms of the extracted entities are used [56]. In order to further increase the links between morphologically dissimilar extracted entities and KB-related objects, neural-based methods are also implemented, exploiting word embeddings to represent semantic spaces [57,58], also allowing for domain-agnostic entity resolution [59]. With regard to sentiment analysis (neutral vs. emotionally loaded) and polarity (positive vs. negative) detection of a text [60], lexicon-based [61], ML-based [62], and neural-based [63] classifiers are commonly used to identify the polarity of a relation within a sentence. In the field of graph representation, the two main implementations of graph models include resource description framework (RDF) triple stores [64] and labeled property graphs (LPG) [65]; both provide ways to explore and graphically depict connected data.

Materials and Methods
Our proposed methodology introduces a processing pipeline that takes as input a large natural language corpus consisting of many documents (e.g., scientific articles) and provides a structured representation of the extracted information in the form of open information triples as output, allowing for interactive data exploration. The system comprises the following components: 1.
an in-place neural coreference resolution component, 2.
an extractive text summarizer that isolates the key points of the ingested text, 3.
a parallel triple extraction component as our core information extraction method, and 4. a toolkit of entity enrichment and representation techniques built around a graph engine.
More information about the data used to demonstrate our methodology and the technical specifications of each component are given at Sections 3.1 and 3.2, respectively. An overview of our pipeline is depicted in Figure 1.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 4 of 28 in text to an ontology or dictionary is considered an essential step in creating informative triples, with various knowledge bases (KBs) pertaining to general knowledge being employed for this purpose based on morphological similarity, such as DBpedia [53], Freebase [54], and Yago [55]. There are cases, however, where many of the entities in these general-interest KBs are irrelevant for certain applications, therefore domain-specific ontologies for semantic information brokering, based on inter-ontology relationships such as synonyms, hyponyms, and hypernyms of the extracted entities are used [56]. In order to further increase the links between morphologically dissimilar extracted entities and KB-related objects, neural-based methods are also implemented, exploiting word embeddings to represent semantic spaces [57,58], also allowing for domain-agnostic entity resolution [59]. With regard to sentiment analysis (neutral vs. emotionally loaded) and polarity (positive vs. negative) detection of a text [60], lexicon-based [61], ML-based [62], and neural-based [63] classifiers are commonly used to identify the polarity of a relation within a sentence. In the field of graph representation, the two main implementations of graph models include resource description framework (RDF) triple stores [64] and labeled property graphs (LPG) [65]; both provide ways to explore and graphically depict connected data.

Materials and Methods
Our proposed methodology introduces a processing pipeline that takes as input a large natural language corpus consisting of many documents (e.g., scientific articles) and provides a structured representation of the extracted information in the form of open information triples as output, allowing for interactive data exploration. The system comprises the following components: 1. an in-place neural coreference resolution component, 2. an extractive text summarizer that isolates the key points of the ingested text, 3. a parallel triple extraction component as our core information extraction method, and 4. a toolkit of entity enrichment and representation techniques built around a graph engine.
More information about the data used to demonstrate our methodology and the technical specifications of each component are given at Sections 3.1 and 3.2, respectively. An overview of our pipeline is depicted in Figure 1.

Data
For the purposes of our work and in response to the recent COVID-19 pandemic, we leveraged the COVID-19 Open Research Dataset (CORD-19) [66], provided by the Allen Institute for AI. The dataset consists of different subsets for commercial and non-commercial usage, collectively including over 40,000 full-text articles about the coronavirus family of viruses, to be used by the global research community. We specifically focused on the "Commercial use subset" that by the time of access (8 April 2020) contained 9365 articles from PubMed Central (PMC), a free full-text archive of biomedical and life sciences journal literature at the U.S. National Institutes of Health's National Library of Medicine. Each paper is represented as a single JSON object, whose schema contains a unique paper ID, the paper's title and authors list, an abstract, the main body text, and its corresponding bibliographic entries. The dataset is constantly updated, aiming at facilitating the development of text mining and information retrieval systems over its rich collection of metadata and structured full text papers, in support of the fight against the COVID-19 disease.

Implementation
The following subsections (Sections 3.2.1-3.2.4 ) are devoted to each of the four steps of our proposed methodology. Each subsection follows the same pattern; a brief description of the process followed by a detailed analysis of the technical implementation. Finally, a number of examples based on the CORD-19 dataset are given to illustrate how this pipeline may be operationalized for other real-world scenarios.

In-Place Coreference Resolution
Given that our information retrieval task requires the extraction of dependency relations from sentences, i.e., sets of the form {subject, predicate, object}, and that in many cases the entity is replaced with its coreferential pronoun (e.g., "Mary is a nice person, I like hanging out with her" rather than "Mary is a nice person, I like hanging out with Mary"), we consider in-place coreference resolution as a crucial pre-processing step on the each article's body text, to improve the quality of the extracted triples. Therefore, in the scope of creating our information extraction pipeline, we leveraged the pretrained neural coreference resolution tool from AllenNLP [67], which implements a variant of Lee et al. end-to-end coreference resolution model [68] using Span-BERT embeddings [23]. The model had been trained on the OntoNotes 5.0 dataset (the largest coreference annotated corpus) [69], achieving F1-score of 78.87% on the test set.
Each article of the CORD-19 dataset was pre-processed by the in-place coreference resolution component, where all noun phrases (mentions) referring to the same entity were substituted with that entity. The pretrained neural model from AllenNLP provided good results even on complex situations containing challenging pronoun disambiguation problems, thus facilitating the creation of more informative triples. Indicative examples of the performed coreference substitutions on article extracts are provided in Table 1. Table 1. In-place coreference resolution on CORD-19 dataset extracts. The patterns of coreference are annotated with subscripts. The anaphors (orange) are replaced by the antecedent to which they refer to (green).

Article ID Extract
Two articles in the top ten cited articles discussed the emergence of New Delhi metallo-β-lactamase (NDM) gene {1}  Official health linkages {1} have served to promote good will in some otherwise difficult relationships, as has been the case with Indonesia. They {1} Official health linkages {1} have also helped to promote a positive international image for Australia.
However, one should also note that the experiment is based on labelling and quantifying proteins about 4 h post-infection. This relatively early time point {1} allows one to minimize potentially confounding influences of virion particle assembly and production on cytoplasmic levels of viral proteins, but it {1} This relatively early time point {1}

Extractive Text Summarization
We consider text summarization as the second step of our information extraction pipeline for two reasons: i. extracting triples directly from large corpora (such as the CORD-19 dataset) is both costly and time-inefficient due to the increased computational resources required by information extraction engines, and ii. in most cases, only a small fraction of the extracted triples is useful and/or relevant to the discussed topic. Relying on the articles' abstracts-whenever applicable-is a reasonable alternative; however, we run the risk of processing text that contains short descriptions of the article's purpose, along with broad, high-level specifications of little contextual value. Hence, we argue that an extractive summarization method is the optimal way to reduce a document's text length by omitting peripheral or inappropriate information, while highlighting its key features for triple extraction. To this end, we relied on the HuggingFace extractive summarizer library based on Miller's transformer-based approach [36]. For our implementation, and since we were dealing mainly with biomedical articles, we substituted the vanilla BERT model with AllenAI's SciBERT pretrained model for scientific text, which significantly outperforms BERT-base on most NLP-related tasks including named entity recognition, sequence labeling and text classification [70].
We iteratively passed the CORD-19 articles (previously submitted to in-place coreference resolution) through the summarizer, where all sentences were embedded into the multi-dimensional space using SciBERT embeddings. Subsequently, k-means clustering was used on the sentence representations to identify those sentences that were closest to the cluster's centroids for summary selection. The ratio of sentences comprising each summary to the original text was provided as parameter (0.2) to the model. The result was an extractive summary for each one of the 9365 ingested articles, with an average count of 843 words per summary, which resulted in a significantly more condensed corpus, compared to an Appl. Sci. 2020, 10, 5630 7 of 30 average count of 4482 words per original body text. An example of the text summarization process is provided in Table A1 of the Appendix A, while detailed information on the word reduction per article is depicted in Figure 2 below. The generated summaries, along with respective metadata (paper_id, title) comprised the corpus for the next phase of our pipeline, parallel triple extraction.
pl. Sci. 2020, 10, x FOR PEER REVIEW 7 of Figure 2. Word reduction per paper using extractive summarization. The original body text of each article (blue) is replaced by its summary (green). This preprocessing step aims at selecting the most representative sentences of the given documents, despite their original length.

.3. Parallel Triple Extraction
Our approach aims at distilling knowledge from texts that can be used directly for end-use plications such as structured searches (e.g., "Find all triples containing SARS-CoV-2 as a Subject alon th all related papers"). Therefore, triple extraction is the third and core step of our knowledg traction pipeline, in an effort to discover relations between the entities of the CORD-19 corpu

Parallel Triple Extraction
Our approach aims at distilling knowledge from texts that can be used directly for end-user applications such as structured searches (e.g., "Find all triples containing SARS-CoV-2 as a Subject along with all related papers"). Therefore, triple extraction is the third and core step of our knowledge extraction Appl. Sci. 2020, 10, 5630 8 of 30 pipeline, in an effort to discover relations between the entities of the CORD-19 corpus. Instead of restricting ourselves to a subset of relations in a pre-defined ontology (closed information extraction), we leveraged an open information extraction (OIE) approach, where triple formulation is defined as the task of generating a structured, machine-readable representation of the information in text, with the length of each triple element varying from a single word to a short text phrase. Moreover, instead of limiting ourselves to a single engine, we combined the three most popular OIE systems, namely the Open IE system from the University of Washington (UW) and Indian Institute of Technology, Delhi (IIT Delhi, India) [71], ClausIE from Max Planck Institute (MPI) [72], and the neural OIE model from AllenNLP [73]. A brief explanation of the intuition behind each system is provided below:

1.
Open IE 5.1 from UW and IIT Delhi is a successor to the Ollie learning-based information extraction system [74]. The latest version is based on the combination of four different rule-based and learning-based OIE tools; namely CALMIE (specializing in triple extraction from conjunctive sentences) [75], RelNoun (for noun relations) [76], BONIE (for numerical sentences) [77], and SRLIE (based on semantic role labeling) [78].

2.
ClausIE from MPI follows a clause-based approach, first identifying the clause type of each sentence and then applying specific proposition extraction based on the corresponding grammatical function of the clause's constituents. It also considers nested clauses as independent sentences. Because ClausIE detects useful pieces of information expressed in a sentence before representing them in terms of one or more extractions, it is especially useful in splitting complex sentences into many individual triples [41]. 3.
AllenNLP OIE system formulates the triple extraction problem as a sequence BIO tagging problem and applies a bi-LSTM transducer to produce OIE tuples, which are grouped by each sentence's predicate [72]. Given that it relies on supervised learning and contextualized word embeddings to produce independent probability distributions over possible BIO tags for each word, it has the potential of discovering richer and more complex relations. On the downside, it is not guaranteed that the neural sequence tagger will produce exactly two arguments for a given predicate (i.e., a subject and an object), thus complicating the triple extraction process.
We implemented a parallel triple extraction sequence based on the aforementioned OIE systems, where the extractive summary of every CORD-19 article from the previous step was passed through each one of the three engines. Given that Open IE 5.1 is based on handcrafted extraction heuristics and automatically constructed triple extractors, that ClausIE follows a rule-based approach which exploits linguistic knowledge about the grammar (clause types) of the English language and that AllenNLP OIE system is dependent on the context's vector representation to detect parts of speech, we relied on the complementarity between the different approaches to ensure maximum recall. The only shortcoming of this method is that it inevitably leads to a higher duplication rate compared to using a single engine; however, this issue was effectively tackled in the next and final stage of our pipeline. Example cases with extracted triples showcasing the complementarity of the different OIE engines are shown in Table 2. Table 2. Parallel triple extraction using different OIE engines. The left column shows the processed sentence and the Source ID of the corresponding CORD-19 article. The middle column shows the derived triples, while the right column denotes the engine(s) that discovered each triple (O: Open IE, C: ClausIE, A: AllenNLP OIE).

Sentence Extracted Triples (S/P/O) Engine
"RA and PBD blocked the attachment of IAV and 3C-like protease (3CLP) of severe acute respiratory syndrome-associated coronaviruses plays a pivotal role in viral replication and is a promising drug target", Source: c85ca5217f9051f839115 69eed1eb52cf992f7dd

Sentence Extracted Triples (S/P/O) Engine
"Involvement of polyamines, possibly due to loss of epigenetic control of X-linked polyamine genes/is suspected in SjS since the appearance of acrolein conjugated proteins is related to the intensity of SjS and acrolein is an oxidation product of polyamines (88)", Source: 8495f7c65f4a6cbce0e0 d53c0900f10f6740826e Involvement of polyamines possibly due to loss of epigenetic control of X-linked polyamine genes/is suspected/in SjS since the appearance of acrolein conjugated proteins is related to the intensity of SjS The new genome sequence was obtained by first mapping reads to a reference SARS-CoV-2 genome using BWA-MEM 0.7.5a-r405 with default parameters to generate the consensus sequence.", Source: b5d303cbcfe6be92d733e c593118b388db77452e The new genome sequence/obtained/by first mapping reads to a reference SARS-CoV-2 genome O,C,A a reference SARS-CoV-2 genome/be using/BWA-MEM 0.7.5a-r405 with default parameters to generate the consensus sequence/ O BWA-MEM 0.7.5a-r405/to generate/the consensus sequence C a reference SARS-CoV-2 genome/using/BWA-MEM 0.7.5a-r405 to generate the consensus sequence C The new genome sequence/was/obtained A

Entity Enrichment & Graph Representation
Our methodology concludes with a series of post-processing activities, including linking the extracted entities to an existing ontology, performing polarity detection on the phrases related to each triple as well as cleaning the duplicate triples that were extracted via the parallel execution of the aforementioned OIE engines, before representing them through a graph data modeling process. More details about the technical implementation of each subtask are given below: 1.
Entity linking: We leveraged the EntityLinker component from SciSpacy [79], a Python package containing models for processing biomedical, scientific, or clinical text. The component was used to perform a string overlap-based search (char-3grams) on named entities, comparing them with the concepts of the UMLS (Unified Medical Language System) knowledge base, using an approximate nearest neighbors search. The UMLS knowledge base contains over four million concepts along with additional information (e.g., definitions, hierarchies, concept-concept relations) from many health and biomedical vocabularies and standards (including CPT, ICD-10-CM, LOINC, MeSH, RxNorm, and SNOMED CT), enabling interoperability between computer systems [80]. For each triple subject or object with one or more entities linked to UMLS concepts, the coded concept name, concept description, and confidence score of the linking process was added along with the existing triple information. In order to address the ambiguity of biomedical terminology, we exploited the parametrization capabilities of the SciSpacy EntityLinker component, mainly by using the resolve_abbreviations parameter to resolve any abbreviations identified in the corpus before performing the linking and by tuning the threshold that a mention candidate must reach to be linked to a specific UMLS concept. Of course, this barely scratches the surface of biomedical terms disambiguation which remains a challenging task [81]. Overall, this entity enrichment process not only increases the contextual value of the extracted triples, but also facilitates the research for specific ontologies by mapping the existing entities with their normalized lexical variants (aliases).

2.
Polarity detection: By implementing polarity detection at sentence-level, we were able to classify triples that were inherently bearing a positive or a negative value. We relied on AllenNLP's RoBERTa-based binary sentiment analyzer [63], which achieves 95.11% accuracy on the Stanford Sentiment Treebank test set.

3.
Triple cleaning: This process is aimed at reducing the redundant triples that resulted from the parallel triple extraction process, as described in Section 3.2.3, while also reducing the number of non-informative triples with little contextual value. To this end, we considered only "fully-linked" triples, i.e., triples whose both subject and object was linked to at least one UMLS concept. For those remaining triples, we additionally implemented a deduplication process to keep only the unique ones, based on the mentions of concepts of each sentence's subjects and objects. In this manner, only one triple containing the same UMLS concepts was stored for each sentence.

4.
Continuity representation: In order to enhance the readability of the extracted information, we implemented a script that connected the extracted triples with each other, based on the order of appearance in the original text. This way, the user is able to unravel the scope of the targeted content by interacting with its structured representation.

5.
Graph representation: It is the final step of our proposed pipeline, aiming at the practical interaction with the ingested data (e.g., visualizations and queries) that will facilitate data exploration tasks. Due to the rich internal structure that characterizes labeled property graphs, allowing each node or relationship to store more than one property (thus reducing the graph's size), we used the Neo4J graph database based on labeled property graphs for storing, representing, and enriching the extracted relationships [82]. Neo4j is a Java-based graph database management system, described by its developers as an ACID-compliant transactional database with native graph storage and processing. It provides a web interface allowing relationships to be queried via CYPHER, a declarative graph query language that allows for expressive and efficient data querying in property graphs.
The final structure of the enriched triple extraction output following the above processes is shown in Table 3 for a single triple. The graph representation consists of three distinct types of nodes which are linked via three types of edges: • Corpus nodes (purple): they signify the ingested text (e.g., scientific articles), are represented by the article_title and the unique article_id properties, and are connected to one or more triples via the contains triples one-directional edge, • Subject nodes (green): they denote the subject of the extracted triple and are connected to one or more corpora (as mentioned above), as well as to one or more Objects, via predicate one-directional edges. The Subject node has the following properties: value: the natural language text of the subject; subj_entity_name: the name(s) of the entities linked to UMLS concepts; subj_entity_coded: the coded ID(s) of the entities linked to UMLS concepts; subj_entity_description: a short description of the linked entities; subj_entity_confidence: the reported confidence of the entity linking process; article_id: the unique ID of the extracted article; article_title: the title of the extracted article; sentence_text: the text of the processed sentence; sent_num: the serial number of the sentence in the whole corpus; triple_num: the serial number of the triple extracted from the given sentence; and engine: the OIE engine that from which the triple was extracted from. • Object nodes (red): they denote the object of the extracted triple and are connected to their respective Subject through the predicate edge (as mentioned above), as well as with the "Subject" of the following triple in the corpus (if there exists one) via the followed by one-directional edge. Their properties are identical to those of the Subject nodes.
An example figure of the graph representation that concentrates the aforementioned nodes and edges of one CORD-19 article is in Figure 3. Table 3. Sample output of the information extraction pipeline. The triple (subject-predicate-object) extracted by the sentence denoted in the sentence_text field is linked with its corresponding UMLS entities (sub_entity_coded, obj_entity_coded fields).

Results
After submitting the CORD-19 articles to our information extraction pipeline, we acquired 411,189 triples that contain subjects and objects linked to at least one UMLS entity. A compressed file containing the extracted triples is available online (https://github.com/lighteternal/CORD-19-OIEtriple-extraction). In the following subsections, we evaluate the validity of our information extraction process and present a number of indicative data exploration tasks on the CORD-19 dataset, which are enabled by the structured representation of the extracted information.

Evaluation of the Information Extraction Process
The evaluation of OIE systems on domain-specific corpora is generally a tricky process, mainly due to the lack of gold extractions (i.e., valid, manually annotated triples) for the specific domain. The common approach to tackle this lack of gold standards and metrics is to annotate a small subset of the extracted triples for correctness, thus yielding a precision measure as the ratio of valid extracted triples over the total number of extracted triples [83]. This approach, however, does not measure the extent to which actual valid triples are being overlooked (sensitivity), since it requires the total population of potentially valid triples. In order to acquire indicative evaluation metrics for our methodology, we manually generated all possible triples from a subset of 50 sentences (one sentence usually generates more than one triple) and calculated both the precision (as defined above) and the recall as the proportion of valid triples extracted by our pipeline to the total number of valid triples (automatically extracted and hand-crafted). We also calculated the F1-score as the harmonic mean of precision and recall. It should be noted that during the manual triple generation process, we only considered triples that could be potentially linked to UMLS entities (e.g., triples from sentences such as "We discovered a mutation of the virus" would not count as valid, because the subject "we" does not correspond to any UMLS entity). Evaluation results are provided in Table 4. Table 4. Evaluation metrics of the information extraction pipeline, on a subset of 50 randomly selected sentences of the CORD-19 corpus. Precision is the ratio of valid triples from our approach among the total generated triples, while recall is the ratio of valid triples from our approach among the total valid ones (generated and hand-crafted). F1-score is the harmonic mean of precision and recall.

Metric Value
Precision 0.78 Recall 0.76 F1-score 0.77 Although a direct comparison with other approaches (e.g., end-to-end methods, OIE engines) is infeasible as it would require their implementation on the same dataset, it is easily understandable that the complementarity of our methodology leads to better performance, at least compared to standalone OIE engines. This can also be inferred by the experimental results of several OIE systems on different datasets, with precision, recall, and F-measure barely surpassing the 0.7 threshold for a large number of extractions [47,84].

Data Exploration Tasks
We present a number of indicative data exploration tasks on the CORD-19 dataset to demonstrate the capabilities of our information extraction pipeline.

1.
The first task focuses on visualizing triples from the CORD-19 bibliography whose subject refers to the SARS-CoV-2 virus, the strain of coronavirus that causes the COVID-19 disease. The domain expert can either query the database for subjects containing the name of the virus or he/she can use the UMLS coded ID of the SARS-CoV-2 (C5203676), to acquire results containing all the corresponding aliases of the entity. These results are available in both tabular and graphical form (Table 5, Figure 4): 2.
The second task attempts to discover useful relationships regarding IL-6, a pleiotropic proinflammatory cytokine that is found in increased levels in COVID-19 patients and similar viruses. By performing a targeted search on subjects containing the "virus" entity and objects containing the "IL-6" entity, we get the results shown in Table 6 (in natural language form) and Figure 5 (as UMLS coded entities). This time, we are also interested in the title of the scientific article from where the triple was extracted.

3.
The final data exploration task allows us to focus on one of the articles and exploit the continuity representation functionality (followed by edges) to traverse through the generated triples. The result is akin to a graphical summary of the processed article. The generated chain consists of alternating subject/object nodes, depicting the sequence of their appearance in the original text ( Figure 6).     Table 6. Sample of extracted triples showing relations between the entities "IL-6" and "virus" (some properties of the related nodes have been omitted for better visibility).

Article Subject Predicate
Object UMLS in Subj.   Appl. Sci. 2020, 10, x FOR PEER REVIEW 17 of 28 2. The second task attempts to discover useful relationships regarding IL-6, a pleiotropic proinflammatory cytokine that is found in increased levels in COVID-19 patients and similar viruses. By performing a targeted search on subjects containing the "virus" entity and objects containing the "IL-6" entity, we get the results shown in Table 6 (in natural language form) and Figure 5 (as UMLS coded entities). This time, we are also interested in the title of the scientific article from where the triple was extracted.  3. The final data exploration task allows us to focus on one of the articles and exploit the continuity representation functionality (followed by edges) to traverse through the generated triples. The result is akin to a graphical summary of the processed article. The generated chain consists of Figure 5. Sample of the Neo4J graph visualization of the triples showing relations between the coded UMLS IDs of "virus" (green) and "IL-6" (red) found in two CORD-19 articles (purple). Other UMLS entities (e.g., "H5N1", "MCP-1") are also present in the Subject and Object nodes.

Distinct Regulation of Host Responses by ERK and JNK MAP Kinases in Swine
Appl. Sci. 2020, 10, x FOR PEER REVIEW 18 of 28 alternating subject/object nodes, depicting the sequence of their appearance in the original text ( Figure 6). Figure 6. A sample sequence of extracted triples from an article's (purple) sentences as alternating subject (green) and object (red) nodes, linked with predicate (grey) and followed_by (yellow) edges.

Discussion
This section aims at illustrating the impact of our approach to domain-specific information extraction tasks, through an end-to-end example ( Table 7) which showcases the added value of each individual component that comprises our pipeline compared to a standalone OIE process. For reasons of better readability, we have included only a small extract of the processed article body.
As seen in the upper part of Table 7, the coreference resolution component correctly substitutes an entity mention ("the bats") to its more semantically informative antecedent ("horseshoe bats"). Subsequently, the extractive summarization component identifies a set of salient points and groups them to a much shorter, concise corpus, which is subjected to the parallel triple extraction process. The extracted triples are finally passed through an entity enrichment and cleaning process, that results in storing only those linked to one or more UMLS concepts. Other provided information including polarity, UMLS coded IDs, matching confidence scores, concept descriptions as well as the Neo4J graph visualization of the extracted triples are omitted in this example.
For the sake of comparison, a small subset of the triples derived from the application of a standalone, state-of-the-art OIE engine (ClausIE) on the raw article body is provided at the lower part of Table 7. It is apparent from the triple samples that the lack of pre-processing (coreference resolution, summarization) and post-processing (entity resolution, cleaning) methods, usually leads to a large number of uninformative triples, despite the fact of some being syntactically correct. For example, the triple {"This finding", "implies", "a possible recombination event"} although being valid, it unpacks little contextual value with regard to the aims of the article. On the contrary, our approach seems to increase the expressiveness and informativeness of the derived triples, ensuring that they remain relevant to the context of the article. Table 7. End-to-end example of information extraction on a CORD-19 article body. The highlighted text composes the extractive summary of the article. The coreference resolution component replaces the anaphors (orange) by their antecedent (green). The parallel information extraction component provides triples, which are linked with existing UMLS concepts via the entity enrichment component. The triples extracted by our pipeline are compared to those extracted using a standalone OIE engine directly on the article body.

CORD-19 Article ID: 85783a36e7e787302307f42460839435d665f4e7
Article Title: SARS-CoV-2: an Emerging Coronavirus that Causes a Global Threat Article body: […] Subsequently, coronaviruses with high similarity to the human SARS-CoV or civet SARS-CoV-like virus were isolated from horseshoe bats{1}, concluding the bats{1} horseshoe bats{1} as the potential natural reservoir of SARS-CoV whereas masked palm civets are the intermediate host [53] [54] [55] [56] .It is thus reasonable to suspect that bat is the natural host of SARS-CoV-2 considering its similarity with SARS-CoV. The phylogenetic analysis of SARS-CoV- Figure 6. A sample sequence of extracted triples from an article's (purple) sentences as alternating subject (green) and object (red) nodes, linked with predicate (grey) and followed_by (yellow) edges.

Discussion
This section aims at illustrating the impact of our approach to domain-specific information extraction tasks, through an end-to-end example ( Table 7) which showcases the added value of each individual component that comprises our pipeline compared to a standalone OIE process. For reasons of better readability, we have included only a small extract of the processed article body.
As seen in the upper part of Table 7, the coreference resolution component correctly substitutes an entity mention ("the bats") to its more semantically informative antecedent ("horseshoe bats"). Subsequently, the extractive summarization component identifies a set of salient points and groups them to a much shorter, concise corpus, which is subjected to the parallel triple extraction process. The extracted triples are finally passed through an entity enrichment and cleaning process, that results in storing only those linked to one or more UMLS concepts. Other provided information including polarity, UMLS coded IDs, matching confidence scores, concept descriptions as well as the Neo4J graph visualization of the extracted triples are omitted in this example.
For the sake of comparison, a small subset of the triples derived from the application of a standalone, state-of-the-art OIE engine (ClausIE) on the raw article body is provided at the lower part of Table 7. It is apparent from the triple samples that the lack of pre-processing (coreference resolution, summarization) and post-processing (entity resolution, cleaning) methods, usually leads to a large number of uninformative triples, despite the fact of some being syntactically correct. For example, the triple {"This finding", "implies", "a possible recombination event"} although being valid, it unpacks little contextual value with regard to the aims of the article. On the contrary, our approach seems to increase the expressiveness and informativeness of the derived triples, ensuring that they remain relevant to the context of the article.
It should be noted that the inclusion of the aforementioned pre-processing and post-processing operations increases the computational cost of our pipeline compared to adopting a standalone OIE approach; however, this is partially countered by the fact that the core triple extraction process is applied to a significantly shorter and more concise corpus. Furthermore, it is evident that, although the results seem encouraging, none of the pipeline components are guaranteed to produce optimal results in every complex situation. More specifically, the extractive summarization component may fail to capture all the key points of the article resulting in loss of information, the in-place coreference resolution component may fail to find the correct antecedent of a mention, the parallel triple extraction process may miss some triples involving compound syntactic phenomena, and the entity enrichment tool may perform a wrong linking to a knowledge base concept. This is to be expected in real-world machine learning applications, where extractive summarization can be conceived as a feature selection method, in-place coreference resolution can be considered as a feature transformation technique, and entity linking can be seen as a data filtering method, each of them contributing to the usefulness of the overall approach. As discussed in Section 4, the added value of our methodology stems from the effective combination of different NLP tasks, benefiting from their distinct characteristics in an attempt to provide a robust outcome. Table 7. End-to-end example of information extraction on a CORD-19 article body. The highlighted text composes the extractive summary of the article. The coreference resolution component replaces the anaphors (orange) by their antecedent (green). The parallel information extraction component provides triples, which are linked with existing UMLS concepts via the entity enrichment component. The triples extracted by our pipeline are compared to those extracted using a standalone OIE engine directly on the article body.

CORD-19 Article ID: 85783a36e7e787302307f42460839435d665f4e7
Article Title: SARS-CoV-2: an Emerging Coronavirus that Causes a Global Threat Article body: [ . . . ] Subsequently, coronaviruses with high similarity to the human SARS-CoV or civet SARS-CoV-like virus were isolated from horseshoe bats {1} , concluding thebats {1} horseshoe bats {1} as the potential natural reservoir of SARS-CoV whereas masked palm civets are the intermediate host [53][54][55][56]. It is thus reasonable to suspect that bat is the natural host of SARS-CoV-2 considering its similarity with SARS-CoV. The phylogenetic analysis of SARS-CoV-2 against a collection of coronavirus sequences from various sources found that SARS-CoV-2 belonged to the Betacoronavirus genera and was closer to SARS-like coronavirus in bat [19]. By analyzing genome sequence of SARS-CoV-2, it was found that SARS-CoV-2 felled within the subgenus Sarbecovirus of the genus Betacoronavirus and was closely related to two bat-derived SARS-like coronaviruses, bat-SL-CoVZC45 and bat-SL-CoVZXC21, but were relatively distant from SARS-CoV [15,18,[57][58][59]. Meanwhile, Zhou and colleagues showed that SARS-CoV-2 had 96.2% overall genome sequence identity throughout the genome to BatCoV RaTG13, a bat coronavirus detected in Rhinolophus affinis from Yunnan province [14]. Furthermore, the phylogenetic analysis of full-length genome, the receptor binding protein spike (S) gene, and RNA-dependent RNA polymerase (RdRp) gene respectively all demonstrated that RaTG13 was the closest relative of the SARS-CoV-2 [14]. However, despite SARS-CoV-2 showed high similarity to coronavirus from bat, SARS-CoV-2 changed topological position within the subgenus Sarbecovirus when different gene was used for phylogenetic analysis: SARS-CoV-2 was closer to bat-SL-CoVZC45 in the S gene phylogeny but felled in a basal position within the subgenus Sarbecovirus in the ORF1b tree [57]. This finding implies a possible recombination event in this group of viruses. Of note, the receptor-binding domain of SARS-CoV-2 demonstrates a similar structure to that of SARS-CoV by homology modelling but a few variations in the key residues exist at amino acid level [15,19]. Despite current evidences are pointing to the evolutional origin of SARS-CoV-2 from bat virus [15,57], an intermediate host between bats and human might exist . Lu et al. raised four reasons for such speculation [15]: First, most bat species in Wuhan are hibernating in late December; Second, no bats in Huanan Seafood market were sold or found; Third, the sequence identity between SARS-CoV-2 and bat-SL-CoVZC45 or bat-SL-CoVZXC21, the closest relatives in their analyses, is lower than 90%; Fourth, there is an intermediate host for other humaninfecting coronaviruses that origin from bat. For example, masked palm civet and dromedary camels are the intermediate hosts for SARS-CoV [49] and MERS-CoV respectively [60]. A study of the relative synonymous codon usage (RSCU) found that SARS-CoV-2, bat-SL-CoVZC45, and snakes had similar synonymous codon usage bias, and speculated that snake might be the intermediate host [61]. However, no SARS-CoV-2 has been isolated from snake yet. Pangolin was later found to be a potential intermediate host for SARS-CoV-2 . The analysis of samples from Malytan pangolins obtained during anti-smuggling operations from Guangdong and Guangxi Customs of China respectively found novel coronaviruses representing two sub-lineages related to SARS-CoV-2 [62]. The similarity of SARS-CoV-2 to these identified coronaviruses from pangolins is approximately 85.5% to 92.4% in genomes, lower than that to the bat coronavirus RaTG13 (96.2%) [14,62]. However, the receptor-binding domain of S protein from one sub-lineage of the pangolin coronaviruses shows 97.4% similarity in amino acid sequences to that of SARS-CoV-2, even higher than that to RaTG13 (89.2%) [62]. Interestingly, the pangolin coronavirus and SARS-CoV-2 share identical amino acids at the five critical residues of RBD of S protein, while RaTG13 only possesses one [62]. The discovery of coronavirus close to SARS-CoV-2 from pangolin suggests that pangolin is a potential intermediate host. However, the roles of bat and pangolin as respective natural reservoir and intermediate host still need further investigation.As an emerging virus, there is no effective drug or vaccine approved for the treatment of SARS-CoV-2 infection yet. Currently, supportive care is provided to the patients, including oxygen therapy, antibiotic treatment, and antifungal treatment, extra-corporeal membrane oxygenation (ECMO) etc. [21,22]. To search for an antiviral drug effective in treating SARS-CoV-2 infection, Wang and colleagues evaluated seven drugs, namely, ribavirin, penciclovir, nitazoxanide, nafamostat, chloroquine, remdesivir (GS-5734) and favipiravir (T-750) against the infection of SARS-CoV-2 on Vero E6 cells in vitro [63]. Among these seven drugs, chloroquine and remdesivir demonstrated the most powerful antiviral activities with low cytotoxicity. The effective concentration (EC 50 ) for chloroquine and remdesivir were 0.77µM and 1.13µM respectively. Chloroquine functions at both viral entry and post-entry stages of the SARS-CoV-2 infection in Vero E6 cells whereas remdesivir does at post-entry stage only. Chloroquine is a drug used for an autoimmune disease and malarial infection with potential broad-spectrum antiviral activities [64,65]. An EC90 (6.90 µM) against the SARS-CoV-2 in Vero E6 cells is clinically achievable in vivo according to a previous clinical trial [66]. Remdesivir is a drug currently under the development for Ebola virus infection and is effective to a broad range of viruses including SARS-CoV and MERS-CoV [67,68]. Functioning as an adenosine analogue targeting RdRp, Remdesivir can result in premature termination during the virus transcription [69,70]. The EC90 of remdesivir against SARS-CoV-2 in Vero E6 cells is 1.76 µM, which is achievable in vivo based on a trial in nonhuman primate experiment [63,69]

Conclusions
In this paper, a pipeline for efficient open information extraction, entity enrichment, and graph representation from unstructured text was presented. We analyzed the rationale and functioning of each preprocessing component comprising our methodology-namely, in-place coreference resolution that was applied on the raw CORD-19 corpus to replace pronouns with their original references and extractive summarization which was subsequently used to reflect the diverse topics of each article while keeping redundancy to a minimum. We integrated a parallel triple extraction process based on different OIE engines, relying on the complementarity of different information retrieval approaches (clause-based, learning-based, embeddings-based, etc.) to counter the loss of structural and semantic information. Finally, we combined a number of post-processing and information enrichment tasks, including entity linking with the Unified Medical Language System, polarity detection, triple deduplication, and continuity representation to enhance the usefulness and readability of the extracted triples before visualizing them in a graph database.
We implemented our approach on a large research dataset (CORD-19) consisting of thousands of scientific articles to illustrate its efficiency and we acquired more than 400,000 valid triples linked with UMLS entities and other relevant metadata, ensuring that the generated information remains relevant to the context of the ingested corpus and free of syntactic variations that would lead to triples of low contextual value. Our information extraction pipeline was evaluated in terms of precision, recall, and F1-score and, while a direct comparison would require the evaluation of similar systems on the same dataset, it has shown promising results compared to standalone OIE engines and other end-to-end frameworks. Finally, we demonstrated its capabilities through a series of indicative data exploration tasks for retrieving different types of information by querying or visually interacting with the graph database. In terms of design, we adapted a modular approach, basing each component of our pipeline on state-of-the-art tools and pretrained deep-learning models, hoping that it contributes to its flexibility and future-proofness.
In the future, we plan to further expand our methodology by introducing additional pre-processing and post-processing features, including mapping the extracted triples to formal semantic schemas such as the Comparative Toxicogenomics Database (CTD) COVID-19 database [77], in order to render it compliant with the existing ontology-guided storage systems. Moreover, we plan to exploit the modularity of our approach to experiment with other state-of-the-art tools and compare it on benchmark datasets and shared tasks [85]. Finally, with regard to the validity of the pipeline's output, we plan on involving human experts from the biomedical domain to assess the informativeness of the extracted triples and the usefulness of the produced graphs.   Table A1. Extractive text summarization example: The highlighted text composes the summary of the processed article.

CORD-19 Article ID: d99dbae98cc9705d9b5674bb6eb66560b4434305
The current epidemic of a new coronavirus disease (COVID-19), caused by a novel coronavirus (2019-nCoV), recently officially named severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), has reopened the issue of the role and importance of coronaviruses in human pathology (1) (2) (3) (4) (5). This epidemic definitively confirms that this heretofore relatively harmless family of viruses, Coronaviridae, includes major pathogens of epidemic potential. The COVID-19 epidemic has clearly demonstrated the power of infectious diseases, which have been responsible for many devastating epidemics throughout history. The epidemiological potential of emerging infectious diseases, especially zoonoses, is affected by numerous environmental, epidemiological, social, and economic factors (6,7). Emerging zoonoses pose both epidemiological and clinical challenges to health care professionals. Since the 1960s, coronaviruses have caused a wide variety of human and animal diseases. In humans, they cause up to a third of all community-acquired upper respiratory tract infections, such as the common cold, pharyngitis, and otitis media. However, more severe forms of bronchiolitis, exacerbations of asthma, and pneumonia in children and adults have also been described, sometimes with fatal outcomes in infants, the elderly, and the immunocompromised. Some coronaviruses are associated with gastrointestinal disease in children. Sporadic infections of the central nervous system have also been reported, although the role of coronaviruses in infections outside the respiratory tract has not been completely clarified (8). Most coronaviruses are adapted to their hosts, whether animal or human, although cases of possible animal-to-hu-man transmission and adaptation have been described in the past two decades, causing two epidemics. The first such outbreak originated in Guangdong, a southern province of the People's Republic of China, in mid-November of 2002. The disease was named severe acute respiratory syndrome (SARS). The cause was shown to be a novel coronavirus (SARS-CoV), an animal virus that had crossed the species barrier and infected humans. The most likely reservoir was bats, with evidence that the virus was transmitted to a human through an intermediate host, probably a palm civet or raccoon dog (8,9). In less than a year, SARS-CoV infected 8098 people in 26 countries, of whom 774 died (10,11). Approximately 25% of the patients developed organ failure, most often acute respiratory distress syndrome (ARDS), requiring admission to an intensive care unit (ICU), while the case fatality rate (CFR) was 9.6%. However, in elderly patients (>60 years), the CFR was over 40%. Poor outcomes were seen in patients with certain comorbidities (diabetes mellitus and hepatitis B virus infection), patients with atypical symptoms, and those with elevated lactic acid dehydrogenase (LDH) values on admission. Interestingly, the course of the disease was biphasic in 80% of the cases, especially those with severe clinical profiles, suggesting that immunological mechanisms, rather than only the direct action of SARS-CoV, are responsible for some of the complications and fatal outcomes (8,9). Approximately 20% of the reported cases during this epidemic were health care workers. Therefore, in addition to persons exposed to animal sources and infected family members, health care workers were among the most heavily exposed and vulnerable individuals (9, 10). During 2004, three minor outbreaks were described among laboratory personnel engaged in coronavirus research. Although several secondary cases, owing to close personal contact with infected patients, were described, there was no further spread of the epidemic. It is not clear how the SARS-CoV eventually disappeared and if it still circulates in nature among animal reservoirs. Despite ongoing surveillance, there have been no reports of SARS in humans worldwide since mid-2004 (11). In the summer of 2012, another epidemic caused by a novel coronavirus broke out in the Middle East. The disease, often complicated with respiratory and renal failure, was called Middle East respiratory syndrome (MERS), while the novel coronavirus causing it was called Middle East respiratory syndrome coronavirus (MERS-CoV). Although a coronavirus, it is not related to the coronaviruses previously described as human pathogens. However, it is closely related to a coronavirus isolated from dromedary camels and bats, which are considered the primary reservoirs, albeit not the only ones (8,12). From 2012 to the end of January 2020, over 2500 laboratory-confirmed MERS cases, including 866 associated deaths, were reported worldwide in 27 countries (13). The largest number of such cases has been reported among the elderly, diabetics, and patients with chronic diseases of the heart, lungs, and kidneys. Over 80% of the patients required admission to the ICU, most often due to the development of ARDS, respiratory insufficiency requiring mechanical ventilation, acute kidney injury, or shock. The CFR is around 35%, and even 75% in patients >60 years of age. However, MERS-CoV, unlike its predecessor SARS-CoV, did not disappear, but still circulates among animal and human populations, occasionally causing outbreaks, either in connection with exposure to camels or infected persons (12). Overall, 19.1% of all MERS cases have been among health care workers, and more than half of all laboratory-confirmed secondary cases were transmitted from human to human in health care settings, at least in part due to shortcomings in infection prevention and control (12,13). Post-exposure prophylaxis with ribavirin and lopinavir/ritonavir decreased the MERS-CoV risk in health care workers by 40% (14). THe eMeRGence oF covId-19 cAused by sARs-cov-2In mid-December of 2019, Table A1. Cont.

CORD-19 Article ID: d99dbae98cc9705d9b5674bb6eb66560b4434305
a pneumonia outbreak erupted once again in China, in the city of Wuhan, the province of Hube (1). The outbreak spread during the next two months throughout the country, with currently over 80 000 cases and more than 2400 fatal outcomes (CFR 2.5%), according to official reports. Exported cases have been reported in 30 countries throughout the world, with over 2400 registered cases, of which 276 are in Europe. On February 25, the first case of COVID-19 was confirmed in Zagreb, Croatia, and was linked to the current outbreak in the Lombardy and Veneto regions of northern Italy (15). The case definition was first established on January 10 and modified over time, taking into account both the virus epidemiology and clinical presentation. The clinical criteria were expanded on February 4 to include any lower acute respiratory diseases, and the epidemiological criterion was extended to the whole of China, with the possibility of expansion to some surrounding countries (16,17). At the early stage of the outbreak, patients' full-length genome sequences were identified, showing that the virus shares 79.5% sequence identity with SARS-CoV. Furthermore, 96% of its whole genome is identical to bat coronavirus. It was also shown that this virus uses the same cell entry receptor, ACE2, as SARS-CoV (18). The full clinical spectrum of COVID-19 ranges from asymptomatic cases, mild cases that do not require hospitalization, to severe cases that require hospitalization and ICU treatment, and those with fatal outcomes. Most cases were classified as mild (81%), 14% as severe, and 5% as critical (ie, respiratory failure, septic shock, and/or multiple organ dysfunction or failure). The overall CFR was 2.3%, while the rate in patients with comorbidities was considerably higher −10.5% for cardiovascular disease, 7.3% for diabetes, 6.3% for chronic respiratory diseases, 6.0% for hypertension, and 5.6% for cancer. The CFR in critical patients was as high as 49.0% (4).It is still not clear which factors contribute to the risk of transmitting the infection, especially by persons who are in the incubation stage or asymptomatic, as well as which factors contribute to the severity of the disease and fatal outcome. Evidence from various types of additional studies is needed to control the epidemic (19). However, it is certain that the binding of the virus to the ACE 2 receptor can induce certain immunoreactions, and the receptor diversity between humans and animal species designated as SARS-CoV-2 reservoirs further increases the complexity of COVID-19 immunopathogenicity (20). Recently, a diagnostic RT-PCR assay for the detection of SARS-CoV-19 has been developed using synthetic nucleic acid technology, despite the lack of virus isolates and clinical samples, owing to its close relation to SARS. Additional diagnostic tests are in the pipeline, some of which are likely to become commercially available soon (21). Currently, randomized controlled trials have not shown any specific antiviral treatment to be effective for COVID-19. Therefore, treatment is based on symptomatic and supportive care, with intensive care measures for the most severe cases (22). However, many forms of specific treatment are being tried, with various results, such as with remdesivir, lopinavir/ritonavir, chloroquine phosphate, convalescent plasma from patients who have recovered from COVID-19, and others (23) (24) (25) (26). No vaccine is currently available, but researchers and vaccine manufacturers have been attempting to develop the best option for COVID-19 prevention. So far, the basic target molecule for the production of a vaccine, as well as therapeutic antibodies, is the CoV spike (S) glycoprotein (27,28). The spread of the epidemic can only be contained and SARS-CoV-2 transmission in hospitals by strict compliance with infection prevention and control measures (contact, droplet, and airborne precautions) (22,29). During the current epidemic, health care workers have been at an increased risk of contracting the disease and consequent fatal outcome owing to direct exposure to patients. Early reports from the beginning of the epidemic indicated that a large proportion of the patients had contracted the infection in a health care facility (as high as 41%), and that health care workers constituted a large proportion of these cases (as high as 29%). However, the largest study to date on more than 72 000 patients from China has shown that health care workers make up 3.8% of the patients. In this study, although the overall CFR was 2.3%, among health care workers it was only 0.3%. In China, the number of severe or critical cases among health care workers has declined overall, from 45.0% in early January to 8.7% in early February (4). This poses numerous psychological and ethical questions about health care workers' role in the spread, eventual arrest, and possible consequences of epidemics. For example, during the 2014-2016 Ebola virus disease epidemic in Africa, health care workers risked their lives in order to perform life-saving invasive procedures (intravenous indwelling, hemodialysis, reanimation procedures, mechanical ventilation), and suffered high stress and fatigue levels, which may have prevented them from practicing optimal safety measures, sometimes with dire consequences (30). This third coronavirus epidemic, caused by the highly pathogenic SARS-CoV-2, underscores the need for the ongoing surveillance of infectious disease trends throughout the world. The examples of pandemic influenza, avian influenza, but also the three epidemics caused by the novel coronaviruses, indicate that respiratory infections are a major threat to humanity. Although Ebola virus disease and avian influenza are far more contagious and influenza currently has a greater epidemic potential, each of the three novel coronaviruses require urgent epidemiologic surveillance. Many infectious diseases, such as diphtheria, measles, and whooping cough, have been largely or completely eradicated or controlled through the use of vaccines. It is hoped that developments in vaccinology and antiviral treatment, as well as new preventive measures, will ultimately vanquish this and other potential threats from infectious diseases in the future.