COVID-19-Related Scientific Literature Exploration: Short Survey and Comparative Study

Simple Summary The COVID-19-related literature has known a surge since the beginning of the pandemic. This surge prompted the creation of multiple literature exploration systems to help automate the exploration of scientific articles. In this work, we survey multiple COVID-19 literature exploration systems by exploring their most discriminative characteristics, give general design principles for these systems, and describe some of their limitations. Abstract The urgency of the COVID-19 pandemic caused a surge in the related scientific literature. This surge made the manual exploration of scientific articles time-consuming and inefficient. Therefore, a range of exploratory search applications have been created to facilitate access to the available literature. In this survey, we give a short description of certain efforts in this direction and explore the different approaches that they used.


Introduction
Due to the vast expansion of the COVID-19 literature (according to LitCOVID [1] website, more than 258,000 unique papers had been published on PubMed before 10 July 2022), there was a need to create information management and retrieval systems for the COVID-19 literature. The data science community responded to this urgent need by creating and deploying dozens of applications to provide researchers with easy access to the COVID-19 literature. These applications mainly focus on text mining [2] and its related tasks (e.g., document retrieval [3], question answering [4], passage retrieval [5], summarization [6], etc.) in order to organize and access relevant knowledge effortlessly. Several public competitions and common tasks, such as the CORD-19 and TREC-COVID initiatives [7,8], further encouraged such efforts.
In this work, we explore the COVID-19 literature exploration applications, which we can classify as one of two categories relative to the format of the search results; (a) textual search engines, and (b) visual search engines. The first category comprises query-oriented applications that extract information from the COVID-19 literature using queries. The second class of applications is used mainly for the bibliometric study of the COVID-19 literature coupled with visual interactive or static summarization graphs. Each one of these applications goes through the same development phases. Figure 1 shows the most common phases that an application would go through. First, the text data needed by the system must be collected. All the explored applications in this work use the CORD19 [7] dataset (either a version of it or a subset of a version of CORD12). Second, the raw data collected may need to be transformed in some cases to meet certain specifications. This can be achieved by enriching the data in order to make it more representative, or it can be achieved by structuring the available data differently. Third, given the available data and the basic application specifications, a set of learning problems (i.e., question answering, document retrieval, passage retrieval) need to be defined. Forth, given the defined learning problem, machine learning models are developed and trained to achieve the learning tasks. Fifth, the models are evaluated, either by a human or an automated evaluation process. Sixth, after evaluating the models, they need to be deployed to ensure their accessibility by a larger number of users, and that is by providing an easy-to-use user interface with a reliable model execution backend architecture.
Although a previous survey [9] has explored the COVID-19 literature search engines, their work has certain limitations that we try to remedy in this work. First, rather than focusing primarily on textual search engines, we explore visual search engines. Second, ref. [9] included a plethora of applications that are not associated with any research papers or technical reports. Consequently, we discarded these applications and focused on applications with research papers in order to gain and express a deeper understanding of the methods that they employed. Third, we try to infer some design principles that the authors of the works used to create their system.  This work is organized as follows: In Section 2, we describe some datasets that were used in the explored works for various purposes. In Section 3, we explore the characteristics and design principles of the COVID-19 exploratory literature search applications. In Section sec:eval, we explore certain methods that were used to evaluate the systems. In Section 5 we show certain limitations of the examined works. Section 6 concludes our work, and the following section (Section 7) gives certain limitations that this work has.

Datasets
In this section, we list some of the datasets that were used in the works that we explored. We categorized the datasets relative to their structure into three categories: (a) unstructured, (b) structured and (c) hybrid.

Unstructured Datasets
Unstructured data are information that does not have a defined data model. This type of data is mainly textual in nature. The following structured and hybrid datasets have been built using unstructured data. In fact, all the previously mentioned categories were either automatically or manually curated and annotated from different literature databases (e.g., Arxiv, DBLP, Pubmed, bioRxiv, medRxiv), which contain unstructured documents, often in a hard-to-read format, such as PDFs.

Structured Datasets
We can recognize two kinds of structured data: (a) data with tabular structure, where every example shares the same set of variables and examples are independent of each other, and (b) data with relational structure, where examples do not necessarily share the same set of variables, examples are inherently typed; that is, each example belong to a predefined group of examples, and examples have a dependency between them, which is implemented practically in the form of links.
The first category contains mainly annotated datasets that are oriented for machine learning purposes, such as training, fine-tuning, or evaluating the created models on specific tasks. The works that we explored use multiple datasets. A later section defines some of the main tasks that the works try to solve. All of these tasks are text-oriented and can fall under the umbrella of information retrieval in general. Annotated datasets such as TREC-COVID [8] and BioASQ [10] were used for document retrieval. These datasets are generally constructed by a set of human curators who were provided with a list of queries (or questions) and a set of supposedly relevant documents, and the goal was to select the most pertinent documents for each query. In addition, multiple datasets have been used to train question answering models such as COVIDQA [11], COVID-19 Questions [12], COVID-QA [13], InfoBot Dataset [14], MS-MARCO [15], Med-MARCO [16], Natural Questions [17], SQuAD [18], BioASQ [10], M-CID [19] and QuAC [20]. Other datasets were used to train document summarization models. For example, DUC 2005 [21], 2006 [22] and Debatepedia [23] were used by [24] to train document summarization models. Other datasets, such as GENIA [25], JNLPBA [26], CHEMDNER [27], NCBI Disease Corpus [28], CHEMPROT [29], BC5CDR [30] and COV19_729 [31], were used for the named entity recognition (NER) of multiple types of entities, namely, chemicals, genes, proteins, diseases and other biomedical entities. Relation extraction (RE) was also a task of interest in [31], which was achieved using the CHEMPROT [29] and BC5CDR [30] datasets. NER and RE tasks are generally used in knowledge graph construction, where the entities extracted represent nodes, and the relations represent edges between nodes. Some of these datasets were curated using data from COVID-19 related source documents, e.g., COVIDQA [11], COVID-19 Questions [12], COVID-QA [13], InfoBot Dataset [14] and TREC-COVID [8]. Table 1 summarizes the the previously mentioned datasets.
Concerning data with relational structure, some works used knowledge graphs constructed from the COVID-19-related literature. In general, the graphs contain four types of entities with multiple properties: (1) a paper entity, which represents a research paper and can be described by a Digital Object Identifier (DOI), title, publication date and other properties; (2) an author entity, which represents a publication's author, and can be described by an identifier, a first, middle and last name and other properties of interest; (3) an affiliation entity, which represents the research structure (lab, university, company, etc.) to which the author is affiliated, which can be described by an identifier, a name and other properties of interest; (4) a concept entity, which represents a domain knowledge-related notion that exists in a paper. A concept can be represented by one word or a series of words. Concepts can have multiple types of relationships between them, depending on the type of concepts. For example, concepts of biomedical types, such as genes, diseases, chemicals, organisms and proteins, can be linked by semantic biomedical relationships [31][32][33][34] or by syntactic relationships based on their co-occurrence in the same sentence [35]. Tables 2 and 3 offer a more detailed description and these entities and how they are related. Table 4 represents multiple KGs and their description. It is worth pointing out that not all knowledge graphs respect this schema. Some implement it totally (e.g., CKG [36]), and some implement it partially (e.g., CovEx KG [37]), as shown in Tables 2 and 3.  Furthermore, it has been observed that the design of certain knowledge graphs is dependent on the tasks they are used for. For instance, for the task of document retrieval, a knowledge graph is generally designed with documents as the central nodes to which other nodes may be linked [36,37]. On the other hand, for the task of question answering, even though the same base data is used, no node holds the document data; instead, documents are ignored, and only concept nodes are presented and interlinked [34]. In addition, the granularity of the relationships and the entities are also important, as it was demonstrated in [33,34], where two types of relationships and entities were extracted: (a) coarse-grained and (b) fine-grained. The latter was needed in a question-answering task to accommodate the specificity of the entities expressed in user queries, which is not required in other tasks, as shown in [31] for the task of link prediction, where the authors discarded fine-grained relationships in favor of more general ones to reduce noise that can hinder the performance of certain models. In the case of network visualization, ref. [35] adopted a more flexible approach to KG construction by extracting a set of entities and saving them so that they could be later aggregated to create domain-specific networks, which can be visualized. Some tasks, such as information extension, which aims at enriching certain information constructs such as queries or KGs, do not need directed edges, which is the case, for example, in Vapur KG [32] and Citation KG [44]. In fact, having undirected edges help explore more complex and unexpected relationships among entities, which was illustrated in a fact-checking application in [45].

Hybrid
Hybrid datasets have some structure, which can be in the form of tags, but most if not all of the tagged elements have no structure, which generally means that these elements are in a textual format. An example of such datasets is CORD19. The CORD-19 dataset is the centerpiece of the COVID-19 literature exploration applications. The CORD-19 dataset [7] is a curated set of articles from multiple resources that were collected to help efforts against the COVID-19 pandemic. This dataset was used in a common document retrieval task TREC-COVID, where a set of CORD-19 articles were curated and annotated for their relevance relative to certain user queries. The dataset is ever-expanding, with new articles being added to it intermittently. The dataset is available online at (as of in 4 April 2022) https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge.

Textual Exploratory Search
Research related to COVID-19 knowledge management and information retrieval (KM&IR) has gained tremendous attention over the past year. Here, we try to present a concise summary of the research in this area. The development of search engines goes through certain common steps that are illustrated in Figure 2. A search engine's development process begins with the base data or the data that are relevant to the search query. Second, the raw textual data are processed to extract certain elements that are of interest and transform them. That same raw data can be reorganized in the form of a knowledge graph to satisfy certain specifications such as fast question answering. Afterward, the tasks that are intended for the search engine should be defined and implemented, followed by an assessment of the efficiency of the system in performing those tasks. Finally, the implemented system needs to be deployed for public access.
COVID-19 literature knowledge management and information retrieval systems have multiple axes along which we can study, survey and compare them. We list some of these characteristics in what follows: • Tasks: The tasks are related to textual data, and hence we suppose that we have a text database (or collection or corpus) T as a string of N symbols drawn from an alphabet (i.e., all possible combinations of letters) Σ. A vocabulary V is the set of unique words used in T . T is partitioned into n documents {d 1 , d 2 , ..., d n }. A document d can be presented as (w d,1 , w d,2 , . . . , w d,n d ) in T , including n d words from V. Queries are also strings (or sets of strings) composed of symbols drawn from Σ. Symbols in Σ may be letters, bytes or even words, and the documents may be articles, chromosomes or any other texts in which we need to search. In general, these tokens are extracted using tokenizers and further processed using lemmatization, stemming and other techniques that help normalize tokens. In the explored systems, we can identify the following tasks.
1. Document Listing: given query Q = {q 1 , . . . , q m |q i ∈ Σ * , ∀i} and a text T ∈ Σ * that is partitioned into n documents, {d 1 , d 2 , ..., d n }, the aim of this task is to return a list of the documents in which one or multiple tokens of Q appear at least once.

2.
Document Ranking: given a query Q = {q 1 , . . . , q m |q i ∈ Σ * , ∀i} , an integer 0 < k ≤ N, and a text T ∈ Σ * that is partitioned into n documents {d 1 , d 2 , ..., d n }, and returns the top-k documents ordered by a similarity measure S(Q, d i ).
-Passage Retrieval (Indexing, Ranking): Given a query Q, and a set of documents D where each document is partitioned into passages, the aim of this task is to find relevant passages for the query [5]. Passage retrieval can also be used for sentence highlighting. -Question Answering: Given a Query Q = {q 1 , q 2 , . . . , q m } made of m tokens and a passage P = {p 1 , p 2 , . . . , p k } made of k tokens, the aim of this task is to find an answer span A = {a start , a end } in P [4]. -Summarization: We will opt for the definition presented in [6]. Given a set of documents D = d i that we will call source documents, summarization aims to generate a text s (called summary) that is coherent and contains a significant amount of relevant information from the source text. Ref. [6] considered a good summary to have a compression rate τ = c(s) c(D) (where c(x) is the word count in x, x can be a sentence or document or any grouping of words) of less than a third of the length of the original document.
-Topic Modeling: The aim of topic modeling is to infer a set of K topics capturing a lower-dimensional representation suitable for summarization and prediction tasks [46]. According to [47], Given a text corpus T with a vocabulary of size V and the predefined number of topics K, the major tasks of topic modeling can be defined as: 1.
Learning the word representation of topics α: a topic α in a given collection T is defined as a multinomial distribution over the vocabulary V, i.e., p(w|α) w∈V .

2.
Learning the sparse topic representation of documents θ: the topic representation of a document d, θ d , is defined as a multinomial distribution over K topics, i.e., p(α k |θ d ) k=1,...,K . In general, the task of topic modeling aims to find K salient topics α k=1,...,K from T and to find the topic representation of each document θ d=1,...,n . -FAQ Matching: let F denote the set of question-answer pairs; given F and a user query Q, this task aims to rank the question-answer pairs in F. The top k QA pairs with high scores are returned to the user [48]. -Recommendation: Given the set of all users C and the set of all possible items that can be recommended S. Let u be a utility function that measures the usefulness of item s to user c, i.e., u : C × S → R, where R is a totally ordered set (e.g., non-negative integers or real numbers within a certain range). The goal of this task is to choose the item(s) s ∈ S that maximize(s) the utility for each user c ∈ C [49].
• Feedback Loop: this characteristic is related to the use of user feedback data in any of the mentioned tasks.
• Representation Level for Text: In general, text can be represented in two distinct spaces: (a) bag-of-words space, (b) vector space. These representations can be shown on one or multiple levels of granularity of textual documents; that is, Document Level, Paragraph Level, Sentence Level and Word Level. • Representation Levels for Graphs: Graphs can also be represented in a frequentist space or low-dimensional vectorial space. These representations can be shown on one or multiple levels of granularity of graphs; that is, Full Graph Level, Sub-graph Level, Node Level and Edge Level. Examples of graph representation in COVID-19 literature search engines are as follow: -Document Sub-graph Embedding: in order to make document-level embeddings, refs. [36,38] combined document-level textual embeddings with embeddings of documents' related sub-graphs from the bigger KG to recommend similar documents.
• Novelty: a research paper is said to have novelty if the authors explored uncharted territories to solve old or new problems. Specifically, we characterize papers to have novelty if they contain new contributions to the design of models, learning objectives or data processing. We ignored the data aspect of this characterization because all the papers can be considered to be novel considering only data. • Data Enrichment: Data enrichment refers, in general, to the process of adding more data to the already existing training data. Data enrichment methods can take two main forms, (a) data augmentation and (b) data supplementation. The former is characteristic of the set of methods that use the already existing data to generate more data, while the latter encapsulates methods that use external resources in order to supplement the available data. The latter is easy to accomplish as long as there are external resources. There are various data augmentations methods. For example, in CO-Search [39], in order to train a Siamese network, the authors generated negative (paragraph, reference) pairs based on positive pairs extracted from documents. • Search Type: -Keyword: Keyword search refers to searching using queries composed of one specific word. -Regular Expression: In this type of search, the query takes the form of regular expressions that annotates textual patterns that we would like to retrieve. For example, ref. [50] used this search strategy to look for drugs with certain properties in a drug re-purposing database. -Open Questions: This type of search refers to using natural language queries with simple or complex structures. -Keyphrase Search: This type of search refers to using queries composed of one or multiple keywords, and the order is taken into consideration.
• KG Traversal: This refers to the use of knowledge graphs to search for entities or relationships that are relevant to achieving one or multiple tasks. • Representation Combination (Rep.Comb.): This characteristic exists in one of two cases: (a) the combination of multiple levels of representation to achieve a task, or (b) the combination of KG and textual representation to achieve a task. • Fast Prototyping and Deployment: Given the urgent nature of most of the applications, the researcher opted mainly for off-the-shelf technologies that are easy to work with. In addition, except for one application, all the other applications used existing models and algorithms, which can also be attributed to the urgency of the task. • Textual Representation Methods: There are two categories of methods: (a) Bag-of-Words (BOW) models and (b) Vector Space Models (VSMs). The major difference is that VSMs capture more of the contextual elements of text than the BOW methods, but on the other hand, the VSMs are computationally more expensive during training and inference. Some works struck a balance by applying both categories of methods, e.g., [37,44,51,52], which is performed generally by using a multi-stage ranking scheme that applies the first ranking using BOW models, which is then followed by a reranking using a VSM of the output of the previous ranking. Some works compensate for the latency of neural language models [12] by pre-indexing documents offline. • Granularity/Levels of Representations: We also noticed that the works used different levels of granularity, which depends on the intended tasks and the available computational resources. For example, to achieve the task of document retrieval, some works opted for simple document level representations [53], while other works either used more granular representations [12,32,37,40,50,[54][55][56] or a mix of more granular representations with document level representations [16,24,38,39,44,51,52,57]. • Using KGs: Knowledge graphs were used in multiple works for different purposes.

Visual Exploratory Search
While exploring the COVID-19 literature, researchers can face two kinds of challenges: (a) quantity of the research papers and (b) the quality of the research papers. Even though the textual exploratory search is a useful literature exploration tool, it is targeted and requires the researcher to know what she/he is looking for in advance, which is not always evident. Consequently, many visual exploratory search tools have been developed to explore the COVID-19 literature in a visual, interactive and general manner, rather than having to go through the tedious process of manually curating the literature. In the context of scientific literature, this can also be used to explore latent structures within the data which may be related to co-authorship networks, citation networks and other important bibliometric dimensions.
In light of the reviewed literature, we can infer a general process that exploratory visual search applications follow. This process is presented in Figure 3. The most important two phases of this process are (a) indicator specification and (b) indicator representation. The former is where one or multiple quantitative (e.g., entity types, topics, affiliation, etc.) or qualitative characteristics (e.g., occurrence/co-occurrence frequency/count) of the data are chosen to be presented, and their method of presentation is also specified. The latter phase is where a significant visual representation is chosen for those indicators; for example, qualitative indicators can be presented using colors, and quantitative indicators can be presented using distance, surface or volume variations.  The data used for the exploratory search applications are either CORD-19 [7] or one of the knowledge graphs presented previously. The frequency and count indicators are the most predominantly used, although other indicators are also used. For example, ref. [58] uses topic similarity vectors to cluster similar topics. Multiple plots and visualization tools were used to visualize the indicators (see Figure 4); these are summarized in Table 6. In addition, some works use certain tasks in the data transformation phase in order to get more relevant data from the raw text. The tasks mentioned in the works are information extraction (IE), which is generally attributed to basic textual information extraction, topic modeling, which was used in [58], and NER, which was used in [35,43,59] to extract named entities and use their count as an indicator, and network analysis [35,59]. In [35], network analysis was used to solve two problems faced during network traversal, namely the problem of network size and the search for deep connections, using a breadth-firstsearch technique on the network structure. In [35], network analysis was used to detect communities within a co-authorship network, motivated by the need to keep track of what other groups were doing in order to explore new fields and potential collaborations. Figure 5 shows the interface proposed by [35]. Reactivity is also an important feature in these tools since it simplifies interactive visual manipulation, which makes the exploration more flexible. Public availability is also looked into, and links to the tools are provided if they exist.

Evaluation Methods
In general, machine learning models are composed of two main modules, (a) a representation module and (b) a decision module. The former is responsible for transforming the data from a complex multidimensional space with latent spatial and temporal dependencies to a lower-dimensional and more abstract space. The second module is used to process the representational modules' output to achieve a task. The training of these modules can be performed independently; that is, the representational module can be trained separately in an unsupervised or self-supervised manner, while the combination of the two modules can be trained in a self-supervised, semi-supervised or fully supervised manner.
The machine learning (ML) models used in the previously explored works, be it search engine-related ML models or knowledge graph creation ML models (e.g., named entity recognition models), have to be evaluated to get empirical evidence on their viability. While exploring the literature, we noticed that there are two main evaluation techniques: human evaluation and automatic evaluation. The former bases its evaluation on the relevance judgment of the users, and the latter focuses on information needs in order to evaluate the results. The latter also has two sub-categories of evaluation measures: intrinsic evaluation measures and extrinsic evaluation measures.

Human Evaluation
Human evaluation is based on quantifying human feedback towards the evaluated application. This type of evaluation is advantageous because of its integral character. Indeed, humans can evaluate more complex applications with multiple interacting modules. For example, in the case of a search engine, a human evaluator can assess the information relevance of the search results in addition to some representational aspects, such as highlighting, which are not easy to evaluate automatically [51,57]. However, the downside of the human evaluation method is its irreplicability due to the fact that human evaluation is inherently biased and depends on the needs that the evaluators have, their field of expertise and what they expect from the application. For example, an experienced researcher may find longer spans of text more reliable as answers to a query, while a novice would generally prefer direct short answers [40]. This makes performance comparison of multiple applications based on human evaluation generally unreliable.

Automatic Evaluation
Automatic evaluation is the de facto evaluation method in the machine learning literature. It is based on using evaluation metrics that quantify the discrepancy that exists between the model output and the wanted output. This is advantageous since it puts multiple applications on an equal footing during evaluation, which is advantageous. On the other hand, automatic evaluation is monolithic, meaning that it only evaluates one aspect of an application at a time (e.g., QA, DR, IR, etc.) and not the integrality of the application as is the case in human evaluation [51,57]. Furthermore, some aspects, such as ease of use and interface interactivity, cannot be evaluated automatically. In addition, the evaluation metrics used can suffer from certain biases that can lessen the validity of the evaluation. For example, ref. [40] has found that automatic metrics such as F1 heavily penalize long answers, as they overlap poorly with the gold annotations, which are mostly short, factual answers.
As was mentioned before, automatic evaluation measures can be categorized into (a) intrinsic evaluation measures (IEMs) and (b) extrinsic evaluation measures (EEMs). The former measures are generally used to evaluate representation modules separately, and the latter measures are used to evaluate the combined representation and decision downstream model.

Intrinsic Evaluation
In the explored works, we only found one example of intrinsic evaluation [31], where KG node embeddings are evaluated by comparing the Pearson and Spearman correlation scores between the ratings and the cosine similarity scores of entities.

Extrinsic Evaluation
In contrast to IEM, EEMs are more frequently used. The works that we explored use a plethora of EEMs that depend on the kind of tasks to be evaluated. This type of evaluation is performed through multiple evaluation metrics that are task-specific. Multiple evaluation measures and their variants were used. For example, the ROUGE evaluation metric [61] and its variants were used in [24] to evaluate the summarization models. The Match method [62,63] was used in [41] to evaluate QA and IR. Other more-standard evaluation metrics such as recall and precision were used for IR tasks [64].

Discussion and Future Research Directions
In general, the explored works have certain common limitations. In what follows, we summarize a few of them: • Evaluation: Most of the applications (e.g., [40,51,57]) suffer from a monolithic evaluation scheme that focuses on one task in particular and ignores other aspects of the application, especially those related to visual aspects. • Feedback Loop: Some applications (e.g., [51,57]) expressed the importance of including human input in the process of information retrieval, as it tends to balance information need and information relevance. • Fact Checking: Due to the rapid expansion of the COVID-19 literature and the existence of many contradictory claims concerning, for example, the incubation period of the virus and the optimal social distancing protocol stresses the importance of fact checking applications for COVID-19 claims. The authors of [65] created a claim verification application for the COVID-19 literature, which uses a passage and a claim as input and outputs if the claim is true or not given the passage. This type of application needs huge amounts of annotated data, which is particularly cumbersome in the case of COVID-19 since it needs skilled specialists to annotate it. Developing semi-supervised or unsupervised techniques would be useful. • Extending Data: Most of the applications (e.g., [53,54]) used limited amounts of data (labeled or not) to perform tasks, either because of the lack of labeled data or because of the lack of computational resources. More data would certainly improve performance. • Data Bias: Some applications (e.g., [54]) can also benefit from reducing data bias, especially gender bias. • Smart Querying: Some applications [56] use query functionalities that tend to be limited to simple word matching. This can be problematic in cases where the intent of the user is not evident in the query. This can be remedied by using embedding-based query matching, which uses contextual information for matching the queries to the results.

Conclusions
This work represents an exploration of COVID-19 literature exploration applications, with emphasis on their design principles and concepts. There are two main types of literature exploration applications, (a) exploratory textual search and (b) exploratory visual search. The former uses textual queries made by end-users in order to explore the knowledge base and send the most relevant documents back to the users, while the latter type of application uses visual summaries to offer a structured view of the existing literature.

Limitations of This Work
Empirical quantitative evaluation of the systems explored in this work was of interest, but discrepancies were found in the evaluation results of the same systems in multiple sources (e.g., the results given in [38] are different from those given in [51] for the same system: COVIDex), in addition to the unavailable implementation details of some systems, discouraged us from pursuing the objective in this work.

Data Availability Statement:
The data presented in this study are available in the article.