Next Article in Journal
The Impact of Physical Education Based on the Adventure Education Programme on Self-Esteem and Social Competences of Adolescent Boys
Previous Article in Journal
Study of Cyberbullying among Adolescents in Recent Years: A Bibliometric Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Expanding Our Understanding of COVID-19 from Biomedical Literature Using Word Embedding

Future Technology Analysis Center, Korea Institute of Science and Technology Information, 66, Hoegi-ro, Dongdaemun-gu, Seoul 02456, Korea
*
Author to whom correspondence should be addressed.
Int. J. Environ. Res. Public Health 2021, 18(6), 3005; https://doi.org/10.3390/ijerph18063005
Submission received: 10 February 2021 / Revised: 9 March 2021 / Accepted: 10 March 2021 / Published: 15 March 2021

Abstract

:
A better understanding of the clinical characteristics of coronavirus disease 2019 (COVID-19) is urgently required to address this health crisis. Numerous researchers and pharmaceutical companies are working on developing vaccines and treatments; however, a clear solution has yet to be found. The current study proposes the use of artificial intelligence methods to comprehend biomedical knowledge and infer the characteristics of COVID-19. A biomedical knowledge base was established via FastText, a word embedding technique, using PubMed literature from the past decade. Subsequently, a new knowledge base was created using recently published COVID-19 articles. Using this newly constructed knowledge base from the word embedding model, a list of anti-infective drugs and proteins of either human or coronavirus origin were inferred to be related, because they are located close to COVID-19 on the knowledge base. This study attempted to form a method to quickly infer related information about COVID-19 using the existing knowledge base, before sufficient knowledge about COVID-19 is accumulated. With COVID-19 not completely overcome, machine learning-based research in the PubMed literature will provide a broad guideline for researchers and pharmaceutical companies working on treatments for COVID-19.

1. Introduction

Coronavirus disease (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), was first identified in Wuhan, China, in December 2019 [1]. The rapid spread of COVID-19 has caused a severe health crisis worldwide and gravely impacted human life and society [2]. The urgent need to develop effective therapeutics and vaccines against COVID-19 is driving numerous clinical studies worldwide. Efforts by several scientists have led to the designing of effective antiviral agents based on an understanding of the SARS-CoV-2′s [3,4] viral genome structure and pathogenicity [5,6], as well as the body’s host response and its protein–protein interactions [7,8,9]. Currently, a few vaccines have been developed. Still safety issues remain in doubt and the supply is insufficient. A therapeutic agent showing a definite effect has not been developed [10].
In addition to clinical-based novel drug development studies, such as antibody therapeutics and plasma therapy, drug repurposing is receiving considerable attention as an alternative for developing COVID-19 treatments [11,12,13]. Several computational drug repurposing studies, including network-based or machine learning-based studies, were conducted to predict drug–target interactions by understanding or utilizing the structural properties of SARS-CoV-2, such as in silico docking and analysis [14], network proximity analysis of drug targets and coronavirus–host interactions in the human interactome [15], and therapeutic target-based virtual ligand screening [16].
Bibliometrics has played a large role as a tool for knowledge discovery. Although traditional bibliometric techniques based on statistics and citation analysis are still widely used for measuring and visualizing the impact of knowledge from the scientific literature [17], new techniques are being developed that have a better effect in inferring knowledge. With the confluence of recently advanced deep learning technologies, bibliometrics has been reborn as a new data mining technology with enhanced inferring ability to discover new knowledge from a latent knowledge base.
Knowledge graph, a graph-based machine-readable data structure, was originally developed to describe interactions between entities and has recently been used as a network-based knowledge discovery tool for understanding COVID-19 and finding a therapy for the disease [18,19,20,21].
Most existing studies extract the structure of the knowledge contained in accumulated databases. Therefore, for their results to become accurate, a significant quantity of data has to be accumulated. In this study, we try to determine a way to infer the characteristics of COVID-19 using the biomedical knowledge base accumulated so far without waiting for further knowledge to be significantly accumulated.
Word embedding techniques, one of the machine learning techniques, can extract knowledge by processing text and keywords, or obtain suggestions for new knowledge using relational reasoning and inference between keywords. This is because word embedding projects keywords onto space and expresses them as vectors [22]. Therefore, inference and analogy between keywords such as FranceParis = KoreaSeoul, or FranceParis + Seoul = Korea becomes possible mathematically. If we have information on France, Paris, and Seoul, it becomes possible to find Korea via word embedding. Using these characteristics, many studies on the use of word embedding are being conducted in each field. Word embedding is also widely used to understand biomedical entities [23].
When COVID-19 was first discovered, there was little knowledge about it, but studies on similar viruses, such as other coronaviruses and RNA viruses, have been accumulated. Using this knowledge to infer the characteristics of COVID-19, it may be possible to accelerate the discovery of solutions for COVID-19.
In this study, we use word embedding and PubMed literature as the knowledge base (Figure 1). Over the past decades, a huge number of studies on viruses, drugs, proteins, and biological entities have been accumulated in PubMed. We try to apply inference of word embedding to the PubMed knowledge base to interpret the characteristics of COVID-19, even when knowledge of COVID-19 is still insufficient. For this, we strive to establish a knowledge base that fully represents the biomedical knowledge collection of the 2010s, that is, a balanced knowledge base, not biased towards a specific area. Then, a modified knowledge base is built by adding a small initial collection of early COVID-19-related articles (new thing). The knowledge base and the modified knowledge base are built into the pretrained model and final model through the word embedding technique. If the pretrained model expresses the knowledge base well, the modified knowledge base, inferring the relationship between the new term and pre-existing words, will be meaningful for understanding the characteristics of the new thing, i.e., COVID-19. To infer characteristics about COVID-19, we analyze the relationship between COVID-19 and two biomedical entities, namely drugs (chemicals), and proteins interacting with COVID-19. Where limited studies on COVID-19 and SARS-CoV-2 have been reported, we attempt to enhance our understanding of the virus using the existing knowledge stock on coronaviruses based on a modified knowledge base. This study aims to examine the potential of drug repurposing by applying word embedding to the PubMed literature. The relationship between COVID-19 and drugs as well as COVID-19 and proteins can then be deduced by the trained model.

2. Materials and Methods

2.1. Data: PubMed Literature

SARS-CoV-2 is a novel coronavirus, and the detrimental impact of the disease is too great to wait until enough research has been conducted to find a solution. Our aim is to determine information on SARS-CoV-2 using accumulated knowledge on known viruses, particularly coronaviruses, on PubMed, which is the largest and most updated literature database for research in the fields of biomedical and life sciences. The PubMed literature has a plethora of information on various subjects, which can be identified using the Medical Subject Headings (MeSH) and Substance Name (SN) of Unique Ingredient Identifiers (UNII), and the Chemical Abstracts Service (CAS) fields. MeSH and SN provide information sources that can be analyzed by extracting the subject keywords of publications. Using all the sentences included in the abstract of a given publication for the construction of the word embedding model, it would be possible to extract more keywords and relationships between keywords in text contents. However, if so, significant noise removal and keyword refinement are required, and this will take a long time. It is a matter of choice as to whether to secure a richer keyword dictionary or a refined keyword dictionary without noise, and we chose the latter for accurate inferring. To block data contamination fundamentally and to efficiently process and analyze data, we only used a controlled subject vocabulary from MeSH and SN. Regarding the PubMed literature pertaining to these two fields, we attempted to identify associations between COVID-19 and drugs as well as COVID-19 and proteins.
The analyzed dataset included 7,804,687 articles from PubMed published between 2010 and 2019; these articles were tagged with MeSH and SN terms. To infer the characteristics of COVID-19, all COVID-19-related articles published before 18 March 2020, were downloaded from PubMed. COVID-19-related articles that were not tagged with MeSH or SN terms were included using the Other Term (OT) field, which refers to the author keyword field. Unlike MeSH or SN terms, the OT category does not have a controlled vocabulary; thus, we further cleaned the terms. Keywords referring to COVID-19, such as “SARS-COV-19,” “2019 Novel Coronavirus,” and “Corona Virus disease 2019,” were all combined as COVID-19. The rest of the Other Terms were also appropriately refined. A total of 539 COVID-19-related articles were included in the analysis using OT.

2.1.1. Medical Subject Headings (MeSH)

Each article on PubMed has several MeSH tags that represent the nature of the article’s subject. MeSH terms have 16 large categories: (a) Anatomy, (b) Organisms, (c) Diseases, (d) Chemicals and Drugs, (e) Analytical, Diagnostics and Therapeutic Techniques, and Equipment, (f) Psychiatry and Psychology, (g) Phenomena and Processes, (h) Disciplines and Occupations, (i) Anthropology, Education, Sociology, and Social Phenomena; (j) Technology, Industry, and Agriculture; (k) Humanities; (l) Information Science; (m) Named Groups; (n) Health Care; (o) Publication Characteristics; and (p) Geographical. Articles, each of which can have as few as one or as many as 40 or more tags. If two MeSH terms are tagged in the same article, the two MeSH terms are defined as being associated with one another. MeSH terms with a known pharmacological action are indexed as Pharmacological Action terms in the MeSH vocabulary system.

2.1.2. Substance Name (SN)

When an article on PubMed literature mentions substances registered in the Unique Ingredient Identifier (UNII) and the Chemical Abstracts Service (CAS), the substance name becomes tagged in the Registry Number/EC Number and Substance Name fields. Registry Number and EC Number are codes registered in UNII and CAS, respectively, whereas Substance Name refers to the identification of the substance. Each article may have more than 20 substances tagged. If two substances are tagged in the same article, they are assumed to be associated with one another. Substance Names sometimes overlap with MeSH, but this rarely occurs. Of note, protein names are listed as broad terms in the MeSH vocabulary system, whereas they are listed in detail, along with the source, such as human, mouse, rat, or virus, in the Substance Name system.

2.1.3. Vocabulary Combing MeSH, SN, and OT Terms

In the present study, a knowledge base was established using literature from PubMed. COVID-19-related articles were used to extract the relationship between COVID-19 and drugs and COVID-19 and proteins. The MeSH and SN terms, which efficiently express the subject of the article with little noise, were used to structure the knowledge base. For each article, the MeSH and SN terms were merged to create the combined vocabulary. The word-embedding model, a machine-learning technique, was then generated using co-occurrence relation information. To build the final model from the COVID-19-related article set, the OT terms of the COVID-19-related articles were added to expand the vocabulary further.

2.2. Model: Word Embedding with FastText

To broaden our understanding of COVID-19 and to infer new information about this disease, a new knowledge base needs to be established using existing knowledge bases. This study aims to produce a word-embedding model using an already established knowledge base, and to create a new knowledge base that allows the effective comparison and inference of the relationship between newly added information and the existing information. Knowledge base refers to the stock of knowledge that has been accumulated by researchers over the years. As COVID-19 is a novel issue, we aimed to build a knowledge base using the PubMed literature from the past ten years. Using the word-embedding model, every term within the knowledge base can be expressed as a vector; consequently, the relationship between terms can be calculated by vector computation.
Word embedding converts the sparse matrix that expresses relationships among numerous keywords (as the number of dimensions equals the number of keywords) into a dense matrix that condenses the number of dimensions (i.e., 100–200 dimensions). This allows the expression of keyword characteristics as vectors. All keywords within the vocabulary are expressed as vectors with appropriate dimensions, enabling the analysis of relationships among keywords using vector algebra. In addition, keyword analogy becomes possible, allowing a more efficient display of keyword relationships.
Common word-embedding models include Word2Vec [24] and FastText [25] for word-level embedding, and BERT (bidirectional encoder representations for transformers) [26] for sentence-level embedding. In this study, word-level embedding was used to embed biomedical terminology tagged in articles, such as MeSH and SN terms, with minimal noise and without requiring natural language processing or named entity recognition for sentences. We also tried to build our own pretrained and final models, considering the formation of an organic relationship between the two knowledge bases. Word2Vec and FastText employ very similar embedding methods; however, FastTex was selected for this study because it has a superior sub word-level analysis and out-of-vocabulary capabilities. Moreover, FastText can utilize the packages created by Facebook and Python-based Gensim. For this study, the Gensim package for FastText was used.

2.2.1. Pretrained Model

FastText can use continuous-bag-of-words and skipgram models to infer relationships between words; in this study, the latter was used. MeSH and SN terms tagged in PubMed literature between 2010 and 2019 were used as data for FastText. The vocabulary consisted of 53,216 terms.
The three hyperparameters that have major impacts on the model characteristics in FastText (vector size, window size, and number of epochs) were tested for model optimization, whereas default values were used for other parameters. Vector size, which refers to the dimension of a word vector, was tested in 100, 150, and 200 settings. Window size, which describes the size of the context window used in measuring word pair relationships when building the word-embedding model, can go beyond 60 MeSH and SN terms per article. Therefore, window size was tested in 40, 50, and 60 settings. The number of epochs was tested in 10, 15, and 20 settings.
As the FastText model building in the present study was an unsupervised training, the following evaluation methods were applied for the model optimization test. First, the evaluate_word_pairs method provided by the Gensim package for FastText functions was utilized to perform plausibility validation of the medical term relation in the model. This method is similar to the one used by the National Center for Biotechnology Information (NCBI) of the US National Library of Medicine. According to [27], NCBI builds the word embedding model of PubMed and MeSH data using FastText; model evaluation is performed by measuring word pair similarity using Medical Resident Relatedness Set (UMNSRS) medical term pairs [28] from the University of Minnesota Pharmacy Informatics Lab. UMNSRS was developed by experts who manually evaluated the relatedness of 588 medical concept pairs. Out of these, the authors selected 145 pairs that were MeSH terms and used them for pretrained model evaluations. The evaluate_word_pairs method from Gensim calculates the Pearson correlation coefficient and the Spearman correlation coefficient between the FastText model and the list of UMNSRS medical term pairs. The model by [27] at NCBI showed a similarity of 0.660 to UMNSRS medical term pairs. The similarity to UMNSRS medical term pairs in this study had a Pearson correlation coefficient of above 0.667 and a Spearman correlation coefficient of above 0.663, as summarized in Table 1. Second, the country-capital pair list from Google’s question-answer.txt, which is a widely used list to evaluate word embedding of common words that appeared in PubMed literature, was also assessed. This method utilizes the analogy between word vectors in the word-embedding model and measures the agreement accuracy of the country–capital analogy relationship. As summarized in Table 1, the accuracy was above 0.785. Based on the two evaluations, the authors determined a vector size of 200, a window size of 50, and number of epochs of 10 as the optimal settings for the pretrained word-embedding model. A different model that exhibited higher accuracy in the second evaluation was considered; however, the results from the first evaluation were considered to be more relevant, as this is an embedding model for biomedical terms, and the Q-A accuracy of the model was found to be high (above 0.928).

2.2.2. Final Model

The pretrained model was a word-embedding model using MeSH and SN terms from PubMed literature between 2010 and 2019. The final model was built by adding to the pretrained model the set of articles on COVID-19 published in 2020. As the number of COVID-19 articles tagged with MeSH and SN terms is not large, OT terms were used instead. The final model is a modified model, where a new thing, in other words, COVID-19, was added to the pretrained model; the root of this model was the same as that of the pretrained model. Therefore, vector size and window size among the three hyperparameters from the pretrained model were applied as fixed parameters in the final model. For the evaluation of the final model, only the number of epochs was used as a variable. Further, the final model evaluation requires a different method than the one used in the evaluation of the pretrained model. This is because the objectives of the two models are different. The pretrained model aims to build a knowledge base from the 2010s, whereas the final model aims to infer the characteristics of COVID-19. Word pair evaluation was applied to the pretrained model to structure the biomedical knowledge base using biomedical terms. In contrast, the final model needed to be evaluated to predict the characteristics of COVID-19 accurately using the pretrained model. However, in the early stage of research on COVID-19, there were not many publications on COVID-19; hence, a model that overfits only a very small part of what humanity has learned about COVID-19 would not be adequate. One solid basic knowledge about COVID-19 is that it is caused by RNA viruses. Therefore, we selected the most effective model based on the measured similarity of the COVID-19 term to RNA virus terms. As summarized in Table 2, the number of epochs ranged from 10 to 150, and the similarity between COVID-19 terms and RNA virus terms were measured. As the number of epochs increases, the learning is repeated, building a word embedding model that well describes the data of COVID-19 added to the final model, but at some point overfitting may occur, which hinders inferring about COVID-19. Therefore, we have to determine the appropriate number of epochs according to the evaluation method and build a final model. The average similarity increases as the number of epochs increases, reaches a maximum value at 110, and then tends to saturate somewhat. The highest average similarity was found with 110 epochs. Therefore, the final model used 110 epochs, and its vocabulary ultimately consisted of 53,316 terms, owing to the addition of the OT terms extracted from the COVID-19 article set to the pretrained model’s vocabulary.

3. Results

The following COVID-19-related drugs and proteins were extracted from the final model. From the list of drugs available, the authors focused on anti-infective drugs. For MeSH terms, Pharmacological Actions keywords are provided along with the drugs. The authors selected the following Pharmacological Action drugs to filter for anti-infective drugs: Anti-Bacterial Agents; Antibiotics, Antifungal; Antibiotics, Antineoplastic; Antibiotics, Antitubercular; Anti-Infective Agents; Anti-Infective Agents, Local; Anti-Infective Agents, Urinary; Antimalarials; Antiprotozoal Agents; Antitubercular Agents; Anti-HIV Agents; Antiviral Agents; HIV Fusion Inhibitors; HIV Integrase Inhibitors; and HIV Protease Inhibitors. Using these terms, a total of 401 anti-infective drugs emerged. Within the final model, the similarity between anti-infective drugs and COVID-19 was measured to assess for any relationship. Table 3 lists the top 100 out of the 401 anti-infective drugs that were related to the COVID-19 vaccine or to the treatment drugs currently being developed. The drugs in Table 3 that are highlighted in gray represent those that showed low relevance to COVID-19, compared with the top 100 drugs; however, these are currently being studied as potential vaccines or treatments. Excelra [29], the ReDO Project [30], and DrugBank [31] summarize the drugs that are being repurposed as potential COVID-19 vaccines or treatments. The authors compared these three drug repurposing databases and the final model results from the current study, and the comparison results are listed in Table 3, Reference column. Out of the 401 anti-infective drugs the authors selected, 64 drugs were identified to be in current development as COVID-19 vaccines or treatments. Based on the relevance to COVID-19, 33 repositioning candidate drugs were identified in the top 100 drugs. The imipenem and cilastatin drug combination (under the brand name Primaxin), which revealed the highest similarity, is a treatment for severe infections affecting the heart, lungs, bladder, kidney, skin, blood, bones, stomach, and the female reproductive organs. With the spread of COVID-19, the U.S. FDA approved the antibiotic combination of imipenem–cilastatin and relbactam (Recarbio) for the treatment of hospital-acquired bacterial pneumonia and ventilator associated bacterial pneumonia. Oseltamivir and chloroquine, the two drugs that were most frequently mentioned in the media in the first half of 2020, also showed a very high relevance to COVID-19. The amoxicillin and clavulanate potassium combination, more commonly known under the trade name Augmentin, is an antibiotic that is widely used for sinusitis, bronchitis, pneumonia, ear infections, and urinary tract and skin infections. Currently, clinical trials utilizing amoxicillin/clavulanate alone or in combination of azithromycin with amoxicillin/clavulanate are ongoing. The trimethoprim–sulfamethoxazole drug combination (Bactrim), which has excellent antibacterial activity against gram-negative bacteria and staphylococcus, is also an antibiotic used for the treatment of ear infections, urinary tract infections, bronchitis, traveler’s diarrhea, shigellosis, and Pneumocystis jirovecii pneumonia. The drug is currently in clinical trials for its use with Anakinra, an IL-1 receptor antagonist indicated for the treatment of the COVID-19-induced hyperimmune respiratory failure (aka cytokine storm).
Most of the potential drugs with the highest relevance (top 100) to COVID-19 were drugs for bacterial infections (antibiotics). Several drugs for viral infections were also on the list. Various anti-retrovirals (used in HIV/AIDS) and anti-malarial drugs were also shown to have high relevance to COVID-19.
To indirectly confirm the robustness of our final model, we compared the drug list of 10 models with different numbers of epochs. The top relevance drug list barely changed, and only the bottom relevance (about 10%) drug list showed small changes, indicating that our final model is robust and the list of potential drugs with the highest relevance (top 100) is a stable result.
Using a similar method, protein terms with high relevance to COVID-19 were extracted from the final model. Only the 5366 proteins that are of either human or coronavirus origin were extracted, and their relevance to COVID-19 was then analyzed. Table 4 lists the top 100 proteins relevant to COVID-19. The proteins highlighted in gray in Table 4 indicate those that showed low relevance to COVID-19 but are known to be human proteins that interact with COVID-19. Information on the human proteins that are known to interact with COVID-19 and on known proteins of COVID-19 can be found in [32] and in the study by [7]. Protein descriptions, gene names, and COVID-19 bait columns in Table 4 also lists the COVID-19 interacting proteins. In particular, COVID-19 viral proteins were identified as proteins with high relevance to COVID-19 in the final model, along with angiotensin converting enzyme 2, which is known as the COVID-19 entry receptor. Among the top 100 highly-relevant proteins, the following were identified: six SARS-CoV-2 viral proteins listed in The Human Protein Atlas (M protein, coronavirus; nsp1 protein, SARS coronavirus; nsp14 protein, SARS coronavirus; 3C-like proteinase, coronavirus; nonstructural protein 3, SARS coronavirus; Nsp16 protein, SARS virus) and three human proteins (angiotensin converting enzyme 2; NARS2 protein, human; ALG8 protein, human).
The drugs and proteins listed in Table 3 and Table 4 are the COVID-19 related term list extracted from PubMed MeSH and SN term-based word embedding model. When comparing these results with the latest references reflective of current research trends, some were consistent, while others highlighted information not being investigated in the current research.

4. Discussion and Conclusions

This research aimed to understand the characteristics of COVID-19, which is a novel disease that humanity is currently facing, using the PubMed database, a knowledge base that has been established over a long duration. To accomplish this, information from PubMed literature pertaining to coronaviruses from the past decade was structured in a word embedding model, and subsequently, the relationships between COVID-19 terms and other biomedical terms were inferred. With the result of this study, proteins and drugs with high relevance to COVID-19 were deduced.
The word embedding technique used in this study upgrades the field of knowledge discovery from the biomedical literature, previously dealt with in bibliometrics, enabling inference on the demand for knowledge with many uncertainties, such as that on COVID-19. This helps to understand and discover new knowledge. The vector calculation and mathematical modeling techniques of word embedding can play a role in advancing drug development, which is time-consuming and costly, by adding inferencing capabilities to the insufficient medical literature knowledge.
The result of this study is highly comparable to the biomedical demands of research and development efforts to overcome the COVID-19 crisis. We expect that this list of drugs and proteins, and their relevance to COVID-19, will help in identifying potential vaccine or treatment candidates. This word embedding research model also provided an in-silico drug design method for drug repurposing that can drastically reduce the time and cost of drug development. With the urgent need for identifying drug candidates for COVID-19, various data, tools and methods for drug repurposing are being introduced and analyzed. The results of this study also provide a computational method to predict potential drug-target interactions (DTIs).
This study exhibits three limitations. First, it only used MeSH and SN terms for word embedding, which both has advantages and limitations. As to the advantages, these terms are controlled vocabularies, and only technical terms were used to establish the model, which virtually eliminates all noise. However, it might have excluded new terms that may exist in plain texts. If plain texts such as abstracts would be included, natural language processing and named entity recognition could be required. In this case, the BERT model can be considered. Second, for drug repositioning, a broader consideration regarding the pharmacological action of drugs as anti-infective drugs should have been included. Recently, there have been cases of drugs being used for an entirely different indication; for example, anti-tumor drugs and anti-parasitic drugs are also being studied as potential COVID-19 treatments. As this study aims to expand our knowledge of COVID-19, it may also be necessary to observe more broadly its relevance to COVID-19. Third, adding more databases beyond PubMed can provide more information. In particular, adding clinical trials databases could be helpful in enriching the information by including data on the latest commercial drugs.
Follow-up research is needed to overcome these limitations. Future research should include the entire list of drug substance terms, as well as anti-infective drugs, for analysis in order to produce helpful results for drug repositioning for COVID-19. This is because, like the cases in which new indications were added for drugs with completely different indications in the past, it is not possible to rule out the possibility that a drug that appears to be irrelevant will appear as a therapeutic candidate for COVID-19. Furthermore, a word embedding model using clinical trial databases, in addition to PubMed literature, needs to be established. With the addition of pharmacokinetic prediction, the list of potential vaccine or treatment candidates could become more meaningful and more useful information. Studies to understand the interaction between drugs and proteins by applying a clustering technique to the drug list and protein list related to COVID-19, or studies applying the BERT model, are also meaningful as follow-up studies. If we approach the pandemic from the perspective of an X-event like a major accident [33], machine learning-based modeling studies of complex systems for the spread of infectious disease will also help broaden our understanding of COVID-19 and new infectious diseases caused by a novel virus [34]. These efforts will contribute to availing more accurate information pertaining to COVID-19 rapidly, which will help overcome new pandemics.

Author Contributions

Conceptualization, methodology, analysis: H.Y.; Interpretation, writing: H.Y. and E.S. Both authors have read and agreed to the published version of the manuscript.

Funding

This work has been supported by the Korea Institute of Science and Technology Information.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The pretrained and the final models in this study are obtained from https://github.com/hyyangkisti/bio_embedding. The whole lists of drugs and proteins related COVID-19 are available at https://hyyangkisti.github.io/ddppr/covid-19. (These sites were accessed on 15 March 2021).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhu, N.; Zhang, D.; Wang, W.; Li, X.; Yang, B.; Song, J.; Tan, W. A novel coronavirus from patients with pneumonia in China, 2019. N. Engl. J. Med. 2020, 382, 727–733. [Google Scholar] [CrossRef]
  2. WHO Coronavirus Disease (COVID-19) Dashboard. Available online: https://covid.who.int/ (accessed on 11 March 2021).
  3. Calligari, P.; Bobone, S.; Ricci, G.; Bocedi, A. Molecular investigation of SARS–CoV-2 proteins and their interactions with antiviral drugs. Viruses 2020, 12, 445. [Google Scholar] [CrossRef] [Green Version]
  4. Prajapat, M.; Sarma, P.; Shekhar, N.; Avti, P.; Sinha, S.; Kaur, H.; Medhi, B. Drug targets for corona virus: A systematic review. Indian J. Pharmacol. 2020, 52, 56. [Google Scholar]
  5. Astuti, I. Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2): An overview of viral structure and host response. Diabetes Metab. Syndr. Clin. Res. Rev. 2020, 14, 407–412. [Google Scholar] [CrossRef]
  6. Chen, Y.; Liu, Q.; Guo, D. Emerging coronaviruses: Genome structure, replication, and pathogenesis. J. Med Virol. 2020, 92, 418–423. [Google Scholar] [CrossRef] [PubMed]
  7. Gordon, D.E.; Jang, G.M.; Bouhaddou, M.; Xu, J.; Obernier, K.; White, K.M.; Krogan, N.J. A SARS-CoV-2 protein interaction map reveals targets for drug repurposing. Nature 2020, 583, 459–468. [Google Scholar] [CrossRef]
  8. Yang, S.; Fu, C.; Lian, X.; Dong, X.; Zhang, Z. Understanding human-virus protein-protein interactions using a human protein complex-based analysis framework. MSystems 2019, 4. [Google Scholar] [CrossRef] [Green Version]
  9. Yoshimoto, F.K. The proteins of severe acute respiratory syndrome coronavirus-2 (SARS CoV-2 or n-COV19), the cause of COVID-19. Protein J. 2020, 39, 198–216. [Google Scholar] [CrossRef] [PubMed]
  10. Tu, Y.F.; Chien, C.S.; Yarmishyn, A.A.; Lin, Y.Y.; Luo, Y.H.; Lin, Y.T.; Chiou, S.H. A review of SARS-CoV-2 and the ongoing clinical trials. Int. J. Mol. Sci. 2020, 21, 2657. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  11. Gómez-Ríos, D.; López-Agudelo, V.A.; Ramírez-Malule, H. Repurposing antivirals as potential treatments for SARS-CoV-2: From SARS to COVID-19. J. Appl. Pharm. Sci. 2020, 10, 1–9. [Google Scholar]
  12. Guy, R.K.; DiPaola, R.S.; Romanelli, F.; Dutch, R.E. Rapid repurposing of drugs for COVID-19. Science 2020, 368, 829–830. [Google Scholar] [CrossRef] [PubMed]
  13. Serafin, M.B.; Bottega, A.; Foletto, V.S.; da Rosa, T.F.; Hörner, A.; Hörner, R. Drug repositioning is an alternative for the treatment of coronavirus COVID-19. Int. J. Antimicrob. Agents 2020, 55, 105969. [Google Scholar] [CrossRef]
  14. Liu, S.; Zheng, Q.; Wang, Z. Potential covalent drugs targeting the main protease of the SARS-CoV-2 coronavirus. Bioinformatics 2020, 36, 3295–3298. [Google Scholar] [CrossRef] [PubMed]
  15. Zhou, Y.; Hou, Y.; Shen, J.; Huang, Y.; Martin, W.; Cheng, F. Network-based drug repurposing for novel coronavirus 2019-nCoV/SARS-CoV-2. Cell Discov. 2020, 6, 14. [Google Scholar] [CrossRef] [Green Version]
  16. Wu, C.; Liu, Y.; Yang, Y.; Zhang, P.; Zhong, W.; Wang, Y.; Li, H. Analysis of therapeutic targets for SARS-CoV-2 and discovery of potential drugs by computational methods. Acta Pharm. Sin. B 2020, 10, 766–788. [Google Scholar] [CrossRef]
  17. Cernile, G.; Heritage, T.; Sebire, N.J.; Gordon, B.; Schwering, T.; Kazemlou, S.; Borecki, Y. Network graph representation of COVID-19 scientific publications to aid knowledge discovery. BMJ Health Care Inform. 2021, 28, e100254. [Google Scholar] [CrossRef] [PubMed]
  18. Domingo-Fernández, D.; Baksi, S.; Schultz, B.; Gadiya, Y.; Karki, R.; Raschka, T.; Hofmann-Apitius, M. COVID-19 Knowledge Graph: A computable, multi-modal, cause-and-effect knowledge model of COVID-19 pathophysiology. Bioinformatics 2020. [Google Scholar] [CrossRef]
  19. Chen, C.; Ebeid, I.A.; Bu, Y.; Ding, Y. Coronavirus knowledge graph: A case study. arXiv 2020, arXiv:2007.10287. [Google Scholar]
  20. Reese, J.T.; Unni, D.; Callahan, T.J.; Cappelletti, L.; Ravanmehr, V.; Carbon, S.; Mungall, C.J. KG-COVID-19: A framework to produce customized knowledge graphs for COVID-19 response. Patterns 2020, 2, 100155. [Google Scholar] [CrossRef] [PubMed]
  21. Wang, Q.; Li, M.; Wang, X.; Parulian, N.; Han, G.; Ma, J.; Onyshkevych, B. COVID-19 literature knowledge graph construction and drug repurposing report generation. arXiv 2020, arXiv:2007.00576. [Google Scholar]
  22. Jurafsky, D. Speech Language Processing; Pearson Education: Tamil Nadu, India, 2000. [Google Scholar]
  23. Habibi, M.; Weber, L.; Neves, M.; Wiegandt, D.L.; Leser, U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 2017, 33, i37–i48. [Google Scholar] [CrossRef] [PubMed]
  24. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.; Dean, J. Distributed representations of words and phrases and their compositionality. arXiv 2013, arXiv:1310.4546. [Google Scholar]
  25. Mikolov, T.; Grave, E.; Bojanowski, P.; Puhrsch, C.; Joulin, A. Advances in pre-training distributed word representations. arXiv 2017, arXiv:1712.09405. [Google Scholar]
  26. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  27. Zhang, Y.; Chen, Q.; Yang, Z.; Lin, H.; Lu, Z. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Sci. Data 2019, 6, 52. [Google Scholar] [CrossRef] [Green Version]
  28. Pakhomov, S.; McInnes, B.; Adam, T.; Liu, Y.; Pedersen, T.; Melton, G.B. Semantic similarity and relatedness between clinical terms: An experimental study. In AMIA Annual Symposium Proceedings; American Medical Informatics Association: Bethesda, MD, USA, 2010; Volume 2010, p. 572. [Google Scholar]
  29. Excelar, COVID-19 Drug Repurposing Database. Available online: https://www.excelra.com/covid-19-drug-repurposing-database/ (accessed on 11 March 2021).
  30. ReDO Project, Covid19_DB. Available online: http://www.redo-project.org/covid19db/ (accessed on 11 March 2021).
  31. DrugBank, Covid-19 Information. Available online: https://www.drugbank.ca/ (accessed on 11 March 2021).
  32. The Human Protein Atlas. Available online: https://www.proteinatlas.org/ (accessed on 11 March 2021).
  33. Di Nardo, M.; Madonna, M.; Murino, T.; Castagna, F. Modelling a Safety Management System Using System Dynamics at the Bhopal Incident. Appl. Sci. 2020, 10, 903. [Google Scholar] [CrossRef] [Green Version]
  34. Galea, S.; Riddle, M.; Kaplan, G.A. Causal thinking and complex system approaches in epidemiology. Int. J. Epidemiol. 2010, 39, 97–106. [Google Scholar] [CrossRef]
Figure 1. Conceptual diagram of research design.
Figure 1. Conceptual diagram of research design.
Ijerph 18 03005 g001
Table 1. Word-embedding model evaluation results based on hyperparameter combinations.
Table 1. Word-embedding model evaluation results based on hyperparameter combinations.
No.HyperparametersEvaluation_1Evaluation_2
Vector SizeWindow SizeNumber of EpochsPearson Correlation SimilaritySpearman Correlation SimilarityQ-A Accuracy
Coefficientp-ValueCoefficientp-Value
110040100.67742.13 × 10−200.67742.10 × 10−200.8810
210040150.67086.71 × 10−200.67762.02 × 10−200.8095
310040200.66879.62 × 10−200.66671.34 × 10−190.8095
410050100.67185.61 × 10−200.67592.77 × 10−200.8333
510050150.66342.34 × 10−190.66322.42 × 10−190.7857
610050200.66889.43 × 10−200.67136.10 × 10−200.8810
710060100.67473.41 × 10−200.68209.25 × 10−210.8810
810060150.67155.89 × 10−200.67364.10 × 10−200.8810
910060200.67453.51 × 10−200.68101.11 × 10−200.8571
1015040100.67116.38 × 10−200.68169.99 × 10−210.9048
1115040150.67493.29 × 10−200.68644.15 × 10−210.9762
1215040200.67185.65 × 10−200.68021.29 × 10−200.9762
1315050100.67752.07 × 10−200.69311.21 × 10−210.9286
1415050150.67811.88 × 10−200.69141.66 × 10−210.9524
1515050200.67722.19 × 10−200.68912.56 × 10−210.9762
1615060100.67086.70 × 10−200.68179.79 × 10−210.9524
1715060150.67801.89 × 10−200.69281.27 × 10−210.9762
1815060200.66958.39 × 10−200.67961.42 × 10−200.9286
1920040100.67146.05 × 10−200.68495.52 × 10−210.9524
2020040150.67185.66 × 10−200.68564.81 × 10−210.9524
2120040200.67007.66 × 10−200.68337.39 × 10−210.9048
2220050100.67831.81 × 10−200.69361.09 × 10−210.9286
2320050150.67612.64 × 10−200.69081.86 × 10−210.8810
2420050200.67165.86 × 10−200.68436.13 × 10−210.9286
2520060100.67086.65 × 10−200.68475.71 × 10−210.9524
2620060150.67552.96 × 10−200.68753.39 × 10−211.0000
2720060200.66691.30 × 10−190.67742.11 × 10−200.9524
Table 2. Final model evaluation results based on the number of epochs.
Table 2. Final model evaluation results based on the number of epochs.
No.Number of EpochsEvaluation for Final Model
COVID-19 with RNA Virus Term Similarity
1100.4112
2200.4524
3300.4696
4400.4779
5500.4778
6600.4802
7700.4822
8800.4835
9900.4850
101000.4862
111100.4873
121200.4867
131300.4851
141400.4853
151500.4854
Table 3. List of drugs extracted from the final model with high relevance to COVID-19.
Table 3. List of drugs extracted from the final model with high relevance to COVID-19.
NoDrug Medical Subject Heading (MeSH) Terms for Anti-Infective PHARMACOLOGICAL Action from the Final ModelOriginal IndicationReference
1Cilastatin, Imipenem Drug CombinationBacterial infection
2OseltamivirInfluenza virus infectionPMID: 12690091, NCT04345419 et al., 7 cases
3ChloroquineMalariaPMID: 32074550, NCT04286503 et al., 29 cases
4Amoxicillin-Potassium Clavulanate CombinationBacterial infectionNCT04363060
5Trimethoprim, Sulfamethoxazole Drug CombinationBacterial infectionNCT04357366, NCT03489629
6Emtricitabine, Rilpivirine, Tenofovir Drug CombinationHIV/AIDS
7ColistinBacterial infectionChiCTR2000032242 (China)
8InterferonsViral infection, CancerNCT04379518 et al., 9 cases
9Artemether, Lumefantrine Drug CombinationMalaria
10PenicillinBacterial infection
11BacteriocinsBacterial infection
12AmdinocillinBacterial infection
13TigecyclineBacterial infectionPMID: 28700943
14StreptograminsBacterial infection
15TeicoplaninBacterial infectionIRCT20161204031229N3 (Iran)
16PalivizumabViral infection
17AztreonamBacterial infection
18MeropenemBacterial infection
19AzlocillinBacterial infection
20Silver ProteinsAntiseptics
21ImipenemBacterial infection
22RibavirinViral infectionPMID: 22555152, NCT04392427
23LincosamidesBacterial infection
24Piperacillin, Tazobactam Drug CombinationBacterial infectionNCT02735707
25PolymyxinsBacterial infection
26Emtricitabine, Tenofovir Disoproxil Fumarate Drug CombinationHIV/AIDSNCT04329520
27MefloquineMalariaNCT04347031
28MethicillinBacterial infection
29ZanamivirInfluenza A virus infectionPMID: 15200845
30RimantadineInfluenza A virus infectionPMID: 31133031, 15288617
31ValganciclovirViral infection
32AmantadineDyskinesia associated with parkinsonism, influenza infection
33CephalosporinsBacterial infection
34AmpicillinBacterial infection
35DoripenemBacterial infection
36SimeprevirHCV infection
37LopinavirHIV/AIDSNCT04372628 et al., 37 cases
38CefamandoleBacterial infection
39CeftriaxoneBacterial infectionNCT02735707
40ThienamycinsBacterial infection
41Penicillic AcidBacterial infection
42SisomicinBacterial infection
43GanciclovirCytomegalovirus retinitisPMID: 32166607
44PrimaquineMalariaNCT04349410
45SulfaleneBacterial infection
46AzithromycinBacterial infectionNCT04332107 et al. 67 cases
47VancomycinBacterial infectionNCT02667418
48SpectinomycinBacterial infection
49Efavirenz, Emtricitabine, Tenofovir Disoproxil Fumarate Drug CombinationHIV/AIDS
50MinocyclineBacterial infectionNCT03489629
51LeucomycinsBacterial infection
52TicarcillinBacterial infection
53LinezolidBacterial infectionPMID: 16127068, 16723564, 22094260
54ErtapenemBacterial infection
55ClindamycinBacterial infectionNCT04349410
56ChloramphenicolBacterial infectionPMID: 23148581
57DoxycyclineBacterial infectionNCT04370782 et al., 6 cases
58HydroxychloroquineMalariaNCT04358068 et al., 177 cases
59Famciclovirviral infection
60TyrocidineBacterial infection
61Acyclovirviral infection
62NisinBacterial infection
63NebramycinBacterial infection
64Penicillanic AcidBacterial infection
65Elvitegravir, Cobicistat, Emtricitabine, Tenofovir Disoproxil Fumarate Drug CombinationHIV/AIDS
66PristinamycinBacterial infection
67NevirapineHIV/AIDS
68LamivudineHIV/AIDS
69PiperacillinBacterial infectionNCT04394182
70Valacyclovirviral infection
71ViomycinBacterial infection
72EmtricitabineHIV/AIDSNCT04334928
73CeftazidimeBacterial infection
74ArtemisininsMalaria
75JosamycinBacterial infection
76Telbivudineviral infection
77FidaxomicinBacterial infectionNCT02667418
78EdeineBacterial infection
79CefoxitinBacterial infection
80ProguanilMalaria
81FosfomycinBacterial infection
82Metha-cyclineBacterial infection
83TylosinBacterial infection
84SulbactamBacterial infection
85AmikacinBacterial infection
86RitonavirHIV/AIDSNCT04372628 et al., 43 cases
87Sulfa-doxineMalaria
88Dihydrostreptomycin SulfateBacterial infection
89CefotaximeBacterial infection
90CefotetanBacterial infection
91HexetidineBacterial infection
92AtovaquonePneumocystis pneumonia, toxoplasmosis, malariaNCT04339426
93OxacillinBacterial infection
94DaptomycinBacterial infection
95RilpivirineHIV/AIDS
96SofosbuvirHepatitis C virus infectionNCT04443725
97StreptomycinBacterial infection
98ArtesunateMalariaNCT04387240
99HepcidinsAntimicrobial peptide
100SparsomycinBacterial infection
108TenofovirViral infectionIRCT20200421047155N1
134MupirocinImpetigo and secondary skin infectionNCT03489629
137Inosine PranobexViral infectionNCT04360122, NCT04383717
142CytarabineLeukemiaNCT02310321
149ClarithromycinBacterial infectionNCT04398004
150ItraconazoleFungal infection2020-001243-15 (Begium)
154AmoxicillinBacterial infectionNCT04363060
157TazobactamBacterial infectionNCT04394182
160CobicistatHIV-1 infectionNCT04425382 et al., 3 cases
175QuinacrineMalariaPMID: 23301007, 31307979, 32194980
186DarunavirHIV-1 infectionNCT04435587 et al., 4 cases
191IodineBreast disorders and painNCT04344236
194IndinavirHIV/AIDSPMID: 15144898
199Clavulanic AcidBacterial infectionNCT04363060
202Mycophenolic AcidOrgan rejectionPMID: 5799033
204MaravirocHIV infectionNCT04435522, NCT04441385
220ChlorhexidineAntisepticsNCT04344236, NCT03489629
235TrimethoprimBacterial infectionNCT04357366, NCT03489629
247SulfamethoxazoleBacterial infectionNCT04357366, NCT03489629
253AcetylcysteineMucolyticsNCT04419025 et al. 4 cases
257DactinomycinCancerPMID: 1335030, 32194980
269Atazanavir SulfateHIV-1 infectionNCT02016924
277IdarubicinAcute Myeloid LeukemiaNCT02310321
284Hydrogen PeroxideDisinfectant and SterilizerNCT04409873
291Povidone-IodineInfectionNCT04410159 et al., 7 cases
294SirolimusOrgan rejectionNCT04374903 et al., 3 cases
298Methylene BlueMethemoglobinemiaNCT04376788, NCT04370288
350PyrazinamideTuberculosisNCT04349241
362CamphorCoughingPMID: 27823881, 32194980
367CetylpyridiniumBacterial infectionNCT04409873
374DaunorubicinCancerPMID: 9647783
Table 4. List of coronavirus or human proteins extracted from the final model with high relevance to COVID-19 with protein description, gene name and covid-19 bait information.
Table 4. List of coronavirus or human proteins extracted from the final model with high relevance to COVID-19 with protein description, gene name and covid-19 bait information.
NoProtein Substance Name (SN) Terms of Human and Coronavirus from the Final ModelProtein DescriptionGene NameCovid-19 Bait
1M protein, CoronavirusSARS-CoV-2 Viral Protein (M)
2Nsp1 protein, SARS coronavirusSARS-CoV-2 Viral Protein (nsp1)
3nsp14 protein, SARS coronavirusSARS-CoV-2 Viral Protein (nsp14)
43C-like proteinase, CoronavirusSARS-CoV-2 Viral Protein (nsp5)
5dynorphin converting enzyme
6COG2 protein, human
7nonstructural protein 3, SARS coronavirusSARS-CoV-2 Viral Protein (nsp3)
8angiotensin converting enzyme 2SARS-CoV-2 entry receptorsACE2
9COX6A1 protein, human
10poly U polymerase
11CORIN protein, human
12COX8C protein, human
13Nsp16 protein, SARS virusSARS-CoV-2 Viral Protein (nsp16)
14COX5A protein, human
15COQ5 protein, human
16GBE1 protein, human
17transmembrane serine protease 2, human
18sfericase
19CPVL protein, human
20COX4I1 protein, human
21LARS2 protein, human
22COX5B protein, human
23NARS2 protein, humanSARS-CoV-2 interacting proteinNARS2SARS-CoV-2 nsp8
24UL49A protein, Human herpesvirus 2
25COX6B1 protein, human
26PARS2 protein, human
27hydrogenase maturating endopeptidase HYBD
28VARS2 protein, human
29human airway trypsin-like protease
30ERI1 protein, human
31Myxo-bacter alpha-lytic proteinase
32AARS2 protein, human
33RARS2 protein, human
34Tli polymerase
35ADAM29 protein, human
36HPN protein, human
37O-antigen polymerase
38SPEG protein, human
39CLPB protein, human
40FONG protein, human
41ERManI protein, human
42PDIK1L protein, human
43ALG8 protein, humanSARS-CoV-2 interacting proteinALG8SARS-CoV-2 orf9c
44NVL protein, human
45HFM1 protein, human
46HARS2 protein, human
47COASY protein, human
48TMPRSS13 protein, human
49C1RL protein, human
50COX20 protein, human
51ECEL1 protein, human
52NARFL protein, human
53GANAB protein, human
54AFG3L2 protein, human
55TSEN54 protein, human
56ERAL1 protein, human
57m-AAA proteases
58KY protein, human
59TMEM129 protein, human
60KEL protein, human
61APH1B protein, human
62MGME1 protein, human
63ATL3 protein, human
64oxacillinase
65COX10 protein, human
66MYORG protein, human
67hemagglutinin-protease
68Tiki1 protein, human
69FIGN protein, human
70ATL1 protein, human
71RLGP protein, human
72FbxL4 protein, human
73hemorrhagic metalloproteinase
743C proteases
75HEXB protein, human
76GNPTG protein, human
77ADAM23 protein, human
78NSF protein, human
79RNA polymerase SP6
80ADAM22 protein, human
81IntS9 protein, human
82SERAC1 protein, human
83RPL41 protein, human
84pokeweed antiviral protein
85COX15 protein, human
86small cardioactive peptide A
87DARS2 protein, human
88AGBL5 protein, human
89LARGE1 protein, human
90COX4I2 protein, human
91NHLH1 protein, human
92MINDY2 protein, human
93DHX29 protein, human
94RNA polymerase Esigma(38)
95ADAM30 protein, human
96DLG2 protein, human
97Ric-8b protein, human
98UST protein, human
99Deep Vent DNA polymerase
100PIGL protein, human
143EXOSC8 protein, humanSARS-CoV-2 interacting proteinEXOSC8SARS-CoV-2 nsp8
163PITRM1 protein, humanSARS-CoV-2 interacting proteinPITRM1SARS-CoV-2 M
172NGLY1 protein, humanSARS-CoV-2 interacting proteinNGLY1SARS-CoV-2 orf8
177ALG11 protein, humanSARS-CoV-2 interacting proteinALG11SARS-CoV-2 nsp4
281TMPRSS2 protein, humanSARS-CoV-2 entry associated proteasesTMPRSS2
345PCSK6 protein, humanSARS-CoV-2 interacting proteinPCSK6SARS-CoV-2 orf8
352MDN1 protein, humanSARS-CoV-2 interacting proteinMDN1SARS-CoV-2 orf7a
360ERMP1 protein, humanSARS-CoV-2 interacting proteinERMP1SARS-CoV-2 orf9c
384QSOX2 protein, humanSARS-CoV-2 interacting proteinQSOX2SARS-CoV-2 nsp7
392HectD1 protein, humanSARS-CoV-2 interacting proteinHECTD1SARS-CoV-2 nsp8
497USP54 protein, humanSARS-CoV-2 interacting proteinUSP54SARS-CoV-2 nsp12
502NDUFB9 protein, humanSARS-CoV-2 interacting proteinNDUFB9SARS-CoV-2 orf9c
577NEU1 protein, humanSARS-CoV-2 interacting proteinNEU1SARS-CoV-2 orf8
664PRIM1 protein, humanSARS-CoV-2 interacting proteinPRIM1SARS-CoV-2 nsp1
685Cwc27 protein, humanSARS-CoV-2 interacting proteinCWC27SARS-CoV-2 E
691NDUFAF1 protein, humanSARS-CoV-2 interacting proteinNDUFAF1SARS-CoV-2 orf9c
696AASS protein, humanSARS-CoV-2 interacting proteinAASSSARS-CoV-2 M
716FKBP10 protein, humanSARS-CoV-2 interacting proteinFKBP10SARS-CoV-2 orf8
740ATP6V1A protein, humanSARS-CoV-2 interacting proteinATP6V1ASARS-CoV-2 M
756Mov10 protein, humanSARS-CoV-2 interacting proteinMOV10SARS-CoV-2 N
773TCF12 protein, humanSARS-CoV-2 interacting proteinTCF12SARS-CoV-2 nsp12
785TBK1 protein, humanSARS-CoV-2 interacting proteinTBK1SARS-CoV-2 nsp13
935DDX21 protein, humanSARS-CoV-2 interacting proteinDDX21SARS-CoV-2 N
938DDX10 protein, humanSARS-CoV-2 interacting proteinDDX10SARS-CoV-2 nsp8
1010UPF1 protein, humanSARS-CoV-2 interacting proteinUPF1SARS-CoV-2 N
1026ACAD9 protein, humanSARS-CoV-2 interacting proteinACAD9SARS-CoV-2 orf9c
1080ADAMTS1 protein, humanSARS-CoV-2 interacting proteinADAMTS1SARS-CoV-2 orf8
1118GFER protein, humanSARS-CoV-2 interacting proteinGFERSARS-CoV-2 nsp10
1120RNF41 protein, humanSARS-CoV-2 interacting proteinRNF41SARS-CoV-2 nsp15
1145ADAM9 protein, humanSARS-CoV-2 interacting proteinADAM9SARS-CoV-2 orf8
1217PPT1 protein, humanSARS-CoV-2 interacting proteinPPT1SARS-CoV-2 orf10
1300LOX protein, humanSARS-CoV-2 interacting proteinLOXSARS-CoV-2 orf8
1450MYCBP2 protein, humanSARS-CoV-2 interacting proteinMYCBP2SARS-CoV-2 nsp12
1481CTSL protein, humanSARS-CoV-2 entry associated proteasesCTSL
1483CYB5R3 protein, humanSARS-CoV-2 interacting proteinCYB5R3SARS-CoV-2 nsp7
1621NEK9 protein, humanSARS-CoV-2 interacting proteinNEK9SARS-CoV-2 nsp9
1693COMT protein, humanSARS-CoV-2 interacting proteinCOMTSARS-CoV-2 nsp7
1709MARK3 protein, humanSARS-CoV-2 interacting proteinMARK3SARS-CoV-2 orf9b
1798HS6ST2 protein, humanSARS-CoV-2 interacting proteinHS6ST2SARS-CoV-2 orf8
1816MARK2 protein, humanSARS-CoV-2 interacting proteinMARK2SARS-CoV-2 orf9b
1840Rab14 protein, humanSARS-CoV-2 interacting proteinRAB14SARS-CoV-2 nsp7
1845G3BP1 protein, humanSARS-CoV-2 interacting proteinG3BP1SARS-CoV-2 N
1871Rab10 protein, humanSARS-CoV-2 interacting proteinRAB10SARS-CoV-2 nsp7
1957MARK1 protein, humanSARS-CoV-2 interacting proteinMARK1SARS-CoV-2 orf9b
2109RAB8A protein, humanSARS-CoV-2 interacting proteinRAB8ASARS-CoV-2 nsp7
2279USP13 protein, humanSARS-CoV-2 interacting proteinUSP13SARS-CoV-2 nsp13
2287RAB5C protein, humanSARS-CoV-2 interacting proteinRAB5CSARS-CoV-2 nsp7
2346PRKACA protein, humanSARS-CoV-2 interacting proteinPRKACASARS-CoV-2 nsp13
2369PLAT protein, humanSARS-CoV-2 interacting proteinPLATSARS-CoV-2 orf8
2436PTGES2 protein, humanSARS-CoV-2 interacting proteinPTGES2SARS-CoV-2 nsp7
2598BRD2 protein, humanSARS-CoV-2 interacting proteinBRD2SARS-CoV2 E
2722PLOD2 protein, humanSARS-CoV-2 interacting proteinPLOD2SARS-CoV-2 orf8
2744RALA protein, humanSARS-CoV-2 interacting proteinRALASARS-CoV-2 nsp7
2847DPP4 protein, humanSARS-CoV-2 entry receptorsDPP4
3039NSD2 protein, humanSARS-CoV-2 interacting proteinNSD2SARS-CoV-2 nsp8
3204MIB1 ligase, humanSARS-CoV-2 interacting proteinMIB1SARS-CoV-2 nsp9
3456CTSB protein, humanSARS-CoV-2 entry associated proteasesCTSB
4470RHOA protein, humanSARS-CoV-2 interacting proteinRHOASARS-CoV-2 nsp7
4471SIRT5 protein, humanSARS-CoV-2 interacting proteinSIRT5SARS-CoV-2 nsp14
4540DNMT1 protein, humanSARS-CoV-2 interacting proteinDNMT1SARS-CoV-2 orf8
4569HMOX1 protein, humanSARS-CoV-2 interacting proteinHMOX1SARS-CoV-2 orf3a
4663IMPDH2 protein, humanSARS-CoV-2 interacting proteinIMPDH2SARS-CoV-2 nsp14
4780RIPK1 protein, humanSARS-CoV-2 interacting proteinRIPK1SARS-CoV-2 nsp12
4929HDAC2 protein, humanSARS-CoV-2 interacting proteinHDAC2SARS-CoV-2 nsp5
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Yang, H.; Sohn, E. Expanding Our Understanding of COVID-19 from Biomedical Literature Using Word Embedding. Int. J. Environ. Res. Public Health 2021, 18, 3005. https://doi.org/10.3390/ijerph18063005

AMA Style

Yang H, Sohn E. Expanding Our Understanding of COVID-19 from Biomedical Literature Using Word Embedding. International Journal of Environmental Research and Public Health. 2021; 18(6):3005. https://doi.org/10.3390/ijerph18063005

Chicago/Turabian Style

Yang, Heyoung, and Eunsoo Sohn. 2021. "Expanding Our Understanding of COVID-19 from Biomedical Literature Using Word Embedding" International Journal of Environmental Research and Public Health 18, no. 6: 3005. https://doi.org/10.3390/ijerph18063005

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop