KnowVID-19: A Knowledge-Based System to Extract Targeted COVID-19 Information from Online Medical Repositories
Abstract
:1. Introduction
2. Description of the Approach
2.1. KnowVID-19—Knowledge Base
2.1.1. Data Source and Collection
- (2019-ncov OR coronavirus OR coronavirus disease 2019 OR COVID-19 OR novel coronavirus OR novel coronavirus pneumonia OR SARS-CoV-2) AND (2019 [Date—Publication]: 2022 [Date—Publication]) AND (English [Language])
- “2019 NCOV” [all fields] or “COVID-19 nucleic acid testing” [all fields] or “COVID-19 nucleic acid testing” [MeSH term] or “COVID-19 serology testing” [all fields] or “COVID-19 Serological testing [MeSH term] or “COVID-19 serotherapy” [all fields] or “COVID-19 serotherapy” [all fields] or “COVID-19 testing” [all fields] or “COVID-19 test” [all areas] COVID-19” [MeSH term] or “COVID-19 vaccine” [all fields] or “COVID-19 vaccine” [MeSH term] or “COVID-19” [all fields] or “COVID-19” [all fields] MeSH] or “NCOV” [all fields] or “SARS-CoV-2” [all fields] or “SARS-CoV-2” [MeSH term] or “Severe Acute Respiratory Syndrome Coronavirus 2” [all fields]
- Topic: vaccination, drugs, non-drugs, intervention, and mental health
2.1.2. Web Crawling
- The web crawler employs incremental crawling, allowing the system to detect and retrieve only newly added or updated content from previously indexed sources. This approach minimizes redundant data retrieval, which significantly reduces the processing load and enables the system to stay up-to-date more efficiently.
- The crawler incorporates an automated scheduling mechanism through a Python script that performs regular database refreshes, ensuring that new publications are integrated into KnowVID-19 without requiring manual intervention.
2.2. Inference Engine (Rules of Engine)
2.2.1. Data Cleaning and Normalization
2.2.2. Data Transformation
2.2.3. Text Classification
2.2.4. Text Classification Based on Manual Keywords Generation
- Topics: Broad categories related to a domain, such as Vaccination, Pandemics, Diagnostics, and Drugs (e.g., Vaccination in Figure 5).
- Subtopics: Specific subdomains within a topic, for instance, under Vaccination, subtopics include RNA vaccines, DNA vaccines, and recombinant vaccines.
- Keywords: Detailed discussions within subtopics, such as the production, delivery, and clinical trials of RNA vaccines (e.g., Vaccination > RNA Vaccine > Production, delivery, clinical trial).
- Subkeywords: Further scientific breakdowns of the keywords, providing a granular discussion, such as clinical trial phases (e.g., Vaccination > RNA Vaccine > Clinical Trial > Phase I, Phase II, Phase III).
2.2.5. Automatic Text Classification (ML Based)/Keywords Generation
- Trial Pattern: Publication trials, like clinical trials or randomized clinical trials.
- Quantity Pattern: Quantities used for different scientific processes, like vaccination.
- Participant Pattern: Number of participants participating in biomedical research.
- Age Pattern: Age range of participants participating in biomedical research.
- Main Keywords: Important keywords and topics, like vaccination.
- Vaccination Type: Vaccine types, like mRNA vaccine (mRNA-1273).
- Phase Keywords: Phases describe the state of the vaccination or study (phase 1).
- duplicated or similar keywords (example: randomized trials/randomized trials).
- same keywords with different description (example: controlled clinical trial/control clinical trial).
- meaningless or technical and scientific terms (example, yoga trial/young trial).
- grammar errors or unwanted symbols in the extracted keywords.
- total number of keywords in the text, paragraph, or document.
- importance of the keyword related to the content of the document.
2.2.6. Text Classification Based on Term Frequency and Document Frequency
2.3. Interface
2.3.1. Network Generation
2.3.2. Visualization
2.3.3. Application
3. Results
3.1. Search Results
3.2. Filtering and Sorting Mechanisms
3.3. Network Visualization
4. Discussion
4.1. Functionalities
- It offers a quick and comprehensive KBS of COVID-19 medical information and helps keep track of the latest research and findings on COVID-19.
- The COVID-19 dataset is represented in a detailed table format that includes publication title, author’s names and their institutions, paper sections, and annotated references.
- It allows the users to easily search for specific information related to COVID-19 without going to the manuscript.
- The system can be used to generate specialized queries and search results.
- It can extract keywords from publication, articles, and biomedical other sources and analyze the most frequently used keywords by the sources and publications.
- A real-time monitoring system for publication, articles, and other sources
- By combining semantics and grammar, the dataset, and the context of the words, information can be extracted and analyzed.
- It saves storage and computing resources by reducing the word length instead of the whole text.
4.2. Limitations
Proposed Improvements
4.3. Example Scenarios Demonstrating the Capacity of KnowVID-19
- Interactive Node-Based Navigation: Researchers may dive down into subnodes like “Pfizer-BioNTech” and “Moderna” by clicking on nodes labeled “mRNA Vaccines” using KnowVID-19’s online interface. This feature arranges data hierarchically, making it easier to browse through large datasets.
- Instant Access to papers: By selecting one of these nodes, you may obtain a well selected list of papers from PubMed instantly. Researchers may now save a great deal of time and effort by not having to manually sort through thousands of publications.
- Focused Information: The system groups pertinent information into categories like “symptoms”, and so on.
- Keyword Search and Visualization: To locate certain information in the dataset, researchers can make use of the keyword search tool.
- Trial Quantities: Researchers can also see the quantities of each component used in the vaccine trials, providing detailed insights into dosage usage and formulation specifics.
- Age Range of Participants: The system includes data on the age range of participants in the vaccine trials, allowing researchers to understand the demographics of the study groups.
- Symptom and Trial Nodes: When a user clicks on a node that represents a symptom, such as “fatigue”, “brain fog”, or “shortness of breath”, linked trials are displayed right away. This methodical technique facilitates finding the needed information quickly.
- Detailed Visualizations: By using network graphs to show the relationships between trials and symptoms, it is simpler to determine which trials mentioned particular symptoms. These graphic aids assist in deciphering intricate linkages and spotting trends in the frequency of symptoms among several trials.
- Keyword Search and Visualization: Using a keyword, medical practitioners can look for certain symptoms or trials.
4.4. Evaluation vis-à-vis Established Paradigms
5. Conclusions and Future Work
6. Future Plans
- Implementation of a feature to read figures and pictures for information extraction. This feature will enable the system to extract relevant information from visual data, such as images and graphs, which will further enhance its capabilities.
- Enhancement of data visualization capabilities to provide more comprehensive and user-friendly representations. This will enable researchers and medical professionals to better understand and interpret the extracted information.
- Integration of advanced ML algorithms to enhance the system’s predictive analytics and decision-making capabilities. This will enable the system to provide more accurate and reliable predictions and recommendations.
- Implementation of real-time collaboration features, facilitating seamless information sharing and collaborative analysis among users. This will enable researchers and medical professionals to work together more effectively and efficiently.
- Integrate fake news detection models to improve the quality and reliability of extracted information by filtering out misleading or unreliable content. This will enable the system to provide more accurate and reliable information.
- Exploration the use of large language models (LLMs). These models have the potential to improve both the accuracy and contextual understanding of extracted information, making them ideal for handling diverse and complex data sources.
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Castrillo-Fernández, O. Web Scraping: Applications and Tools. Topic Report No. 2015/10. 2015. Available online: https://data.europa.eu/sites/default/files/report/2015_web_scraping_applications_and_tools.pdf (accessed on 16 May 2022).
- NCBI. SARS-CoV-2 Resources. 2024. Available online: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049 (accessed on 25 January 2024).
- Elgabry, O. The Ultimate Guide to Data Cleaning. 2019. Available online: https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4 (accessed on 21 September 2022).
- Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef] [PubMed]
- Lo, K.; Wang, L.L.; Neumann, M.; Kinney, R.; Weld, D. S2ORC: The Semantic Scholar Open Research Corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 4969–4983. [Google Scholar]
- Comeau, D.C.; Wei, C.H.; Doğan, R.I.; Lu, Z. PMC text mining subset in BioC: 2.3 million full text articles and growing. arXiv 2018, arXiv:1804.05957. [Google Scholar] [CrossRef] [PubMed]
- Lopez, P. GROBID: Combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications. In Research and Advanced Technology for Digital Libraries; Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 473–474. ISBN 978-3-642-04345-1. [Google Scholar]
- Sajid, H. Named Entity Recognition (NER) with Python. 2023. Available online: https://www.wisecube.ai/blog/named-entity-recognition-ner-with-python/ (accessed on 3 November 2024).
- Marshall, C. What Is Named Entity Recognition (NER) and How Can I Use It? super.AI. 2019. Available online: https://medium.com/mysuperai/what-is-named-entity-recognition-ner-and-how-can-i-use-it-2b68cf6f545d (accessed on 16 August 2022).
- Van Otten, N. How To Implement Named Entity Recognition in Python with SpaCy, BERT, NLTK & Flair. 2022. Available online: https://spotintelligence.com/2022/12/06/named-entity-recognition-ner/ (accessed on 3 November 2024).
- Rose, S.; Engel, D.; Cramer, N.; Cowley, W. Automatic Keyword Extraction from Individual Documents. In Text Mining: Applications and Theory; Berry, M.W., Kogan, J., Eds.; John Wiley & Sons: Hoboken, NJ, USA, 2010; pp. 1–20. ISBN 9780470689646. [Google Scholar]
- Campos, R.; Mangaravite, V.; Pasquali, A.; Jorge, A.; Nunes, C.; Jatowt, A. YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 2020, 509, 257–289. [Google Scholar] [CrossRef]
- Sharma, P.; Li, Y. Self-Supervised Contextual Keyword and Keyphrase Retrieval with Self-Labelling. Preprints 2019, 2019080073. [Google Scholar] [CrossRef]
- Hagberg, A.; Swart, P.; Schult, D. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the Python in Science Conference, Pasadena, CA, USA, 19–24 August 2008. [Google Scholar]
- Shannon, P.; Markiel, A.; Ozier, O.; Baliga, N.S.; Wang, J.T.; Ramage, D.; Amin, N.; Schwikowski, B.; Ideker, T. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 2003, 13, 2498–2504. [Google Scholar] [CrossRef] [PubMed]
- Franz, M.; Lopes, C.T.; Huck, G.; Dong, Y.; Sumer, O.; Bader, G.D. Cytoscape.js: A graph theory library for visualisation and analysis. Bioinformatics 2016, 32, 309–311. [Google Scholar] [CrossRef] [PubMed]
- Grinberg, M. Flask Web Development: Developing Web Applications with Python, 2nd ed.; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2018; ISBN 978-1491991732. [Google Scholar]
- Poole, G. Flask: A Flexible Micro-Framework for Backend Dev in Python. 2020. Available online: https://levelup.gitconnected.com/flask-a-flexible-micro-framework-for-backend-dev-in-python-9cfaf1114095 (accessed on 21 September 2022).
- Pestryakova, S.; Vollmers, D.; Sherif, M.A.; Heindorf, S.; Saleem, M.; Moussallem, D.; Ngomo, A.-C.N. CovidPubGraph: A FAIR Knowledge Graph of COVID-19 Publications. Sci. Data 2022, 9, 389. [Google Scholar] [CrossRef] [PubMed]
- Reese, J.T.; Unni, D.; Callahan, T.J.; Cappelletti, L.; Ravanmehr, V.; Carbon, S.; Shefchek, K.A.; Good, B.M.; Balhoff, J.P.; Fontana, T.; et al. KG-COVID-19: A Framework to Produce Customized Knowledge Graphs for COVID-19 Response. Patterns 2020, 2, 100155. [Google Scholar] [CrossRef] [PubMed]
(A) Document Frequency | ||||||
---|---|---|---|---|---|---|
Trial Type | Age Group | |||||
Keywords | ID_List | Count | Keywords | ID_List | Count | |
clinical trial | [8830978, 7749639, et al.] | 2050 | 18 to 55 years | [7472384, 8313090, et al.] | 70 | |
controlled trial | [7235585, 8382475, et al.] | 1107 | 18 to 59 years | [7472384, 8630786, et al.] | 39 | |
randomized controlled trial | [8382475, 7460877, et al.] | 674 | 18 to 65 years | [8423936, 8839300, et al.] | 24 | |
randomized clinical trial | [7235585, 7573513, et al.] | 373 | 12 to 17 years | [8604800, 9350282, et al.] | 22 | |
randomized trial | [7573513, 7460877, et al.] | 352 | 12 to 15 years | [8461570, 8711308, et al.] | 21 | |
placebo-controlled trial | [7885317, 8813065, et al.] | 248 | 18 to 64 years | [8084611, 8585490, et al.] | 20 | |
controlled clinical trial | [7683586, 7527945, et al.] | 196 | 65 to 85 years | [7706592, 8695521, et al.] | 18 | |
ongoing clinical trial | [7683586, 7906827, et al.] | 133 | 18 to 60 years | [8014753, 7362821, et al.] | 17 | |
double-blind trial | [7527945, 8813065, et al.] | 125 | 18 to 80 years | [8872486, 8871718, et al.] | 17 | |
open-label trial | [7445008, 7263255, et al.] | 121 | 18 to 55 years | [8395838, 7821985, et al.] | 13 | |
(B) Term Frequency | ||||||
Number of topics | Number of words per topic (one publication) | |||||
ID | Number_Topic | Number_Keywords | ID | Topic | Keywords | Counts |
7591699 | 7 | 87 | 7591699 | AGE | 10 to 15 year | 2 |
8704728 | 7 | 62 | 7591699 | GENDER | men | 2 |
7706592 | 7 | 61 | 7591699 | GENDER | women | 1 |
7824305 | 7 | 49 | 7591699 | PARTCIPANTS | 30,000 volunteer | 3 |
9062866 | 7 | 46 | 7591699 | QUANTITY_MG | 100 ug | 5 |
7990482 | 7 | 42 | 7591699 | TRIAL_PHASE | phase 1 | 15 |
7583697 | 7 | 41 | 7591699 | TRIAL_PHASE | phase 3 | 20 |
8482810 | 7 | 41 | 7591699 | TRIAL_TYPE | clinical trial | 35 |
9106357 | 7 | 41 | 7591699 | TRIAL_TYPE | human clinical trial | 3 |
9127699 | 7 | 36 | 7591699 | VACCINE_NAMES | azd1222 | 18 |
8776284 | 7 | 35 | 7591699 | VACCINE_NAMES | mrna-1273 | 3 |
Feature | KnowVID-19 | CovidPubGraph | KG-COVID-19 |
---|---|---|---|
User-Centric Design | Tailored information retrieval for specific research needs. | Broad dataset integration without a focus on individual user queries. | Framework for producing customizable KGs for various COVID-19 applications. |
Adaptability | Easily customizable to various scientific topics. | Structured for COVID-19 publications with emphasis on dataset interoperability. | Flexible framework for integrating biomedical data. |
Integrated Tools | Leverages multiple Python libraries for robust processing. | Focuses on named entity recognition and linking. | Uses KGX for graph manipulation, and emphasizing data summarization. |
Contextual Visualization | Visual representation of linkages between information. | Data linkage and interoperability without emphasis on visual context. | Supports hypothesis-based querying and visualization. |
Precision in Information Extraction | Offers refined search outputs for highly specific datasets or insights. | Broad overview and linkage of publications, not tailored for specific queries. | Supports complex queries for drug repurposing and disease understanding. |
Dynamic Adaptation | Capable of adapting to new information sources and formats. | Updated regularly but with less emphasis on dynamic content adaptation. | Regular updates incorporate new and updated data. |
Data Visualization | Utilizes Cytoscape for intuitive network graphs. | Provides a comprehensive RDF knowledge graph without a specific focus on visual presentation. | Less emphasis on direct visual representation. |
Data Sources | PubMed, PMC | CORD-19 | Various biomedical sources, less specific. |
Data Usage | Real-time data extraction and updating. | Regularly updated but less dynamic. | Supports regular updates, but less dynamic adaptation. |
Date Extracted From | PubMed and PMC, up-to-date as of 2024 | CORD-19, varies by update schedule. | Various sources, date varies by updates. |
Focused Data Access | Immediate access to highly curated and categorized data. | Broad overview requiring more manual effort for specific data extraction. | Extensive data but less focused categorization. |
Age Filter in the Network | Yes, allows categorizing by age range of trial participants. | No specific age range filter available. | No specific age range filter available. |
Trial Phases Categorization | Yes, allows categorizing by trial phases (e.g., Phase III). | No specific trial phase categorization. | No specific trial phase categorization. |
Quantity in Trials | Yes, provides details on quantities used in trials. | No detailed quantity information available. | No detailed quantity information available. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Aziz, M.; Popa, I.; Zia, A.; Fischer, A.; Khan, S.A.; Hamedani, A.F.; Asif, A.R. KnowVID-19: A Knowledge-Based System to Extract Targeted COVID-19 Information from Online Medical Repositories. Biomolecules 2024, 14, 1411. https://doi.org/10.3390/biom14111411
Aziz M, Popa I, Zia A, Fischer A, Khan SA, Hamedani AF, Asif AR. KnowVID-19: A Knowledge-Based System to Extract Targeted COVID-19 Information from Online Medical Repositories. Biomolecules. 2024; 14(11):1411. https://doi.org/10.3390/biom14111411
Chicago/Turabian StyleAziz, Muzzamil, Ioana Popa, Amjad Zia, Andreas Fischer, Sabih Ahmed Khan, Amirreza Fazely Hamedani, and Abdul R. Asif. 2024. "KnowVID-19: A Knowledge-Based System to Extract Targeted COVID-19 Information from Online Medical Repositories" Biomolecules 14, no. 11: 1411. https://doi.org/10.3390/biom14111411
APA StyleAziz, M., Popa, I., Zia, A., Fischer, A., Khan, S. A., Hamedani, A. F., & Asif, A. R. (2024). KnowVID-19: A Knowledge-Based System to Extract Targeted COVID-19 Information from Online Medical Repositories. Biomolecules, 14(11), 1411. https://doi.org/10.3390/biom14111411