LLM-Powered Natural Language Text Processing for Ontology Enrichment
Abstract
1. Introduction
- An ontological model can be easily expanded and supplemented, including integration with models from other domains. This allows it to evolve as necessary.
- An ontological model can be developed and used collaboratively by different organizations and expert groups. The use of namespaces allows for the division of development into parts corresponding to different knowledge domains.
- The construction of knowledge bases with the capability for logical inference.
- Semantic search.
- The publication of linked open datasets.
2. Literature Review
- Q1.
- Is it feasible to obtain a machine-readable output from processing natural language texts using ChatGPT when the source texts are unadopted web pages?
- Q2.
- How consistent is the response from the OpenAI API when the query format and source text remain unchanged?
- Q3.
- Can the result from the OpenAI API query be automatically processed and the derived data be uploaded into an ontological model?
3. Architecture of the Information System for the Experiment
- Making HTTP requests to predefined URLs in accordance with the subject domain of the ontology being developed.
- Preparing a query based on the content of the web page obtained in the previous step and the description of the intermediate data representation format.
- Executing a query to the OpenAI API GPT-3.5 and receiving data in the intermediate format (Python data).
- Creating objects and defining their properties in the ontological model using the methods of the OWLready2 library.
4. Description of the Experiment
4.1. Retrieving Data from the WWW via HTTP Protocol
4.2. Preparing a Query for LLM ChatGPT 3.5
- Web-page code.
- A general description of the data to be extracted.
- A description of the data presentation format.
4.3. Executing a Query to the OpenAI API and Receiving Data
4.4. Enrichment of the Ontological Model with the Obtained Data
5. Results and Discussion
- The total number of data elements found and included in the result (countries; regions; settlements; geographical features; and the relationships region–country, settlement–region, and object–region).
- The number of irrelevant data elements included in the result, i.e., objects or facts not present in the text and considered “information noise” by the neural network.
- The error rate in processing the page (the ratio of the number of irrelevant elements to the total number of elements).
- The time taken to process the page.
- Q1.
- Is it feasible to obtain a machine-readable output from processing natural language texts using ChatGPT when the source texts are unadapted web pages? The answer to this research question: Yes, this possibility is confirmed by the experiments conducted and described in this work.
- Q2.
- How consistent is the response from the OpenAI API when the query format and source text remain unchanged? The answer to this research question: The experiments conducted showed that the results obtained using the same query can randomly differ from one another, and there is no reliable way to avoid this.
- Q3.
- Can the result from the OpenAI API query be automatically processed and the derived data be uploaded into an ontological model? The answer to this research question: Yes, it is possible, and such automatic processing was implemented in the program code during the research.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ranjan, R.; Vathsala, H.; Koolagudi, S.G. Profile generation from web sources: An information extraction system. Soc. Netw. Anal. Min. 2022, 12, 2. [Google Scholar] [CrossRef]
- Jayasankar, U.; Thirumal, V.; Ponnurangam, D. A survey on data compression techniques: From the perspective of data quality, coding schemes, data type and applications. J. King Saud Univ.-Comput. Inf. Sci. 2021, 33, 119–140. [Google Scholar] [CrossRef]
- Dey, R.; Balabantaray, R.C.; Mohanty, S. Sliding window based off-line handwritten text recognition using edit distance. Multimed. Tools Appl. 2022, 81, 22761–22788. [Google Scholar] [CrossRef]
- Rupapara, V.; Narra, M.; Gonda, N.K.; Thipparthy, K. Relevant data node extraction: A web data extraction method for non contagious data. In Proceedings of the 2020 5th International Conference on Communication and Electronics Systems (ICCES), Coimbatore, India, 10–12 June 2020; pp. 500–505. [Google Scholar] [CrossRef]
- Xu, T.; Feng, A.; Song, X.; Gao, Z.; Zeng, X. Chinese News Data Extraction System Based on Readability Algorithm. In Proceedings of the 6th International Conference on Artificial Intelligence and Security, Hohhot, China, 17–20 July 2020; Springer: Singapore, 2020; pp. 153–164. [Google Scholar] [CrossRef]
- Plotnikova, V.; Dumas, M.; Milani, F. Adaptations of data mining methodologies: A systematic literature review. PeerJ Comput. Sci. 2020, 6, e267. [Google Scholar] [CrossRef] [PubMed]
- Verma, A.; Bhattacharya, P.; Bodkhe, U.; Ladha, A.; Tanwar, S. Dams: Dynamic association for view materialization based on rule mining scheme. In Proceedings of the 3rd International Conference on Recent Innovations in Computing, Jammu, India, 20–21 March 2020; Springer: Singapore, 2020; pp. 529–544. [Google Scholar] [CrossRef]
- Fareri, S.; Fantoni, G.; Chiarello, F.; Coli, E.; Binda, A. Estimating Industry 4.0 impact on job profiles and skills using text mining. Comput. Ind. 2020, 118, 103222. [Google Scholar] [CrossRef]
- Zong, C.; Xia, R.; Zhang, J. Text Data Mining; Springer: Singapore, 2021; Volume 711, p. 712. [Google Scholar] [CrossRef]
- Chowdhary, K.; Chowdhary, K.R. Natural language processing. In Fundamentals of Artificial Intelligence; Springer: New Delhi, India, 2020; pp. 603–649. [Google Scholar] [CrossRef]
- Torfi, A.; Shirvani, R.A.; Keneshloo, Y.; Tavaf, N.; Fox, E.A. Natural language processing advancements by deep learning: A survey. arXiv 2020, arXiv:2003.01200. [Google Scholar] [CrossRef]
- Qiu, X.; Sun, T.; Xu, Y.; Shao, Y.; Dai, N.; Huang, X. Pre-trained models for natural language processing: A survey. Sci. China Technol. Sci. 2020, 63, 1872–1897. [Google Scholar] [CrossRef]
- Koleck, T.A.; Dreisbach, C.; Bourne, P.E.; Bakken, S. Natural language processing of symptoms documented in free-text narratives of electronic health records: A systematic review. J. Am. Med. Inform. Assoc. 2019, 26, 364–379. [Google Scholar] [CrossRef] [PubMed]
- Nadif, M.; Role, F. Unsupervised and self-supervised deep learning approaches for biomedical text mining. Brief. Bioinform. 2021, 22, 1592–1603. [Google Scholar] [CrossRef]
- Demner-Fushman, D.; Elhadad, N.; Friedman, C. Natural language processing for health-related texts. In Biomedical Informatics: Computer Applications in Health Care and Biomedicine; Springer International Publishing: Cham, Switzerland, 2021; pp. 241–272. [Google Scholar] [CrossRef]
- Kersloot, M.G.; van Putten, F.J.; Abu-Hanna, A.; Cornet, R.; Arts, D.L. Natural language processing algorithms for mapping clinical text fragments onto ontology concepts: A systematic review and recommendations for future studies. J. Biomed. Semant. 2020, 11, 14. [Google Scholar] [CrossRef]
- Tamine, L.; Goeuriot, L. Semantic information retrieval on medical texts: Research challenges, survey, and open issues. ACM Comput. Surv. (CSUR) 2021, 54, 14. [Google Scholar] [CrossRef]
- Li, Y.; Thomas, M.A.; Osei-Bryson, K.M. Ontology-based data mining model management for self-service knowledge discovery. Inf. Syst. Front. 2017, 19, 925–943. [Google Scholar] [CrossRef]
- Prokhorov, V.; Pilehvar, M.T.; Collier, N. Generating knowledge graph paths from textual definitions using sequence-to-sequence models. arXiv 2019, arXiv:1904.02996. [Google Scholar] [CrossRef]
- Oommen, C.; Howlett-Prieto, Q.; Carrithers, M.D.; Hier, D.B. Inter-Rater Agreement for the Annotation of Neurologic Concepts in Electronic Health Records. medRxiv 2022. [Google Scholar] [CrossRef]
- Wang, Y.; Fan, X.; Chen, L.; Chang EI, C.; Ananiadou, S.; Tsujii, J.; Xu, Y. Mapping anatomical related entities to human body parts based on wikipedia in discharge summaries. BMC Bioinform. 2019, 20, 430. [Google Scholar] [CrossRef] [PubMed]
- Islam, N.; Syed, D.; Shaikh, Z.A. Semantic Web: An Overview and a. net-based Tool for Knowledge Extraction and Ontology Development. In Semantic Technologies for Intelligent Industry 4.0 Applications; River Publishers: New York, NY, USA, 2023; pp. 169–197. [Google Scholar] [CrossRef]
- Elnagar, S.; Yoon, V.; Thomas, M.A. An automatic ontology generation framework with an organizational perspective. arXiv 2022, arXiv:2201.05910. [Google Scholar] [CrossRef]
- Pezoulas, V.C.; Sakellarios, A.; Kleber, M.; Bosch, J.A.; Van der Laan, S.W.; Lamers, F.; Lehtimaki, T.; Marz, W.; Fotiadis, D.I. A hybrid data harmonization workflow using word embeddings for the interlinking of heterogeneous cross-domain clinical data structures. In Proceedings of the 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), Virtual Conference, 27–30 July 2021; pp. 1–4. [Google Scholar] [CrossRef]
- Ghoniem, R.M.; Alhelwa, N.; Shaalan, K. A novel hybrid genetic-whale optimization model for ontology learning from Arabic text. Algorithms 2019, 12, 182. [Google Scholar] [CrossRef]
- Liu, K.; Chen, Y.; Liu, J.; Zuo, X.; Zhao, J. Extracting events and their relations from texts: A survey on recent research progress and challenges. AI Open 2020, 1, 22–39. [Google Scholar] [CrossRef]
- Houssein, E.H.; Mohamed, R.E.; Ali, A.A. Machine learning techniques for biomedical natural language processing: A comprehensive review. IEEE Access 2021, 9, 140628–140653. [Google Scholar] [CrossRef]
- González, L.; García-Barriocanal, E.; Sicilia, M.A. Entity linking as a population mechanism for skill ontologies: Evaluating the use of ESCO and Wikidata. In Proceedings of the Metadata and Semantic Research: 14th International Conference, MTSR 2020, Madrid, Spain, 2–4 December 2020; Revised Selected Papers 14. Springer International Publishing: Cham, Switzerland, 2021; pp. 116–122. [Google Scholar] [CrossRef]
- Melo, D.; Rodrigues, I.P.; Varagnolo, D. A strategy for archives metadata representation on CIDOC-CRM and knowledge discovery. Semant. Web 2023, 14, 553–584. [Google Scholar] [CrossRef]
- Zhang, C.; Zhang, C.; Zheng, S.; Qiao, Y.; Li, C.; Zhang, M.; Dam, S.K.; Thwal, C.M.; Tun, Y.L.; Huy, L.L. A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need? arXiv 2023, arXiv:2303.11717Jiang. [Google Scholar]
- Bhandari, P.; Anastasopoulos, A.; Pfoser, D. Are large language models geospatially knowledgeable? In Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems, Hamburg, Germany, 13–16 November 2023; pp. 1–4. [Google Scholar] [CrossRef]
- Rodrigues, F.H.; Lopes, A.G.; dos Santos, N.O.; Garcia, L.F.; Carbonera, J.L.; Abel, M. On the Use of ChatGPT for Classifying Domain Terms According to Upper Ontologies. In Proceedings of the 42nd International Conference on Conceptual Modeling, Lisbon, Portugal, 6–9 November 2023; Springer: Cham, Switzerland, 2023; pp. 249–258. [Google Scholar] [CrossRef]
- Ekuobase, G.O.; Ebietomere, E.P. Latest Applications of Semantic Web Technologies for Service Industry. In Semantic Web Technologies; CRC Press: Boca Raton, FL, USA, 2022; pp. 73–104. [Google Scholar] [CrossRef]
- Feng, Y.; Ding, L.; Xiao, G. GeoQAMap-Geographic Question Answering with Maps Leveraging LLM and Open Knowledge Base (Short Paper). In Proceedings of the 12th International Conference on Geographic Information Science (GIScience 2023), Leeds, UK, 12–15 September 2023. [Google Scholar] [CrossRef]
- Scheider, S.; Nyamsuren, E.; Kruiger, H.; Xu, H. Geo-analytical question-answering with GIS. Int. J. Digit. Earth 2021, 14, 1–14. [Google Scholar] [CrossRef]
- Yang, J.; Jang, H.; Yu, K. Geographic Knowledge Base Question Answering over OpenStreetMap. ISPRS Int. J. Geo-Inf. 2023, 13, 10. [Google Scholar] [CrossRef]
- Jiang, Y.; Yang, C. Is ChatGPT a Good Geospatial Data Analyst? Exploring the Integration of Natural Language into Structured Query Language within a Spatial Database. ISPRS Int. J. Geo-Inf. 2024, 13, 26. [Google Scholar] [CrossRef]
- Xu, H.; Nyamsuren, E.; Scheider, S.; Top, E. A grammar for interpreting geo-analytical questions as concept transformations. Int. J. Geogr. Inf. Sci. 2023, 37, 276–306. [Google Scholar] [CrossRef]









| Experiment Number | Number of Elements Found | Quantity of Non-Relevant Elements | Error Rate | Processing Time, Seconds | 
|---|---|---|---|---|
| 1 | 18 | 1 | 5.56% | 19 | 
| 2 | 36 | 3 | 8.33% | 22 | 
| 3 | 36 | 5 | 13.89% | 21 | 
| 4 | 18 | 1 | 5.56% | 20 | 
| 5 | 42 | 7 | 16.67% | 23 | 
| 6 | 25 | 0 | 0.00% | 18 | 
| 7 | 16 | 0 | 0.00% | 17 | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mukanova, A.; Milosz, M.; Dauletkaliyeva, A.; Nazyrova, A.; Yelibayeva, G.; Kuzin, D.; Kussepova, L. LLM-Powered Natural Language Text Processing for Ontology Enrichment. Appl. Sci. 2024, 14, 5860. https://doi.org/10.3390/app14135860
Mukanova A, Milosz M, Dauletkaliyeva A, Nazyrova A, Yelibayeva G, Kuzin D, Kussepova L. LLM-Powered Natural Language Text Processing for Ontology Enrichment. Applied Sciences. 2024; 14(13):5860. https://doi.org/10.3390/app14135860
Chicago/Turabian StyleMukanova, Assel, Marek Milosz, Assem Dauletkaliyeva, Aizhan Nazyrova, Gaziza Yelibayeva, Dmitrii Kuzin, and Lazzat Kussepova. 2024. "LLM-Powered Natural Language Text Processing for Ontology Enrichment" Applied Sciences 14, no. 13: 5860. https://doi.org/10.3390/app14135860
APA StyleMukanova, A., Milosz, M., Dauletkaliyeva, A., Nazyrova, A., Yelibayeva, G., Kuzin, D., & Kussepova, L. (2024). LLM-Powered Natural Language Text Processing for Ontology Enrichment. Applied Sciences, 14(13), 5860. https://doi.org/10.3390/app14135860
 
        



 
       