Next Article in Journal
AI-Driven Identification of Candidate Peptides for Immunotherapy in Non-Obese Diabetic Mice: An In Silico Study
Previous Article in Journal
Hybrid Sentiment Analysis in Financial Markets: Multi-Stage LLM Integration for Market-Neutral Alpha Generation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Leveraging LLMs for Collaborative Ontology Engineering in Parkinson Disease Monitoring and Alerting

by
Georgios Bouchouras
1,*,
Dimitrios Doumanas
1,
Andreas Soularidis
1,
Konstantinos Kotis
1,* and
George Vouros
2
1
Intelligent Systems Laboratory, Department of Cultural Technology and Communication, University of the Aegean, 81100 Mytilene, Greece
2
Artificial Intelligence Laboratory, Department of Digital Systems, University of Piraeus, 18534 Piraeus, Greece
*
Authors to whom correspondence should be addressed.
AI 2026, 7(4), 139; https://doi.org/10.3390/ai7040139
Submission received: 3 February 2026 / Revised: 14 March 2026 / Accepted: 3 April 2026 / Published: 14 April 2026

Abstract

Ontology engineering plays a critical role in clinical decision support systems for Parkinson’s Disease (PD) monitoring and alerting. While Large Language Models (LLMs) have shown promise in knowledge modeling tasks, their effectiveness in autonomously constructing comprehensive ontologies for complex clinical domains remains unclear. This study investigates four ontology engineering methodologies for PD monitoring and alerting: One-shot (OS) prompting, Decomposed Sequential Prompting (DSP), X-HCOME, and SimX-HCOME+. Multiple LLMs were evaluated across these methodologies. Generated ontologies were assessed against a reference PD ontology using structural evaluation metrics focused on classes and object properties. Expert review was additionally conducted to analyze knowledge extensions beyond the gold standard. LLMs were able to autonomously generate syntactically valid and semantically meaningful ontologies using OS and DSP prompting; however, these ontologies exhibited limited conceptual coverage. Incorporating human expertise through X-HCOME significantly improved ontology completeness and evaluation metrics. Expert review further validated clinically relevant concepts absent from the reference ontology. SimX-HCOME+ demonstrated that iterative, supervised collaboration supports ontology refinement, although challenges persisted in natural language-to-rule formalization. The findings suggest that LLMs are more effective as collaborative assistants rather than standalone ontology engineers in the PD domain. Structured human–LLM collaboration is associated with improved ontology coverage and facilitates the identification of potential knowledge extensions in clinical monitoring applications. While the present evaluation focuses primarily on structural ontology elements, the proposed methodologies provide useful insights for LLM-assisted ontology engineering in complex healthcare domains.

1. Introduction

The integration of Large Language Models (LLMs) with ontological frameworks has recently attracted increasing attention in the fields of Knowledge Representation (KR) and Artificial Intelligence (AI) [1]. Recent studies have explored the potential of LLMs to support tasks such as ontology construction, refinement, and alignment—activities traditionally performed and supervised by human experts with extensive domain and ontology engineering expertise [2]. By leveraging large-scale training data, LLMs enable broader and more cost-effective access to expert-level knowledge across multiple domains. Moreover, recent advances suggest that the effectiveness of LLMs in ontology-related tasks can be further enhanced within the context of Neurosymbolic AI, where statistical learning is combined with symbolic representations and reasoning [3,4].
Artificial intelligence plays a particularly important role in addressing complex healthcare challenges, such as the continuous monitoring of patients and the timely alerting of patients and clinicians in the context of Parkinson’s Disease (PD), the second most common neurodegenerative disease worldwide [5]. Despite extensive research efforts, the pathophysiology of PD remains only partially understood, and existing treatments offer limited effectiveness in halting disease progression [6]. To support improved understanding, monitoring, alerting, and treatment strategies, several ontologies have been developed in the PD domain. Among them, the Wear4PDmove ontology [7] was recently introduced to integrate heterogeneous sensor-based movement data and personal health record (PHR) information. This ontology serves as a semantic model that enables interaction among patients, clinicians, smart devices, and health applications by supporting the construction of Personal Health Knowledge Graphs (PHKGs). It facilitates the semantic integration of dynamic data streams from wearable devices and static or historical clinical data, and supports reasoning mechanisms for high-level event recognition in PD monitoring scenarios, such as identifying missed medication doses or patient falls [7]. Such ontologies play a critical role in structuring domain-specific knowledge, enabling interoperability, and supporting data-driven PD monitoring and treatment approaches.
Effective PD monitoring and alerting require flexible and adaptive KR methods capable of accommodating evolving health conditions. While LLMs have demonstrated strong capabilities in processing large volumes of data and generating useful insights, their direct application to structured knowledge modeling tasks remains challenging. Limitations related to consistency, domain specificity, and adherence to formal semantic constraints can restrict their effectiveness in healthcare monitoring and alerting scenarios. PD represents a particularly complex domain, characterized by subtle contextual variations, disease-specific vocabularies, and heterogeneous data sources. Capturing and formalizing such knowledge often requires domain-adapted models or fine-tuning procedures, which are resource-intensive and may be impractical for many healthcare professionals. In parallel, healthcare ontologies must comply with multiple standards and representation formats, further increasing the complexity of ontology development. A central technical challenge lies in integrating and reconciling heterogeneous information sources into coherent and interoperable ontologies. Addressing these challenges necessitates ontology engineering methodologies (OEMs) that streamline development, maintenance, and refinement processes while reducing human effort and preserving semantic quality.
Existing research in ontology engineering has largely emphasized collaboration among human participants, particularly domain experts working together throughout the ontology lifecycle. However, structured and real-time collaboration between humans and intelligent systems, such as LLMs, across varying degrees of participation within an OEM remains relatively underexplored. In particular, the extent of human involvement and the specific roles that LLMs can assume during different ontology engineering phases have not been systematically examined. In practice, ontology engineers often invest substantial time and effort in constructing an initial or “kick-off” ontology, while automated support for subsequent refinement and evolution remains limited. This gap motivates the investigation of collaborative approaches that balance human expertise and machine assistance to improve efficiency and sustainability in ontology engineering.
In this study, we explore different ways humans and Large Language Models can work together to build digital knowledge maps known as ontologies. We examine various levels of cooperation, ranging from minimal to active human participation. This approach shows how we can move from traditional human-led methods to more automated processes, where AI handles specific tasks under human guidance. By doing so, we aim to make the creation of these systems both faster and more detailed.
To test these ideas, we conducted experiments in the complex field of Parkinson’s Disease. We took a well-known development method called HCOME [8] and upgraded it by adding AI-driven steps, creating a new approach called X-HCOME. We also developed a simulated version termed SimX-HCOME+, where humans and AI interact in a controlled environment. These methods help create tools that are thorough yet efficient to develop. The study looks into how these collaborative methods can help monitor and alert Parkinson’s patients, while also discussing the lessons learned and the current limitations of the technology.
Building on our previous research [9], this study adds several important elements. First, we implement and test the SimX-HCOME+ method to see how humans and AI interact effectively. We also add the ability to turn simple natural language instructions into formal logic rules using the Semantic Web Rule Language. Finally, we compare how well the AI performs depending on how much help it receives from human experts. These additions aim to make these digital tools more complete and practical for real-world medical use.
The remainder of the study is organized as follows. Section 2 reviews related work on the integration of LLMs into ontology engineering methodologies. Section 3 presents the proposed research methodology and associated hypotheses. Section 4 reports the experimental results, while Section 5 provides a comparative evaluation of LLM performance across different collaborative settings, emphasizing the degree of human involvement. Finally, Section 6 discusses the findings, outlines limitations, and concludes the study.

2. Related Work

Recent research has investigated the use of neural language models and, more recently, LLMs to support ontology-related tasks, including ontology learning, mapping, enrichment, and knowledge extraction. Early approaches primarily relied on transformer-based architectures, such as BERT, to extract structured knowledge from unstructured textual sources. For example, Oksanen et al. [10] proposed a method for deriving product ontologies from textual reviews using BERT models, requiring minimal manual annotation and achieving improved precision and recall compared to traditional ontology learning tools such as Text2Onto and COMET. Similarly, He et al. [11] introduced BERTMap, a system for ontology mapping that demonstrated strong performance in unsupervised and semi-supervised settings, outperforming existing ontology mapping systems in entity alignment tasks.
Beyond ontology extraction and mapping, several studies have explored the use of LLMs for knowledge extraction and enrichment. Ning et al. [12] proposed a prompt-based approach for extracting factual knowledge from pre-trained LLMs through subject–relation pairs, highlighting the importance of prompt design and parameter selection in improving extraction quality across domains. Lippolis et al. [13] investigated entity alignment between ArtGraph and Wikidata by combining traditional querying techniques with LLM-based reasoning, demonstrating that LLMs can effectively bridge knowledge gaps in complex and heterogeneous knowledge graphs.
More recent study has examined the role of LLMs in semi-automatic ontology construction and refinement. Funk et al. [4] studied the capabilities of GPT-3.5 in generating concept hierarchies across multiple domains, showing that LLMs can produce meaningful and appropriately named concepts when integrated in a controlled manner. Their findings emphasize the need for structured and supervised LLM integration, particularly in organizational and enterprise environments. Along similar lines, Mateiu et al. [14] demonstrated the use of GPT-3 to translate natural language descriptions into ontology axioms, facilitating ontology development and lowering the entry barrier for ontology engineering. Biester et al. [15] explored the use of prompt ensembles to improve knowledge base construction, reporting notable improvements in precision, recall, and F1-score when applying LLMs such as ChatGPT and Google Bard.
LLMs have also been investigated in the broader context of knowledge validation, reasoning, and information extraction. Mountantonakis and Tzitzikas [16] proposed an approach for verifying factual information generated by ChatGPT using RDF knowledge graphs, underscoring the importance of validation mechanisms when integrating LLM outputs into structured knowledge systems. Pan et al. [17] examined frameworks that combine LLMs with knowledge graphs to enhance reasoning capabilities by leveraging the complementary strengths of statistical and symbolic AI. In the biomedical domain, Joachimiak et al. [18] employed LLM-based approaches to summarize gene sets, while Caufield et al. [19] introduced the SPIRES approach for structured information extraction without model fine-tuning.
Despite the breadth of these studies, most existing work focuses on the capabilities of LLMs either in isolation or in comparison with traditional automated methods, often emphasizing fully automated or semi-automated ontology engineering tasks. The structured integration of human expertise and LLM capabilities within ontology engineering methodologies (OEMs), particularly across varying degrees of human involvement, remains relatively underexplored. In practice, human experts play a critical role in defining scope, validating concepts, resolving ambiguities, and ensuring semantic coherence—tasks that are difficult to fully automate.
Addressing this gap, the present study focuses on collaborative ontology engineering that explicitly combines human expertise and LLM-based assistance within established OEMs. By examining different levels of human involvement and comparing LLM performance across multiple collaborative configurations, this study aims to provide insights into how human–LLM collaboration can improve efficiency and conceptual coverage in ontology engineering. Furthermore, recognizing that different LLMs exhibit distinct strengths and weaknesses, comparative evaluation across models can inform practitioners’ choices for real-world ontology development and entity resolution tasks [20].

3. Research Methodology

This section presents a series of experiments organized into four distinct phases, focusing on the development and assessment of ontologies for Parkinson’s Disease (PD) monitoring and alerting, with particular emphasis on ontological classes and object properties. The initial phase examines the capability of Large Language Models (LLMs) to autonomously generate an ontology with minimal human involvement, while subsequent phases progressively introduce structured human-LLM collaboration through hybrid ontology engineering methodologies (OEMs).The evaluated LLMs were ChatGPT-3.5 and ChatGPT-4 (OpenAI, San Francisco, CA, USA), Gemini/Bard (Google LLC, Mountain View, CA, USA), Claude (Anthropic, San Francisco, CA, USA), and Llama2 (Meta Platforms, Menlo Park, CA, USA).

3.1. Experiment Overview

Figure 1 illustrates the overall experimental workflow. In Experiment 1, multiple LLMs independently generate an ontology for PD monitoring and alerting using prompt-based approaches with minimal human intervention. Experiments 2 through 4 introduce increasing levels of human involvement by integrating domain experts and ontology engineers within collaborative OEMs. In all experiments, the generated ontologies are compared against a reference ontology. In this study, the Wear4PDmove ontology [7] is used as the gold standard and is referred to as such throughout the study. Generated ontologies were assessed against Wear4PDmove as a reference baseline rather than an exhaustive ground truth using structural evaluation metrics focused on classes and object properties, while data properties, complex axioms, and downstream reasoning behavior were left for future analysis. Expert review was additionally conducted to analyze clinically meaningful knowledge extensions beyond the reference ontology and to reduce circularity in gold-standard-based evaluation.

3.2. Experiment 1: Autonomous Ontology Generation by LLMs

Hypothesis 1.
When prompted with domain-specific requirements, LLMs can autonomously generate an ontology for PD monitoring and alerting with meaningful conceptual coverage.
In Experiment 1, LLMs are tasked with generating a PD monitoring and alerting ontology from scratch, with minimal human involvement. The goal is to assess their ability to extract and structure domain knowledge based solely on prompt-based input.
The experimental steps are as follows:
1.
The LLM generates an ontology in Turtle (TTL) format, modeling core aspects of PD monitoring, alerting, patient health records, and healthcare team coordination.
2.
The generated ontology is validated for syntactic correctness and logical consistency using ontology engineering tools such as OOPS! [21] https://oops.linkeddata.es, (accessed on 2 April 2026) and Protégé with the Pellet reasoner https://protege.stanford.edu (assessed on 2 April 2026), Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, USA.
3.
The ontology is evaluated against the gold standard using Precision, Recall, and F1-score metrics, focusing on classes and object properties (Table 1).
Evaluation is performed using both exact matching, where generated entities directly correspond to entities in the gold-standard ontology, and similarity matching, where entities are considered correct if they are semantically similar. Exact matching captures strict alignment accuracy, while similarity matching accounts for alternative but semantically appropriate modeling choices. This evaluation framework provides a comparative structural assessment of the generated ontologies rather than a full functional validation.

3.3. Prompting Strategies

Two prompt-based strategies are employed in Experiment 1.
One-shot prompting (OS): In this setting, the LLM is provided with a single prompt that specifies the role, aim, and scope of the ontology without examples or iterative refinement. The objective is to assess the model’s ability to generate a meaningful ontology based on a standalone, domain-specific description.
An example one-shot prompt is shown below:
“Act as an Ontology Engineer. Generate an ontology for Parkinson’s Disease monitoring and alerting. The ontology should model movement data collected via wearable sensors, capture disease severity and activities of daily living, and support semantic annotation for interoperability. Reuse relevant ontologies related to neurodegenerative diseases. Provide the output in TTL format.”
Decomposed Sequential Prompting (DSP): This strategy decomposes the ontology generation task into two sequential prompts. The first prompt defines the role, aim, and scope of the ontology, while the second prompt focuses on modeling specific PD-related aspects and output formatting. This decomposition supports a structured progression of the task, allowing the LLM to incrementally build the ontology.
Prompt 1 (Conceptualization):
“Act as an Ontology Engineer. Your task is to conceptualize an ontology for Parkinson’s Disease (PD) monitoring and alerting. Define the core entities and relationships required to model movement data from wearable sensors, disease severity, and activities of daily living. Focus on ensuring the ontology supports semantic annotation and interoperability with existing neurodegenerative disease models. List the key classes and properties that should be included without providing code.”
Prompt 2 (Implementation):
“Based on the previously defined scope and classes, implement the formal ontology. Model the specific PD-related aspects in detail, ensuring that the relationships between sensor data, symptoms, and alerts are clinically accurate. Reuse relevant concepts from established medical ontologies where appropriate. Provide the final output in Turtle (TTL) format.”

3.4. Experiment 2: Human–LLM Collaborative Ontology Engineering (X-HCOME)

Hypothesis 2.
The integration of human expertise with LLM capabilities enhances the conceptual coverage and quality of the developed ontology.
Experiment 2 introduces X-HCOME, an extension of the Human-Centered Ontology Engineering methodology (HCOME) [8], which incorporates LLM-based tasks into the ontology engineering lifecycle. The methodology alternates between human-driven and LLM-driven tasks, leveraging the complementary strengths of both.
The X-HCOME process consists of the following steps:
1.
Human: Define ontology scope, requirements, and competency questions; provide domain-specific data and prompts.
2.
LLM: Generate an initial domain ontology.
3.
Human: Compare the generated ontology with the gold standard using manual inspection or ontology matching tools (e.g., LogMap [22]).
4.
LLM: Perform machine-assisted comparison against the gold standard.
5.
Human: Revise and refine the ontology by integrating validated concepts.
6.
LLM: Repeat automated evaluation.
7.
Human: Perform final consistency and validity checks using ontology engineering tools.

3.5. Experiment 3: Expert Review of False Positives

Hypothesis 3.
Through expert review, LLM-generated entities initially classified as false positives may reveal relevant domain knowledge absent from the gold-standard ontology.
In Experiment 3, domain experts analyze false positives produced in X-HCOME to determine whether these entities represent valid PD-related knowledge not previously included in the gold-standard ontology. Reclassification was based on explicit criteria: clinical relevance to PD monitoring or alerting, semantic compatibility with the ontology scope, non-redundancy with existing concepts, and potential usefulness for downstream representation or reasoning.

3.6. Experiment 4: Simulated Human–LLM Collaboration (SimX-HCOME+)

Hypothesis 4.
Simulated collaboration between human experts and LLMs enables effective ontology engineering by allowing LLMs to lead development tasks under structured human supervision.
Experiment 4 introduces SimX-HCOME+, a simulated extension of X-HCOME. This methodology models iterative collaboration among three roles: Knowledge Worker (KW), Domain Expert (DE), and Knowledge Engineer (KE). These roles are simulated through structured conversational interactions, during which the LLM generates and refines ontologies at each iteration. More specifically, each simulation cycle consisted of: (i) a KW phase, in which scope, goals, and use-case descriptions were stated; (ii) a DE phase, in which domain constraints, missing concepts, and conceptual corrections were introduced; and (iii) a KE phase, in which the ontology structure was reformulated into classes, relations, and formal representations. Human supervision was operationalized as approval, rejection, or refinement prompts at the end of each cycle. The simulation stopped when no substantial new concepts or corrections were introduced in the subsequent turn. Human supervision is applied to validate scope, resolve ambiguities, and approve refinements rather than directly authoring ontological content.
In addition, this experiment evaluates the ability of LLMs to transform rules expressed in natural language into Semantic Web Rule Language (SWRL). The evaluated rule concerns identifying a missing medication dose event based on observations of bradykinesia following medication intake. The SWRL evaluation considered syntactic validity and partial logical alignment with an expected target rule. However, the present study did not perform a full downstream assessment of rule utility, such as reasoner-triggered inference, rule firing correctness, or end-to-end alert generation performance.

4. Results

4.1. Experiments 1 and 2 (OS, DSP, and X-HCOME)

All generated ontologies were examined for syntactic correctness and logical consistency. Ontologies produced by ChatGPT-3.5, ChatGPT-4, and Bard/Gemini were syntactically valid and logically consistent across all prompting strategies and collaborative methodologies. In contrast, ontologies generated by Llama2 (OS, DSP, and X-HCOME) exhibited syntactic errors and inconsistent class definitions, preventing reliable validation and reuse. A manual inspection of these outputs indicated recurring issues such as malformed OWL/RDF serialization, incomplete class declarations, and unstable reuse of ontology identifiers. Because the present study did not include a dedicated prompt-engineering ablation for repairing these outputs, Llama2 was retained in the comparison as evidence of model instability but was not considered reusable for the downstream ontology workflow.
Table 2 summarizes the comparative results for ontological classes across OS, DSP, and X-HCOME. Overall, one-shot and decomposed prompting strategies (OS and DSP) resulted in limited conceptual coverage, reflected in consistently low recall values across all evaluated LLMs. Decomposed prompting generally improved precision compared to one-shot prompting but often at the expense of identifying fewer classes, indicating more conservative ontology generation behavior.
In contrast, the X-HCOME methodology consistently increased the number of identified classes and improved recall across all evaluated LLMs. Among the examined models, Bard/Gemini under X-HCOME achieved the highest class coverage and recall, while ChatGPT-3.5 and ChatGPT-4 also demonstrated notable improvements compared to their OS and DSP counterparts. These results indicate that structured human–LLM collaboration supports broader conceptual exploration and more effective refinement in ontology engineering tasks.
With respect to object properties, performance remained uniformly low across all methodologies, with F1 scores ranging between 0% and 12%. This result indicates that LLMs were substantially more successful in generating taxonomic content (classes) than in modeling relational structure. A likely explanation is that object properties require explicit identification of domain–range constraints, relation directionality, and context-sensitive semantic dependencies, which are harder to infer from prompts than class labels. Thus, the main bottleneck in autonomous ontology construction was not concept naming, but relation modeling. Detailed results for object properties are provided in the available GitHub repository https://github.com/GiorgosBouh/Ontologies_by_LLMs, (accessed on 2 April 2026).

4.2. Experiment 3: Expert Review of X-HCOME Results

Experiment 3 examined the impact of expert review on X-HCOME-generated ontologies by analyzing false positives identified during automated evaluation. Results after expert reassessment are presented in Table 3.
For both ChatGPT-3.5 and ChatGPT-4, expert review substantially improved precision, recall, and F1 scores compared to OS and DSP approaches. This improvement reflects the ability of domain experts to validate and reinterpret LLM-generated entities that were initially classified as false positives due to strict alignment with the gold-standard ontology.
Notably, Bard/Gemini under X-HCOME achieved recall and F1 values exceeding 100% following expert review. This behavior arises from the reclassification of several LLM-generated entities that were initially treated as false positives but were subsequently validated by domain experts as clinically relevant concepts. As a result, the number of expert-validated true positives exceeds the number of corresponding entities in the gold-standard ontology, leading to negative false negatives and recall values above 100%. Importantly, these values should not be interpreted as superior metric performance, but rather as an indication that the generated ontology extends beyond the conceptual coverage of the reference model. Representative examples include classes such as surgical intervention, rigidity, cognitive impairment, which enhance the ontology’s applicability for PD monitoring and alerting. This outcome highlights a known limitation of gold-standard-based evaluation approaches when applied to knowledge discovery and ontology extension tasks.
For object properties, expert review also led to substantial performance improvements, with F1 scores ranging from 6% to 84% across LLMs. Due to space limitations, detailed object property results are provided in the accompanying GitHub repository https://github.com/GiorgosBouh/Ontologies_by_LLMs (accessed on 2 April 2026).

4.3. Experiment 4: SimX-HCOME+

SimX-HCOME+ was evaluated based on ontology reusability, syntactic correctness, logical consistency, and compatibility with Protégé using the Pellet reasoner. Ontologies generated by ChatGPT-3.5, ChatGPT-4, and Claude were syntactically valid, logically consistent, and reusable. Gemini-generated ontologies, while editable in Protégé, exhibited syntactic errors that limited their consistency.
Quantitative evaluation results for ontological classes are presented in Table 4. The results reveal moderate performance across all evaluated LLMs, with Gemini achieving the highest F1 score, followed by ChatGPT-3.5. These findings indicate that simulated collaborative environments enable LLMs to incrementally refine ontologies, although performance remains below that observed in expert-reviewed X-HCOME.
Regarding natural language to SWRL rule transformation, all LLMs except Gemini generated syntactically valid SWRL expressions. However, logical completeness remained limited, with only a subset of the expected atoms correctly identified. Among the evaluated models, Claude demonstrated relatively better logical alignment, as shown in Table 5. Among the evaluated models, Claude demonstrated relatively better logical alignment, as shown in Table 5. Ultimately, while LLMs demonstrate potential in supporting rule formalization, reliable NL-to-SWRL transformation remains a significant challenge. The current evaluation provides a foundational structural assessment of the generated SWRL rules; however, it should not be construed as a definitive validation of their logical correctness or efficacy in downstream reasoning tasks.

5. Levels of Human Involvement Across Different Methodological Approaches in Ontology Engineering

The methodological approaches investigated in this study correspond to distinct levels of human–machine collaboration, forming a spectrum from human-centered to LLM-centered ontology engineering. To support comparative analysis, the degree of human involvement was mapped to a five-level scale, reflecting increasing LLM participation and decreasing direct human intervention (Table 6). This assignment, while heuristic, enables consistent interpretation of results across methodologies.
Figure 2 illustrates the relationship between the highest observed F1 scores and the degree of human involvement. Approaches incorporating higher levels of expert participation, particularly Expert Review X-HCOME, achieved the highest performance. SimX-HCOME+ demonstrated intermediate performance with moderate human involvement, while OS and DSP approaches exhibited lower F1 scores corresponding to minimal human engagement. This pattern indicates an empirical association between structured human involvement and ontology quality in the present experiments, but it should not be interpreted as a controlled causal ablation of individual collaborative interventions. SimX-HCOME+ demonstrated intermediate performance with moderate human involvement, while OS and DSP approaches exhibited lower F1 scores corresponding to minimal human engagement. Overall, the results indicate a positive association between structured human involvement and ontology quality in the PD domain.
The experimental results partially support the first hypothesis, which posited that LLMs can autonomously develop an ontology for PD monitoring and alerting when provided with domain-specific input, such as aim, scope, requirements, competency questions, and relevant data. While all evaluated LLMs demonstrated the ability to generate syntactically valid and semantically meaningful ontological structures, the resulting ontologies exhibited limited conceptual coverage when compared to the reference model. These findings indicate that, although LLMs can initiate ontology development, autonomous generation alone is insufficient to achieve the level of completeness required for complex clinical domains such as PD monitoring.
The second hypothesis, which stated that the integration of human expertise with LLM capabilities improves ontology quality and coverage, is strongly supported by the results. The X-HCOME methodology consistently led to increased conceptual coverage and improved evaluation metrics across all evaluated LLMs. This outcome highlights the value of structured human–LLM collaboration, in which human experts guide scope definition, validate concepts, and resolve ambiguities that are difficult for LLMs to address independently.
Results related to the third hypothesis further demonstrate the importance of expert involvement. Expert review of X-HCOME-generated ontologies substantially reduced false positives and, in several cases, validated LLM-generated concepts that were not present in the gold-standard ontology. This process not only improved conventional evaluation metrics but also revealed clinically relevant knowledge extensions, illustrating the potential of LLMs to contribute to ontology evolution when guided by domain expertise.
The fourth hypothesis, examined through the SimX-HCOME+ methodology, provides additional evidence for the effectiveness of collaborative ontology engineering. Simulated interaction among knowledge workers, domain experts, and knowledge engineers enabled iterative refinement of ontologies and yielded moderate improvements in conceptual quality compared to fully autonomous approaches. However, the transformation of natural language rules into SWRL representations remained a challenging task. Although most LLMs generated syntactically valid SWRL expressions, logical completeness was limited, indicating that formal rule generation remains an open research challenge.
In general, the findings suggest that increasing levels of structured human participation are associated with improvements in ontology quality, particularly in terms of conceptual coverage and clinical relevance. Rather than replacing human expertise, LLMs function most effectively as collaborative assistants within well-defined ontology engineering workflows.
Several limitations should be considered when interpreting the findings of this study. First, the evaluation focuses primarily on ontological classes and object properties, providing a structural comparison rather than a full functional assessment that would also consider data properties, axioms, and reasoning performance in downstream tasks. In addition, the persistently low object-property scores indicate a fundamental weakness in relation-level ontology modeling that was only partially mitigated by collaboration. Second, the evaluation methodology must address the inherent dependency on a single reference ontology, Wear4PDmove. While this model provides a robust baseline, it cannot be considered an exhaustive representation of the highly complex Parkinson’s Disease (PD) domain. Relying exclusively on this gold standard introduces a critical risk of methodological circularity. Specifically, when the LLM successfully generates novel, clinically valid domain extensions, a strict automated comparison unfairly penalizes these valuable additions as false positives simply because they fall outside the predefined reference boundaries. To mitigate this limitation and prevent valid knowledge from being discarded, expert reassessment was systematically integrated into the evaluation workflow. This expert validation successfully recovered clinically relevant concepts that the automated baseline had rejected, confirming that many presumed errors were actually legitimate evolutionary steps for the ontology. Ultimately, this demonstrates that while gold-standard benchmarking is necessary for initial structural comparisons, it is fundamentally insufficient for ontology evolution tasks, where the objective is to expand knowledge rather than merely replicate a static model. Third, collaborative methodologies rely on the participation of human experts, which can introduce subjective bias related to individual expertise and experience, despite improving semantic validity and clinical relevance. Moreover, although expert review improved semantic validity, formal inter-annotator agreement was not measured in the current study, which may introduce subjectivity in the reassessment of false positives. Fourth, no formal significance testing was performed, and therefore differences across models and methodologies should be interpreted descriptively rather than inferentially, especially given the small number of evaluated ontology outputs. Finally, the LLMs evaluated in this study do not correspond to the most recent model versions available at the time of writing. While newer models may exhibit improved standalone performance, the findings of this study aim to highlight methodological insights that are not tied to specific model generations and should be interpreted accordingly.
Future studies may explore the application of the proposed methodologies across additional healthcare domains and investigate their integration into real-world clinical decision support systems. Further research could also examine the development of domain-adapted language models or specialized LLM configurations tailored specifically for ontology engineering tasks, building upon the collaborative workflows introduced in this study.

6. Conclusions

This study explored how Large Language Models (LLMs) and humans can work together to build knowledge systems for Parkinson’s Disease monitoring. We found that while AI can easily draft the initial framework on its own, it performs much better when guided by step-by-step human input. By using our collaborative methods—such as X-HCOME and SimX-HCOME+—we were able to capture a wider range of relevant concepts and build the system much more efficiently than if we had left the AI entirely to its own devices. However, this study also highlights clear limits to relying on AI right now. Specifically, LLMs still struggle to map out complex relationships between concepts, write strict logical rules, and are often unfairly penalized when we grade them against rigid, older databases. Ultimately, our findings suggest that moving forward is not just about building smarter AI. We also need better ways to evaluate these models, clearer standards for what makes a good system, and more transparent ways for humans and AI to actually work together.

Author Contributions

Conceptualization, G.B. and K.K.; methodology, G.B., D.D. and K.K.; software, G.B. and D.D.; validation, G.B., D.D. and A.S.; formal analysis, G.B., D.D. and A.S.; investigation, G.B., D.D. and A.S.; resources, K.K. and G.V.; data curation, G.B. and D.D.; writing—original draft preparation, G.B., D.D., A.S., K.K. and G.V.; writing—review and editing, G.B.; visualization, G.B. and K.K.; supervision, K.K. and G.V.; project administration, K.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study, including the prompts used for LLMs, the generated ontological artifacts (TTL files), and the detailed evaluation metrics, are openly available in the GitHub repository at https://github.com/GiorgosBouh/Ontologies_by_LLMs, last access 2 April 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
AIArtificial Intelligence
BERTBidirectional Encoder Representations from Transformers
DSPDecomposed Sequential Prompting
FNFalse Negatives
FPFalse Positives
GNNGraph Neural Network
HCOMEHuman-Centered Ontology Engineering Methodology
IoTInternet of Things
KGKnowledge Graph
LLMLarge Language Model
LM-KBCLanguage Model Knowledge Base Construction
NLPNatural Language Processing
OEMOntology Engineering Methodology
OOPS!Ontology Pitfall Scanner
OSOne Shot Prompting
OWLWeb Ontology Language
PaLMPathways Language Model
PDParkinson’s Disease
PDONParkinson’s Disease Ontology
PHKGsPersonal Health Knowledge Graphs
PHRPersonal Health Record
RDFResource Description Framework
SPARQLSPARQL Protocol and RDF Query Language
SPIRESStructured Prompt Interrogation and Recursive Extraction of Semantics
TPTrue Positives
TTLTurtle Serialization Format

References

  1. Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. arXiv 2022, arXiv:2204.02311. [Google Scholar] [CrossRef]
  2. Uschold, M.; Gruninger, M. Ontologies: Principles, methods and applications. Knowl. Eng. Rev. 1996, 11, 93–136. [Google Scholar] [CrossRef]
  3. Sheth, A.; Roy, K.; Gaur, M. Neurosymbolic AI—Why, What, and How. arXiv 2023, arXiv:2305.00813. [Google Scholar]
  4. Funk, M.; Hosemann, S.; Jung, J.C.; Lutz, C. Towards Ontology Construction with Language Models. arXiv 2023, arXiv:2309.09898. [Google Scholar] [CrossRef]
  5. Corrà, M.F.; Vila-Chã, N.; Sardoeira, A.; Hansen, C.; Sousa, A.P.; Reis, I.; Sambayeta, F.; Damásio, J.; Calejo, M.; Schicketmueller, A.; et al. Peripheral neuropathy in Parkinson’s disease: Prevalence and functional impact on gait and balance. Brain 2023, 146, 225–236. [Google Scholar] [CrossRef] [PubMed]
  6. Bonuccelli, U.; Ceravolo, R. The safety of dopamine agonists in the treatment of Parkinson’s disease. Expert Opin. Drug Saf. 2008, 7, 111–127. [Google Scholar] [CrossRef] [PubMed]
  7. Zafeiropoulos, N.; Bitilis, P.; Kotis, K. Wear4PDmove: An Ontology for Knowledge-Based Personalized Health Monitoring of PD Patients. CEUR Workshop Proc. 2023, 3632, 4. [Google Scholar]
  8. Kotis, K.; Vouros, G.A. Human-centered ontology engineering: The HCOME methodology. Knowl. Inf. Syst. 2006, 10, 109–131. [Google Scholar] [CrossRef]
  9. Bouchouras, G.; Bitilis, P.; Kotis, K.; Vouros, G.A. LLMs for the Engineering of a Parkinson Disease Monitoring and Alerting Ontology. In Proceedings of the GeNeSy2024 Workshop, ESWC 2024, Crete, Greece, 26 May 2024. [Google Scholar]
  10. Oksanen, J.; Cocarascu, O.; Toni, F. Automatic Product Ontology Extraction from Textual Reviews. arXiv 2021, arXiv:2105.10966. [Google Scholar] [CrossRef]
  11. He, Y.; Chen, J.; Antonyrajah, D.; Horrocks, I. BERTMap: A BERT-Based Ontology Alignment System. Proc. AAAI Conf. Artif. Intell. 2022, 36, 5684–5691. [Google Scholar] [CrossRef]
  12. Ning, X.; Celebi, R. Knowledge Base Construction from Pre-trained Language Models by Prompt Learning. CEUR Workshop Proc. 2022, 3274, 46–54. [Google Scholar]
  13. Lippolis, A.S.; Klironomos, A.; Milon-Flores, D.F.; Zheng, H.; Jouglar, A.; Norouzi, E.; Hogan, A. Enhancing Entity Alignment Between Wikidata and ArtGraph Using LLMs. CEUR Workshop Proc. 2023, 3540, 1–11. [Google Scholar]
  14. Mateiu, P.; Groza, A. Ontology Engineering with Large Language Models. In Proceedings of the 25th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC 2023); IEEE: New York, NY, USA, 2023; pp. 226–229. [Google Scholar]
  15. Biester, F.; Gaudio, D.D.; Abdelaal, M. Enhancing Knowledge Base Construction from Pre-trained Language Models Using Prompt Ensembles. CEUR Workshop Proc. 2023, 3577, 1–11. [Google Scholar]
  16. Mountantonakis, M.; Tzitzikas, Y. Real-Time Validation of ChatGPT Facts Using RDF Knowledge Graphs. CEUR Workshop Proc. 2023, 3632, 1–5. [Google Scholar]
  17. Pan, S.; Luo, L.; Wang, Y.; Chen, C.; Wang, J.; Wu, X. Unifying Large Language Models and Knowledge Graphs: A Roadmap. arXiv 2023, arXiv:2306.08302. [Google Scholar] [CrossRef]
  18. Joachimiak, M.P.; Caufield, J.H.; Harris, N.L.; Kim, H.; Mungall, C.J. Gene Set Summarization using Large Language Models. arXiv 2009, arXiv:2305.13338v3. [Google Scholar]
  19. Caufield, J.H.; Hegde, H.; Emonet, V.; Harris, N.L.; Joachimiak, M.P.; Matentzoglu, N.; Kim, H.; Moxon, S.A.T.; Reese, J.T.; Haendel, M.A.; et al. Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): A Method for Populating Knowledge Bases Using Zero-Shot Learning. arXiv 2023, arXiv:2304.02711. [Google Scholar] [CrossRef] [PubMed]
  20. Zeakis, A.; Papadakis, G.; Skoutas, D.; Koubarakis, M. Pre-trained Embeddings for Entity Resolution: An Experimental Analysis. Proc. VLDB Endow. 2023, 16, 2225–2238. [Google Scholar] [CrossRef]
  21. Poveda-Villalón, M.; Gómez-Pérez, A.; Suárez-Figueroa, M.C. OOPS! (OntOlogy Pitfall Scanner!): An On-line Tool for Ontology Evaluation. Int. J. Semant. Web Inf. Syst. 2014, 10, 7–34. [Google Scholar] [CrossRef]
  22. Jiménez-Ruiz, E.; Cuenca Grau, B. LogMap: Logic-based and scalable ontology matching. Lect. Notes Comput. Sci. 2011, 7031, 273–288. [Google Scholar]
Figure 1. Flowchart illustrating the four-phase experimental process for ontology construction and evaluation using different ontology engineering methodologies (created with AI-Whimsical ChatGPT-4 https://openai.com/chatgpt (accessed on 1 March 2024), OpenAI, San Francisco, CA, USA.
Figure 1. Flowchart illustrating the four-phase experimental process for ontology construction and evaluation using different ontology engineering methodologies (created with AI-Whimsical ChatGPT-4 https://openai.com/chatgpt (accessed on 1 March 2024), OpenAI, San Francisco, CA, USA.
Ai 07 00139 g001
Figure 2. Relationship between the highest observed F1 scores and the degree of human involvement across different ontology engineering methodologies in the PD domain. The x-axis represents the evaluated methodologies (DSP Gemini, OS ChatGPT-4, X-HCOME Bard/Gemini, Expert Review X-HCOME Bard/Gemini, and SimX-HCOME+ Gemini). The left y-axis shows the F1 score, while the right y-axis represents the level of human involvement on a scale from 1 (minimum) to 5 (maximum).
Figure 2. Relationship between the highest observed F1 scores and the degree of human involvement across different ontology engineering methodologies in the PD domain. The x-axis represents the evaluated methodologies (DSP Gemini, OS ChatGPT-4, X-HCOME Bard/Gemini, Expert Review X-HCOME Bard/Gemini, and SimX-HCOME+ Gemini). The left y-axis shows the F1 score, while the right y-axis represents the level of human involvement on a scale from 1 (minimum) to 5 (maximum).
Ai 07 00139 g002
Table 1. Summary of metrics for class evaluation. This table presents the formulas for Precision, Recall, and the F1-score, along with their definitions.
Table 1. Summary of metrics for class evaluation. This table presents the formulas for Precision, Recall, and the F1-score, along with their definitions.
FormulasDefinitions
Precision = T P T P + F P True Positives (TP): classes correctly classified as positive in alignment with the gold standard ontology (human judgment or alignment tool).
Recall = T P T P + F N False Positives (FP): classes incorrectly classified as positive in alignment with the gold-standard ontology.
F 1 = 2 · Precision · Recall Precision + Recall False Negatives (FN): classes incorrectly classified as negative despite being positive in the gold-standard ontology.
Table 2. Comparative evaluation of methodologies used for ontology creation against the gold-standard ontology.
Table 2. Comparative evaluation of methodologies used for ontology creation against the gold-standard ontology.
MethodNumber of ClassesTrue PositivesFalse PositivesFalse NegativesPrecisionRecallF-1 Score
Gold-ontology41
ChatGPT3.5 DSP3213967%5%9%
ChatGPT3.5 OS5233940%5%9%
ChatGPT3.5 X-HCOME2510153140%24%30%
ChatGPT4 DSP6423767%10%17%
ChatGPT4 OS9543656%12%20%
ChatGPT4 X-HCOME3310233130%24%27%
Bard/Gemini DSP8533663%12%20%
Bard/Gemini OS13112408%2%4%
Bard/Gemini X-HCOME5019312238%46%42%
Llama2 DSP33038100%7%14%
Llama2 OS22039100%5%9%
Llama2 X-HCOME324283713%10%11%
Table 3. Comparative evaluation of ontology creation methods after expert review of false positives. To account for knowledge extension, metrics are reinterpreted as Extended Recall and Extended F-1. Values exceeding 100% indicate the model yielded more clinically valid concepts than were present in the baseline reference ontology (Wear4PDmove).
Table 3. Comparative evaluation of ontology creation methods after expert review of false positives. To account for knowledge extension, metrics are reinterpreted as Extended Recall and Extended F-1. Values exceeding 100% indicate the model yielded more clinically valid concepts than were present in the baseline reference ontology (Wear4PDmove).
MethodNumber of ClassesValid ConceptsStrict FPFalse NegativesPrecisionExtended Recall *Extended F-1 *
Gold-ontology41
ChatGPT3.5 DSP3213967%5%9%
ChatGPT3.5 OS5233940%5%9%
ChatGPT3.5 X-HCOME252321892%56%70%
ChatGPT4 DSP6423767%10%17%
ChatGPT4 OS9543656%12%20%
ChatGPT4 X-HCOME332941288%71%78%
Bard/Gemini DSP8533663%12%20%
Bard/Gemini OS13112408%2%4%
Bard/Gemini X-HCOME505000 *100%122% *110% *
Llama2 DSP33038100%7%14%
Llama2 OS22039100%5%9%
Llama2 X-HCOME322661581%63%71%
* Extended Recall and resulting Extended F-1 scores exceeding 100% reflect the inclusion of expert-validated novel concepts that successfully extend the conceptual coverage beyond the original 41-class gold standard. Consequently, False Negatives are logically bounded at 0, replacing the mathematical anomaly of negative values that occurs in rigid standard calculations.
Table 4. Evaluation metrics on SimX-HCOME+ generated ontologies in the PD domain (classes).
Table 4. Evaluation metrics on SimX-HCOME+ generated ontologies in the PD domain (classes).
MethodNumber of ClassesTrue PositivesFalse PositivesFalse NegativesPrecisionRecallF-1 Score
Gold ontology41
ChatGPT-417983252%21%31%
ChatGPT-3.5211472766%34%45%
Gemini221572668%36%48%
Claude2412122950%29%37%
Table 5. Evaluation metrics on SimX-HCOME+ generated ontologies in the PD domain (NL2SWRL).
Table 5. Evaluation metrics on SimX-HCOME+ generated ontologies in the PD domain (NL2SWRL).
MethodAtomsTP SCTP LCFP SCFP LCFN SCFN LCPrec SCPrec LCRec SCRec LCF1 LC
Gold ontology8
ChatGPT-413031310850%23%0%27%13%
ChatGPT-3.517131614755%17%12.5%3%11%
Gemini00000000%0%0%0%0%
Claude1205127830%41.6%0%28.4%20%
Table 6. Comparison of Methodological Approaches and Human Involvement Levels.
Table 6. Comparison of Methodological Approaches and Human Involvement Levels.
Methodological ApproachOSDSPSimX-HCOME+X-HCOMEExpert Review X-HCOME
Level of Human Involvement12345
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bouchouras, G.; Doumanas, D.; Soularidis, A.; Kotis, K.; Vouros, G. Leveraging LLMs for Collaborative Ontology Engineering in Parkinson Disease Monitoring and Alerting. AI 2026, 7, 139. https://doi.org/10.3390/ai7040139

AMA Style

Bouchouras G, Doumanas D, Soularidis A, Kotis K, Vouros G. Leveraging LLMs for Collaborative Ontology Engineering in Parkinson Disease Monitoring and Alerting. AI. 2026; 7(4):139. https://doi.org/10.3390/ai7040139

Chicago/Turabian Style

Bouchouras, Georgios, Dimitrios Doumanas, Andreas Soularidis, Konstantinos Kotis, and George Vouros. 2026. "Leveraging LLMs for Collaborative Ontology Engineering in Parkinson Disease Monitoring and Alerting" AI 7, no. 4: 139. https://doi.org/10.3390/ai7040139

APA Style

Bouchouras, G., Doumanas, D., Soularidis, A., Kotis, K., & Vouros, G. (2026). Leveraging LLMs for Collaborative Ontology Engineering in Parkinson Disease Monitoring and Alerting. AI, 7(4), 139. https://doi.org/10.3390/ai7040139

Article Metrics

Back to TopTop