LLMs in Action: Robust Metrics for Evaluating Automated Ontology Annotation Systems

Noori, Ali; Devkota, Pratik; Mohanty, Somya D.; Manda, Prashanti

doi:10.3390/info16030225

Open AccessArticle

LLMs in Action: Robust Metrics for Evaluating Automated Ontology Annotation Systems

by

Ali Noori

¹,

Pratik Devkota

²

,

Somya D. Mohanty

³ and

Prashanti Manda

^4,*

¹

Informatics and Analytics, University of North Carolina, Greensboro, NC 27412, USA

²

Fractal Analytics, New York, NY 10006, USA

³

United Health Group, Minnetonka, MN 55343, USA

⁴

Department of Computer Science, University of Nebraska, Omaha, NE 68182, USA

^*

Author to whom correspondence should be addressed.

Information 2025, 16(3), 225; https://doi.org/10.3390/info16030225

Submission received: 6 January 2025 / Revised: 27 February 2025 / Accepted: 27 February 2025 / Published: 14 March 2025

(This article belongs to the Special Issue Biomedical Natural Language Processing and Text Mining)

Download

Browse Figures

Versions Notes

Abstract

Ontologies are critical for organizing and interpreting complex domain-specific knowledge, with applications in data integration, functional prediction, and knowledge discovery. As the manual curation of ontology annotations becomes increasingly infeasible due to the exponential growth of biomedical and genomic data, natural language processing (NLP)-based systems have emerged as scalable alternatives. Evaluating these systems requires robust semantic similarity metrics that account for hierarchical and partially correct relationships often present in ontology annotations. This study explores the integration of graph-based and language-based embeddings to enhance the performance of semantic similarity metrics. Combining embeddings generated via Node2Vec and large language models (LLMs) with traditional semantic similarity metrics, we demonstrate that hybrid approaches effectively capture both structural and semantic relationships within ontologies. Our results show that combined similarity metrics outperform individual metrics, achieving high accuracy in distinguishing child–parent pairs from random pairs. This work underscores the importance of robust semantic similarity metrics for evaluating and optimizing NLP-based ontology annotation systems. Future research should explore the real-time integration of these metrics and advanced neural architectures to further enhance scalability and accuracy, advancing ontology-driven analyses in biomedical research and beyond.

Keywords:

1. Introduction

Ontologies play a critical role in biology, biomedicine, and other fields by providing a structured framework for organizing, standardizing, and sharing knowledge [1]. In biology and biomedicine, ontologies enable researchers to represent complex biological systems and relationships consistently, which facilitates interoperability among databases, tools, and studies [2]. For example, Gene Ontology (GO) provides a unified vocabulary for describing gene and protein functions across species, enabling seamless integration of diverse datasets [3]. Ontologies also improve data annotation and retrieval, making it easier for researchers to perform meta-analyses, identify patterns, and generate hypotheses. This consistency is especially important in biomedical research, where clear, standardized definitions of diseases, phenotypes, and molecular pathways are essential for precision medicine and translational research [4,5].

Gene Ontology (GO) and other biological ontologies are widely used in scientific analyses to standardize the annotation of biological data, enabling consistent interpretation across studies [3]. GO annotations describe genes and proteins in terms of their molecular functions, biological processes, and cellular components, facilitating cross-species comparisons and functional predictions [3,6]. These annotations are instrumental in large-scale genomic and transcriptomic studies, where they help identify enriched biological processes or pathways in datasets [7]. By leveraging these structured annotations, researchers can generate testable hypotheses, predict gene functions, and better understand complex biological systems.

Ontology annotations were historically generated manually through expert curation, where researchers reviewed primary literature and assigned GO terms to genes or proteins based on experimental evidence [8]. This labor-intensive process ensured high-quality annotations but was not scalable, given the rapid growth of biological data [8,9]. As new genomic and proteomic data are generated at an unprecedented rate, manual curation alone cannot keep pace, creating significant annotation gaps [10,11]. To address this challenge, natural language processing (NLP) approaches are increasingly employed to automate the extraction of information from scientific texts and assign GO annotations [12]. These methods leverage machine learning algorithms to identify relevant biological entities and relationships, enabling the semi-automated generation of annotations at a much larger scale. By combining NLP with manual curation for validation, researchers can enhance the throughput and coverage of GO annotations while maintaining quality standards, thereby accelerating the annotation of newly discovered genes and proteins [13,14].

Automated ontology annotation using natural language processing (NLP) has become a critical area of research to address the scalability challenges posed by the manual curation of large biomedical datasets [15]. NLP techniques enable the extraction of biological entities, relationships, and functional descriptions directly from scientific literature and other textual data sources. A significant contribution to this field has been made by Manda et al., who developed ontology-driven methods for text mining to identify meaningful biological terms and map them to appropriate ontology classes [15]. Their work emphasized leveraging semantic similarity measures and machine learning algorithms to improve the precision of annotations. Similarly, Devkota et al. advanced automated annotation by integrating statistical NLP techniques with domain-specific knowledge, enhancing the ability to accurately extract and categorize biological concepts [16,17,18]. These efforts demonstrate how NLP-driven tools can support high-throughput annotation, bridging the gap between the volume of biomedical data and the resources required for manual annotation.

Recent advances in deep learning have further refined NLP-based ontology annotation methods, with models capable of understanding context and semantics at a granular level. Manda and her collaborators explored neural architectures to enhance the quality of automated annotations [15].

Advances in deep learning and pre-trained language models like BERT and GPT have further revolutionized automated ontology annotation [19,20]. These models can understand complex linguistic structures and capture the semantic context of biomedical terms with high precision [21]. By incorporating domain-specific embeddings and fine-tuning with curated datasets, NLP tools are now capable of generating high-quality annotations that rival manual efforts. Additionally, hybrid approaches that combine NLP with structured databases and semantic similarity measures enhance the accuracy and relevance of annotations. For example, an NLP model might prioritize certain ontology terms based on their similarity to existing annotations in a given dataset [22].

Traditional information retrieval metrics such as precision and recall are insufficient for evaluating ontology-based NLP systems because these systems often exhibit partial accuracy, reflecting the hierarchical and semantic nature of ontologies. Unlike simple keyword matching, ontology-based annotations involve complex relationships in which a predicted term may be partially correct by being semantically related to the ground truth. For instance, in Gene Ontology, predicting a parent or child term of the correct annotation still provides meaningful biological insights, but traditional metrics would treat it as entirely incorrect. This limitation necessitates the use of semantic similarity metrics specifically designed for ontologies, which account for the structured relationships between terms. These metrics not only capture the quality of partial matches but also help prioritize biologically relevant annotations, making them indispensable for assessing and improving ontology-driven NLP systems [23].

Semantic similarity metrics for Gene Ontology (GO) are widely used to evaluate the functional similarity of genes or proteins by considering the hierarchical structure and relationships among GO terms. Commonly used metrics include Resnik similarity, which calculates similarity based on the information content (IC) of the most informative common ancestor (MICA) of two terms, providing a robust measure of shared specificity [24]. Lin similarity extends this by normalizing the IC of the MICA relative to the combined IC of the compared terms, offering a measure of how closely related the terms are in their specific context [25]. Jiang–Conrath similarity further refines this by incorporating the IC difference between the terms and their MICA [26]. Additionally, Wang’s semantic similarity metric incorporates both the hierarchical structure and the edge weights between terms, emphasizing their semantic contributions within the GO graph [27]. These metrics are crucial for tasks such as clustering genes with similar functions, validating gene annotations, and measuring functional enrichment, making them integral tools in computational biology.

Traditional semantic similarity metrics, such as Resnik’s, Lin’s, Jiang’s and Conrath’s, and Wang’s methods, have been instrumental in assessing functional similarities between genes or proteins within the Gene Ontology (GO) framework. However, a study by Manda and Vision [28] highlights significant concerns regarding the statistical robustness of these metrics. The authors conducted a comprehensive analysis, revealing that many commonly used semantic similarity measures exhibit sensitivity to variations in annotation depth and ontology structure, leading to inconsistencies in similarity assessments. This variability can result in unreliable comparisons, particularly when dealing with genes or proteins annotated at different levels of specificity within the GO hierarchy. The study underscores the necessity of developing more statistically robust methods that can account for these variations, ensuring consistent and reliable similarity evaluations across diverse biological datasets.

The evaluation of new NLP systems for ontology annotation requires more robust semantic similarity methods to ensure reliability and accuracy in assessing the systems’ performance. Existing metrics, such as Resnik’s, Lin’s, and Wang’s methods, while widely used, often lack statistical robustness, as they are sensitive to annotation depth, ontology structure, and the variability of term specificity. As highlighted by Manda and Vision [28], these limitations can lead to inconsistent similarity assessments, particularly when comparing annotations across heterogeneous datasets or systems. For NLP systems generating partially accurate annotations, current metrics may fail to capture meaningful biological or semantic relationships between terms, thus undervaluing the system’s contributions. Robust similarity measures that incorporate statistical rigor and account for ontology-specific characteristics are essential for accurately quantifying the degree of correctness and relevance in annotations. Such metrics would not only provide a fair evaluation framework for NLP systems but also guide their iterative improvement, fostering advancements in automated ontology annotation.

Ontology node embeddings are vector representations of concepts within an ontology, capturing both their semantic meanings and structural relationships. By embedding ontology nodes into a continuous vector space, these representations facilitate the computation of similarity measures between concepts, enabling the assessment of varying degrees of similarity. In their study, Devkota et al. [17] demonstrated that incorporating ontology embeddings into deep learning architectures enhances the prediction of ontology concepts from the literature. Their approach involved projecting the ontology into a graph, implementing strategies to traverse the ontology graph, creating a corpus of sentences based on these traversals, and generating embeddings from that corpus. This methodology allowed the model to learn both textual patterns and the ontology’s structure, resulting in improved annotation performance. By leveraging ontology node embeddings, researchers can more effectively capture the nuanced relationships between concepts, leading to more accurate and semantically informed analyses.

Large language models (LLMs) can generate embeddings for ontology nodes by leveraging their ability to understand and represent semantic and contextual relationships within text. These models, such as GPT or BERT, are pre-trained on vast corpora and can encode linguistic nuances into dense, fixed-size vectors. For ontology applications, embeddings can be generated by inputting textual definitions, descriptions, or relational information associated with ontology nodes. For example, the textual definitions of Gene Ontology (GO) terms or their hierarchical relationships can be fed into an LLM to produce embeddings that capture both the semantic meaning of the term and its context within the ontology. Additionally, LLMs can encode complex interactions between nodes when provided with structured prompts describing relationships, such as “is-a” or “part-of”. These embeddings can then be used to measure similarity, cluster related concepts, or improve downstream tasks like ontology annotation or enrichment analysis. The advantage of using LLMs lies in their ability to generalize across domains, making them versatile tools for generating high-quality, semantically rich representations of ontology nodes.

Our hypothesis was that combining traditional similarity metrics with similarity derived from vector embeddings would yield superior performance in predicting relationships between nodes because this hybrid approach leverages the strengths of both methods. Traditional metrics are explicitly designed to capture the structural relationships within ontologies, such as parent–child, ancestor–descendant, or sibling relationships, and encode these topological connections into a continuous vector space. These metrics excel at preserving the integrity of the ontology’s graph-like structure, making them particularly effective for tasks where the relational hierarchy is paramount. On the other hand, vector embeddings encapsulate rich semantic information derived from the textual definitions, descriptions, and contextual usage of nodes, offering a nuanced understanding of their meaning.

By integrating these two types of similarity, the combined representation can capture both the structural nuances and the semantic depth of ontology nodes. This holistic representation is likely to improve the model’s ability to predict relationships between nodes, particularly in cases where structural proximity does not fully explain the semantic relationships or vice versa. For example, two nodes may be semantically similar but located far apart in the hierarchy due to domain-specific ontology design. In such cases, LLM embeddings can complement hierarchical embeddings by providing additional context. Conversely, for nodes that are structurally proximate but lack clear semantic connections, hierarchical embeddings can add relational context that LLM embeddings might overlook. The combination of graph-based and language-based embeddings offers a promising approach to capturing both structural and semantic information, enabling more accurate node similarity analysis.

The goal of this paper is to explore and enhance the evaluation and application of automated ontology annotation systems driven by natural language processing (NLP). Specifically, it aims to develop and validate robust semantic similarity metrics that can effectively capture both structural and semantic relationships within ontologies. By integrating traditional similarity metrics, with language-based embeddings from large language models (LLMs), this work seeks to create hybrid approaches that improve the accuracy, scalability, and reliability of ontology annotations.

2. Materials and Methods

2.1. Creating Ontology Embeddings

For Node2Vec-based embeddings, we represented the Gene Ontology (GO) graph as a structured network, where nodes correspond to ontology terms, and edges represent hierarchical relationships such as “is-a” or “part-of” connections. To capture both local and global structural contexts, we employed biased random walks controlled by hyperparameters P (return parameter) and Q (in-out parameter). These hyperparameters were tuned systematically to balance the exploration-exploitation tradeoff: A low P-value encourages revisiting previously seen nodes, emphasizing local structures within the ontology. A high Q-value allows for more exploratory walks, capturing broader relationships between ontology nodes.

We generated embedded vectors for the nodes in the Gene Ontology (GO) graph using the Node2Vec algorithm. Node2Vec creates ontology embeddings by treating the ontology as a graph and leveraging its graph-based structure to learn vector representations for nodes (concepts or entities) within a continuous vector space. Node2Vec creates ontology embeddings by representing the ontology as a graph, where each concept or entity is a node, and relationships (e.g., “is-a”, “part-of”) are edges. This graph structure captures the semantic and hierarchical connections within the ontology. To learn embeddings, Node2Vec performs random walks from each node to generate sequences of neighboring nodes, similar to sentences in natural language processing, where co-occurrence reflects semantic similarity. These random walks are controlled via hyperparameters such as walk length, which determines the exploration depth, and the number of walks, ensuring sufficient sampling of the graph’s neighborhoods. Additionally, the return parameter (P) influences the likelihood of revisiting previously visited nodes (local context), while the in–out parameter (Q) balances the exploration of distant nodes (global context), enabling a nuanced traversal of the graph’s structure.

To optimize the quality of the embeddings, we performed hyperparameter tuning to identify the best parameter settings. The key parameters and their ranges are as follows:

Walk length: the length of random walks was varied from 10 to 100 steps to capture different levels of contextual information for each node.
Number of walks: the number of walks per node was adjusted from 5 to 30 to ensure sufficient sampling of network neighborhoods.
Embedding dimensions: various dimensionalities were tested, including 64, 128, 256, 512, and 1024, to evaluate the impact of dimensionality on embedding quality.
P and Q: These return (P) and in–out (Q) hyperparameters, which influence the probability of revisiting previously visited nodes, were tuned within the range of 0.2 to 2. These parameters balance the weighting between local and global context in the embeddings.

For hyperparameter tuning, we randomly selected 20 nodes from the GO graph. For each embedding configuration, we computed the cosine similarity between each of these nodes and all its subsumers up to the root. The average similarity for the 20 nodes was calculated, and this process was repeated for each parameter combination. The embedding configuration that achieved the highest average similarity across the 20 nodes was selected for further experiments. We performed hyperparameter tuning by testing values for P and Q in the range of 0.2 to 2.0, selecting optimal parameters based on the similarity consistency between child–parent pairs across the ontology. Embedding dimensions were varied from 64 to 1024, with 256 yielding the best tradeoff between computational efficiency and semantic richness. The final embeddings were validated by measuring the cosine similarity between known functionally related terms in the ontology.

For LLM-based embeddings, we leveraged PubMedGPT and MiniLM6 to encode textual definitions and descriptions of ontology terms into dense vector representations. We extracted embeddings by feeding the GO term definitions and hierarchical relationships as input prompts and processed them using pre-trained transformer-based models fine-tuned on biomedical literature. Embeddings were optimized by evaluating their ability to group related ontology terms in unsupervised clustering tasks, ensuring that similar concepts were consistently mapped to nearby vector spaces.

For MiniLM-L6 and PubMed, we used the default parameters and applied Sentence Transformer for sentence embedding. The embedded vector for each node was used to calculate cosine similarity. For Node2Vec, after tuning, we arrived at the following parameters:

Workers = 10
Dimensions = 512
Num_walks = 30
Walk_length = 70
P = 0
Q = 1.4

2.2. Semantic Similarity Metrics

We employed four core similarity measures to assess relationships between ontology nodes:

Jaccard similarity: Jaccard semantic similarity evaluates the similarity between concepts within an ontology using the Jaccard Index, a measure traditionally applied to set similarity. In ontology-driven applications, the Jaccard Index can be adapted to capture semantic similarities based on shared features, relationships, or annotations.
Cosine similarity using hierarchical embedding: hierarchical embedding with Node2Vec combines the idea of generating node embeddings for a graph with hierarchical graph structures, enabling the model to capture relationships and features at multiple levels of the hierarchy.
Cosine similarity using LLM PubMed embeddings: LLM PubMed embeddings refer to vector representations of biomedical text derived from PubMed articles using large language models (LLMs). These embeddings capture the semantic meaning of biomedical terms, phrases, or documents, enabling advanced downstream applications in healthcare, life sciences, and biomedical research.
Cosine similarity using MiniLM6 embeddings: MiniLM6 embeddings are lightweight, high-performance vector representations derived from MiniLM v6, a distilled version of a transformer-based language model. MiniLM6 focuses on delivering powerful semantic embeddings with reduced computational overhead, making it efficient for large-scale and real-time applications.

In addition to the four core metrics, we created 11 hybrid combinations of the above metrics listed in Table 1.

The selection and integration of hybrid similarity metrics were driven by a combination of theoretical principles and empirical validation. Traditional ontology-based similarity metrics (e.g., Jaccard and Resnik) focus on hierarchical relationships but often fail to capture semantic nuances found in textual descriptions. Conversely, LLM-based embeddings provide richer semantic context but may overlook structured relationships embedded in the ontology graph.

To address these limitations, we combined hierarchical embeddings, language-based embeddings, and traditional similarity measures (e.g., Jaccard similarity) to create a hybrid framework of metrics (Table 1) that leverages both structural and semantic knowledge.

Jaccard similarity captures shared annotations between terms, making it useful for functional comparison. Cosine similarity using hierarchical embeddings (Node2Vec) emphasizes relationships defined by the ontology’s graph structure. Cosine similarity using PubMed embeddings enriches term comparisons by incorporating biomedical literature-based context. Cosine similarity using MiniLM6 embeddings enhances generalizability by utilizing a compact transformer-based model trained on diverse textual data.

We tested the integration of these metrics, guided by statistical tests, by assessing the discriminative power of individual and combined metrics in distinguishing between child–parent node pairs and random node pairs. Hybrid combinations that maximize the separation between these groups were prioritized. Additionally, we validated the combinations through logistic regression models, where features derived from different similarity metrics were used to predict whether two nodes shared a hierarchical relationship.

2.3. Dataset for Evaluating Semantic Similarity

We used Gene Ontology to create two synthetic datasets to evaluate various semantic similarity methods. The first dataset, representing the signal, consists of 2000 parent–child node pairs selected from the GO graph, capturing meaningful semantic relationships. The second dataset, representing noise, contains 2000 randomly selected node pairs, which are unlikely to have strong semantic connections.

2.4. Performance Evaluation

The semantic similarity metrics in this study were evaluated in two ways:

Ability to distinguish between real similarity and noise: Different semantic similarity metrics were applied to the signal and noise datasets to observe the metrics’ ability to discriminate between the two. The metric with the largest disparity in similarity scores between the signal and noise datasets can be said to effectively distinguish meaningful relationships from random associations.
Machine learning evaluation of semantic similarity metrics: we used scores from the four core, as well as the 11 hybrid metrics above, as features in machine learning models to predict whether the similarity was from a random pair or a child–parent pair.
The goal of these models was to evaluate the predictive ability of different similarity metrics (or combinations thereof) to distinguish between child–parent and random node pairs.
The dataset was split into training (80%) and testing (20%) sets, and we trained a linear regression model to predict relationships between nodes. Additionally, we used logistic regression to classify node pairs and evaluate model accuracy. Our primary focus was to determine whether combining hierarchical and LLM embeddings led to better classification performance.

3. Results

3.1. Ability to Distinguish Between Real Similarity and Noise

First, we present the results from evaluating the discriminative power of the four core metrics at differentiating similarities between child–parent node pairs and random pairs. Figure 1 highlights the ability of four similarity metrics—Jaccard, PubMed, L6, and Hierarchy—to distinguish between child–parent pairs (blue circles) and random pairs (red triangles). Across all metrics, the child–parent pairs consistently achieve higher similarity scores than random pairs, as expected. Among the metrics, Jaccard, followed by PubMed and L6, demonstrated a strong separation between the two groups, with a clear gap in mean scores. These metrics are particularly effective at distinguishing the pairs, as is evident from the minimal overlap between error bars for the two categories. Overall, the results suggest that, while all metrics can distinguish child–parent pairs from random pairs, Jaccard, PubMed, and L6 appear to perform this task more reliably than the Hierarchy metric.

Figure 2 evaluates the ability of combined similarity metrics—combinations of Jaccard, Hierarchy, PubMed, and L6—to distinguish between child–parent pairs (blue circles) and random pairs (red triangles). The separation between the two groups is evident, with minimal overlap of error bars in most combinations as compared to Figure 1, suggesting that the combined metrics improve differentiation.

Combinations that include Jaccard, PubMed, and L6 (e.g., Jaccard+PubMed+L6 and Jaccard+Hierarchy+L6) show the most substantial separation, with the random pairs maintaining consistently low scores and child–parent pairs achieving near-maximum similarity. This highlights the complementary strength of these metrics when combined.

In contrast, combinations involving Hierarchy alone or without the inclusion of PubMed or L6 (e.g., Hierarchy+L6) show slightly reduced effectiveness, as is indicated by a greater overlap in the error bars between the two groups. While all metric combinations successfully distinguish the pairs to some extent, the inclusion of Jaccard, PubMed, and L6 together appears to maximize the ability to differentiate child–parent pairs from random pairs. Overall, the figure highlights that combining features, particularly those that integrate multiple perspectives (e.g., Jaccard similarity, Hierarchy, and PubMed context), is crucial for robustly distinguishing structured relationships.

We performed independent t-tests to compare similarity scores between child–parent and random node pairs. The null hypothesis stated that there was no significant difference in similarity scores between these two groups. Our analysis revealed that, for all metrics shown in Figure 1 and Figure 2, the null hypothesis could be rejected at a significance level of 0.05/14, adjusted for multiple testing using Bonferroni’s correction. This confirms that each metric effectively distinguishes child–parent pairs from randomly selected node pairs, reinforcing the robustness of our similarity measures.

3.2. Machine Learning Evaluation of Semantic Similarity Metrics

The results in Table 2 demonstrate the accuracy of logistic regression models using various combinations of features, revealing insights into the effectiveness of these features in classification tasks. Among individual features, Jaccard achieves the highest accuracy at 0.97, while cosine similarity using Hierarchy embeddings has the lowest at 0.6832. The combination of features enhances performance, with “Jaccard” combined with “Hierarchy” achieving 0.97 for accuracy, and the combination of all four features (“Jaccard”, “Hierarchy”, “Pubmed”, and “L6”) yielding the highest accuracy of 0.98. Notably, feature combinations involving “Jaccard” consistently outperform others, suggesting its strong predictive power. Combining Jaccard with other embedding-based similarities results in an increase in predictive power.

These results demonstrate that combining multiple types of embeddings, particularly Jaccard and Hierarchy embeddings, yields the highest accuracy in predicting child–parent relationships, highlighting the robustness of the combined approach, particularly the benefit of incorporating both structural and semantic knowledge. The cosine similarity measures for hierarchical, PubMed, and MiniLM6 embeddings, along with Jaccard similarity, provided insights into node relationships. Combining hierarchical embeddings with LLM embeddings (such as PubMed or MiniLM6) improved the model’s ability to distinguish between child–parent and random pairs. Specifically, the combination of hierarchical and LLM embeddings resulted in higher cosine similarity scores for child–parent pairs, indicating enhanced performance in capturing hierarchical relationships. The logistic regression model’s accuracy was significantly higher when using combined embeddings, achieving approximately 98% accuracy in predicting child–parent pairs. This finding highlights the robustness of the combined approach, particularly the benefit of incorporating both structural and semantic knowledge.

4. Discussion

The integration of graph-based and language-based embeddings has been shown to significantly enhance the ability to differentiate between hierarchical and non-hierarchical relationships. Our findings are consistent with previous research, which highlights the importance of using multiple embedding types to capture complementary information within ontologies. The application of cosine similarity for evaluating semantic relationships, combined with Jaccard similarity for assessing structural overlap, has been demonstrated to be effective in a variety of domains. Our findings indicate that the Node2Vec algorithm, with optimized parameters, can effectively embed ontology graphs for downstream similarity analysis. The integration of multiple similarity measures enabled us to gain a nuanced view of node relationships. High cosine similarity values for child–parent pairs reflect the embeddings’ ability to capture hierarchical relationships effectively. Additionally, the combined use of hierarchical embeddings and LLM embeddings consistently outperformed individual embeddings, highlighting the potential of integrated embedding approaches to capture complex relationships more accurately. The improved ability to differentiate between child–parent and random node pairs suggests that incorporating both hierarchical and semantic information enhances ontological analysis. This approach could be particularly impactful in biomedical informatics, where understanding relationships between entities such as genes or proteins is critical. The distribution of cosine similarity for PubMed embeddings and MiniLM6 embeddings further supports the effectiveness of combining these embeddings in representing semantic relationships. The effectiveness of different embedding combinations was also reflected in the feature importance analysis, which showed that the features derived from combined embeddings were most influential in predicting child–parent relationships. This indicates that the integration of structural and semantic information provides complementary insights that are crucial for understanding complex ontologies. Future studies could explore more advanced models, such as neural networks, to achieve further performance gains. Additionally, incorporating other types of embeddings or domain-specific contextual information could further improve node similarity predictions. The application of transformer-based models to generate richer semantic embeddings could be a promising direction for enhancing the representation of ontology nodes.

5. Conclusions

Ontologies serve as foundational tools for organizing and interpreting complex domain-specific knowledge. Their accurate annotation is essential for enabling downstream applications such as data integration, functional prediction, and knowledge discovery. However, as the volume of biomedical and genomic data grows exponentially, manual curation has become increasingly impractical. In response, natural language processing (NLP)-based systems have emerged as scalable alternatives for ontology annotation. Yet, evaluating the quality of these systems requires metrics that go beyond traditional measures like precision and recall, as ontology annotations often involve hierarchical and partially correct relationships. This study highlights the critical role of robust semantic similarity metrics in improving the evaluation and application of automated ontology annotation systems, particularly those driven by NLP approaches.

We demonstrate the significant advantages of combining multiple embedding types and similarity metrics to create robust tools that effectively capture semantic similarity between ontology concepts. By integrating traditional metrics with language-based embeddings derived from large language models (LLMs), we show that hybrid approaches significantly enhance the performance of semantic similarity measures. Individual metrics, while effective in isolation, benefit greatly from integration, leading to improved robustness and accuracy in distinguishing meaningful relationships within ontologies.

Graph-based embeddings, such as those generated by Node2Vec, encode topological features and the hierarchical arrangement of ontology nodes, preserving critical information about structural relationships. Conversely, language-based embeddings excel at encapsulating semantic richness and contextual meaning derived from textual descriptions of ontology nodes. This duality in the nature of embeddings highlights their complementary strengths. The combination of these approaches leverages structural context while enriching the semantic depth of node representations, resulting in a holistic framework for understanding complex ontological relationships.

Our experimental results reveal that robust semantic similarity metrics can effectively distinguish between meaningful annotations (e.g., child–parent pairs) and noise (e.g., random pairs). This ability is critical for assessing the performance of NLP-based ontology annotation systems, as it demonstrates their capability to produce biologically relevant annotations. Furthermore, when these metrics are used as features in predictive models, they enhance classification performance, achieving high accuracy in distinguishing between true and random relationships. These findings suggest that robust semantic similarity metrics can also serve as valuable tools for debugging and optimizing annotation pipelines, helping researchers identify areas for improvement and refine their models.

Robust similarity metrics also have significant implications for improving the design of automated annotation systems. By providing granular feedback on the quality of predictions, these metrics can guide the iterative improvement of NLP models. Metrics emphasizing hierarchical relationships can prioritize annotations closer to the ground truth within the ontology’s structure. Similarly, metrics leveraging language-based embeddings can identify cases where semantic meaning is preserved despite structural deviations, encouraging the development of hybrid approaches that balance both perspectives. This iterative refinement process can lead to the creation of more accurate and reliable annotation systems, ultimately accelerating scientific discovery in data-intensive fields.

In conclusion, robust semantic similarity metrics are indispensable for evaluating NLP-based approaches for automated ontology annotation. By accounting for both structural and semantic dimensions, these metrics provide a nuanced framework for assessing annotation quality, guiding the improvement of annotation systems, and enabling the integration of diverse data sources. As automated ontology annotation continues to evolve, robust similarity metrics will play a pivotal role in ensuring scalability, accuracy, and reliability, supporting advancements in biomedical research, genomics, and beyond.

This work advances the field of ontology annotation by addressing critical challenges that have hindered accuracy in automated annotation systems. Traditional ontology annotation methods have relied heavily on manual curation, which, while highly accurate, is labor-intensive, time-consuming, and unable to keep pace with the exponential growth of biomedical data. Early computational approaches introduced statistical NLP techniques, but they often lacked the ability to capture the complex hierarchical and semantic relationships inherent in ontologies.

Despite these advancements, several key challenges remain unresolved. One of the most persistent issues is the difficulty of capturing both the hierarchical structure of ontologies and the contextual meaning of terms, leading to inconsistencies in annotation accuracy. Traditional semantic similarity metrics that are still widely used, such as those of Resnik, Lin, and Jiang-Conrath, struggle with sensitivity to annotation depth, often failing to account for the true semantic relationships between terms. To mitigate this, the proposed work integrates both graph-based embeddings, such as those generated by Node2Vec, and language-based embeddings derived from models like PubMedGPT and MiniLM6. This hybrid approach ensures that similarity assessments are both structurally grounded and contextually aware, significantly improving annotation robustness.

The practical implications of these advancements are substantial, particularly in biomedical informatics, genomics, and precision medicine. Enhanced ontology annotation will facilitate the integration of diverse biomedical datasets, improving meta-analyses, functional predictions, and cross-species comparisons. In drug discovery and disease research, the ability to accurately and efficiently annotate genomic and proteomic data will accelerate the identification of disease-related genes and potential therapeutic targets. In scientific literature curation, NLP-driven annotation tools will streamline the extraction and organization of biomedical knowledge, significantly reducing the burden on human curators.

6. Future Directions

Building upon our findings, future research should focus on the following key innovations to enhance ontology annotation systems:

Development of Next-Generation Semantic Similarity Metrics Existing similarity measures exhibit sensitivity to annotation depth and hierarchical structures, leading to inconsistencies in similarity assessments. To address these limitations, future work should explore the following:
- Context-aware semantic metrics: develop hybrid similarity measures that dynamically adjust based on term frequency, ontology structure, and contextual embeddings.
- Graph neural networks (GNNs) for semantic similarity: utilize GNNs to encode hierarchical relationships while incorporating domain-specific text embeddings, ensuring robust cross-domain applicability.
- Uncertainty-aware similarity metrics: introduce probabilistic methods that account for annotation uncertainty, particularly in incomplete or evolving ontologies.
Real-Time Integration of Automated Ontology Annotation Pipelines To bridge the gap between NLP-based annotation systems and real-world biomedical applications, future research should focus on streaming annotation pipelines, federated learning for distributed annotation, and explainable AI (XAI) for ontology annotation. This requires the implementation of real-time text mining frameworks that continuously ingest scientific literature and dynamically update ontology annotations. Additionally, federated learning should be leveraged to train NLP models across multiple institutions while preserving data privacy, thereby enhancing annotation scalability. Finally, transparent annotation models with built-in interpretability modules must be developed to provide justifications for assigned ontology terms, fostering trust and reliability in automated annotation systems.
Leveraging Advanced Neural Architectures for Ontology Embeddings Recent advances in deep learning models present promising opportunities for enhancing ontology-based text mining. One key direction is contrastive learning for ontology representations, where transformer-based models such as BioBERT and PubMedGPT are trained using contrastive loss to refine ontology embeddings by aligning contextual and structural information. Additionally, multi-modal ontology learning aims to improve annotation accuracy by integrating textual definitions, biological pathways, and experimental data, such as gene expression, into multimodal neural models. Another critical avenue is self-supervised pretraining on ontology graphs, which leverages techniques like masked node prediction and node contrastive learning to enhance representations of under-annotated ontology nodes, ensuring more robust and comprehensive knowledge extraction.

By advancing these areas, future research can ensure that automated ontology annotation systems remain scalable, interpretable, and aligned with real-world biomedical applications.

Author Contributions

Conceptualization, P.M. and S.D.M.; methodology, P.D.; software, P.D.; validation, P.D.; formal analysis, P.D. and A.N.; investigation, P.D.; resources, P.M.; data curation, P.M.; writing—original draft preparation, P.M. and A.N.; writing—review and editing, P.M.; visualization, P.D.; supervision, P.M. and S.D.M.; project administration, P.M.; funding acquisition, P.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by a CAREER award (#1942727) to Manda from the Division of Biological Infrastructure at the National Science Foundation, USA.

Data Availability Statement

The data and code presented in this study are available on request from the corresponding author.

Conflicts of Interest

Author Pratik Devkota was employed by the company Fractal Analytics. Author Somya D. Mohanty was employed by the company United Health Group. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Smith, B.; Williams, J.; Steffen, S.K. The ontology of the gene ontology. AMIA Annu. Symp. Proc. 2003, 2003, 609–613. [Google Scholar] [PubMed]
Lesteven, S.; Derriere, S.; Dubois, P.; Genova, F.; Preite Martinez, A.; Hernandez, N.; Mothe, J.; Napoli, A.; Toussaint, Y. Ontologies for Astronomy. In Library and Information Services in Astronomy V; Astronomical Society of the Pacific: San Francisco, CA, USA, 2007; Volume 377, p. 193. [Google Scholar]
Ashburner, M.; Ball, C.A.; Blake, J.A.; Botstein, D.; Butler, H.; Cherry, J.M.; Davis, A.P.; Dolinski, K.; Dwight, S.S.; Eppig, J.T.; et al. Gene ontology: Tool for the unification of biology. Nat. Genet. 2000, 25, 25–29. [Google Scholar] [CrossRef] [PubMed]
Both, V. From Functional Similarity Among Gene Products to Dependence Relations Among Gene Ontology Terms. In Proceedings of the Second conference on Standards and Ontologies for Functional Genomics (SOFG 2), Philadelphia, PA, USA, 23–26 October 2004; p. 53. [Google Scholar]
Bodenreider, O.; Smith, B.; Kumar, A.; Burgun, A. Investigating subsumption in SNOMED CT: An exploration into large description logic-based biomedical terminologies. Artif. Intell. Med. 2007, 39, 183–195. [Google Scholar] [CrossRef] [PubMed]
Consortium, T.G.O. The Gene Ontology resource: Enriching a GOld mine. Nucleic Acids Res. 2021, 49, D325–D334. [Google Scholar] [CrossRef] [PubMed]
Huang, D.W.; Sherman, B.T.; Lempicki, R.A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 2009, 4, 44–57. [Google Scholar] [CrossRef] [PubMed]
Dessimoz, C.; Škunca, N. The Gene Ontology Handbook; Humana Press: Totowa, NJ, USA, 2017. [Google Scholar]
Robinson, P.N.; Köhler, S.; Bauer, S.; Seelow, D.; Horn, D.; Mundlos, S. The Human Phenotype Ontology: A tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet. 2008, 83, 610–615. [Google Scholar] [CrossRef] [PubMed]
Fang, L.; Cui, C.; Zhang, J.; Yang, R.; Liu, B. Predicting gene ontology annotations using domain-specific word embeddings. PLoS Comput. Biol. 2021, 17, e1009056. [Google Scholar]
Radivojac, P.; Clark, W.T.; Oron, T.R.; Schnoes, A.M.; Wittkop, T.; Sokolov, A.; Graim, K.; Funk, C.; Verspoor, K.; Ben-Hur, A.; et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 2013, 10, 221–227. [Google Scholar] [CrossRef] [PubMed]
Zhao, M.; Havrilla, J.M.; Fang, L.; Chen, Y.; Peng, J.; Liu, C.; Wu, C.; Sarmady, M.; Botas, P.; Isla, J.; et al. Phen2Gene: Rapid phenotype-driven gene prioritization for rare diseases. NAR Genom. Bioinform. 2020, 2, lqaa032. [Google Scholar] [CrossRef] [PubMed]
Groth, P.; Pavlova, N.; Kalev, I.; Tonov, S.; Georgiev, G.; Pohjalainen, E. Automated assignment of biomedical annotations using statistical and machine learning methods. Nat. Biotechnol. 2010, 28, 977–982. [Google Scholar]
Baumgartner, W.A.; Cohen, K.B.; Fox, L.; Acquaah-Mensah, G.; Hunter, L. Manual curation is not sufficient for annotation of genomic data. Bioinformatics 2008, 24, 1846–1852. [Google Scholar]
Manda, P.; SayedAhmed, S.; Mohanty, S.D. Automated Ontology-Based Annotation of Scientific Literature Using Deep Learning. In Proceedings of the SBD ’20: International Workshop on Semantic Big Data, New York, NY, USA, 14–19 June 2020. [Google Scholar] [CrossRef]
Devkota, P.; Mohanty, S.D.; Manda, P. A Gated Recurrent Unit based architecture for recognizing ontology concepts from biological literature. BioData Min. 2022, 15, 22. [Google Scholar] [CrossRef] [PubMed]
Devkota, P.; Mohanty, S.; Manda, P. Ontology-Powered Boosting for Improved Recognition of Ontology Concepts from Biological Literature [Ontology-Powered Boosting for Improved Recognition of Ontology Concepts from Biological Literature]. In Proceedings of the 16th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2023), Lisbon, Portugal, 16–18 February 2023; Volume 3. [Google Scholar]
Devkota, P.; Mohanty, S.; Manda, P. Knowledge of the Ancestors: Intelligent Ontology-aware Annotation of Biological Literature using Semantic Similarity. In Proceedings of the International Conference on Biomedical Ontology, Ann Arbor, MI, USA, 25–28 September 2022. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [PubMed]
Zhou, N.; Jiang, Y.; Bergquist, T.R.; Lee, A.J.; Kacsoh, B.Z.; Crocker, A.W.; Gillis, J. The Gene Ontology resource: 20 years and still GOing strong. Nucleic Acids Res. 2019, 47, D330–D338. [Google Scholar] [CrossRef]
Pesquita, C.; Faria, D.; Falcao, A.O.; Lord, P.; Couto, F.M. Semantic similarity in biomedical ontologies. PLoS Comput. Biol. 2009, 5, e1000443. [Google Scholar] [CrossRef] [PubMed]
Resnik, P. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res. 1999, 11, 95–130. [Google Scholar] [CrossRef]
Lin, D. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning (ICML), Madison, WI, USA, 24–27 July 1998; pp. 296–304. [Google Scholar]
Jiang, J.J.; Conrath, D.W. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the 10th Research on Computational Linguistics International Conference, Taipei, Taiwan, 19–22 August 1997; pp. 19–33. [Google Scholar]
Wang, J.; Zhang, Z.; Li, J. A new method to measure the semantic similarity in gene ontology. BMC Bioinform. 2007, 8, 44. [Google Scholar]
Manda, P.; Vision, T.J. On the Statistical Sensitivity of Semantic Similarity Metrics. In Proceedings of the 9th International Conference on Biological Ontology (ICBO 2018), Corvallis, OR, USA, 7–10 August 2018. [Google Scholar]

Figure 1. Comparison of discrimination between real similarity and noise among four core metrics.

Figure 2. Comparison of discrimination between real similarity and noise among combinations of metrics.

Table 1. Hybrid combinations of core metrics.

Jaccard & Hierarchy

Jaccard & Pubmed

Jaccard & L6

Hierarchy & Pubmed

Hierarchy & L6

Pubmed & L6

Jaccard, Hierarchy & Pubmed

Jaccard, Hierarchy & L6

Jaccard, Pubmed & L6

Hierarchy, Pubmed & L6

Jaccard, Hierarchy, Pubmed & L6

Table 2. Accuracy and other performance metrics of logistic regression model with different combinations of similarity metrics as features.

Predictive Metrics	Accuracy	Precision	Recall	F1 Score
Jaccard, Hierarchy, L6	0.98	0.98	0.97	0.97
Jaccard	0.97	0.97	0.96	0.97
Jaccard, Hierarchy	0.97	0.98	0.97	0.97
Jaccard, Pubmed	0.97	0.97	0.97	0.97
Jaccard, L6	0.97	0.97	0.96	0.97
Jaccard, Hierarchy, Pubmed	0.97	0.98	0.97	0.97
Jaccard, Hierarchy, Pubmed, L6	0.97	0.98	0.97	0.97
Jaccard, Pubmed, L6	0.97	0.97	0.97	0.97
Hierarchy, Pubmed, L6	0.91	0.90	0.90	0.90
Pubmed	0.90	0.90	0.89	0.90
Hierarchy, Pubmed	0.90	0.90	0.89	0.90
Hierarchy, L6	0.90	0.90	0.89	0.90
Pubmed, L6	0.90	0.90	0.90	0.90
L6	0.89	0.89	0.88	0.89
Hierarchy	0.68	0.64	0.80	0.71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Noori, A.; Devkota, P.; Mohanty, S.D.; Manda, P. LLMs in Action: Robust Metrics for Evaluating Automated Ontology Annotation Systems. Information 2025, 16, 225. https://doi.org/10.3390/info16030225

AMA Style

Noori A, Devkota P, Mohanty SD, Manda P. LLMs in Action: Robust Metrics for Evaluating Automated Ontology Annotation Systems. Information. 2025; 16(3):225. https://doi.org/10.3390/info16030225

Chicago/Turabian Style

Noori, Ali, Pratik Devkota, Somya D. Mohanty, and Prashanti Manda. 2025. "LLMs in Action: Robust Metrics for Evaluating Automated Ontology Annotation Systems" Information 16, no. 3: 225. https://doi.org/10.3390/info16030225

APA Style

Noori, A., Devkota, P., Mohanty, S. D., & Manda, P. (2025). LLMs in Action: Robust Metrics for Evaluating Automated Ontology Annotation Systems. Information, 16(3), 225. https://doi.org/10.3390/info16030225

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

LLMs in Action: Robust Metrics for Evaluating Automated Ontology Annotation Systems

Abstract

1. Introduction

2. Materials and Methods

2.1. Creating Ontology Embeddings

2.2. Semantic Similarity Metrics

2.3. Dataset for Evaluating Semantic Similarity

2.4. Performance Evaluation

3. Results

3.1. Ability to Distinguish Between Real Similarity and Noise

3.2. Machine Learning Evaluation of Semantic Similarity Metrics

4. Discussion

5. Conclusions

6. Future Directions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI