Next Article in Journal
RFSCMOEA: A Dual-Population Cooperative Evolutionary Algorithm with Relaxed Feasibility Selection
Next Article in Special Issue
FDSDS: A Fuzzy-Based Driver Stress Detection System for VANETs Considering Interval Type-2 Fuzzy Logic and Its Performance Evaluation
Previous Article in Journal
Guardians of the Grid: A Collaborative AI System for DDoS Detection in Autonomous Vehicles Infrastructure
Previous Article in Special Issue
PRA-Unet: Parallel Residual Attention U-Net for Real-Time Segmentation of Brain Tumors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

KGEval: Evaluating Scientific Knowledge Graphs with Large Language Models

1
Leibniz Information Centre for Science and Technology, 30167 Hannover, Germany
2
Natural Language Learning & Generation (NLLG), University of Technology Nuremberg (UTN), 90461 Nuremberg, Germany
*
Author to whom correspondence should be addressed.
Information 2026, 17(1), 35; https://doi.org/10.3390/info17010035
Submission received: 25 November 2025 / Revised: 27 December 2025 / Accepted: 31 December 2025 / Published: 3 January 2026
(This article belongs to the Special Issue Feature Papers in Information in 2024–2025)

Abstract

This paper explores the novel application of large language models (LLMs) as evaluators for structured scientific summaries—a task where traditional natural language evaluation metrics may not readily apply. Leveraging the Open Research Knowledge Graph (ORKG) as a repository of human-curated properties, we augment a gold-standard dataset by generating corresponding properties using three distinct LLMs—Llama, Mistral, and Qwen—under three contextual settings: context-lean (research problem only), context-rich (research problem with title and abstract), and context-dense (research problem with multiple similar papers). To assess the quality of these properties, we employ LLM evaluators (Deepseek, Mistral, and Qwen) to rate them on criteria, including similarity, relevance, factuality, informativeness, coherence, and specificity. This study addresses key research questions: How do LLM-as-a-judge rubrics transfer to the evaluation of structured summaries? How do LLM-generated properties compare to human-annotated ones? What are the performance differences among various LLMs? How does the amount of contextual input affect the generation quality? The resulting evaluation framework, KGEval, offers a customizable approach that can be extended to diverse knowledge graphs and application domains. Our experimental findings reveal distinct patterns in evaluator biases, contextual sensitivity, and inter-model performance, thereby highlighting both the promise and the challenges of integrating LLMs into structured science evaluation.

1. Introduction

Knowledge bases such as the Open Research Knowledge Graph (ORKG) [1] are crucial for making scientific findings FAIR (findable, accessible, interoperable, and reusable) [2]. By providing structured summaries of research contributions, these KGs enable efficient comparison and retrieval of scholarly work. However, populating such knowledge graphs is inherently costly and time-consuming, as it relies heavily on manual curation by domain experts.
Large language models (LLMs) have shown great promise in automating the construction of structured representations, potentially alleviating the burden of manual annotation [3]. Despite their potential, evaluating the quality of LLM-generated outputs in the scientific domain poses unique challenges. Traditional NLP metrics such as BLEU [4] and ROUGE [5] are primarily designed to assess surface-level text matching and do not capture deeper semantic meaning or the task-specific nuances required for structured science summarization. For example, while these metrics can measure word overlap between generated and reference texts, they fail to assess whether the generated properties accurately and comprehensively reflect the underlying research problem.
To address these challenges, we introduce KGEval, a novel framework that leverages LLMs both as generators and as evaluators (i.e., as LLM-as-a-judge [6]) of structured scientific summaries. KGEval is built with a modular architecture that enables the interchangeability of LLMs and allows for customization of prompts and evaluation criteria. This flexibility is essential for systematically investigating how various context types and evaluation strategies influence the quality of generated properties in a domain as complex as scientific research.
Specifically, our work investigates the following research questions:
  • RQ1. Transferability of Qualitative Rubrics: How effectively do LLM-as-a-judge evaluation rubrics capture the quality of structured summaries, given that these outputs lack the conventional sentence structures found in natural language?
  • RQ2. Comparison with Human Annotations: How do the properties generated by LLMs compare to human-annotated properties in terms of relevance, consistency, and other evaluation criteria?
  • RQ3. Inter-Model Performance: How do different LLMs (e.g., Qwen, Mistral, and Llama) perform in generating properties, and how does their performance compare?
  • RQ4. Impact of Context: How does the amount and type of contextual input (research problem only vs. research problem with one abstract vs. research problem with multiple abstracts) affect the quality of the generated properties?
In this paper, we focus primarily on the evaluation of structured outputs. By leveraging LLMs as both generators and evaluators, KGEval systematically assesses the quality of generated scientific properties while addressing challenges such as scientific domain complexity and the high cost of manual evaluation. Our contributions are threefold. First, we introduce a robust evaluation framework that repurposes LLMs as evaluators, overcoming the limitations of traditional metrics. Second, we provide a comparative analysis between LLM-generated properties and expert-annotated ORKG properties, highlighting both strengths and areas for improvement. Third, we examine how varying contextual inputs influence the performance of different LLMs, thereby offering insights into the scalability and adaptability of automated structured science summarization.
In the following sections, we detail the KGEval framework, describe our dataset and experimental setup, and present an in-depth analysis of our results.

2. Related Work

In the NLP community, developing evaluation metrics that reliably measure the quality of tasks such as translation or summarization has long been a fundamental concern. In recent decades, a plethora of different paradigms have been suggested. These range from (1) lexical overlap metrics such as BLEU [4] and ROUGE [5], which are inherently limited, to (2) semantic similarity metrics like BERTScore [7] and MoverScore [8]; (3) text generation-based metrics such as BARTScore [9] and PRISM [10]; (4) natural language inference metrics like MENLI [11], which promise to increase robustness; and (5) prompting-based metrics such as GEMBA [12] which rely on LLMs and their prompts for judging the quality of outputs. While these metrics have been explored for ‘standard’ tasks like summarization and machine translation, their potential for evaluating structured science summaries remains fundamentally underexplored. In this work, we fill this gap, focusing primarily on prompt-based metrics to evaluate structured science summaries.
Several recent studies have adopted the LLM-as-a-judge paradigm to assess generated content by correlating LLM evaluations with human judgments using pairwise preference evaluations [13,14,15,16,17]. Building on these methods, some approaches have incorporated rubric-based techniques—such as G-Eval [17] for summarization and GPTScore [18] for flexible prompt-based evaluation—to capture nuanced aspects of generated text. In addition, frameworks such as FLASK [19] and Prometheus [20] have advanced the state of the art by emphasizing fine-grained rubrics that assess robustness, correctness, efficiency, factuality, and readability. Together, these studies underscore the evolving landscape of LLM-based evaluation frameworks and motivate our extension of the paradigm to the domain of structured science summarization.
Our previous work [21] laid the groundwork by exploring the feasibility of using LLMs to recommend research properties for structured science summarization in the ORKG. That study employed methods such as semantic alignment and deviation assessments, fine-grained property-to-dimension mappings, embeddings-based evaluations, and human surveys to compare LLM-generated dimensions with expert-curated ORKG properties. Building upon these findings, our current work extends the LLM-as-a-judge paradigm through the KGEval framework, integrating both generation and evaluation in a unified system while systematically examining evaluator biases and context effects.
In summary, while a range of evaluation metrics and frameworks have been proposed in the literature, our work contributes by extending prompting-based evaluation to the domain of structured science summarization. By building on the advancements in LLM-based evaluation rubrics and our previous findings, we provide a robust, open-science framework that is adaptable to diverse scientific KGs.

3. The KGEval Framework

In this section, we present KGEval, a modular framework designed to both generate and evaluate structured scientific properties. The framework is built around two primary modules: an LLM generator and an LLM evaluator. This modular design enables KGEval to handle various input contexts and property sources flexibly, making it adaptable to a wide range of KGs and evaluation tasks.

3.1. Framework Overview and Workflow

KGEval is structured as a two-module system that operates in a sequential yet modular fashion. The first module, the LLM generator, accepts diverse forms of input context, such as research questions, abstracts, full papers, articles, or even multiple related papers. Regardless of the context—whether context-lean (research problem only), -rich (research problem with title and abstract), or -dense (research problem with multiple abstracts)—the generator utilizes customizable prompts (which remain consistent across scenarios, with only the input context varying) to produce structured representations, hereafter referred to as properties. Once these properties are generated, they are forwarded to the second module, the LLM evaluator.
The evaluator module is designed to assess the quality of the properties based on a comprehensive, unified prompt that incorporates multiple qualitative criteria. This module accepts properties generated by the LLM generator as well as those obtained from external sources, such as human-annotated entries from a KG (e.g., ORKG). The evaluator then outputs a quantitative score reflecting the quality of the input properties according to criteria such as similarity, relevance, factuality, informativeness, coherence, and specificity. Both modules leverage a shared LLM management system that supports various models (e.g., Deepseek, Llama, Mistral, and Qwen), which can be run locally or accessed via API. Figure 1 illustrates the KGEval pipeline, showing how context (research question, abstract, and papers) and prompts feed into the generator, how the generated properties and human-annotated properties (from a KG) are then evaluated against defined criteria, and how the evaluator produces a final score.

3.2. Evaluation Scenarios and Criteria

KGEval supports two primary evaluation scenarios: direct assessment and pairwise ranking. In the direct assessment scenario, a single set of properties—either generated by the LLM or curated by humans—is evaluated against the qualitative criteria. This process yields individual scores that reflect the properties’ relevance, factuality, informativeness, coherence, and specificity. In contrast, the pairwise ranking scenario directly compares two sets of properties. For example, KGEval compares properties generated using a research problem alone (context-lean) against those generated with additional contextual information (rich or dense context), as well as comparing LLM-generated properties against human-annotated ones from the ORKG. These comparisons are conducted on a Likert scale (ranging from 1 to 5), enabling a fine-grained evaluation of how different contexts and sources influence the quality and consistency of the structured representations.
The evaluation process is streamlined by incorporating all the qualitative criteria into a single prompt used by the LLM evaluator. This approach not only saves on API calls but also ensures a standardized assessment method across different property sets. The six criteria—similarity, relevance, factuality, informativeness, coherence, and specificity—are operationalized within this prompt to provide a comprehensive evaluation of each property set. As a result, KGEval is capable of systematically comparing outputs across different contexts and sources, yielding insights into the strengths and limitations of both LLM-generated and human-curated scientific representations.
Overall, the KGEval framework provides a robust and adaptable method for evaluating structured scientific data. Its modular architecture, combined with a unified evaluation strategy, allows for flexible integration with a variety of LLMs and input contexts, ultimately facilitating a deeper understanding of how qualitative rubrics transfer to the evaluation of structured summaries.

4. Experimental Dataset and Setup

Our evaluation dataset is based on the gold-standard annotations extracted from the ORKG. We constructed this dataset by curating a selection of ORKG comparisons and later extended it with associated abstracts to provide additional contextual information. These comparisons were chosen from those created by experienced ORKG users with diverse research backgrounds. The selection criteria mandated that each comparison contain at least three properties and a minimum of five contributions. This ensured that the properties represented a rich and structured depiction of research problems rather than a sparse or superficial summary. Applying these criteria yielded a dataset of 103 ORKG comparisons, encompassing 1317 papers across 35 research fields and covering over 150 distinct research problems. The highly multidisciplinary dataset includes examples from domains such as Earth Sciences, Natural Language Processing, Medicinal Chemistry, Operations Research, Systems Engineering, Cultural History, and the Semantic Web.
The LLMs and their identifiers used in the experiments were as follows:
  • meta-llama-3.1-70b-instruct (referred to in text as “Llama”);
  • deepseek-r1-distill-llama-70b (DeepSeek);
  • mistral-large-instruct (Mistral);
  • qwen2.5-72b-instruct (Qwen).
Inference was performed through the Academic Cloud Chat AI API (https://academiccloud.de accessed on 15 March 2025). The generation parameters were left at the API defaults for our runs, in particular, temperature = 0.8 and top_p = 0.9. If reproducibility requires absolute stability of generated outputs, we note that running with explicit sampling seeds and setting parameters such as temperature = 0 or assigning a fixed value to max_tokens will reduce randomness; these options were not used in the presented experiments.
Since evaluator judgments are obtained via LLM inference, runtime and monetary cost are dominated by the chosen model and by input context length (lean/rich/dense) and will differ between hosted-API and local deployments. As our experiments used a hosted API, we cannot report provider hardware footprints; all other framework computations (preprocessing, prompt assembly, and aggregation) are negligible in comparison. Users who require concrete latency, token usage, or cost estimates for their environment should measure per-call latency and token counts against their chosen deployment.

4.1. LLMs as Generators

For the task of property generation, we employ three different LLMs: Llama, Mistral, and Qwen. These models are tasked with generating structured representations (i.e., properties) from various forms of context. The generation module of KGEval accepts a range of input types, including research problems, titles, abstracts, as well as multiple related papers. Although the underlying prompt structure remains consistent, only the input context varies among the three scenarios: lean context, rich context, and dense context.
The prompts used for generation are designed with a structured format that includes specific tags such as <role>, <task>, <context-input>, and <output-response-format>. In the <role> tag, the LLM is assigned the role of a researcher whose objective is to analyze and identify common properties that characterize significant contributions across research studies. The <task> tag instructs the model to generate a list of properties that succinctly capture the salient aspects of the research problem. The <context-input> tag provides the necessary context—varying according to the scenario—while the <output-response-format> tag enforces a strict output structure (a list data structure) to ensure consistency. Detailed descriptions of these prompts, along with their variations for different contexts, are provided in Appendix A.

4.2. LLMs as Evaluators

For the evaluation of the generated properties, KGEval employs a set of LLMs, including Deepseek, Mistral, and Qwen. The evaluator module is designed to assess the quality of properties based on a unified evaluation prompt that integrates several qualitative criteria: relevance, factuality, informativeness, coherence, and specificity. The evaluation framework supports two scenarios: direct assessment and pairwise ranking. In direct assessment, a single set of properties—whether generated by an LLM or sourced from ORKG—is evaluated against the input context. In pairwise ranking, two sets of properties are compared directly to determine their similarity.
The evaluation prompts follow a structured format similar to the generation prompts. They include tags such as <role>, where the LLM is assigned the role of an evaluator, and <task>, which outlines the criteria and steps for evaluation. Additional tags, such as <input> (which provides the context and the properties to be evaluated) and <output_format> (which specifies the desired feedback format, including both qualitative feedback and quantitative scores) ensure that the evaluation is both standardized and comprehensive. These prompts, along with detailed guidelines for rating each criterion on a Likert scale, are also presented in Appendix A.
In summary, our experimental setup leverages a multidisciplinary, gold-standard dataset derived from the ORKG and employs LLMs in dual roles—as generators and evaluators—within the KGEval framework. The use of structured yet customizable prompts in both modules enables a systematic investigation into the quality of structured scientific representations across different context scenarios and evaluation tasks.

4.3. Example Instance

To illustrate our experimental setup, consider an example using Llama as the generator in the rich context scenario. For this example, the research problem is “Etching of silicon,” and the paper is titled “Modified TMAH based etchant for improved etching characteristics on Si{1 0 0} wafer.” The ORKG properties for this paper, as manually curated by domain experts, are as follows: ‘Measured at temperature’, ‘Etching rate’, ‘Type of etching’, ‘Research problem’, ‘Substrate’, ‘Type of etching mixture’, ‘Miller index’.
In contrast, Llama generated the following properties in the rich scenario: ‘Etchant composition’, ‘Etching rate’, ‘Surface morphology’, ‘Undercutting characteristics’, ‘Etch depth’.
These properties were subsequently evaluated by three different LLM evaluators using our five defined criteria. The scores for the ORKG properties were as follows:
  • Deepseek: [4, 5, 4, 5, 4];
  • Mistral: [4, 5, 3, 4, 3];
  • Qwen: [3, 4, 3, 4, 3].
For the Llama-generated properties, the corresponding evaluation scores were as follows:
  • Deepseek: [5, 5, 4, 5, 4];
  • Mistral: [4, 5, 3, 4, 3];
  • Qwen: [3, 4, 3, 4, 3].
This example demonstrates the process of generating structured scientific properties using LLMs and evaluating them using multiple criteria with different evaluators. It thereby highlights both the strengths and challenges of aligning LLM-generated outputs with expert-curated annotations.

5. Results

In this section, we report the outcomes of our experimental evaluation of generated and human-curated properties. We present the results from direct assessment and pairwise ranking experiments, quantify evaluator self-preference using Cohen’s d, and compare LLM evaluators to human judgments via Spearman rank correlations.

5.1. Direct Assessment

The direct assessment experiments yielded evaluation scores across five criteria (relevance, factuality, informativeness, coherence, and specificity) on a 1–5 Likert scale. For each context scenario, the evaluations were conducted by three different LLM evaluators (Deepseek, Mistral, and Qwen) on properties generated by Llama, Mistral, and Qwen, as well as on human-annotated properties (ORKG). In addition to numerical scores, the evaluation prompt elicits free-text justifications for each rating, so the underlying rationales are available and could be systematically coded to derive an error taxonomy. Note that ORKG entries were created by domain experts from the full paper text (title, abstract, and body), which corresponds most closely to our “rich” context; therefore, ORKG properties were evaluated only in the rich scenario so that comparisons to human curation are made under equivalent information conditions. All scores reported in Table 1 are averaged over all evaluated properties. Complete per-scenario results with scores reported as mean ± standard deviation, are provided in Appendix B. To facilitate interpretation of the numerical results in Table 1, Figure 2 provides a heatmap visualization of criterion-averaged direct assessment scores across generators, evaluators, and context scenarios.
Across scenarios, the following stable patterns are evident from the aggregated scores (see Appendix B for full tables): (1) generated properties produced by modern LLMs routinely receive high average ratings on relevance and informativeness; (2) factuality and specificity are consistently rated lower than relevance and informativeness, indicating persistent difficulty in producing precise, domain-specific details; and (3) human-curated ORKG properties receive lower mean ratings from human validators in the rich scenario (see the Human Evaluation subsection below for details).

Self-Preference Bias

Since both Mistral and Qwen were used as generators and evaluators, we can assess self-preference bias by comparing how these models rate their own outputs versus outputs from other generators. We compute scenario-level, criterion-averaged Cohen’s d using the aggregated mean ± SD values in Appendix B. The resulting effect sizes are shown in Figure 3. In the lean scenario, Mistral shows near-zero bias ( d = 0.037 ) while Qwen exhibits a moderately negative bias ( d = 0.508 ), suggesting weak or even reversed self-preference when context is limited. In the rich scenario, both evaluators demonstrate strong positive bias (Mistral d = 0.888 ; Qwen d = 0.855 ), indicating clear preference for their own outputs when more context is available. In the dense scenario, self-preference largely disappears (Mistral d = 0.078 ; Qwen d = 0.166 ), possibly due to increased uniformity of outputs or reduced recognizability of one’s own generation patterns. Overall, self-preference is highly context-dependent: weak or negative in lean and dense contexts and strong in the rich context, with model-specific differences that vary by scenario rather than indicating a systematic dominance of one evaluator. Negative values suggest evaluators may undervalue their own outputs under certain conditions, reflecting a conservative evaluation tendency.

5.2. Human Evaluation

To assess the extent to which LLM-based evaluators align with human judgments, we conducted a small-scale human validation experiment in the rich context scenario. Human annotators evaluated properties generated by Llama, Mistral, and Qwen, as well as human-authored ORKG properties. For each of the four sources, the first 25 instances were selected, and each instance was rated across the five evaluation criteria, resulting in a total of 500 human evaluation scores. As with the LLM evaluators, human annotators were blinded to the origin of the properties to avoid potential bias.
We then computed Spearman rank correlation coefficients between the human scores and the corresponding scores produced by each LLM evaluator (Deepseek, Mistral, and Qwen). For each evaluator and criterion, scores were first averaged per generator, resulting in one aggregated score for each of the four generators (ORKG, Llama, Mistral, and Qwen). Correlations were then calculated across these four generator-level scores, both separately for each criterion and averaged across criteria. The results are summarized in Table 2.
Overall, the LLM evaluators show strong rank-order agreement with human judgments in the rich scenario. Criterion-averaged Spearman correlations range from ρ = 0.836 (Qwen) to ρ = 0.910 (Mistral), with an overall average correlation of ρ = 0.878 . For relevance, both Deepseek and Mistral exhibit perfect rank correlation with human judgments ( ρ = 1.000 ), while factuality correlations are similarly high for Deepseek and Mistral and moderately lower for Qwen. Informativeness shows consistently strong correlations across evaluators, whereas coherence and specificity exhibit moderate but stable agreement.
It is important to note that these correlations are computed over only four data points (one per generator) and should therefore be interpreted as descriptive rather than confirmatory. Nevertheless, the consistently high correlations across evaluators and criteria suggest that, in information-rich settings, LLM-based evaluators capture ranking preferences that are largely aligned with human judgments.
Finally, we observe that ORKG properties are rated substantially lower than LLM-generated properties by human annotators across all criteria. Possible explanations include the more conservative and template-driven nature of ORKG property descriptions, higher variance across human authors leading to stylistic and conceptual inconsistency, and potential schema alignment issues that reduce perceived relevance and specificity compared to LLM-generated content.

5.3. Pairwise Ranking

Complementing the direct assessment, the pairwise ranking experiments (Table 3) provide further insight into the structural alignment of properties across different contexts and between human-annotated and LLM-generated outputs. In these experiments, the similarity score represents the Similarity criterion, and the values are averaged over all properties. Pairwise similarity scores reveal that properties generated with richer contexts (rich and dense) are more similar to each other than those generated in the lean scenario. For instance, in the case of Llama-generated properties, the similarity score for Llama between rich and dense contexts was 3.38, which is higher than the scores observed between lean and rich (2.76) or lean and dense (2.76). A similar pattern is observed for both Mistral and Qwen, where rich versus dense comparisons yield scores in the range of 3.38 to 3.6, while lean versus rich/dense scores remain around 2.66 to 2.77.
The similarity scores are consistently lower when comparing human-annotated (ORKG) properties with LLM-generated properties. For example, the similarity between ORKG properties and Llama-generated properties in the lean context is as low as 1.96, indicating a substantial structural divergence. This trend holds across all models, suggesting that while richer contexts help stabilize and align LLM outputs with each other, they do not fully bridge the gap to human-curated representations.

5.4. Key Findings and Implications

Across all experiments, several consistent patterns emerge that clarify both the strengths and limitations of LLM-based property generation and evaluation. First, LLM-generated properties achieve consistently high mean scores for relevance and informativeness across lean, rich, and dense scenarios (Table 1; Appendix B). This indicates that contemporary LLMs are generally effective at producing structured summaries that capture the central themes of scientific texts, particularly when assessed at an aggregate level.
At the same time, informativeness and specificity remain consistently weaker dimensions. These criteria exhibit lower average scores and greater variability across evaluators and scenarios, suggesting that precise, domain-specific claims and fine-grained distinctions are more difficult for current models to capture reliably. This pattern persists even as contextual richness increases, highlighting an important limitation of automated property generation for knowledge graph construction.
Evaluator behavior further reveals that assessment outcomes are not purely model-agnostic. Using scenario-level, criterion-averaged Cohen’s d, we find that self-preference bias is strongly context-dependent. In the rich scenario, both Mistral and Qwen show large positive self-preference effects (Mistral d = 0.888 ; Qwen d = 0.855 ), indicating that evaluators tend to rate their own outputs substantially higher when more contextual information is available. In contrast, self-preference is weak or absent in the lean and dense scenarios (lean: Mistral d = 0.037 , Qwen d = 0.508 ; dense: Mistral d = 0.078 , Qwen d = 0.166 ), suggesting that minimal context or highly redundant information weakens evaluator familiarity effects. Notably, negative values imply that under some conditions, evaluators may rate other models’ outputs more favorably than their own.
Despite these biases, LLM evaluators exhibit strong alignment with human judgments in information-rich settings. In the rich scenario, Spearman rank correlations between LLM evaluators and human assessments are high across all criteria, with criterion-averaged correlations of ρ = 0.887 for Deepseek, ρ = 0.910 for Mistral, and ρ = 0.836 for Qwen (overall ρ ¯ = 0.878 ). These results indicate that, at least at the level of generator ranking, LLM evaluators capture preference structures that are broadly consistent with human evaluators. However, because these correlations are computed over only four generators, they should be interpreted as descriptive rather than confirmatory.
The human validation experiment also reveals a systematic difference between LLM-generated and human-curated content. ORKG properties are rated substantially lower than LLM-generated properties across all criteria. Possible explanations include the more concise and template-driven nature of ORKG entries, higher stylistic and conceptual variance across human authors, and schema alignment mismatches between ORKG property formulations and the evaluation rubric, which may disadvantage human-authored content in this assessment framework.
Taken together, these findings suggest that KGEval-style LLM evaluation pipelines are a viable and scalable tool for assessing structured scientific summaries, particularly in rich-context settings where evaluator–human agreement is high. At the same time, the presence of context-dependent self-preference and persistent weaknesses in factual precision and specificity motivate a hybrid workflow. Automated LLM evaluation is well suited for screening large volumes of candidate properties and identifying high-level quality patterns, while targeted human review remains essential for validating factual correctness, resolving evaluator disagreement, and ensuring alignment with domain-specific knowledge standards.

6. Discussion and Future Work

The results demonstrate that LLMs can effectively generate and evaluate structured scientific representations, while also revealing important limitations that must be addressed in practical deployments. Within the KGEval framework, LLM-generated properties consistently score highly on relevance, factuality, and coherence but remain weaker on informativeness and specificity. Moreover, evaluator behavior is not model-agnostic: self-preference effects emerge in information-rich contexts, underscoring the need to interpret evaluation results in light of evaluator identity and context conditions.
KGEval also addresses key shortcomings of traditional NLP evaluation metrics such as BLEU or ROUGE, which are poorly suited to structured outputs and semantic adequacy. By leveraging LLMs as evaluators, KGEval enables task-specific, semantically informed assessment. This approach is empirically supported by the strong rank-order agreement observed between LLM evaluators and human judgments in rich contexts. At the same time, the results highlight clear boundaries: agreement is strongest at the level of relative ranking rather than absolute correctness, and factual precision remains a persistent challenge.
These observations have direct methodological implications for the design of evaluation pipelines. Rather than treating LLM evaluators as substitutes for human judgment, our results suggest they are most effective when embedded within hybrid workflows that exploit their scalability while accounting for their biases. In particular, evaluator self-preference and sensitivity to context indicate that evaluator choice and configuration should be treated as experimental factors rather than neutral instruments. This perspective reframes LLM evaluation from a purely automated alternative to human assessment into a controllable component of a broader curation process.
Future work will therefore focus on validating KGEval across external knowledge graphs with diverse schemas, mitigating evaluator bias through ensembling or evaluator separation, and improving informativeness and specificity via tighter evidence grounding and retrieval-aware evaluation prompts. By explicitly addressing these limitations, KGEval can serve as a robust component of hybrid workflows that combine the scalability of LLM-based evaluation with the reliability of targeted human oversight.

Cross-KG Validation (ClaimsKG)

Although our experiments focus on the ORKG, KGEval is designed to generalize beyond a single knowledge graph. Its prompt-driven architecture and rubric-based evaluation make it applicable to a wide range of research KGs, including SoftwareKG, Springer SciGraph, ClaimsKG, and others. While the same core evaluation criteria can be reused across domains, our findings suggest that prompt design, evaluator choice, and context configuration play a critical role in shaping outcomes. As a result, portability across KGs requires not only schema adaptation but also careful calibration of evaluation settings.
To further probe the generalizability of KGEval beyond the ORKG, we conducted an additional pilot study on ClaimsKG. We assembled a small dataset of 30 news articles and corresponding claims mentioning military conflict and adapted our direct assessment evaluation prompt (Appendix A.4) by replacing references to “properties” and “research papers” with “facts” and “articles,” respectively. This minimal prompt modification reflects the intended portability of KGEval across knowledge graphs with differing semantic units but comparable evaluation needs. We then applied the evaluator module using Deepseek as the LLM evaluator and performed a parallel human evaluation on the same dataset.
The resulting averaged scores show close agreement between LLM and human assessments. For the LLM evaluator, the mean scores were relevance = 5.0, factuality = 2.4, informativeness = 2.8, coherence = 5.0, and specificity = 5.0, while the corresponding human scores were relevance = 5.0, factuality = 2.4, informativeness = 2.5, coherence = 5.0, and specificity = 5.0. Since the dataset intentionally included both correct and false claims, factuality and informativeness exhibited substantial variance, whereas relevance, coherence, and specificity remained consistently high, as all claims were explicitly stated in their source articles. Spearman rank correlation computed over the criteria with non-zero variance (factuality and informativeness, averaged over 30 items) yields a high agreement between LLM and human judgments ρ = 0.93 , indicating strong alignment in ranking behavior where discrimination is required.
While limited in scale, this experiment provides initial empirical support for the applicability of KGEval to a fact-checking-oriented knowledge graph with different structural assumptions than ORKG. At the same time, it reinforces observations from the main study: evaluation outcomes remain sensitive to prompt formulation, evaluator choice, and context configuration. Future work will extend this validation to larger and more diverse datasets across multiple knowledge graphs and evaluators, enabling a more systematic assessment of how evaluation criteria and prompt adaptations interact with domain-specific characteristics.

7. Conclusions

This paper introduced KGEval, a modular framework for generating and evaluating structured scientific properties using large language models. We studied the transferability of qualitative evaluation rubrics, inter-model differences among LLMs, and the impact of contextual richness on both generation quality and evaluation behavior.
Our results show that LLM-generated properties generally achieve high scores on relevance, factuality, and coherence, while informativeness and specificity remain persistent challenges. Evaluator behavior is context-sensitive: self-preference effects are pronounced in rich contexts but weak or absent in lean and dense settings, and Spearman correlations indicate strong rank-order agreement between LLM evaluators and human judgments in the rich scenario. These findings support the use of LLM-based evaluation for large-scale ranking, but not as a replacement for targeted human validation, particularly for factual accuracy.
This study is subject to limitations, including small sample sizes for some analyses, reliance on aggregated statistics, and schema-alignment effects that disadvantage human-authored ORKG entries under the evaluation rubric. Future work will focus on mitigating evaluator bias, improving factual precision, and validating KGEval across additional knowledge graphs with diverse schemas.
Overall, KGEval provides a practical step toward scalable evaluation of structured scientific content, with the greatest utility in hybrid human–LLM workflows that combine automated ranking with expert oversight.

Author Contributions

Conceptualization, J.D. and S.E.; methodology, V.N.; validation, V.N.; investigation, V.N. and J.D.; resources, V.N. and J.D.; data curation, V.N.; writing—original draft preparation, V.N. and J.D.; writing—review and editing, J.D., S.E. and S.A.; visualization, V.N.; supervision, J.D., S.E. and S.A.; project administration, J.D.; funding acquisition, J.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the German BMBF project SCINEXT (ID 01lS22070), the European Research Council for ScienceGRAPH (Grant Agreement (GA) ID: 819536), and German DFG for NFDI4DataScience (no. 460234259).

Data Availability Statement

The original contributions presented in this study are included in this article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Prompts Used in KGEval

Below, we provide the complete code for the prompts utilized in our framework. These prompts are used both for generating research dimensions under various context conditions and for evaluating the generated properties.

Appendix A.1. Zero-Shot Prompt for Lean Context

Information 17 00035 i001
Information 17 00035 i002

Appendix A.2. Zero-Shot Prompt for Rich Context

Information 17 00035 i003

Appendix A.3. Zero-Shot Prompt for Dense Context

Information 17 00035 i004
Information 17 00035 i005

Appendix A.4. Direct Assessment Evaluation Prompt

Information 17 00035 i006
Information 17 00035 i007
Information 17 00035 i008

Appendix A.5. Pairwise Ranking Evaluation Prompt

Information 17 00035 i009
Information 17 00035 i010
Information 17 00035 i011

Appendix B. Full Direct Assessment Results

This appendix reports the complete direct assessment results for the lean, rich, and dense context scenarios. Scores are shown as mean ± standard deviation and are averaged over all evaluated properties for each generator–evaluator pair. Evaluations were performed on a 1–5 Likert scale across five criteria: relevance, factuality, informativeness, coherence, and specificity. Then, 95% confidence intervals were computed for all reported means based on the corresponding scenario-level sample sizes ( N = 326 for lean, N = 2861 for rich, and N = 2230 for dense) but are omitted from the tables for readability.
Table A1. Direct assessment results for the lean context scenario (mean ± standard deviation).
Table A1. Direct assessment results for the lean context scenario (mean ± standard deviation).
GeneratorEvaluatorRelevanceFactualityInformativenessCoherenceSpecificity
LlamaDeepseek4.69 ± 0.654.92 ± 0.394.46 ± 0.644.88 ± 0.364.05 ± 0.83
MistralDeepseek4.63 ± 0.584.88 ± 0.394.10 ± 0.674.80 ± 0.443.63 ± 0.83
QwenDeepseek4.48 ± 0.614.86 ± 0.414.10 ± 0.714.78 ± 0.463.41 ± 0.84
LlamaMistral4.86 ± 0.414.97 ± 0.174.82 ± 0.504.99 ± 0.114.64 ± 0.76
MistralMistral4.74 ± 0.564.95 ± 0.254.52 ± 0.754.95 ± 0.224.24 ± 0.99
QwenMistral4.55 ± 0.714.91 ± 0.314.35 ± 0.834.95 ± 0.243.96 ± 1.14
LlamaQwen4.32 ± 0.624.96 ± 0.243.79 ± 0.834.77 ± 0.423.65 ± 0.96
MistralQwen4.23 ± 0.614.93 ± 0.293.55 ± 0.704.68 ± 0.473.46 ± 0.88
QwenQwen3.92 ± 0.674.83 ± 0.383.31 ± 0.644.53 ± 0.543.03 ± 0.86
Table A2. Direct assessment results for the rich context scenario (mean ± standard deviation).
Table A2. Direct assessment results for the rich context scenario (mean ± standard deviation).
GeneratorEvaluatorRelevanceFactualityInformativenessCoherenceSpecificity
ORKGDeepseek3.59 ± 0.744.12 ± 0.892.97 ± 0.773.66 ± 0.972.81 ± 0.82
LlamaDeepseek4.60 ± 0.634.85 ± 0.454.09 ± 0.714.78 ± 0.483.95 ± 0.90
MistralDeepseek4.70 ± 0.584.90 ± 0.374.09 ± 0.644.80 ± 0.454.07 ± 0.84
QwenDeepseek4.75 ± 0.594.92 ± 0.354.23 ± 0.664.83 ± 0.414.18 ± 0.84
ORKGMistral2.83 ± 0.723.53 ± 0.852.14 ± 0.783.17 ± 0.952.12 ± 0.78
LlamaMistral4.57 ± 0.654.89 ± 0.334.28 ± 0.944.87 ± 0.394.25 ± 0.98
MistralMistral4.63 ± 0.634.93 ± 0.324.26 ± 0.904.89 ± 0.364.30 ± 0.96
QwenMistral4.78 ± 0.494.97 ± 0.214.54 ± 0.744.95 ± 0.234.54 ± 0.82
ORKGQwen2.89 ± 0.574.04 ± 0.462.53 ± 0.583.31 ± 0.682.44 ± 0.60
LlamaQwen4.17 ± 0.634.92 ± 0.323.55 ± 0.654.53 ± 0.523.51 ± 0.80
MistralQwen4.26 ± 0.614.91 ± 0.293.52 ± 0.614.58 ± 0.513.59 ± 0.76
QwenQwen4.37 ± 0.604.92 ± 0.293.70 ± 0.654.64 ± 0.503.72 ± 0.79
Table A3. Direct assessment results for the dense context scenario (mean ± standard deviation).
Table A3. Direct assessment results for the dense context scenario (mean ± standard deviation).
GeneratorEvaluatorRelevanceFactualityInformativenessCoherenceSpecificity
LlamaDeepseek4.67 ± 0.594.90 ± 0.384.19 ± 0.724.81 ± 0.454.02 ± 0.90
MistralDeepseek4.72 ± 0.524.92 ± 0.344.07 ± 0.634.79 ± 0.483.95 ± 0.85
QwenDeepseek4.73 ± 0.564.92 ± 0.334.20 ± 0.674.82 ± 0.424.06 ± 0.86
LlamaMistral4.49 ± 0.734.86 ± 0.414.23 ± 0.974.84 ± 0.414.13 ± 1.06
MistralMistral4.55 ± 0.724.90 ± 0.354.20 ± 0.964.85 ± 0.394.19 ± 1.04
QwenMistral4.68 ± 0.614.92 ± 0.324.45 ± 0.844.91 ± 0.324.40 ± 0.93
LlamaQwen3.96 ± 0.714.73 ± 0.453.44 ± 0.634.36 ± 0.563.23 ± 0.84
MistralQwen4.03 ± 0.704.79 ± 0.413.37 ± 0.584.38 ± 0.563.24 ± 0.79
QwenQwen4.13 ± 0.684.82 ± 0.393.53 ± 0.644.42 ± 0.563.38 ± 0.83

References

  1. Auer, S.; Oelen, A.; Haris, M.; Stocker, M.; D’Souza, J.; Farfar, K.E.; Vogt, L.; Prinz, M.; Wiens, V.; Jaradeh, M.Y. Improving access to scientific literature with knowledge graphs. Bibl. Forsch. Und Prax. 2020, 44, 516–529. [Google Scholar] [CrossRef]
  2. Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.W.; da Silva Santos, L.B.; Bourne, P.E.; et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 2016, 3, 160018. [Google Scholar] [CrossRef] [PubMed]
  3. Meyer, L.P.; Stadler, C.; Frey, J.; Radtke, N.; Junghanns, K.; Meissner, R.; Dziwis, G.; Bulert, K.; Martin, M. Llm-assisted knowledge graph engineering: Experiments with chatgpt. In Proceedings of the Working conference on Artificial Intelligence Development for a Resilient and Sustainable Tomorrow; Springer Fachmedien Wiesbaden: Wiesbaden, Germany, 2023; pp. 103–115. [Google Scholar]
  4. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PN, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
  5. Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
  6. Gu, J.; Jiang, X.; Shi, Z.; Tan, H.; Zhai, X.; Xu, C.; Li, W.; Shen, Y.; Ma, S.; Liu, H.; et al. A Survey on LLM-as-a-Judge. arXiv 2025, arXiv:2411.15594. [Google Scholar]
  7. Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. Bertscore: Evaluating text generation with bert. arXiv 2019, arXiv:1904.09675. [Google Scholar]
  8. Zhao, W.; Peyrard, M.; Liu, F.; Gao, Y.; Meyer, C.M.; Eger, S. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. arXiv 2019, arXiv:1909.02622. [Google Scholar] [CrossRef]
  9. Yuan, W.; Neubig, G.; Liu, P. Bartscore: Evaluating generated text as text generation. Adv. Neural Inf. Process. Syst. 2021, 34, 27263–27277. [Google Scholar]
  10. Thompson, B.; Post, M. Automatic machine translation evaluation in many languages via zero-shot paraphrasing. arXiv 2020, arXiv:2004.14564. [Google Scholar] [CrossRef]
  11. Chen, Y.; Eger, S. Menli: Robust evaluation metrics from natural language inference. Trans. Assoc. Comput. Linguist. 2023, 11, 804–825. [Google Scholar] [CrossRef]
  12. Kocmi, T.; Federmann, C. Large language models are state-of-the-art evaluators of translation quality. arXiv 2023, arXiv:2302.14520. [Google Scholar]
  13. Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]
  14. Wang, J.; Liang, Y.; Meng, F.; Sun, Z.; Shi, H.; Li, Z.; Xu, J.; Qu, J.; Zhou, J. Is chatgpt a good nlg evaluator? A preliminary study. arXiv 2023, arXiv:2303.04048. [Google Scholar] [CrossRef]
  15. Chiang, C.H.; Lee, H.y. Can large language models be an alternative to human evaluations? arXiv 2023, arXiv:2305.01937. [Google Scholar] [CrossRef]
  16. Dubois, Y.; Li, C.X.; Taori, R.; Zhang, T.; Gulrajani, I.; Ba, J.; Guestrin, C.; Liang, P.S.; Hashimoto, T.B. Alpacafarm: A simulation framework for methods that learn from human feedback. Adv. Neural Inf. Process. Syst. 2023, 36, 30039–30069. [Google Scholar]
  17. Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv 2023, arXiv:2303.16634. [Google Scholar] [CrossRef]
  18. Fu, J.; Ng, S.K.; Jiang, Z.; Liu, P. Gptscore: Evaluate as you desire. arXiv 2023, arXiv:2302.04166. [Google Scholar] [CrossRef]
  19. Ye, S.; Kim, D.; Kim, S.; Hwang, H.; Kim, S.; Jo, Y.; Thorne, J.; Kim, J.; Seo, M. Flask: Fine-grained language model evaluation based on alignment skill sets. arXiv 2023, arXiv:2307.10928. [Google Scholar]
  20. Kim, S.; Shin, J.; Cho, Y.; Jang, J.; Longpre, S.; Lee, H.; Yun, S.; Shin, S.; Kim, S.; Thorne, J.; et al. Prometheus: Inducing fine-grained evaluation capability in language models. In Proceedings of the The Twelfth International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  21. Nechakhin, V.; D’Souza, J.; Eger, S. Evaluating large language models for structured science summarization in the open research knowledge graph. Information 2024, 15, 328. [Google Scholar] [CrossRef]
Figure 1. Overview of the KGEval workflow.
Figure 1. Overview of the KGEval workflow.
Information 17 00035 g001
Figure 2. Criterion-averaged direct assessment scores across generators, evaluators, and context scenarios.
Figure 2. Criterion-averaged direct assessment scores across generators, evaluators, and context scenarios.
Information 17 00035 g002
Figure 3. Scenario-level, criterion-averaged Cohen’s d quantifying self-preference bias for Mistral and Qwen evaluators across lean, rich, and dense contexts. Positive values indicate a preference for evaluating their own outputs more favorably.
Figure 3. Scenario-level, criterion-averaged Cohen’s d quantifying self-preference bias for Mistral and Qwen evaluators across lean, rich, and dense contexts. Positive values indicate a preference for evaluating their own outputs more favorably.
Information 17 00035 g003
Table 1. Direct assessment scores. Columns R, F, I, C, and S indicate the relevance, factuality, informativeness, coherence, and specificity scores, respectively.
Table 1. Direct assessment scores. Columns R, F, I, C, and S indicate the relevance, factuality, informativeness, coherence, and specificity scores, respectively.
GeneratorEvaluatorLean ScenarioRich ScenarioDense Scenario
R F I C S R F I C S R F I C S
ORKGDeepseek 3.594.122.973.662.81
LlamaDeepseek4.694.924.464.884.054.604.854.094.783.954.674.904.194.814.01
MistralDeepseek4.634.884.104.803.634.704.904.094.804.074.724.924.074.793.95
QwenDeepseek4.484.864.104.783.414.754.924.234.834.184.734.924.204.824.06
ORKGMistral 2.833.532.143.172.12
LlamaMistral4.864.974.824.994.644.574.894.284.874.254.494.864.234.844.13
MistralMistral4.744.954.524.954.244.634.934.264.884.304.554.904.204.854.19
QwenMistral4.554.914.354.943.964.774.974.544.954.544.684.924.454.914.40
ORKGQwen 2.894.042.523.312.44
LlamaQwen4.324.963.794.773.654.174.923.554.533.513.964.733.444.363.23
MistralQwen4.234.933.554.683.464.264.913.524.583.584.034.793.374.383.24
QwenQwen3.924.833.314.523.034.374.923.704.643.724.134.823.534.423.38
ORKGHuman 2.602.302.202.201.80
LlamaHuman 4.104.103.404.203.10
MistralHuman 4.204.503.103.903.00
QwenHuman 4.504.603.404.303.30
Table 2. Spearman rank correlation ( ρ ) between LLM evaluators and human judgments in the rich context scenario, computed over generator-level averaged assessment scores.
Table 2. Spearman rank correlation ( ρ ) between LLM evaluators and human judgments in the rich context scenario, computed over generator-level averaged assessment scores.
EvaluatorRelevanceFactualityInformativenessCoherenceSpecificityAvg.
Deepseek1.0001.0000.8330.8000.8000.887
Mistral1.0001.0000.9490.8000.8000.910
Qwen1.0000.6320.9490.8000.8000.836
Table 3. Pairwise ranking similarity scores.
Table 3. Pairwise ranking similarity scores.
Properties Set 1Properties Set 2Similarity
Llama (lean)Llama (rich)2.76
Llama (rich)Llama (dense)3.38
Llama (lean)Llama (dense)2.76
Mistral (lean)Mistral (rich)2.66
Mistral (rich)Mistral (dense)3.38
Mistral (lean)Mistral (dense)2.66
Qwen (lean)Qwen (rich)2.71
Qwen (rich)Qwen (dense)3.60
Qwen (lean)Qwen (dense)2.77
ORKGLlama (lean)1.96
ORKGLlama (rich)2.05
ORKGLlama (dense)2.04
ORKGMistral (lean)2.13
ORKGMistral (rich)2.06
ORKGMistral (dense)2.11
ORKGQwen (lean)2.30
ORKGQwen (rich)2.15
ORKGQwen (dense)2.21
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nechakhin, V.; D’Souza, J.; Eger, S.; Auer, S. KGEval: Evaluating Scientific Knowledge Graphs with Large Language Models. Information 2026, 17, 35. https://doi.org/10.3390/info17010035

AMA Style

Nechakhin V, D’Souza J, Eger S, Auer S. KGEval: Evaluating Scientific Knowledge Graphs with Large Language Models. Information. 2026; 17(1):35. https://doi.org/10.3390/info17010035

Chicago/Turabian Style

Nechakhin, Vladyslav, Jennifer D’Souza, Steffen Eger, and Sören Auer. 2026. "KGEval: Evaluating Scientific Knowledge Graphs with Large Language Models" Information 17, no. 1: 35. https://doi.org/10.3390/info17010035

APA Style

Nechakhin, V., D’Souza, J., Eger, S., & Auer, S. (2026). KGEval: Evaluating Scientific Knowledge Graphs with Large Language Models. Information, 17(1), 35. https://doi.org/10.3390/info17010035

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop