Next Article in Journal
Self-Organizing Coverage Method of Swarm Robots Based on Dynamic Virtual Force
Previous Article in Journal
A Symmetry-Driven Adaptive Dual-Subpopulation Tree–Seed Algorithm for Complex Optimization with Local Optima Avoidance and Convergence Acceleration
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

ContextualAugmentation via Retrieval for Multi-Granularity Relation Extraction in LLMs

by
Danjie Han
1,
Lingzhong Meng
2,
Xun Li
1,
Jia Li
3,
Cunhan Guo
4,5,
Yanghao Zhou
4,
Changsen Yuan
4 and
Yuxi Ma
2,*
1
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
2
Institute of Software Chinese Academy of Sciences, Beijing 100190, China
3
School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450001, China
4
School of Computer Science and Engineering, Beijing Institute of Technology, Beijing 100081, China
5
Southeast Academy of Information Technology, Beijing Institute of Technology, Putian 351100, China
*
Author to whom correspondence should be addressed.
Symmetry 2025, 17(8), 1201; https://doi.org/10.3390/sym17081201
Submission received: 19 June 2025 / Revised: 20 July 2025 / Accepted: 24 July 2025 / Published: 28 July 2025
(This article belongs to the Section Computer)

Abstract

To address issues commonly observed during the inference phase of large language models—such as inconsistent labels, formatting errors, or semantic deviations—a series of targeted strategies has been proposed. First, a relation label refinement strategy based on semantic similarity and syntactic structure has been designed to calibrate the model’s outputs, thereby improving the accuracy and consistency of label prediction. Second, to meet the contextual modeling needs of different types of instance bags, a multi-level contextual augmentation strategy has been constructed. For multi-sentence instance bags, a graph-based retrieval enhancement mechanism is introduced, which integrates intra-bag entity co-occurrence networks with document-level sentence association graphs to strengthen the model’s understanding of cross-sentence semantic relations. For single-sentence instance bags, a semantic expansion strategy based on term frequency-inverse document frequency is employed to retrieve similar sentences. This enriches the training context under the premise of semantic consistency, alleviating the problem of insufficient contextual information. Notably, the proposed multi-granularity framework captures semantic symmetry between entities and relations across different levels of context, which is crucial for accurate and balanced relation understanding. The proposed methodology offers practical advancements for semantic analysis applications, particularly in knowledge graph development.

1. Introduction

With a series of groundbreaking advancements in the field of large language models (LLMs), models such as Qwen2.5 [1], Baichuan2-7B-Chat (https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat (accessed on 2 April 2025)), LLaMA3-8B-instruct [2], and DeepSeek-V3 [3], have achieved significant progress in natural language processing. These models have demonstrated strong capabilities not only in traditional tasks such as text generation [4] and recommender systems [5] but also in the increasingly prominent domain of information extraction (IE). As a core subtask of IE, relation extraction (RE) plays a pivotal role in practical applications such as knowledge graph construction [6] and intelligent question answering systems [7].
The essence of RE lies in the accurate identification of semantic relationships—such as causality and affiliation—between entity pairs in input texts. This task imposes stringent requirements on a model’s semantic comprehension, demanding both a holistic understanding of the overall context and precise recognition of local details. Traditional approaches, such as BERT [8], have made notable advances in local feature modeling through carefully designed fine-tuning strategies. However, their performance remains highly dependent on the quality of labeled data and tends to fall short in handling semantically complex scenarios. In recent years, the emergence of LLMs has introduced a new paradigm for RE. Importantly, understanding such relations often involves recognizing patterns of symmetry and asymmetry in semantic structures—for example, the bidirectionality of affiliation versus the unidirectionality of causality—which further complicates the task for general-purpose models.
Our systematic evaluations on three authoritative datasets—GDS [9], SemEval 2010 [10], and KBP37 [11]—demonstrate (see Figure 1) that the relation extraction capabilities of mainstream large language models, including Qwen2.5-3B/7B-Instruct, Baichuan2-7B-Chat, LLaMA3-8B-Instruct, and DeepSeek-V3, still leave considerable room for improvement when no task-specific fine-tuning is performed. This performance gap can be attributed to two primary factors.
First, the generative architecture of LLMs is inherently optimized for global semantic modeling, whereas relation extraction tasks typically require attention to localized syntactic structures and contextual features. Consequently, LLMs tend to overlook critical details necessary for precise relation identification. Second, most existing RE approaches are confined to sentence-level processing, whereas the accurate recognition of many complex relations in real-world applications often depends on broader contextual understanding across multiple sentences. In long documents, for instance, entity relations often span multiple sentences, and the absence of such context severely hinders model performance.
To address these challenges, we propose a graph-based retrieval-augmented generation (GraphRAG) [12] framework specifically designed for multi-sentence instance bags. By constructing bag-level contextual representations, this method enhances the model’s ability to understand cross-sentence semantic relations and effectively leverages document-level context to resolve complex issues such as coreference resolution. In addition, for single-sentence instance bags, we introduce a lightweight context enhancement strategy based on term frequency-inverse document frequency (TF-IDF). This approach enriches training data diversity through controlled augmentation while preserving semantic consistency.
Moreover, a novel relation label refinement strategy is developed to calibrate model predictions via post-processing, thereby improving inference accuracy. This technique is particularly effective in correcting specific types of prediction errors, especially in cases involving ambiguous semantic relations. Finally, we adopt a parameter-efficient fine-tuning (PEFT) framework that combines with low-rank adaptation (LoRA) [13] to optimize the Qwen2.5-3B/7B-Instruct models. This hybrid approach retains the knowledge embedded in the pretrained models while enabling effective adaptation to task-specific requirements. The main contributions of this paper are summarized as follows:
  • A multi-dimensional relation label refinement strategy that integrates semantic and syntactic similarity has been designed to address the issue of non-standard output formats generated by LLMs.
  • A multi-granularity contextual augmentation framework has been constructed for both multi-instance and single-instance learning scenarios, leveraging the semantic understanding capabilities of LLMs and incorporating LoRA for efficient model fine-tuning.
  • Extensive experiments on datasets from different domains demonstrate that the proposed method consistently enhances the relation extraction performance of large language models.

2. Related Work

With the rapid advancement of large language models (LLMs) in the field of natural language processing, their application to relation extraction (RE) has become an emerging research focus. Current studies on LLM-based relation extraction primarily face two critical challenges.
First, regarding model adaptability, various innovative approaches have been proposed to improve the performance of LLMs in RE tasks. For instance, the ChatIE framework introduced by Wei et al. [14] employs a two-stage prompting strategy that significantly enhances performance in zero-shot scenarios. Similarly, Lou et al. [15] developed a universal structured model (USM) that achieves more effective joint modeling through unified token linkage operations. Despite these advances, LLMs still exhibit notable limitations in relation extraction. Empirical analyses by Han et al. [16] and Li et al. [17] have demonstrated that models such as ChatGPT (gpt-3.5-turbo-0301) consistently underperform compared to fine-tuned BERT-based counterparts when inferring semantic relations between entity pairs. This performance gap is largely attributed to two factors: insufficient representation of domain-specific knowledge and inherent deficiencies in fine-grained semantic reasoning.
Second, in terms of framework generalizability, several unified information extraction frameworks have emerged in recent years [18,19,20]. While these frameworks enable joint modeling across multiple tasks, their performance on dedicated RE tasks still lags behind that of specialized models. To address these challenges, researchers have explored several innovative directions. On the methodological front, Miao et al. [21], inspired by neuroscience, simulated hippocampal mechanisms of pattern separation and completion to generate counterfactual samples, thereby enhancing the model’s commonsense reasoning capabilities. Xue et al. [22] broke through the limitations of traditional relation classification paradigms by proposing a more extensible document-level relation extraction framework. From a technical optimization perspective, Efeoglu et al. [23] employed retrieval-augmented generation (RAG) techniques to mitigate hallucination problems in LLMs, while Chen et al. [24] introduced a context-aware prompt tuning framework that leverages joint contrastive learning to improve domain adaptability.
Nevertheless, empirical studies continue to highlight persistent limitations in current technologies. Evaluations on the GDS, SemEval, and KBP37 datasets reveal that mainstream LLMs—including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, Baichuan2-7B-Chat, LLaMA3-8B-Instruct, and DeepSeek-V3—consistently perform poorly on sentence-level relation extraction tasks (see Figure 1). These findings align with the evaluation results reported by [16,17], underscoring key technical challenges that remain unresolved in this domain.

3. Methodology

3.1. Task Definition

RE aims to identify and extract semantic relationships between entities within a sentence. In this study, the task is defined following the multi-instance learning (MIL) framework proposed by [25]. Specifically, given a sentence S = x 1 , x 2 , consisting of n tokens, the primary objective of the RE task is to predict the relation r between an entity pair ( e 1 , e 2 ) , where r R and R denotes a predefined set of relation labels. In this formulation, e 1 refers to the head entity (subject), and e 2 refers to the tail entity (object). We designed a framework named contextual augmentation via retrieval for multi-granularity relation extraction in LLMs (CARME). The overall architecture is illustrated in Figure 2.

3.2. Relation Label Refinement Strategy (RLRS)

When applying LLMs such as Qwen2.5-3B/7B-Instruct, LLaMA3-8B-Instruct, and Baichuan-7B to relation extraction tasks, it was observed that the initial prediction results were significantly affected by noise (as illustrated in Table 1), rendering them unsuitable for direct evaluation.
As shown in Table 1, LLMs are prompted to directly generate prediction results using prompt engineering, formatting inconsistencies frequently arise, even when the prompt explicitly and strictly specifies that the output must follow the relation types defined in the relation list. Such deviations from the expected format significantly undermine the accuracy and consistency of the relation extraction results, thereby negatively impacting subsequent analyses and downstream applications that rely on these outputs. For example, there are the following errors:
  • In one case, the gold relation label between an entity pair was /people/deceased_person/place_of_death, yet the model output was [John Glover]/place_of_birth, [George Feyer], ?, [Upper East Side], demonstrating a clear deviation from the expected format.
  • In another instance, where the correct label was /people/person/place_of_birth, the model predicted [George Perkins Merrill], place_of_birth, [Androscoggin County/Auburn city], which, while partially correct, still failed to align precisely with the required structure.
These findings indicate that without additional constraints or refinement mechanisms, direct inference from LLMs can yield outputs with varying degrees of structural inconsistency and semantic noise, thus necessitating further methodological interventions.
To address the aforementioned issue, a novel relation label refinement strategy based on multi-dimensional similarity matching is proposed in this study. Taking the GDS dataset as an example, the refined prediction results are illustrated in Figure 3.
Based on the initial predictions generated by the large language model, the candidate relations are filtered using a multi-dimensional similarity matching algorithm. Specifically, the composite similarity score Y i (incorporating both semantic similarity and character edit distance) between the current prediction and each predefined relation in set R is computed. The relation with the highest composite similarity score is subsequently selected as the final predicted label. The detailed workflow is illustrated in Figure 4.
Y i = argmax r R λ · cos semantic r , p i + ( 1 λ ) · sim edit r , p i
where R denotes the set of predefined relation labels, p i represents the model’s original predicted relation, and λ [ 0 , 1 ] is a weighting coefficient that controls the balance between semantic similarity and edit distance-based similarity. The term sim edit ( · ) refers to the normalized string similarity derived from the Levenshtein distance, which measures the degree of difference between two strings. The semantic similarity cos semantic ( r , p ) is computed using the cosine similarity between Sentence-BERT (https://github.com/UKPLab/sentence-transformers (accessed on 2 April 2025)) embeddings of the candidate relation r and the model prediction p, as defined in Equation (2):
cos semantic ( r , p ) = e r T · e p e r · e p , e r R d
where e r and e p are d-dimensional embedding vectors generated by Sentence-BERT, a BERT-based model specifically designed to produce semantically meaningful sentence-level embeddings. Sentence-BERT addresses the limitations of the original BERT model in tasks involving sentence similarity, by enabling direct and efficient computation of vector-based semantic comparisons such as cosine similarity.

3.3. Multi-Granularity Context Enhancement Framework

In this subsection, a multi-granularity contextual enhancement module is proposed based on sentence-level relation extraction datasets. This module is designed to alleviate the issues of semantic ambiguity and information insufficiency that lead to suboptimal recognition performance by incorporating richer contextual information. To address the different characteristics of multiple instance learning (MIL) and single instance learning (SIL) settings, distinct strategies are employed at the document-level and sentence-level granularities. Specifically, the GraphRAG method is adopted for MIL scenarios, while a TF-IDF-based strategy is used for SIL settings to construct and model contextual information.

3.3.1. Document-Level Contextual Enhancement Strategy (for MIL)

This strategy is designed for MIL scenarios, where each instance bag consists of multiple sentence-level instances. Traditional approaches often overlook the potential structured associations among sentences, leading to incomplete semantic expression or redundant information. To overcome this limitation, the idea of retrieval-augmented modeling is introduced, and a GraphRAG-based method is employed to integrate semantically relevant contextual information within each bag using a graph-based structure. This enables more robust relation modeling. The proposed method comprises the following steps:
  • Explicit Structural Modeling: In GraphRAG, all entities appearing in the sentences are treated as nodes in a graph, while edges are constructed based on the sentences containing each entity pair. This enables the structural relationships among sentences to be explicitly captured.
  • Knowledge Retrieval: All edges associated with the target entity pair are retrieved from the constructed knowledge graph. Sentence-BERT is then used to encode both the target sentence and the retrieved edges, and their semantic similarity is computed to rank the relevance of these candidate contextual fragments. Notably, all retrieved content is sourced from the internal corpus, without the introduction of additional external noise.
  • Re-ranking of Retrieved Results: To further refine the retrieved results, the BAAI/bge-reranker-base (https://huggingface.co/BAAI/bge-reranker-base (accessed on 2 April 2025)) model is employed to perform re-ranking. Finally, the top-k most relevant sentences are selected as the enhanced contextual information within the instance bag.
Specifically, given a sentence S i = { x 1 , x 2 , , x n } with the head and tail entities denoted as ( e 1 , e 2 ) , all sentences S j ( i j ) that contain the same entity pair are retrieved from the knowledge graph constructed by GraphRAG. The Sentence-BERT encoder is then applied to obtain vector representations of S i and each retrieved S j . Cosine similarity is computed between these vectors as follows:
cos ( S i , S j ) = S i · S j S i · S j
where S i and S j denote the encoded vector representations of the original and retrieved sentences, respectively. The ranked list of retrieved sentences is denoted as S j * .
Next, a fine-grained re-ranking is conducted using a re-ranker model applied to S i and each S j * :
C k = Rerank ( S i , S j * )
The top-k sentences are selected from the re-ranked list and concatenated with the original sentence S i to construct the enhanced contextual information within the instance bag, as shown in Equations (5) and  (6).
C k = t o p k ( C k )
S k = c o n c a t ( S i , C k )
The resulting sentence S k represents the document-level contextually enhanced instance.

3.3.2. Sentence-Level Context Enhancement (for SIL)

A single sentence often fails to provide sufficient semantic support. To address the issue of missing contextual information in sentence-level relation extraction, a TF-IDF-based strategy for sentence-level context enhancement was proposed. Through the TF-IDF process, the target sentence is transformed into the format:
content1 [head entity] content2 [tail entity] content3, thereby constructing the sentence-level context-enhanced instance S .
Finally, the document-level contextual representation S k and the sentence-level contextual representation S were concatenated to form the multi-granularity context-enhanced input S C , which was subsequently fed into a unified relation classification model for training and prediction.
S C = Concat ( S k , S )
This multi-granularity context enhancement framework leverages the complementary strengths of graph-based structure modeling and sparse vector retrieval, significantly improving classification accuracy and generalization under complex relational scenarios.

3.4. LoRA Tuning

Fine-tuning large-scale pre-trained language models is a key technique for adapting them to specific tasks or domains. However, with the rapid increase in model sizes, traditional full-parameter fine-tuning becomes infeasible under limited computational resources. According to the official documentation (https://github.com/deepseek-ai/DeepSeek-V3 (accessed on 2 April 2025)), full-parameter fine-tuning of the DeepSeek-V3 671B model requires more than 600 GB of disk space. In addition, the total GPU memory recommended is ≥1400 GB for FP16 precision or ≥700 GB for FP8 precision. Such computational and storage requirements are difficult for general users to meet. Therefore, DeepSeek-V3 is typically used for inference tasks rather than full fine-tuning. As a result, more efficient fine-tuning techniques, such as LoRA and QLoRA, have been proposed. These methods are designed to significantly reduce the number of trainable parameters while maintaining competitive model performance, making them particularly well-suited for resource-constrained scenarios or specialized domains like relation extraction.
In this study, Qwen2.5-7B-Instruct and Qwen2.5-3B-Instruct were selected as the base models for fine-tuning. According to the evaluation results shown in Figure 1 (excluding DeepSeek-V3), these two models demonstrated the most stable and competitive performance. Due to resource limitations in our laboratory, fine-tuning experiments on the DeepSeek-V3 model were not conducted.
To optimize training efficiency under limited resources, low-rank adaptation (LoRA) was employed. Specifically, only the WQ and WV matrices of the attention layers in each Transformer decoder block were fine-tuned, as illustrated in Figure 5. During LoRA training, the pretrained weights W 0 are frozen, and only the low-rank matrices B and A are optimized. After training, only the parameters of the low-rank matrices B and A need to be saved.

4. Experiments

4.1. Datasets and Evaluation Metrics

Building upon and extending the pioneering work of [26], this study focuses on two critical domains—news and scientific literature—for comprehensive evaluation of RE models. Three representative benchmark datasets were systematically selected for assessment: GDS [9], SemEval-2010 Task 8 [10], and KBP37 [11]. The detailed data statistics are shown in Table 2.
GDS [9] dataset is a RE corpus combining manual annotation with distant supervision (DS), constructed through expert fine-grained labeling and knowledge base-aligned web data expansion, with multi-stage filtering to ensure data quality.
SemEval-2010 Task 8 (SemEval) [10], originally developed for the 2010 International Workshop on Semantic Evaluation, has become a benchmark resource that has significantly advanced the field of RE. This carefully annotated corpus has inspired numerous cutting-edge methodological advances [19,20,23,24], significantly contributing to the advancement of the field.
KBP37 [11] dataset is an improved version of the MIML-RE dataset originally introduced by Gabor Angeli et al. [27]. It was constructed by integrating documents from the 2010 and 2013 KBP evaluations, with additional annotations derived from July 2013 Wikipedia data used as supplementary corpora.
Evaluation Metrics. Consistent with the evaluation protocol used in DILUIE, macro-average and micro-average F1-scores are adopted as the primary evaluation metrics across all three datasets in our experiments. The macro-average F1-score evaluates performance by calculating the F1-score independently for each relation class and then averaging them, which provides a balanced view regardless of class distribution. In contrast, the micro-average F1-score aggregates the contributions of all classes to compute the average performance, thus placing more emphasis on frequent relation types. By reporting both metrics, we aim to provide a comprehensive assessment of model performance across both common and rare relation types.

4.2. Experimental Settings

To comprehensively evaluate the performance of language models with different scales on the relation extraction task, two representative variants from the Qwen2.5 series, namely Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct, were selected as the base models for experimentation. All experiments were conducted on a single NVIDIA A100 GPU server with 40 GB of memory.
The hyperparameters were kept consistent across all three datasets. Detailed parameter settings during the training process are provided, which were largely aligned with those reported in [28]. To ensure optimal performance on datasets of varying sizes, the number of training epochs and the checkpoint-saving intervals were adjusted accordingly.
In the direct inference stage, the design of the prompt templates played a crucial role, as it directly influenced the model’s understanding of the task and the quality of the inferred results. A general prompt template was carefully designed and is detailed in Table 3. In our inference experiments, two types of prompts were used for comparison. The primary distinction between them lies in the inclusion or exclusion of relational descriptions. Specifically, the first prompt only contained a basic task instruction along with the input data, aiming to guide the model in performing relation extraction purely based on the textual input. In contrast, the second prompt extended the first by incorporating additional relational description information. This setup was intended to investigate the impact of explicit relation descriptions on the model’s reasoning ability. By comparing the outputs generated under the two different prompt settings, insights were gained into the role of relation descriptions in enhancing extraction performance, thereby providing guidance for future prompt engineering efforts.
During the fine-tuning stage, the design of direct tuning templates plays a critical role in enhancing model performance. Taking the GDS dataset as a case study, a dedicated direct fine-tuning template was designed, as shown in Table 4. This template includes specific formatting requirements for the input data, explicit task instructions, and the expected format of the output. Such a standardized template design allows the model to better capture the relational patterns and features embedded within the GDS dataset during training. The model processes input data in accordance with the defined template and progressively updates its parameters to improve its performance on the relation extraction task. Moreover, the adoption of a unified fine-tuning template contributes to the reproducibility and comparability of experiments, enabling consistent evaluations across varying experimental settings and facilitating reliable comparative analyses.

4.3. Baseline Models

BERT [8] a Transformer-based pre-trained language model, utilizes bidirectional context encoding and has become a standard baseline for RE tasks.
LasUIE [18] is a structure-aware generation model that performs information extraction through a three-stage training process.
InstructUIE [20] is an end-to-end, general-purpose IE framework based on natural language instructions. It guides LLMs to adapt to diverse RE tasks.
USM [15] proposes an end-to-end general-purpose IE framework that achieves cross-task joint modeling through unified token linking operations, supporting structured IE under multi-task learning.
DILUIE [19] is a unified IE framework based on the EVA attention mechanism and incremental encoding technology. It employs a context-based learning approach to achieve efficient RE.
CRE-LLM [26] utilizes prompt engineering and PEFT-based adaptation of pre-trained LLMs, combining instruction fine-tuning for relationship reasoning and parameter-efficient optimization for effective end-to-end RE.

4.4. Experimental Results Analysis of Relation Label Refinement Strategy

In this section, a thorough analysis of the experimental results of large language models on the relation extraction task is presented. Using the GDS dataset as a case study, we conducted experiments based on standardized prompt templates, comparing model performance with and without relation label correction. The results are shown in Table 5.
Several key findings can be drawn from these results: Model performance without correction is highly correlated with model size. In the absence of relation label correction, a clear correlation between the number of model parameters and prediction accuracy was observed. Generally, models with fewer parameters exhibited poorer performance. This trend was particularly evident within the same model family. For instance, the Qwen2.5-7B-Instruct model achieved 29.84% higher accuracy than its smaller counterpart, Qwen2.5-3B-Instruct, indicating that increasing parameter size within a reasonable range can substantially enhance relation extraction capabilities.
The Baichuan2-7B-Chat model underperformed compared to peers. Notably, the Baichuan2-7B-Chat model demonstrated weaker performance than other models of similar scale. Further analysis suggests that the predictions of this model may have been affected by significant noise fluctuations, resulting in instability during evaluation and ultimately impairing its overall accuracy.
Relation label refinement significantly improves model performance. After applying RLRS, all models exhibited consistent improvements in prediction accuracy. For example, the Qwen2.5-3B-Instruct model achieved an accuracy gain of approximately 40.72%. This substantial increase strongly supports the effectiveness of the proposed correction strategy, which helps rectify errors in model outputs, leading to more reliable results in practical applications.
DeepSeek-V3 model performance and comparative analysis. The DeepSeek-V3 model, with a total parameter count of 671 billion, achieved the highest accuracy across all experiments. Interestingly, this model showed little difference in performance between corrected and uncorrected outputs. This can be attributed to its large parameter scale, which endows it with a strong capacity to understand task-specific semantics, thereby diminishing the marginal benefit of correction.
However, the computational cost of such a large model must also be considered. Due to its size, DeepSeek-V3 requires significantly more inference time than smaller models. To illustrate this, we compared the prediction accuracy and inference latency of Qwen2.5-7B-Instruct and DeepSeek-V3 under the corrected condition, as shown in Table 6. Although Qwen2.5-7B-Instruct exhibited a 4.29% lower accuracy than DeepSeek-V3, its inference speed was several times faster, offering a compelling trade-off between performance and efficiency.
Based on the above experimental results, it can be concluded that, in practical application scenarios where time efficiency and computational resources are limited, smaller models can be effectively optimized through techniques such as relation label correction. Although the performance of smaller models may be relatively weak without correction, their predictive accuracy can be significantly improved through well-designed correction strategies. Moreover, their faster inference speed can better meet the real-world requirements for time-sensitive applications. Therefore, it is advisable to select models and optimization strategies according to specific needs and resource constraints in order to achieve efficient and accurate RE.
Building upon the relation label correction strategy, relation descriptions were further introduced and evaluated on the GDS, SemEval, and KBP37 datasets. The corresponding results are presented in Table 7. A thorough analysis of these results leads to the following findings:
Impact of Relation Descriptions on Model Performance. The incorporation of relation descriptions generally led to performance improvements across all three datasets. For instance, the F1-scores of the Qwen2.5-7B-Instruct model increased by 1.57%, 11.45%, and 1.39% on GDS, SemEval, and KBP37, respectively. Similarly, the Qwen2.5-3B-Instruct model exhibited F1-score improvements of 1.19% and 7.73% on GDS and SemEval, respectively. However, on KBP37, the F1-score of the Qwen2.5-3B-Instruct model dropped slightly from 17.80% to 16.09%, possibly due to the limited model capacity and the resulting susceptibility to noise during prediction. Comparable fluctuations were also observed with Baichuan2-7B-Chat and LLaMA3-8B-Instruct on KBP37, which may be attributed to their limited compatibility with structured relation prompts. These observations indicate that while model size is an important factor, the integration of relation descriptions can enhance relation extraction performance in most cases.
Comparison of Models with Similar Parameter Scales. Under the condition of incorporating relation descriptions, a comparative evaluation was conducted among models with similar parameter sizes, including Baichuan2-7B-Chat, LLaMA3-8B-Instruct, and Qwen2.5-7B-Instruct. The Qwen2.5-7B-Instruct model outperformed Baichuan2-7B-Chat by 18.57%, 52.06%, and 24.74% on the GDS, SemEval, and KBP37 datasets, respectively. Compared with LLaMA3-8B-Instruct, Qwen2.5-7B-Instruct achieved improvements of 4.09%, 11.32%, and 7.68% on the same datasets. These results demonstrate that the Qwen2.5-7B-Instruct model possesses superior capability in understanding and extracting relations.
Balancing Performance and Cost. Although DeepSeek-V3 achieved the best performance across all three datasets, its large parameter scale leads to significantly higher inference latency. Thus, relation label correction strategies and prompt engineering techniques are preferred for smaller models to compensate for their limited reasoning capacity. These findings suggest that architectural design choices may outweigh the benefits of sheer model size. In practice, it is essential to strike a balance between performance and computational cost by selecting suitable models and optimization strategies.

4.5. Overall Advantages and Dataset-Specific Analysis

While considerable performance gains were achieved through correction strategies and prompt optimizations, there remains room for further improvement. To this end, the CARME framework was proposed and applied to the GDS, SemEval, and KBP37 datasets. The results, shown in Table 8, consistently demonstrate that the CARME framework achieves performance gains over existing baselines. Specifically, the CRE-LLM results were re-implemented using the Qwen2.5-7B-Instruct model, while CARME3B and CARME7B refer to models fine-tuned on Qwen2.5-3B and Qwen2.5-7B-Instruct, respectively. Based on the results reported in Table 8, further insights into the model improvements can be drawn.
Superior Overall Performance.
The CARME7B model achieved state-of-the-art performance, obtaining the highest macro-average F1-scores of 87.12%, 88.51%, and 71.49% on the GDS, SemEval, and KBP37 datasets, respectively. Similarly, the highest micro-average F1-scores of 85.91%, 87.45%, and 69.31% were attained on the same datasets. Notably, in comparison to the previous best-performing baseline, CRE-LLM, macro-average F1-scores improved by 5.59%, 1.81%, and 1.17% across the three datasets. Corresponding micro-average F1-score improvements were 5.46%, 1.70%, and 1.06%. These results clearly demonstrate that the proposed model outperformed existing methods in relation extraction tasks.
Datasets-Specific Improvements.
  • GDS Dataset: Both variants of the CARME framework significantly outperformed the previous best results reported by DILUIE. Specifically, CARME7B yielded an increase of 2.95 percentage points in both macro- and micro-average F1-scores, indicating the effectiveness of the CARME framework in capturing relation patterns specific to the GDS dataset.
  • SemEval Dataset: On top of the strong baseline performance of CRE-LLM, CARME7B further improved micro- and macro-average F1-scores by 1.70% and 1.81%, respectively. This demonstrates the CARME framework’s ability to consistently enhance model performance and improve extraction accuracy on the SemEval dataset.
  • KBP37 Dataset: On the challenging KBP37 dataset, CARME7B established a new performance benchmark with a macro-average F1-score of 71.49%, representing a 1.17% gain over CRE-LLM. This highlights the robustness and effectiveness of the CARME framework in handling complex datasets.
Architectural Advantages. Despite the trainable parameters accounting for less than 0.0331% of the full model parameters (e.g., in Qwen2.5-7B-Instruct), a significant performance improvement was achieved after integrating both document-level and sentence-level contextual enhancement signals. The performance improvements achieved by the CARME framework can be attributed to the following key factors:
  • Multi-Granularity and Multi-Level Context Modeling: By integrating document-level features via GraphRAG and sentence-level signals via TF-IDF, the framework captures both local and global relation patterns, thereby providing richer contextual information for relation extraction.
  • Dynamic Relation Correction: The proposed label refinement strategy effectively addresses noise in the initial predictions of LLMs, thereby improving the overall accuracy of relation extraction.
  • Parameter-Efficient Adaptation: Leveraging the LoRA fine-tuning technique enables the optimization of relation extraction performance while preserving the knowledge encoded in pre-trained models, making it feasible to achieve high performance under limited computational resources.
Despite utilizing 57% fewer parameters compared to CARME7B, the CARME3B model delivered comparable performance, with only a 1.9% difference in F1-score. Both CARME variants demonstrated particularly strong performance on GDS for complex, long-range relations, with an average improvement of 6.3% over the baseline. These findings suggest that in practical applications, appropriate CARME variants can be selected based on resource availability and performance requirements to achieve a balance between efficiency and effectiveness.

4.6. Analysis of Differences in Experimental Results

Following the overall performance evaluation of the model, a more detailed analysis was conducted on the precision, recall, and F1-score across different relation categories within the GDS, SemEval, and KBP37 datasets. The results of this analysis are presented in Table 9, Table 10, Table 11, respectively. By examining the model’s performance on individual relation types, we aim to obtain a more comprehensive understanding of its behavior and identify areas that may benefit from targeted optimization.
GDS Dataset. As shown in Table 9, the model exhibited considerable variation in precision across different relation types within the GDS dataset. For instance, the relation /people/deceased_person/place_of_death achieved a precision of 86.66%, a recall of 91.44%, and an F1-score of 88.98%, indicating strong performance in identifying this type of relation. In contrast, the /education/education/institution relation attained slightly lower values, with a precision of 83.79%, recall of 87.11%, and F1-score of 85.42%, though still within an acceptable performance range. However, for the unknown relation type NA, the model yielded a precision of only 74.10%, a recall of 66.45%, and an F1-score of 70.06%, which were substantially lower than those of the well-defined relation categories.
SemEval Dataset. A similar trend was observed in the SemEval dataset, as illustrated in Table 10. Among all relation types, Cause-Effect demonstrated the highest performance, with a precision of 95.96%, recall of 94.21%, and F1-score of 95.08%, suggesting that the model was particularly effective at recognizing this type of relation. In contrast, the Other category, which encompasses a broad range of ambiguous or less-defined relationships, exhibited significantly lower performance, with a precision of 71.66%, recall of 67.40%, and F1-score of 69.47%. These results highlight the model’s limited capacity to handle vague or heterogeneous relation types.
KBP37 Dataset. The performance distribution across relation categories in the KBP37 dataset, shown in Table 11, followed a comparable pattern. The title_of_person relation achieved strong results, with a precision of 95.20%, recall of 86.86%, and F1-score of 90.84%, indicating reliable recognition by the model. Conversely, for the NA relation, the model attained only 57.23% precision, 42.48% recall, and 48.77% F1-score, demonstrating significant performance degradation compared to the defined relations.
Based on the above analysis across all three datasets, it is evident that the CARME model consistently performs worse on unknown or ambiguous relation types, such as NA in GDS and KBP37, or Other in SemEval. The presence of such relation categories increases the complexity and uncertainty of the RE task. The model’s low accuracy on these types may adversely affect the overall extraction performance and limit its generalization and practical applicability.
Therefore, improving the model’s capability to identify unknown relation types remains a key direction for future work. Several potential strategies can be explored, such as enhancing the model architecture to better capture uncertainty and ambiguity, incorporating richer contextual information to facilitate deeper semantic understanding, or employing transfer learning techniques to leverage knowledge from related tasks. Through these efforts, it is expected that the model’s performance on unknown relations can be significantly improved, thereby boosting the overall effectiveness of RE systems in real-world applications.
As shown in Table 8, the multi-granularity and multi-level context enhancement module proposed in this study demonstrates significant effectiveness on distant supervision datasets. A preliminary analysis of the GDS dataset reveals (as shown in Table 12) that approximately 70% of the test instances benefit from the rich contextual information provided by GraphRAG, which contributes significantly to the observed performance improvements.
Ablation experiments conducted on the Qwen2.5-7B-Instruct model show that removing the two types of contexts constructed by GraphRAG and TF-IDF results in a decrease in the experimental performance of the GDS dataset by 4.72% and 0.83%, respectively. However, the improvements on the SemEval and KBP37 datasets are not as significant as those on GDS, this can primarily be attributed to two factors: first, the higher proportion of single-instance samples reduces the effectiveness of context modeling, as shown in the table above; and second, the relatively smaller test set size may introduce greater variability in the evaluation results. These findings suggest that our method is particularly suitable for datasets like GDS, which are derived from distant supervision and contain rich multi-instance structures.

4.7. Case Study

To further evaluate the effectiveness of the proposed CARME model, a case study was conducted on two large-language models, Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct. CARME and three baseline systems were assessed; the detailed results are presented in Table 13. Two representative samples were selected from the GDS corpus for in-depth analysis.
As shown in Table 13, direct inference with Qwen2.5-3B-Instruct frequently produced ill-formatted outputs, yielding non-standard relation labels. After the RLRS proposed in this study had been applied, the formatting errors were markedly reduced, and the predicted relations became more structured and consistent. In contrast, Qwen2.5-7B-Instruct exhibited a substantially lower incidence of such formatting issues, indicating greater output stability.
Although CARME fine-tuning was performed on Qwen2.5-3B-Instruct, certain instances still revealed erroneous relation predictions. These findings suggest that CARME retains limitations in precisely identifying and aligning the correct triple entities, and that the current prompt design could benefit from further refinement. Future work should therefore focus on more effective prompt-engineering techniques and enhanced context-construction mechanisms, so as to improve the generalization and accuracy of RE in complex scenarios.

5. Conclusions

This paper addresses three core challenges faced by LLMs in RE: non-standardized prediction labels, insufficient contextual modeling, and high training costs. To this end, we propose a relation extraction framework that integrates label refinement, multi-level contextual augmentation, and parameter-efficient fine-tuning. At the label level, a relation label refinement mechanism based on semantic and syntactic information is designed to standardize the model outputs, thereby improving the consistency and interpretability of predictions. For contextual modeling, a dual semantic enhancement strategy is constructed to accommodate both multi-sentence and single-sentence instance bags. Specifically, a graph-based retrieval approach using GraphRAG is employed for global semantics at the bag level, while a lightweight enhancement method based on TF-IDF is introduced for local sentence-level context, enabling joint modeling of global and local information. In terms of model optimization, the parameter-efficient fine-tuning method LoRA is adopted, allowing efficient adaptation under limited-resource conditions while preserving the knowledge acquired during pretraining. Experimental results demonstrate that the proposed method significantly improves RE performance across multiple domain-specific datasets. Future work will explore cross-document semantic modeling, adaptive context-aware mechanisms, and the application of self-supervised learning to complex relation extraction tasks, aiming to enhance the generalization capability and practical value of the proposed approach in real-world scenarios.

Author Contributions

Conceptualization: D.H. and C.Y.; Methodology: D.H.; Software: D.H.; Validation: D.H., J.L., C.Y., Y.Z. and Y.M.; Investigation: Y.M., J.L. and C.Y.; Data Curation: D.H. and Y.Z.; Writing—Original Draft Preparation: D.H.; Writing—Review and Editing: D.H., L.M., X.L. and C.G.; Visualization: D.H.; Supervision: L.M. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by the CIPSC-SMP-Zhipu Large Model Cross-Disciplinary Fund; the Doctoral Research Fund of Zhengzhou University of Light Industry (No. 13501050093); and the “Double Innovation” Special Project-Yunnan Province Science and Technology Small and Medium-sized Enterprises (SMEs) Technology Innovation Fund Project (No. 202404AP110047).

Data Availability Statement

All original contributions of this study are incorporated in this article; any further inquiries may be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2.5 technical report. arXiv 2024, arXiv:2412.15115. [Google Scholar] [CrossRef]
  2. Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
  3. Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
  4. Li, J.; Tang, T.; Zhao, W.X.; Nie, J.Y.; Wen, J.R. Pre-trained language models for text generation: A survey. ACM Comput. Surv. 2024, 56, 1–39. [Google Scholar] [CrossRef]
  5. Wu, L.; Zheng, Z.; Qiu, Z.; Wang, H.; Gu, H.; Shen, T.; Qin, C.; Zhu, C.; Zhu, H.; Liu, Q.; et al. A survey on large language models for recommendation. World Wide Web 2024, 27, 60. [Google Scholar] [CrossRef]
  6. Bi, Z.; Chen, J.; Jiang, Y.; Xiong, F.; Guo, W.; Chen, H.; Zhang, N. Codekgc: Code language model for generative knowledge graph construction. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2024, 23, 1–16. [Google Scholar] [CrossRef]
  7. Yang, X.; Wang, Z.; Wang, Q.; Wei, K.; Zhang, K.; Shi, J. Large language models for automated q&a involving legal documents: A survey on algorithms, frameworks and applications. Int. J. Web Inf. Syst. 2024, 20, 413–435. [Google Scholar] [CrossRef]
  8. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  9. Jat, S.; Khandelwal, S.; Talukdar, P. Improving distantly supervised relation extraction using word and entity based attention. arXiv 2018, arXiv:1804.06987. [Google Scholar] [CrossRef]
  10. Hendrickx, I.; Kim, S.N.; Kozareva, Z.; Nakov, P.; Séaghdha, D.Ó.; Padó, S.; Pennacchiotti, M.; Romano, L.; Szpakowicz, S. SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations between Pairs of Nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, 15–16 July 2010; pp. 33–38. [Google Scholar]
  11. Zhang, D.; Wang, D. Relation classification via recurrent neural network. arXiv 2015, arXiv:1508.01006. [Google Scholar] [CrossRef]
  12. Han, H.; Wang, Y.; Shomer, H.; Guo, K.; Ding, J.; Lei, Y.; Halappanavar, M.; Rossi, R.A.; Mukherjee, S.; Tang, X.; et al. Retrieval-augmented generation with graphs (graphrag). arXiv 2024, arXiv:2501.00309. [Google Scholar] [CrossRef]
  13. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. [Google Scholar]
  14. Wei, X.; Cui, X.; Cheng, N.; Wang, X.; Zhang, X.; Huang, S.; Xie, P.; Xu, J.; Chen, Y.; Zhang, M.; et al. Zero-shot information extraction via chatting with chatgpt. arXiv 2023, arXiv:2302.10205. [Google Scholar] [CrossRef]
  15. Lou, J.; Lu, Y.; Dai, D.; Jia, W.; Lin, H.; Han, X.; Sun, L.; Wu, H. Universal information extraction as unified semantic matching. In Proceedings of the AAAI conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 13318–13326. [Google Scholar]
  16. Han, R.; Peng, T.; Yang, C.; Wang, B.; Liu, L.; Wan, X. Is information extraction solved by chatgpt? An analysis of performance, evaluation criteria, robustness and errors. arXiv 2023, arXiv:2305.14450. [Google Scholar] [CrossRef]
  17. Li, B.; Fang, G.; Yang, Y.; Wang, Q.; Ye, W.; Zhao, W.; Zhang, S. Evaluating ChatGPT’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness. arXiv 2023, arXiv:2304.11633. [Google Scholar]
  18. Fei, H.; Wu, S.; Li, J.; Li, B.; Li, F.; Qin, L.; Zhang, M.; Zhang, M.; Chua, T.S. Lasuie: Unifying information extraction with latent adaptive structure-aware generative language model. Adv. Neural Inf. Process. Syst. 2022, 35, 15460–15475. [Google Scholar]
  19. Guo, Q.; Guo, Y.; Zhao, J. Diluie: Constructing diverse demonstrations of in-context learning with large language model for unified information extraction. Neural Comput. Appl. 2024, 36, 13491–13512. [Google Scholar] [CrossRef]
  20. Wang, X.; Zhou, W.; Zu, C.; Xia, H.; Chen, T.; Zhang, Y.; Zheng, R.; Ye, J.; Zhang, Q.; Gui, T.; et al. Instructuie: Multi-task instruction tuning for unified information extraction. arXiv 2023, arXiv:2304.08085. [Google Scholar]
  21. Miao, X.; Li, Y.; Zhou, S.; Qian, T. Episodic Memory Retrieval from LLMs: A Neuromorphic Mechanism to Generate Commonsense Counterfactuals for Relation Extraction. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 2489–2511. [Google Scholar]
  22. Xue, L.; Zhang, D.; Dong, Y.; Tang, J. AutoRE: Document-Level Relation Extraction with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 11–16 August 2024; pp. 211–220. [Google Scholar]
  23. Efeoglu, S.; Paschke, A. Retrieval-augmented generation-based relation extraction. arXiv 2024, arXiv:2404.13397. [Google Scholar]
  24. Chen, Z.; Li, Z.; Zeng, Y.; Zhang, C.; Ma, H. GAP: A novel Generative context-Aware Prompt-tuning method for relation extraction. Expert Syst. Appl. 2024, 248, 123478. [Google Scholar] [CrossRef]
  25. Surdeanu, M.; Tibshirani, J.; Nallapati, R.; Manning, C.D. Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju, Republic of Korea, 12–14 July 2012; pp. 455–465. [Google Scholar]
  26. Efeoglu, S.; Paschke, A. Relation extraction with fine-tuned large language models in retrieval augmented generation frameworks. arXiv 2024, arXiv:2406.14745. [Google Scholar]
  27. Angeli, G.; Tibshirani, J.; Wu, J.; Manning, C.D. Combining distant and partial supervision for relation extraction. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1556–1567. [Google Scholar]
  28. Shi, Z.; Luo, H. CRE-LLM: A domain-specific Chinese relation extraction framework with fine-tuned large language model. arXiv 2024, arXiv:2404.18085. [Google Scholar]
Figure 1. The results obtained by directly inferring on the GDS, SemEval, and KBP37 test sets without fine-tuning, followed by the correction of relation labels.
Figure 1. The results obtained by directly inferring on the GDS, SemEval, and KBP37 test sets without fine-tuning, followed by the correction of relation labels.
Symmetry 17 01201 g001
Figure 2. The overview framework of CARME. Here, ① and ② represent the results obtained using the multi-granularity context enhancement framework, respectively.
Figure 2. The overview framework of CARME. Here, ① and ② represent the results obtained using the multi-granularity context enhancement framework, respectively.
Symmetry 17 01201 g002
Figure 3. Zero-shot performance analysis of RLRS on GDS dataset.
Figure 3. Zero-shot performance analysis of RLRS on GDS dataset.
Symmetry 17 01201 g003
Figure 4. Framework diagram for relation label refinement strategy.
Figure 4. Framework diagram for relation label refinement strategy.
Symmetry 17 01201 g004
Figure 5. LoRA computation flowchart based on Qwen2.5-3B/7B-Instruct.
Figure 5. LoRA computation flowchart based on Qwen2.5-3B/7B-Instruct.
Symmetry 17 01201 g005
Table 1. Relation prediction results in GDS (before relation label refinement).
Table 1. Relation prediction results in GDS (before relation label refinement).
Ground Truth RelationRelation Prediction Results (Before RLRS)
/people/deceased_person/place_ of_death/[John Glover]/place_of_birth
[George Feyer ],?, [Upper East Side]
/people/person/place_of_birth[George Perkins Merrill], place_of_birth, [Androscoggin County/Auburn city]
NA/people/person/place_of_birth; /people/person/education./education/education/institution
/people/person/education./education/education/institution[Ben Okri], born 15 March 1959, Minna, Nigeria, Nigerian novelist, short-story writer, and poet who used magic realism to convey the social and political chaos in the country [University of Essex] their birth. ([Ben Okri], [University of Essex])
/people/person/education./education/education/institution[Henry Segerstrom], managing_partner,[Stanford Graduate School of Business]
/people/deceased_person/place_of_death/people/person/place_of_birth: NA /people/person/place_of_death: NA /people/person/place_of_birth: /people/person/place_of_birth: /people/person/education./education/degree: NA /people/person/education./education/institution: NA
Table 2. Statistics for GDS, SemEval, and KBP37 datasets.
Table 2. Statistics for GDS, SemEval, and KBP37 datasets.
DatasetsTrainingTestRelation
Instances Entity Pairs Instances Entity Pairs
GDS13,1617580566332475
SemEval650762992717267910
KBP3715,91712,5333405328818
Table 3. Prompt templates used in direct inference: standard prompt (Std.Prompt) and prompt with relational descriptions (Std.Prompt + Rel.Desc).
Table 3. Prompt templates used in direct inference: standard prompt (Std.Prompt) and prompt with relational descriptions (Std.Prompt + Rel.Desc).
Prompt
Std.PromptAs a relation extraction expert, please extract the relation based on the given sentence and entities.
The listed categories:
relation1,
relation2,
Example
Input: [Jack Smooth], real name Ron Wells, born in [London] 1970. ([Jack Smooth], ?, [London])
Output: /people/person/place_of_birth
Output strictly according to the format of the given example,
without any additional explanations.
The relation must be from the listed categories.
Std.Prompt + Rel.DescAs a relation extraction expert, please extract the relation based on the given sentence and entities.
Information about each relation category:
relation1: description1
relation2: description2
Example
Input: [Jack Smooth], real name Ron Wells, born in [London] 1970. ([Jack Smooth], ?, [London])
Output: /people/person/place_of_birth
Output strictly according to the format of the given example,
without any additional explanations.
The relation must be from the listed categories.
Table 4. Fine-tuning template for GDS.
Table 4. Fine-tuning template for GDS.
AttributeContent
InstructionPlease extract the relationship based on the given sentence and entities.
InputSir [John Myres] Linton Myres (3 July 1869 in [Preston]—6 March 1954 in Oxford) was a British archaeologist. He conducted excavations in Cyprus in 1904. He became the first Wykeham Professor of Ancient History at the University of Oxford in 1910, having been Gladstone Professor of Greek and Lecturer in Ancient Geography, University of Liverpool from 1907…([John Myres], ?, [Preston])
Output([John Myres], /people/person/place_of_birth, [Preston])
Table 5. (%) Accuracy using standard prompts with and without RLRS.
Table 5. (%) Accuracy using standard prompts with and without RLRS.
Metric (%)GDSSemEvalKBP37
Model/Datasets w/o RLRS RLRS w/o RLRS RLRS w/o RLRS RLRS
Baichuan2-7B-Chat22.2047.775.2323.112.5612.69
Qwen2.5-3B-Instruct12.7353.4527.7133.827.1116.68
LLaMA3-8B-Instruct46.5164.0631.2833.4918.4726.43
Qwen2.5-7B-Instruct42.5765.3047.2247.8127.2228.99
DeepSeek-V369.5969.5959.7759.7742.6742.79
Table 6. Direct Inference Efficiency Comparison on GDS Dataset.
Table 6. Direct Inference Efficiency Comparison on GDS Dataset.
ModelAccuracy (%)Throughput (inst/s)
Qwen2.5-7B-Instruct65.300.23
DeepSeek-V369.594.62
Note: Throughput measured in instances processed per second (inst/s).
Table 7. Direct inference performance comparison across LLMs on GDS, SemEval, and KBP37 datasets. “Std.Prompt” indicates standard prompt usage while “Rel.Desc” denotes prompts augmented with relation descriptions.
Table 7. Direct inference performance comparison across LLMs on GDS, SemEval, and KBP37 datasets. “Std.Prompt” indicates standard prompt usage while “Rel.Desc” denotes prompts augmented with relation descriptions.
Metrics (%)Std.PromptStd.Prompt + Rel.Desc
Models/Datasets GDS SemEval KBP37 GDS SemEval KBP37
Baichuan2-7B-Chat45.2920.2213.5547.475.695.75
Qwen2.5-3B-Instruct53.6931.2617.8054.8838.9916.09
LLaMA3-8B-Instruct59.2932.5525.9161.9546.4322.81
Qwen2.5-7B-Instruct64.4746.3029.1066.0457.7530.49
DeepSeek-V370.8360.1948.4771.1165.8948.86
Table 8. (%) The Micro-/Macro-F1-scores comparison of Our method with other baselines on GDS, SemEval, and KBP37 datasets.
Table 8. (%) The Micro-/Macro-F1-scores comparison of Our method with other baselines on GDS, SemEval, and KBP37 datasets.
Metrics (%)Micro-F1Macro-F1
Model/Datasets GDS SemEval KBP37 GDS SemEval KBP37
BERT [8]78.6384.0367.4979.5284.8769.50
LasUIE [18]81.2674.5736.1079.5474.9936.19
USM [15]80.3774.3836.0880.4974.9736.96
InstructUIE [20]81.9873.2336.1481.3173.4937.28
DILUIE [19]84.6975.4735.4784.1775.6836.09
CRE-LLM [28]80.4585.7568.2581.5386.7070.32
CARME3B (Ours)85.7785.4669.1686.8286.6170.89
CARME7B (Ours)85.9187.4569.3187.1288.5171.49
The experimental results of CRE-LLM were obtained through our re-implementation using the Qwen2.5-7B-Instruct model. CARME3B and CARME7B denote our fine-tuned models based on Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct, respectively.
Table 9. (%) Precision of different relation categories on GDS.
Table 9. (%) Precision of different relation categories on GDS.
RelationPrecisionRecallMacro-F1Support
/people/deceased_person/place_of_death86.6691.4488.981016
/people/person/education./education/education/degree96.3099.1197.68894
/people/person/education./education/education/institution83.7987.1185.421365
/people/person/place_of_birth92.6693.0292.841032
NA74.1066.4570.061356
Table 10. (%) Precision of different relation categories on SemEval.
Table 10. (%) Precision of different relation categories on SemEval.
RelationPrecisionRecallMacro-F1Support
Cause-Effect95.9694.2195.08328
Component-Whole88.4090.3889.38312
Content-Container90.8292.7191.75192
Entity-Destination91.7595.2193.45292
Entity-Origin90.4888.3789.41258
Instrument-Agency94.8182.0587.97156
Member-Collection84.0690.5687.19233
Message-Topic89.8995.4092.57261
Product-Producer88.0989.6188.84231
Other71.6667.4069.47454
Table 11. (%) Precision of different relation categories on KBP37.
Table 11. (%) Precision of different relation categories on KBP37.
RelationPrecisionRecallMacro-F1Support
alternate names75.1669.0171.95171
cities of residence71.0780.9275.68173
city of headquarters77.1986.5681.61305
countries of residence56.6365.7960.87266
country of birth57.8141.5748.3789
country of headquarters87.6787.2887.47228
employee of61.9070.0765.73568
founded85.7189.7287.67107
founded by68.1258.7563.0980
members63.4557.5060.33160
origin85.7173.8579.3465
spouse88.1491.2389.6657
state or province of headquarters76.6980.9578.76126
state or provinces of residence64.3968.0066.15125
subsidiaries59.3162.6960.96193
title of person95.2086.8690.84137
top members employees74.1765.4469.53136
NA57.2342.4848.77419
Table 12. Proportion of GDS, SemEval, and KBP37 in constructing GraphRAG on the training set (training set).
Table 12. Proportion of GDS, SemEval, and KBP37 in constructing GraphRAG on the training set (training set).
GDSSemEvalKBP37
Instances in GraphRAG89784376147
Total instances13,161650715,917
Proportion68.2%6.7%38.6%
Table 13. Case Study on the GDS Dataset. denotes an incorrect prediction, whereas denotes a correct prediction. 3B/7B represent the Qwen2.5-3B/7B-Instruct models, respectively.
Table 13. Case Study on the GDS Dataset. denotes an incorrect prediction, whereas denotes a correct prediction. 3B/7B represent the Qwen2.5-3B/7B-Instruct models, respectively.
ModelsPrediction Results
ExamplesExample 1: [David Phinney] graduated from the [University of California] at Berkeley with a dual major in English and Political Science. He began their journalism career as a music and film critic and by writing frequent humorous sketches about local personalities and events in the San Francisco Bay Area.
Triplet 1: ([David Phinney], NA, [University of California])
Example 2: this project, two BRT stations [Will Ransom] be installed on Ransom Avenue near Crescent [St]reet. The first phase of construction near the Medical Center has begun and runs through June.
Triplet 2: ([Will Ransom], NA, [St])
Std. Prompt (3B)Example 1: ([David Phinney], /people/person/education) ()
Example 2: (this project, two BRT stations [Will Ransom] be installed on Ransom…) ()
Std. Prompt +
Desc. (3B)
Example 1: ([David hinney], /people/person/education, [University of California]) ()
Example 2: (this project, two BRT stations [Will Ransom], be installed on Ransom…) ()
RLRS (3B)Example 1: ([David Phinney],/people/person/education./education/education/institution, [University of California]) ()
Example 2: ([Will Ransom], NA, [St]) ()
CARME3BExample 1: ([David Phinney],/people/person/education./education/education/institution, [University of California]) ()
Example 2: ([Will Ransom], NA, [St]) ()
Std. Prompt +
Desc. (7B)
Example 1: ([David Phinney],/people/person/education./education/education/institution, [University of California]) ()
Example 2: ([Will Ransom], NA, [St]) ()
RLRS (7B)Example 1: ([David Phinney], /people/person/education./education/education/institution, [University of California]) ()
Example 2: ([Will Ransom], NA, [St]) ()
CARME7BExample 1: ([David Phinney], NA, [University of California]) ()
Example 2: ([Will Ransom], NA, [St]) ()
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, D.; Meng, L.; Li, X.; Li, J.; Guo, C.; Zhou, Y.; Yuan, C.; Ma, Y. ContextualAugmentation via Retrieval for Multi-Granularity Relation Extraction in LLMs. Symmetry 2025, 17, 1201. https://doi.org/10.3390/sym17081201

AMA Style

Han D, Meng L, Li X, Li J, Guo C, Zhou Y, Yuan C, Ma Y. ContextualAugmentation via Retrieval for Multi-Granularity Relation Extraction in LLMs. Symmetry. 2025; 17(8):1201. https://doi.org/10.3390/sym17081201

Chicago/Turabian Style

Han, Danjie, Lingzhong Meng, Xun Li, Jia Li, Cunhan Guo, Yanghao Zhou, Changsen Yuan, and Yuxi Ma. 2025. "ContextualAugmentation via Retrieval for Multi-Granularity Relation Extraction in LLMs" Symmetry 17, no. 8: 1201. https://doi.org/10.3390/sym17081201

APA Style

Han, D., Meng, L., Li, X., Li, J., Guo, C., Zhou, Y., Yuan, C., & Ma, Y. (2025). ContextualAugmentation via Retrieval for Multi-Granularity Relation Extraction in LLMs. Symmetry, 17(8), 1201. https://doi.org/10.3390/sym17081201

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop