Contextual Augmentation via Retrieval for Multi-Granularity Relation Extraction in LLMs

Han, Danjie; Meng, Lingzhong; Li, Xun; Li, Jia; Guo, Cunhan; Zhou, Yanghao; Yuan, Changsen; Ma, Yuxi

doi:10.3390/sym17081201

Open AccessArticle

Contextual Augmentation via Retrieval for Multi-Granularity Relation Extraction in LLMs

by

Danjie Han

¹

,

Lingzhong Meng

²,

Xun Li

¹,

Jia Li

³,

Cunhan Guo

^4,5,

Yanghao Zhou

⁴,

Changsen Yuan

⁴ and

Yuxi Ma

^2,*

¹

School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

²

Institute of Software Chinese Academy of Sciences, Beijing 100190, China

³

School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450001, China

⁴

School of Computer Science and Engineering, Beijing Institute of Technology, Beijing 100081, China

⁵

Southeast Academy of Information Technology, Beijing Institute of Technology, Putian 351100, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(8), 1201; https://doi.org/10.3390/sym17081201

Submission received: 19 June 2025 / Revised: 20 July 2025 / Accepted: 24 July 2025 / Published: 28 July 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

To address issues commonly observed during the inference phase of large language models—such as inconsistent labels, formatting errors, or semantic deviations—a series of targeted strategies has been proposed. First, a relation label refinement strategy based on semantic similarity and syntactic structure has been designed to calibrate the model’s outputs, thereby improving the accuracy and consistency of label prediction. Second, to meet the contextual modeling needs of different types of instance bags, a multi-level contextual augmentation strategy has been constructed. For multi-sentence instance bags, a graph-based retrieval enhancement mechanism is introduced, which integrates intra-bag entity co-occurrence networks with document-level sentence association graphs to strengthen the model’s understanding of cross-sentence semantic relations. For single-sentence instance bags, a semantic expansion strategy based on term frequency-inverse document frequency is employed to retrieve similar sentences. This enriches the training context under the premise of semantic consistency, alleviating the problem of insufficient contextual information. Notably, the proposed multi-granularity framework captures semantic symmetry between entities and relations across different levels of context, which is crucial for accurate and balanced relation understanding. The proposed methodology offers practical advancements for semantic analysis applications, particularly in knowledge graph development.

Keywords:

relation extraction; large language models; LoRA tuning; knowledge graph

1. Introduction

With a series of groundbreaking advancements in the field of large language models (LLMs), models such as Qwen2.5 [1], Baichuan2-7B-Chat (https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat (accessed on 2 April 2025)), LLaMA3-8B-instruct [2], and DeepSeek-V3 [3], have achieved significant progress in natural language processing. These models have demonstrated strong capabilities not only in traditional tasks such as text generation [4] and recommender systems [5] but also in the increasingly prominent domain of information extraction (IE). As a core subtask of IE, relation extraction (RE) plays a pivotal role in practical applications such as knowledge graph construction [6] and intelligent question answering systems [7].

The essence of RE lies in the accurate identification of semantic relationships—such as causality and affiliation—between entity pairs in input texts. This task imposes stringent requirements on a model’s semantic comprehension, demanding both a holistic understanding of the overall context and precise recognition of local details. Traditional approaches, such as BERT [8], have made notable advances in local feature modeling through carefully designed fine-tuning strategies. However, their performance remains highly dependent on the quality of labeled data and tends to fall short in handling semantically complex scenarios. In recent years, the emergence of LLMs has introduced a new paradigm for RE. Importantly, understanding such relations often involves recognizing patterns of symmetry and asymmetry in semantic structures—for example, the bidirectionality of affiliation versus the unidirectionality of causality—which further complicates the task for general-purpose models.

Our systematic evaluations on three authoritative datasets—GDS [9], SemEval 2010 [10], and KBP37 [11]—demonstrate (see Figure 1) that the relation extraction capabilities of mainstream large language models, including Qwen2.5-3B/7B-Instruct, Baichuan2-7B-Chat, LLaMA3-8B-Instruct, and DeepSeek-V3, still leave considerable room for improvement when no task-specific fine-tuning is performed. This performance gap can be attributed to two primary factors.

First, the generative architecture of LLMs is inherently optimized for global semantic modeling, whereas relation extraction tasks typically require attention to localized syntactic structures and contextual features. Consequently, LLMs tend to overlook critical details necessary for precise relation identification. Second, most existing RE approaches are confined to sentence-level processing, whereas the accurate recognition of many complex relations in real-world applications often depends on broader contextual understanding across multiple sentences. In long documents, for instance, entity relations often span multiple sentences, and the absence of such context severely hinders model performance.

To address these challenges, we propose a graph-based retrieval-augmented generation (GraphRAG) [12] framework specifically designed for multi-sentence instance bags. By constructing bag-level contextual representations, this method enhances the model’s ability to understand cross-sentence semantic relations and effectively leverages document-level context to resolve complex issues such as coreference resolution. In addition, for single-sentence instance bags, we introduce a lightweight context enhancement strategy based on term frequency-inverse document frequency (TF-IDF). This approach enriches training data diversity through controlled augmentation while preserving semantic consistency.

Moreover, a novel relation label refinement strategy is developed to calibrate model predictions via post-processing, thereby improving inference accuracy. This technique is particularly effective in correcting specific types of prediction errors, especially in cases involving ambiguous semantic relations. Finally, we adopt a parameter-efficient fine-tuning (PEFT) framework that combines with low-rank adaptation (LoRA) [13] to optimize the Qwen2.5-3B/7B-Instruct models. This hybrid approach retains the knowledge embedded in the pretrained models while enabling effective adaptation to task-specific requirements. The main contributions of this paper are summarized as follows:

A multi-dimensional relation label refinement strategy that integrates semantic and syntactic similarity has been designed to address the issue of non-standard output formats generated by LLMs.
A multi-granularity contextual augmentation framework has been constructed for both multi-instance and single-instance learning scenarios, leveraging the semantic understanding capabilities of LLMs and incorporating LoRA for efficient model fine-tuning.
Extensive experiments on datasets from different domains demonstrate that the proposed method consistently enhances the relation extraction performance of large language models.

2. Related Work

With the rapid advancement of large language models (LLMs) in the field of natural language processing, their application to relation extraction (RE) has become an emerging research focus. Current studies on LLM-based relation extraction primarily face two critical challenges.

First, regarding model adaptability, various innovative approaches have been proposed to improve the performance of LLMs in RE tasks. For instance, the ChatIE framework introduced by Wei et al. [14] employs a two-stage prompting strategy that significantly enhances performance in zero-shot scenarios. Similarly, Lou et al. [15] developed a universal structured model (USM) that achieves more effective joint modeling through unified token linkage operations. Despite these advances, LLMs still exhibit notable limitations in relation extraction. Empirical analyses by Han et al. [16] and Li et al. [17] have demonstrated that models such as ChatGPT (gpt-3.5-turbo-0301) consistently underperform compared to fine-tuned BERT-based counterparts when inferring semantic relations between entity pairs. This performance gap is largely attributed to two factors: insufficient representation of domain-specific knowledge and inherent deficiencies in fine-grained semantic reasoning.

Second, in terms of framework generalizability, several unified information extraction frameworks have emerged in recent years [18,19,20]. While these frameworks enable joint modeling across multiple tasks, their performance on dedicated RE tasks still lags behind that of specialized models. To address these challenges, researchers have explored several innovative directions. On the methodological front, Miao et al. [21], inspired by neuroscience, simulated hippocampal mechanisms of pattern separation and completion to generate counterfactual samples, thereby enhancing the model’s commonsense reasoning capabilities. Xue et al. [22] broke through the limitations of traditional relation classification paradigms by proposing a more extensible document-level relation extraction framework. From a technical optimization perspective, Efeoglu et al. [23] employed retrieval-augmented generation (RAG) techniques to mitigate hallucination problems in LLMs, while Chen et al. [24] introduced a context-aware prompt tuning framework that leverages joint contrastive learning to improve domain adaptability.

Nevertheless, empirical studies continue to highlight persistent limitations in current technologies. Evaluations on the GDS, SemEval, and KBP37 datasets reveal that mainstream LLMs—including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, Baichuan2-7B-Chat, LLaMA3-8B-Instruct, and DeepSeek-V3—consistently perform poorly on sentence-level relation extraction tasks (see Figure 1). These findings align with the evaluation results reported by [16,17], underscoring key technical challenges that remain unresolved in this domain.

3. Methodology

3.1. Task Definition

RE aims to identify and extract semantic relationships between entities within a sentence. In this study, the task is defined following the multi-instance learning (MIL) framework proposed by [25]. Specifically, given a sentence

S = \{x_{1}, x_{2}, \dots\}

consisting of n tokens, the primary objective of the RE task is to predict the relation r between an entity pair

(e_{1}, e_{2})

, where

r \in R

and

R

denotes a predefined set of relation labels. In this formulation,

e_{1}

refers to the head entity (subject), and

e_{2}

refers to the tail entity (object). We designed a framework named contextual augmentation via retrieval for multi-granularity relation extraction in LLMs (CARME). The overall architecture is illustrated in Figure 2.

3.2. Relation Label Refinement Strategy (RLRS)

When applying LLMs such as Qwen2.5-3B/7B-Instruct, LLaMA3-8B-Instruct, and Baichuan-7B to relation extraction tasks, it was observed that the initial prediction results were significantly affected by noise (as illustrated in Table 1), rendering them unsuitable for direct evaluation.

As shown in Table 1, LLMs are prompted to directly generate prediction results using prompt engineering, formatting inconsistencies frequently arise, even when the prompt explicitly and strictly specifies that the output must follow the relation types defined in the relation list. Such deviations from the expected format significantly undermine the accuracy and consistency of the relation extraction results, thereby negatively impacting subsequent analyses and downstream applications that rely on these outputs. For example, there are the following errors:

In one case, the gold relation label between an entity pair was /people/deceased_person/place_of_death, yet the model output was [John Glover]/place_of_birth, [George Feyer], ?, [Upper East Side], demonstrating a clear deviation from the expected format.
In another instance, where the correct label was /people/person/place_of_birth, the model predicted [George Perkins Merrill], place_of_birth, [Androscoggin County/Auburn city], which, while partially correct, still failed to align precisely with the required structure.

These findings indicate that without additional constraints or refinement mechanisms, direct inference from LLMs can yield outputs with varying degrees of structural inconsistency and semantic noise, thus necessitating further methodological interventions.

To address the aforementioned issue, a novel relation label refinement strategy based on multi-dimensional similarity matching is proposed in this study. Taking the GDS dataset as an example, the refined prediction results are illustrated in Figure 3.

Based on the initial predictions generated by the large language model, the candidate relations are filtered using a multi-dimensional similarity matching algorithm. Specifically, the composite similarity score

Y_{i}

(incorporating both semantic similarity and character edit distance) between the current prediction and each predefined relation in set

R

is computed. The relation with the highest composite similarity score is subsequently selected as the final predicted label. The detailed workflow is illustrated in Figure 4.

Y_{i} = \underset{r \in R}{argmax} [λ \cdot {cos}_{semantic} (r, p_{i}) + (1 - λ) \cdot {sim}_{edit} (r, p_{i})]

(1)

where

R

denotes the set of predefined relation labels,

p_{i}

represents the model’s original predicted relation, and

λ \in [0, 1]

is a weighting coefficient that controls the balance between semantic similarity and edit distance-based similarity. The term

{sim}_{edit} (\cdot)

refers to the normalized string similarity derived from the Levenshtein distance, which measures the degree of difference between two strings. The semantic similarity

{cos}_{semantic} (r, p)

is computed using the cosine similarity between Sentence-BERT (https://github.com/UKPLab/sentence-transformers (accessed on 2 April 2025)) embeddings of the candidate relation r and the model prediction p, as defined in Equation (2):

{cos}_{semantic} (r, p) = \frac{e_{r}^{T} \cdot e_{p}}{∥e_{r}∥ \cdot ∥e_{p}∥}, e_{r} \in R^{d}

(2)

where

e_{r}

and

e_{p}

are d-dimensional embedding vectors generated by Sentence-BERT, a BERT-based model specifically designed to produce semantically meaningful sentence-level embeddings. Sentence-BERT addresses the limitations of the original BERT model in tasks involving sentence similarity, by enabling direct and efficient computation of vector-based semantic comparisons such as cosine similarity.

3.3. Multi-Granularity Context Enhancement Framework

In this subsection, a multi-granularity contextual enhancement module is proposed based on sentence-level relation extraction datasets. This module is designed to alleviate the issues of semantic ambiguity and information insufficiency that lead to suboptimal recognition performance by incorporating richer contextual information. To address the different characteristics of multiple instance learning (MIL) and single instance learning (SIL) settings, distinct strategies are employed at the document-level and sentence-level granularities. Specifically, the GraphRAG method is adopted for MIL scenarios, while a TF-IDF-based strategy is used for SIL settings to construct and model contextual information.

3.3.1. Document-Level Contextual Enhancement Strategy (for MIL)

This strategy is designed for MIL scenarios, where each instance bag consists of multiple sentence-level instances. Traditional approaches often overlook the potential structured associations among sentences, leading to incomplete semantic expression or redundant information. To overcome this limitation, the idea of retrieval-augmented modeling is introduced, and a GraphRAG-based method is employed to integrate semantically relevant contextual information within each bag using a graph-based structure. This enables more robust relation modeling. The proposed method comprises the following steps:

Explicit Structural Modeling: In GraphRAG, all entities appearing in the sentences are treated as nodes in a graph, while edges are constructed based on the sentences containing each entity pair. This enables the structural relationships among sentences to be explicitly captured.
Knowledge Retrieval: All edges associated with the target entity pair are retrieved from the constructed knowledge graph. Sentence-BERT is then used to encode both the target sentence and the retrieved edges, and their semantic similarity is computed to rank the relevance of these candidate contextual fragments. Notably, all retrieved content is sourced from the internal corpus, without the introduction of additional external noise.
Re-ranking of Retrieved Results: To further refine the retrieved results, the BAAI/bge-reranker-base (https://huggingface.co/BAAI/bge-reranker-base (accessed on 2 April 2025)) model is employed to perform re-ranking. Finally, the top-k most relevant sentences are selected as the enhanced contextual information within the instance bag.

Specifically, given a sentence

S_{i} = {x_{1}, x_{2}, \dots, x_{n}}

with the head and tail entities denoted as

(e_{1}, e_{2})

, all sentences

S_{j}

(

i \neq j

) that contain the same entity pair are retrieved from the knowledge graph constructed by GraphRAG. The Sentence-BERT encoder is then applied to obtain vector representations of

S_{i}

and each retrieved

S_{j}

. Cosine similarity is computed between these vectors as follows:

cos (S_{i}, S_{j}) = \frac{S_{i} \cdot S_{j}}{∥ S_{i} ∥ \cdot ∥ S_{j} ∥}

(3)

where

S_{i}

and

S_{j}

denote the encoded vector representations of the original and retrieved sentences, respectively. The ranked list of retrieved sentences is denoted as

S_{j}^{*}

.

Next, a fine-grained re-ranking is conducted using a re-ranker model applied to

S_{i}

and each

S_{j}^{*}

:

C_{k} = Rerank (S_{i}, S_{j}^{*})

(4)

The top-k sentences are selected from the re-ranked list and concatenated with the original sentence

S_{i}

to construct the enhanced contextual information within the instance bag, as shown in Equations (5) and (6).

C_{k}^{'} = t o p_{k} (C_{k})

(5)

S_{k}^{'} = c o n c a t (S_{i}, C_{k}^{'})

(6)

The resulting sentence

S_{k}^{'}

represents the document-level contextually enhanced instance.

3.3.2. Sentence-Level Context Enhancement (for SIL)

A single sentence often fails to provide sufficient semantic support. To address the issue of missing contextual information in sentence-level relation extraction, a TF-IDF-based strategy for sentence-level context enhancement was proposed. Through the TF-IDF process, the target sentence is transformed into the format:

content₁ [head entity] content₂ [tail entity] content₃, thereby constructing the sentence-level context-enhanced instance

S^{'}

.

Finally, the document-level contextual representation

S_{k}^{'}

and the sentence-level contextual representation

S^{'}

were concatenated to form the multi-granularity context-enhanced input

S^{C}

, which was subsequently fed into a unified relation classification model for training and prediction.

S^{C} = Concat (S_{k}^{'}, S^{'})

(7)

This multi-granularity context enhancement framework leverages the complementary strengths of graph-based structure modeling and sparse vector retrieval, significantly improving classification accuracy and generalization under complex relational scenarios.

3.4. LoRA Tuning

Fine-tuning large-scale pre-trained language models is a key technique for adapting them to specific tasks or domains. However, with the rapid increase in model sizes, traditional full-parameter fine-tuning becomes infeasible under limited computational resources. According to the official documentation (https://github.com/deepseek-ai/DeepSeek-V3 (accessed on 2 April 2025)), full-parameter fine-tuning of the DeepSeek-V3 671B model requires more than 600 GB of disk space. In addition, the total GPU memory recommended is ≥1400 GB for FP16 precision or ≥700 GB for FP8 precision. Such computational and storage requirements are difficult for general users to meet. Therefore, DeepSeek-V3 is typically used for inference tasks rather than full fine-tuning. As a result, more efficient fine-tuning techniques, such as LoRA and QLoRA, have been proposed. These methods are designed to significantly reduce the number of trainable parameters while maintaining competitive model performance, making them particularly well-suited for resource-constrained scenarios or specialized domains like relation extraction.

In this study, Qwen2.5-7B-Instruct and Qwen2.5-3B-Instruct were selected as the base models for fine-tuning. According to the evaluation results shown in Figure 1 (excluding DeepSeek-V3), these two models demonstrated the most stable and competitive performance. Due to resource limitations in our laboratory, fine-tuning experiments on the DeepSeek-V3 model were not conducted.

To optimize training efficiency under limited resources, low-rank adaptation (LoRA) was employed. Specifically, only the W_Q and W_V matrices of the attention layers in each Transformer decoder block were fine-tuned, as illustrated in Figure 5. During LoRA training, the pretrained weights

W_{0}

are frozen, and only the low-rank matrices B and A are optimized. After training, only the parameters of the low-rank matrices B and A need to be saved.

4. Experiments

4.1. Datasets and Evaluation Metrics

Building upon and extending the pioneering work of [26], this study focuses on two critical domains—news and scientific literature—for comprehensive evaluation of RE models. Three representative benchmark datasets were systematically selected for assessment: GDS [9], SemEval-2010 Task 8 [10], and KBP37 [11]. The detailed data statistics are shown in Table 2.

GDS [9] dataset is a RE corpus combining manual annotation with distant supervision (DS), constructed through expert fine-grained labeling and knowledge base-aligned web data expansion, with multi-stage filtering to ensure data quality.

SemEval-2010 Task 8 (SemEval) [10], originally developed for the 2010 International Workshop on Semantic Evaluation, has become a benchmark resource that has significantly advanced the field of RE. This carefully annotated corpus has inspired numerous cutting-edge methodological advances [19,20,23,24], significantly contributing to the advancement of the field.

KBP37 [11] dataset is an improved version of the MIML-RE dataset originally introduced by Gabor Angeli et al. [27]. It was constructed by integrating documents from the 2010 and 2013 KBP evaluations, with additional annotations derived from July 2013 Wikipedia data used as supplementary corpora.

Evaluation Metrics. Consistent with the evaluation protocol used in DILUIE, macro-average and micro-average F1-scores are adopted as the primary evaluation metrics across all three datasets in our experiments. The macro-average F1-score evaluates performance by calculating the F1-score independently for each relation class and then averaging them, which provides a balanced view regardless of class distribution. In contrast, the micro-average F1-score aggregates the contributions of all classes to compute the average performance, thus placing more emphasis on frequent relation types. By reporting both metrics, we aim to provide a comprehensive assessment of model performance across both common and rare relation types.

4.2. Experimental Settings

To comprehensively evaluate the performance of language models with different scales on the relation extraction task, two representative variants from the Qwen2.5 series, namely Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct, were selected as the base models for experimentation. All experiments were conducted on a single NVIDIA A100 GPU server with 40 GB of memory.

The hyperparameters were kept consistent across all three datasets. Detailed parameter settings during the training process are provided, which were largely aligned with those reported in [28]. To ensure optimal performance on datasets of varying sizes, the number of training epochs and the checkpoint-saving intervals were adjusted accordingly.

In the direct inference stage, the design of the prompt templates played a crucial role, as it directly influenced the model’s understanding of the task and the quality of the inferred results. A general prompt template was carefully designed and is detailed in Table 3. In our inference experiments, two types of prompts were used for comparison. The primary distinction between them lies in the inclusion or exclusion of relational descriptions. Specifically, the first prompt only contained a basic task instruction along with the input data, aiming to guide the model in performing relation extraction purely based on the textual input. In contrast, the second prompt extended the first by incorporating additional relational description information. This setup was intended to investigate the impact of explicit relation descriptions on the model’s reasoning ability. By comparing the outputs generated under the two different prompt settings, insights were gained into the role of relation descriptions in enhancing extraction performance, thereby providing guidance for future prompt engineering efforts.

During the fine-tuning stage, the design of direct tuning templates plays a critical role in enhancing model performance. Taking the GDS dataset as a case study, a dedicated direct fine-tuning template was designed, as shown in Table 4. This template includes specific formatting requirements for the input data, explicit task instructions, and the expected format of the output. Such a standardized template design allows the model to better capture the relational patterns and features embedded within the GDS dataset during training. The model processes input data in accordance with the defined template and progressively updates its parameters to improve its performance on the relation extraction task. Moreover, the adoption of a unified fine-tuning template contributes to the reproducibility and comparability of experiments, enabling consistent evaluations across varying experimental settings and facilitating reliable comparative analyses.

4.3. Baseline Models

BERT [8] a Transformer-based pre-trained language model, utilizes bidirectional context encoding and has become a standard baseline for RE tasks.

LasUIE [18] is a structure-aware generation model that performs information extraction through a three-stage training process.

InstructUIE [20] is an end-to-end, general-purpose IE framework based on natural language instructions. It guides LLMs to adapt to diverse RE tasks.

USM [15] proposes an end-to-end general-purpose IE framework that achieves cross-task joint modeling through unified token linking operations, supporting structured IE under multi-task learning.

DILUIE [19] is a unified IE framework based on the EVA attention mechanism and incremental encoding technology. It employs a context-based learning approach to achieve efficient RE.

CRE-LLM [26] utilizes prompt engineering and PEFT-based adaptation of pre-trained LLMs, combining instruction fine-tuning for relationship reasoning and parameter-efficient optimization for effective end-to-end RE.

4.4. Experimental Results Analysis of Relation Label Refinement Strategy

In this section, a thorough analysis of the experimental results of large language models on the relation extraction task is presented. Using the GDS dataset as a case study, we conducted experiments based on standardized prompt templates, comparing model performance with and without relation label correction. The results are shown in Table 5.

Several key findings can be drawn from these results: Model performance without correction is highly correlated with model size. In the absence of relation label correction, a clear correlation between the number of model parameters and prediction accuracy was observed. Generally, models with fewer parameters exhibited poorer performance. This trend was particularly evident within the same model family. For instance, the Qwen2.5-7B-Instruct model achieved 29.84% higher accuracy than its smaller counterpart, Qwen2.5-3B-Instruct, indicating that increasing parameter size within a reasonable range can substantially enhance relation extraction capabilities.

The Baichuan2-7B-Chat model underperformed compared to peers. Notably, the Baichuan2-7B-Chat model demonstrated weaker performance than other models of similar scale. Further analysis suggests that the predictions of this model may have been affected by significant noise fluctuations, resulting in instability during evaluation and ultimately impairing its overall accuracy.

Relation label refinement significantly improves model performance. After applying RLRS, all models exhibited consistent improvements in prediction accuracy. For example, the Qwen2.5-3B-Instruct model achieved an accuracy gain of approximately 40.72%. This substantial increase strongly supports the effectiveness of the proposed correction strategy, which helps rectify errors in model outputs, leading to more reliable results in practical applications.

DeepSeek-V3 model performance and comparative analysis. The DeepSeek-V3 model, with a total parameter count of 671 billion, achieved the highest accuracy across all experiments. Interestingly, this model showed little difference in performance between corrected and uncorrected outputs. This can be attributed to its large parameter scale, which endows it with a strong capacity to understand task-specific semantics, thereby diminishing the marginal benefit of correction.

However, the computational cost of such a large model must also be considered. Due to its size, DeepSeek-V3 requires significantly more inference time than smaller models. To illustrate this, we compared the prediction accuracy and inference latency of Qwen2.5-7B-Instruct and DeepSeek-V3 under the corrected condition, as shown in Table 6. Although Qwen2.5-7B-Instruct exhibited a 4.29% lower accuracy than DeepSeek-V3, its inference speed was several times faster, offering a compelling trade-off between performance and efficiency.

Based on the above experimental results, it can be concluded that, in practical application scenarios where time efficiency and computational resources are limited, smaller models can be effectively optimized through techniques such as relation label correction. Although the performance of smaller models may be relatively weak without correction, their predictive accuracy can be significantly improved through well-designed correction strategies. Moreover, their faster inference speed can better meet the real-world requirements for time-sensitive applications. Therefore, it is advisable to select models and optimization strategies according to specific needs and resource constraints in order to achieve efficient and accurate RE.

Building upon the relation label correction strategy, relation descriptions were further introduced and evaluated on the GDS, SemEval, and KBP37 datasets. The corresponding results are presented in Table 7. A thorough analysis of these results leads to the following findings:

Impact of Relation Descriptions on Model Performance. The incorporation of relation descriptions generally led to performance improvements across all three datasets. For instance, the F1-scores of the Qwen2.5-7B-Instruct model increased by 1.57%, 11.45%, and 1.39% on GDS, SemEval, and KBP37, respectively. Similarly, the Qwen2.5-3B-Instruct model exhibited F1-score improvements of 1.19% and 7.73% on GDS and SemEval, respectively. However, on KBP37, the F1-score of the Qwen2.5-3B-Instruct model dropped slightly from 17.80% to 16.09%, possibly due to the limited model capacity and the resulting susceptibility to noise during prediction. Comparable fluctuations were also observed with Baichuan2-7B-Chat and LLaMA3-8B-Instruct on KBP37, which may be attributed to their limited compatibility with structured relation prompts. These observations indicate that while model size is an important factor, the integration of relation descriptions can enhance relation extraction performance in most cases.

Comparison of Models with Similar Parameter Scales. Under the condition of incorporating relation descriptions, a comparative evaluation was conducted among models with similar parameter sizes, including Baichuan2-7B-Chat, LLaMA3-8B-Instruct, and Qwen2.5-7B-Instruct. The Qwen2.5-7B-Instruct model outperformed Baichuan2-7B-Chat by 18.57%, 52.06%, and 24.74% on the GDS, SemEval, and KBP37 datasets, respectively. Compared with LLaMA3-8B-Instruct, Qwen2.5-7B-Instruct achieved improvements of 4.09%, 11.32%, and 7.68% on the same datasets. These results demonstrate that the Qwen2.5-7B-Instruct model possesses superior capability in understanding and extracting relations.

Balancing Performance and Cost. Although DeepSeek-V3 achieved the best performance across all three datasets, its large parameter scale leads to significantly higher inference latency. Thus, relation label correction strategies and prompt engineering techniques are preferred for smaller models to compensate for their limited reasoning capacity. These findings suggest that architectural design choices may outweigh the benefits of sheer model size. In practice, it is essential to strike a balance between performance and computational cost by selecting suitable models and optimization strategies.

4.5. Overall Advantages and Dataset-Specific Analysis

While considerable performance gains were achieved through correction strategies and prompt optimizations, there remains room for further improvement. To this end, the CARME framework was proposed and applied to the GDS, SemEval, and KBP37 datasets. The results, shown in Table 8, consistently demonstrate that the CARME framework achieves performance gains over existing baselines. Specifically, the CRE-LLM results were re-implemented using the Qwen2.5-7B-Instruct model, while CARME_3B and CARME_7B refer to models fine-tuned on Qwen2.5-3B and Qwen2.5-7B-Instruct, respectively. Based on the results reported in Table 8, further insights into the model improvements can be drawn.

Superior Overall Performance.

The CARME_7B model achieved state-of-the-art performance, obtaining the highest macro-average F1-scores of 87.12%, 88.51%, and 71.49% on the GDS, SemEval, and KBP37 datasets, respectively. Similarly, the highest micro-average F1-scores of 85.91%, 87.45%, and 69.31% were attained on the same datasets. Notably, in comparison to the previous best-performing baseline, CRE-LLM, macro-average F1-scores improved by 5.59%, 1.81%, and 1.17% across the three datasets. Corresponding micro-average F1-score improvements were 5.46%, 1.70%, and 1.06%. These results clearly demonstrate that the proposed model outperformed existing methods in relation extraction tasks.

Datasets-Specific Improvements.

GDS Dataset: Both variants of the CARME framework significantly outperformed the previous best results reported by DILUIE. Specifically, CARME_7B yielded an increase of 2.95 percentage points in both macro- and micro-average F1-scores, indicating the effectiveness of the CARME framework in capturing relation patterns specific to the GDS dataset.
SemEval Dataset: On top of the strong baseline performance of CRE-LLM, CARME_7B further improved micro- and macro-average F1-scores by 1.70% and 1.81%, respectively. This demonstrates the CARME framework’s ability to consistently enhance model performance and improve extraction accuracy on the SemEval dataset.
KBP37 Dataset: On the challenging KBP37 dataset, CARME_7B established a new performance benchmark with a macro-average F1-score of 71.49%, representing a 1.17% gain over CRE-LLM. This highlights the robustness and effectiveness of the CARME framework in handling complex datasets.

Architectural Advantages. Despite the trainable parameters accounting for less than 0.0331% of the full model parameters (e.g., in Qwen2.5-7B-Instruct), a significant performance improvement was achieved after integrating both document-level and sentence-level contextual enhancement signals. The performance improvements achieved by the CARME framework can be attributed to the following key factors:

Multi-Granularity and Multi-Level Context Modeling: By integrating document-level features via GraphRAG and sentence-level signals via TF-IDF, the framework captures both local and global relation patterns, thereby providing richer contextual information for relation extraction.
Dynamic Relation Correction: The proposed label refinement strategy effectively addresses noise in the initial predictions of LLMs, thereby improving the overall accuracy of relation extraction.
Parameter-Efficient Adaptation: Leveraging the LoRA fine-tuning technique enables the optimization of relation extraction performance while preserving the knowledge encoded in pre-trained models, making it feasible to achieve high performance under limited computational resources.

Despite utilizing 57% fewer parameters compared to CARME_7B, the CARME_3B model delivered comparable performance, with only a 1.9% difference in F1-score. Both CARME variants demonstrated particularly strong performance on GDS for complex, long-range relations, with an average improvement of 6.3% over the baseline. These findings suggest that in practical applications, appropriate CARME variants can be selected based on resource availability and performance requirements to achieve a balance between efficiency and effectiveness.

4.6. Analysis of Differences in Experimental Results

Following the overall performance evaluation of the model, a more detailed analysis was conducted on the precision, recall, and F1-score across different relation categories within the GDS, SemEval, and KBP37 datasets. The results of this analysis are presented in Table 9, Table 10, Table 11, respectively. By examining the model’s performance on individual relation types, we aim to obtain a more comprehensive understanding of its behavior and identify areas that may benefit from targeted optimization.

GDS Dataset. As shown in Table 9, the model exhibited considerable variation in precision across different relation types within the GDS dataset. For instance, the relation /people/deceased_person/place_of_death achieved a precision of 86.66%, a recall of 91.44%, and an F1-score of 88.98%, indicating strong performance in identifying this type of relation. In contrast, the /education/education/institution relation attained slightly lower values, with a precision of 83.79%, recall of 87.11%, and F1-score of 85.42%, though still within an acceptable performance range. However, for the unknown relation type NA, the model yielded a precision of only 74.10%, a recall of 66.45%, and an F1-score of 70.06%, which were substantially lower than those of the well-defined relation categories.

SemEval Dataset. A similar trend was observed in the SemEval dataset, as illustrated in Table 10. Among all relation types, Cause-Effect demonstrated the highest performance, with a precision of 95.96%, recall of 94.21%, and F1-score of 95.08%, suggesting that the model was particularly effective at recognizing this type of relation. In contrast, the Other category, which encompasses a broad range of ambiguous or less-defined relationships, exhibited significantly lower performance, with a precision of 71.66%, recall of 67.40%, and F1-score of 69.47%. These results highlight the model’s limited capacity to handle vague or heterogeneous relation types.

KBP37 Dataset. The performance distribution across relation categories in the KBP37 dataset, shown in Table 11, followed a comparable pattern. The title_of_person relation achieved strong results, with a precision of 95.20%, recall of 86.86%, and F1-score of 90.84%, indicating reliable recognition by the model. Conversely, for the NA relation, the model attained only 57.23% precision, 42.48% recall, and 48.77% F1-score, demonstrating significant performance degradation compared to the defined relations.

Based on the above analysis across all three datasets, it is evident that the CARME model consistently performs worse on unknown or ambiguous relation types, such as NA in GDS and KBP37, or Other in SemEval. The presence of such relation categories increases the complexity and uncertainty of the RE task. The model’s low accuracy on these types may adversely affect the overall extraction performance and limit its generalization and practical applicability.

Therefore, improving the model’s capability to identify unknown relation types remains a key direction for future work. Several potential strategies can be explored, such as enhancing the model architecture to better capture uncertainty and ambiguity, incorporating richer contextual information to facilitate deeper semantic understanding, or employing transfer learning techniques to leverage knowledge from related tasks. Through these efforts, it is expected that the model’s performance on unknown relations can be significantly improved, thereby boosting the overall effectiveness of RE systems in real-world applications.

As shown in Table 8, the multi-granularity and multi-level context enhancement module proposed in this study demonstrates significant effectiveness on distant supervision datasets. A preliminary analysis of the GDS dataset reveals (as shown in Table 12) that approximately 70% of the test instances benefit from the rich contextual information provided by GraphRAG, which contributes significantly to the observed performance improvements.

Ablation experiments conducted on the Qwen2.5-7B-Instruct model show that removing the two types of contexts constructed by GraphRAG and TF-IDF results in a decrease in the experimental performance of the GDS dataset by 4.72% and 0.83%, respectively. However, the improvements on the SemEval and KBP37 datasets are not as significant as those on GDS, this can primarily be attributed to two factors: first, the higher proportion of single-instance samples reduces the effectiveness of context modeling, as shown in the table above; and second, the relatively smaller test set size may introduce greater variability in the evaluation results. These findings suggest that our method is particularly suitable for datasets like GDS, which are derived from distant supervision and contain rich multi-instance structures.

4.7. Case Study

To further evaluate the effectiveness of the proposed CARME model, a case study was conducted on two large-language models, Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct. CARME and three baseline systems were assessed; the detailed results are presented in Table 13. Two representative samples were selected from the GDS corpus for in-depth analysis.

As shown in Table 13, direct inference with Qwen2.5-3B-Instruct frequently produced ill-formatted outputs, yielding non-standard relation labels. After the RLRS proposed in this study had been applied, the formatting errors were markedly reduced, and the predicted relations became more structured and consistent. In contrast, Qwen2.5-7B-Instruct exhibited a substantially lower incidence of such formatting issues, indicating greater output stability.

Although CARME fine-tuning was performed on Qwen2.5-3B-Instruct, certain instances still revealed erroneous relation predictions. These findings suggest that CARME retains limitations in precisely identifying and aligning the correct triple entities, and that the current prompt design could benefit from further refinement. Future work should therefore focus on more effective prompt-engineering techniques and enhanced context-construction mechanisms, so as to improve the generalization and accuracy of RE in complex scenarios.

5. Conclusions

This paper addresses three core challenges faced by LLMs in RE: non-standardized prediction labels, insufficient contextual modeling, and high training costs. To this end, we propose a relation extraction framework that integrates label refinement, multi-level contextual augmentation, and parameter-efficient fine-tuning. At the label level, a relation label refinement mechanism based on semantic and syntactic information is designed to standardize the model outputs, thereby improving the consistency and interpretability of predictions. For contextual modeling, a dual semantic enhancement strategy is constructed to accommodate both multi-sentence and single-sentence instance bags. Specifically, a graph-based retrieval approach using GraphRAG is employed for global semantics at the bag level, while a lightweight enhancement method based on TF-IDF is introduced for local sentence-level context, enabling joint modeling of global and local information. In terms of model optimization, the parameter-efficient fine-tuning method LoRA is adopted, allowing efficient adaptation under limited-resource conditions while preserving the knowledge acquired during pretraining. Experimental results demonstrate that the proposed method significantly improves RE performance across multiple domain-specific datasets. Future work will explore cross-document semantic modeling, adaptive context-aware mechanisms, and the application of self-supervised learning to complex relation extraction tasks, aiming to enhance the generalization capability and practical value of the proposed approach in real-world scenarios.

Author Contributions

Conceptualization: D.H. and C.Y.; Methodology: D.H.; Software: D.H.; Validation: D.H., J.L., C.Y., Y.Z. and Y.M.; Investigation: Y.M., J.L. and C.Y.; Data Curation: D.H. and Y.Z.; Writing—Original Draft Preparation: D.H.; Writing—Review and Editing: D.H., L.M., X.L. and C.G.; Visualization: D.H.; Supervision: L.M. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by the CIPSC-SMP-Zhipu Large Model Cross-Disciplinary Fund; the Doctoral Research Fund of Zhengzhou University of Light Industry (No. 13501050093); and the “Double Innovation” Special Project-Yunnan Province Science and Technology Small and Medium-sized Enterprises (SMEs) Technology Innovation Fund Project (No. 202404AP110047).

Data Availability Statement

All original contributions of this study are incorporated in this article; any further inquiries may be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. Qwen2.5 technical report. arXiv 2024, arXiv:2412.15115. [Google Scholar] [CrossRef]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Liu, A.; Feng, B.; Xue, B.; Wang, B.; Wu, B.; Lu, C.; Zhao, C.; Deng, C.; Zhang, C.; Ruan, C.; et al. Deepseek-v3 technical report. arXiv 2024, arXiv:2412.19437. [Google Scholar]
Li, J.; Tang, T.; Zhao, W.X.; Nie, J.Y.; Wen, J.R. Pre-trained language models for text generation: A survey. ACM Comput. Surv. 2024, 56, 1–39. [Google Scholar] [CrossRef]
Wu, L.; Zheng, Z.; Qiu, Z.; Wang, H.; Gu, H.; Shen, T.; Qin, C.; Zhu, C.; Zhu, H.; Liu, Q.; et al. A survey on large language models for recommendation. World Wide Web 2024, 27, 60. [Google Scholar] [CrossRef]
Bi, Z.; Chen, J.; Jiang, Y.; Xiong, F.; Guo, W.; Chen, H.; Zhang, N. Codekgc: Code language model for generative knowledge graph construction. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2024, 23, 1–16. [Google Scholar] [CrossRef]
Yang, X.; Wang, Z.; Wang, Q.; Wei, K.; Zhang, K.; Shi, J. Large language models for automated q&a involving legal documents: A survey on algorithms, frameworks and applications. Int. J. Web Inf. Syst. 2024, 20, 413–435. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Jat, S.; Khandelwal, S.; Talukdar, P. Improving distantly supervised relation extraction using word and entity based attention. arXiv 2018, arXiv:1804.06987. [Google Scholar] [CrossRef]
Hendrickx, I.; Kim, S.N.; Kozareva, Z.; Nakov, P.; Séaghdha, D.Ó.; Padó, S.; Pennacchiotti, M.; Romano, L.; Szpakowicz, S. SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations between Pairs of Nominals. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, 15–16 July 2010; pp. 33–38. [Google Scholar]
Zhang, D.; Wang, D. Relation classification via recurrent neural network. arXiv 2015, arXiv:1508.01006. [Google Scholar] [CrossRef]
Han, H.; Wang, Y.; Shomer, H.; Guo, K.; Ding, J.; Lei, Y.; Halappanavar, M.; Rossi, R.A.; Mukherjee, S.; Tang, X.; et al. Retrieval-augmented generation with graphs (graphrag). arXiv 2024, arXiv:2501.00309. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, 25–29 April 2022. [Google Scholar]
Wei, X.; Cui, X.; Cheng, N.; Wang, X.; Zhang, X.; Huang, S.; Xie, P.; Xu, J.; Chen, Y.; Zhang, M.; et al. Zero-shot information extraction via chatting with chatgpt. arXiv 2023, arXiv:2302.10205. [Google Scholar] [CrossRef]
Lou, J.; Lu, Y.; Dai, D.; Jia, W.; Lin, H.; Han, X.; Sun, L.; Wu, H. Universal information extraction as unified semantic matching. In Proceedings of the AAAI conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 13318–13326. [Google Scholar]
Han, R.; Peng, T.; Yang, C.; Wang, B.; Liu, L.; Wan, X. Is information extraction solved by chatgpt? An analysis of performance, evaluation criteria, robustness and errors. arXiv 2023, arXiv:2305.14450. [Google Scholar] [CrossRef]
Li, B.; Fang, G.; Yang, Y.; Wang, Q.; Ye, W.; Zhao, W.; Zhang, S. Evaluating ChatGPT’s information extraction capabilities: An assessment of performance, explainability, calibration, and faithfulness. arXiv 2023, arXiv:2304.11633. [Google Scholar]
Fei, H.; Wu, S.; Li, J.; Li, B.; Li, F.; Qin, L.; Zhang, M.; Zhang, M.; Chua, T.S. Lasuie: Unifying information extraction with latent adaptive structure-aware generative language model. Adv. Neural Inf. Process. Syst. 2022, 35, 15460–15475. [Google Scholar]
Guo, Q.; Guo, Y.; Zhao, J. Diluie: Constructing diverse demonstrations of in-context learning with large language model for unified information extraction. Neural Comput. Appl. 2024, 36, 13491–13512. [Google Scholar] [CrossRef]
Wang, X.; Zhou, W.; Zu, C.; Xia, H.; Chen, T.; Zhang, Y.; Zheng, R.; Ye, J.; Zhang, Q.; Gui, T.; et al. Instructuie: Multi-task instruction tuning for unified information extraction. arXiv 2023, arXiv:2304.08085. [Google Scholar]
Miao, X.; Li, Y.; Zhou, S.; Qian, T. Episodic Memory Retrieval from LLMs: A Neuromorphic Mechanism to Generate Commonsense Counterfactuals for Relation Extraction. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024, Bangkok, Thailand, 11–16 August 2024; pp. 2489–2511. [Google Scholar]
Xue, L.; Zhang, D.; Dong, Y.; Tang, J. AutoRE: Document-Level Relation Extraction with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 11–16 August 2024; pp. 211–220. [Google Scholar]
Efeoglu, S.; Paschke, A. Retrieval-augmented generation-based relation extraction. arXiv 2024, arXiv:2404.13397. [Google Scholar]
Chen, Z.; Li, Z.; Zeng, Y.; Zhang, C.; Ma, H. GAP: A novel Generative context-Aware Prompt-tuning method for relation extraction. Expert Syst. Appl. 2024, 248, 123478. [Google Scholar] [CrossRef]
Surdeanu, M.; Tibshirani, J.; Nallapati, R.; Manning, C.D. Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju, Republic of Korea, 12–14 July 2012; pp. 455–465. [Google Scholar]
Efeoglu, S.; Paschke, A. Relation extraction with fine-tuned large language models in retrieval augmented generation frameworks. arXiv 2024, arXiv:2406.14745. [Google Scholar]
Angeli, G.; Tibshirani, J.; Wu, J.; Manning, C.D. Combining distant and partial supervision for relation extraction. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1556–1567. [Google Scholar]
Shi, Z.; Luo, H. CRE-LLM: A domain-specific Chinese relation extraction framework with fine-tuned large language model. arXiv 2024, arXiv:2404.18085. [Google Scholar]

Figure 1. The results obtained by directly inferring on the GDS, SemEval, and KBP37 test sets without fine-tuning, followed by the correction of relation labels.

Figure 2. The overview framework of CARME. Here, ① and ② represent the results obtained using the multi-granularity context enhancement framework, respectively.

Figure 3. Zero-shot performance analysis of RLRS on GDS dataset.

Figure 4. Framework diagram for relation label refinement strategy.

Figure 5. LoRA computation flowchart based on Qwen2.5-3B/7B-Instruct.

Table 1. Relation prediction results in GDS (before relation label refinement).

Ground Truth Relation	Relation Prediction Results (Before RLRS)
/people/deceased_person/place_ of_death	/[John Glover]/place_of_birth
/people/deceased_person/place_ of_death	[George Feyer ],?, [Upper East Side]
/people/person/place_of_birth	[George Perkins Merrill], place_of_birth, [Androscoggin County/Auburn city]
NA	/people/person/place_of_birth; /people/person/education./education/education/institution
/people/person/education./education/education/institution	[Ben Okri], born 15 March 1959, Minna, Nigeria, Nigerian novelist, short-story writer, and poet who used magic realism to convey the social and political chaos in the country [University of Essex] their birth. ([Ben Okri], [University of Essex])
/people/person/education./education/education/institution	[Henry Segerstrom], managing_partner,[Stanford Graduate School of Business]
/people/deceased_person/place_of_death	/people/person/place_of_birth: NA /people/person/place_of_death: NA /people/person/place_of_birth: /people/person/place_of_birth: /people/person/education./education/degree: NA /people/person/education./education/institution: NA

Table 2. Statistics for GDS, SemEval, and KBP37 datasets.

Datasets	Training		Test		Relation
Datasets	Instances	Entity Pairs	Instances	Entity Pairs	Relation
GDS	13,161	7580	5663	3247	5
SemEval	6507	6299	2717	2679	10
KBP37	15,917	12,533	3405	3288	18

Table 3. Prompt templates used in direct inference: standard prompt (Std.Prompt) and prompt with relational descriptions (Std.Prompt + Rel.Desc).

Prompt
Std.Prompt	As a relation extraction expert, please extract the relation based on the given sentence and entities.
	The listed categories:
	relation1,
	relation2,
	…
	Example
	Input: [Jack Smooth], real name Ron Wells, born in [London] 1970. ([Jack Smooth], ?, [London])
	Output: /people/person/place_of_birth
	Output strictly according to the format of the given example, without any additional explanations.
	The relation must be from the listed categories.
Std.Prompt + Rel.Desc	As a relation extraction expert, please extract the relation based on the given sentence and entities.
	Information about each relation category:
	relation1: description1
	relation2: description2
	…
	Example
	Input: [Jack Smooth], real name Ron Wells, born in [London] 1970. ([Jack Smooth], ?, [London])
	Output: /people/person/place_of_birth
	Output strictly according to the format of the given example, without any additional explanations.
	The relation must be from the listed categories.

Table 4. Fine-tuning template for GDS.

Attribute	Content
Instruction	Please extract the relationship based on the given sentence and entities.
Input	Sir [John Myres] Linton Myres (3 July 1869 in [Preston]—6 March 1954 in Oxford) was a British archaeologist. He conducted excavations in Cyprus in 1904. He became the first Wykeham Professor of Ancient History at the University of Oxford in 1910, having been Gladstone Professor of Greek and Lecturer in Ancient Geography, University of Liverpool from 1907…([John Myres], ?, [Preston])
Output	([John Myres], /people/person/place_of_birth, [Preston])

Table 5. (%) Accuracy using standard prompts with and without RLRS.

Metric (%)	GDS		SemEval		KBP37
Model/Datasets	w/o RLRS	RLRS	w/o RLRS	RLRS	w/o RLRS	RLRS
Baichuan2-7B-Chat	22.20	47.77	5.23	23.11	2.56	12.69
Qwen2.5-3B-Instruct	12.73	53.45	27.71	33.82	7.11	16.68
LLaMA3-8B-Instruct	46.51	64.06	31.28	33.49	18.47	26.43
Qwen2.5-7B-Instruct	42.57	65.30	47.22	47.81	27.22	28.99
DeepSeek-V3	69.59	69.59	59.77	59.77	42.67	42.79

Table 6. Direct Inference Efficiency Comparison on GDS Dataset.

Model	Accuracy (%)	Throughput (inst/s)
Qwen2.5-7B-Instruct	65.30	0.23
DeepSeek-V3	69.59	4.62

Note: Throughput measured in instances processed per second (inst/s).

Table 7. Direct inference performance comparison across LLMs on GDS, SemEval, and KBP37 datasets. “Std.Prompt” indicates standard prompt usage while “Rel.Desc” denotes prompts augmented with relation descriptions.

Metrics (%)	Std.Prompt			Std.Prompt + Rel.Desc
Models/Datasets	GDS	SemEval	KBP37	GDS	SemEval	KBP37
Baichuan2-7B-Chat	45.29	20.22	13.55	47.47	5.69	5.75
Qwen2.5-3B-Instruct	53.69	31.26	17.80	54.88	38.99	16.09
LLaMA3-8B-Instruct	59.29	32.55	25.91	61.95	46.43	22.81
Qwen2.5-7B-Instruct	64.47	46.30	29.10	66.04	57.75	30.49
DeepSeek-V3	70.83	60.19	48.47	71.11	65.89	48.86

Table 8. (%) The Micro-/Macro-F1-scores comparison of Our method with other baselines on GDS, SemEval, and KBP37 datasets.

Metrics (%)	Micro-F1			Macro-F1
Model/Datasets	GDS	SemEval	KBP37	GDS	SemEval	KBP37
BERT [8]	78.63	84.03	67.49	79.52	84.87	69.50
LasUIE [18]	81.26	74.57	36.10	79.54	74.99	36.19
USM [15]	80.37	74.38	36.08	80.49	74.97	36.96
InstructUIE [20]	81.98	73.23	36.14	81.31	73.49	37.28
DILUIE [19]	84.69	75.47	35.47	84.17	75.68	36.09
CRE-LLM [28]	80.45	85.75	68.25	81.53	86.70	70.32
CARME_3B (Ours)	85.77	85.46	69.16	86.82	86.61	70.89
CARME_7B (Ours)	85.91	87.45	69.31	87.12	88.51	71.49

The experimental results of CRE-LLM were obtained through our re-implementation using the Qwen2.5-7B-Instruct model. CARME_3B and CARME_7B denote our fine-tuned models based on Qwen2.5-3B-Instruct and Qwen2.5-7B-Instruct, respectively.

Table 9. (%) Precision of different relation categories on GDS.

Relation	Precision	Recall	Macro-F1	Support
/people/deceased_person/place_of_death	86.66	91.44	88.98	1016
/people/person/education./education/education/degree	96.30	99.11	97.68	894
/people/person/education./education/education/institution	83.79	87.11	85.42	1365
/people/person/place_of_birth	92.66	93.02	92.84	1032
NA	74.10	66.45	70.06	1356

Table 10. (%) Precision of different relation categories on SemEval.

Relation	Precision	Recall	Macro-F1	Support
Cause-Effect	95.96	94.21	95.08	328
Component-Whole	88.40	90.38	89.38	312
Content-Container	90.82	92.71	91.75	192
Entity-Destination	91.75	95.21	93.45	292
Entity-Origin	90.48	88.37	89.41	258
Instrument-Agency	94.81	82.05	87.97	156
Member-Collection	84.06	90.56	87.19	233
Message-Topic	89.89	95.40	92.57	261
Product-Producer	88.09	89.61	88.84	231
Other	71.66	67.40	69.47	454

Table 11. (%) Precision of different relation categories on KBP37.

Relation	Precision	Recall	Macro-F1	Support
alternate names	75.16	69.01	71.95	171
cities of residence	71.07	80.92	75.68	173
city of headquarters	77.19	86.56	81.61	305
countries of residence	56.63	65.79	60.87	266
country of birth	57.81	41.57	48.37	89
country of headquarters	87.67	87.28	87.47	228
employee of	61.90	70.07	65.73	568
founded	85.71	89.72	87.67	107
founded by	68.12	58.75	63.09	80
members	63.45	57.50	60.33	160
origin	85.71	73.85	79.34	65
spouse	88.14	91.23	89.66	57
state or province of headquarters	76.69	80.95	78.76	126
state or provinces of residence	64.39	68.00	66.15	125
subsidiaries	59.31	62.69	60.96	193
title of person	95.20	86.86	90.84	137
top members employees	74.17	65.44	69.53	136
NA	57.23	42.48	48.77	419

Table 12. Proportion of GDS, SemEval, and KBP37 in constructing GraphRAG on the training set (training set).

	GDS	SemEval	KBP37
Instances in GraphRAG	8978	437	6147
Total instances	13,161	6507	15,917
Proportion	68.2%	6.7%	38.6%

Table 13. Case Study on the GDS Dataset. ✗ denotes an incorrect prediction, whereas ✓denotes a correct prediction. 3B/7B represent the Qwen2.5-3B/7B-Instruct models, respectively.

Models	Prediction Results
Examples	Example 1: [David Phinney] graduated from the [University of California] at Berkeley with a dual major in English and Political Science. He began their journalism career as a music and film critic and by writing frequent humorous sketches about local personalities and events in the San Francisco Bay Area.
	Triplet 1: ([David Phinney], NA, [University of California])
	Example 2: this project, two BRT stations [Will Ransom] be installed on Ransom Avenue near Crescent [St]reet. The first phase of construction near the Medical Center has begun and runs through June.
	Triplet 2: ([Will Ransom], NA, [St])
Std. Prompt (3B)	Example 1: ([David Phinney], /people/person/education) (✗)
Std. Prompt (3B)	Example 2: (this project, two BRT stations [Will Ransom] be installed on Ransom…) (✗)
Std. Prompt + Desc. (3B)	Example 1: ([David hinney], /people/person/education, [University of California]) (✗)
Std. Prompt + Desc. (3B)	Example 2: (this project, two BRT stations [Will Ransom], be installed on Ransom…) (✗)
RLRS (3B)	Example 1: ([David Phinney],/people/person/education./education/education/institution, [University of California]) (✗)
RLRS (3B)	Example 2: ([Will Ransom], NA, [St]) (✓)
CARME_3B	Example 1: ([David Phinney],/people/person/education./education/education/institution, [University of California]) (✗)
CARME_3B	Example 2: ([Will Ransom], NA, [St]) (✓)
Std. Prompt + Desc. (7B)	Example 1: ([David Phinney],/people/person/education./education/education/institution, [University of California]) (✗)
Std. Prompt + Desc. (7B)	Example 2: ([Will Ransom], NA, [St]) (✓)
RLRS (7B)	Example 1: ([David Phinney], /people/person/education./education/education/institution, [University of California]) (✗)
RLRS (7B)	Example 2: ([Will Ransom], NA, [St]) (✓)
CARME_7B	Example 1: ([David Phinney], NA, [University of California]) (✓)
CARME_7B	Example 2: ([Will Ransom], NA, [St]) (✓)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, D.; Meng, L.; Li, X.; Li, J.; Guo, C.; Zhou, Y.; Yuan, C.; Ma, Y. Contextual Augmentation via Retrieval for Multi-Granularity Relation Extraction in LLMs. Symmetry 2025, 17, 1201. https://doi.org/10.3390/sym17081201

AMA Style

Han D, Meng L, Li X, Li J, Guo C, Zhou Y, Yuan C, Ma Y. Contextual Augmentation via Retrieval for Multi-Granularity Relation Extraction in LLMs. Symmetry. 2025; 17(8):1201. https://doi.org/10.3390/sym17081201

Chicago/Turabian Style

Han, Danjie, Lingzhong Meng, Xun Li, Jia Li, Cunhan Guo, Yanghao Zhou, Changsen Yuan, and Yuxi Ma. 2025. "Contextual Augmentation via Retrieval for Multi-Granularity Relation Extraction in LLMs" Symmetry 17, no. 8: 1201. https://doi.org/10.3390/sym17081201

APA Style

Han, D., Meng, L., Li, X., Li, J., Guo, C., Zhou, Y., Yuan, C., & Ma, Y. (2025). Contextual Augmentation via Retrieval for Multi-Granularity Relation Extraction in LLMs. Symmetry, 17(8), 1201. https://doi.org/10.3390/sym17081201

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Contextual Augmentation via Retrieval for Multi-Granularity Relation Extraction in LLMs

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Task Definition

3.2. Relation Label Refinement Strategy (RLRS)

3.3. Multi-Granularity Context Enhancement Framework

3.3.1. Document-Level Contextual Enhancement Strategy (for MIL)

3.3.2. Sentence-Level Context Enhancement (for SIL)

3.4. LoRA Tuning

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Experimental Settings

4.3. Baseline Models

4.4. Experimental Results Analysis of Relation Label Refinement Strategy

4.5. Overall Advantages and Dataset-Specific Analysis

4.6. Analysis of Differences in Experimental Results

4.7. Case Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI