An Entity Relationship Extraction Method Based on Multi-Mechanism Fusion and Dynamic Adaptive Networks

Jiang, Xiantao; Hu, Xin; Zhou, Bowen

doi:10.3390/info17010038

Open AccessArticle

An Entity Relationship Extraction Method Based on Multi-Mechanism Fusion and Dynamic Adaptive Networks

by

Xiantao Jiang

,

Xin Hu

and

Bowen Zhou

^*

Department of Information Engineering, Shanghai Maritime University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Information 2026, 17(1), 38; https://doi.org/10.3390/info17010038

Submission received: 20 November 2025 / Revised: 22 December 2025 / Accepted: 30 December 2025 / Published: 3 January 2026

Download

Browse Figures

Versions Notes

Abstract

This study introduces a multi-mechanism entity–relation extraction model designed to address persistent challenges in natural language processing, including syntactic complexity, long-range dependency modeling, and suboptimal utilization of contextual information. The proposed architecture integrates several complementary components. First, a pre-trained Chinese-RoBERTa-wwm-ext encoder with a whole-word masking strategy is employed to preserve lexical semantics and enhance contextual representations for multi-character Chinese text. Second, BiLSTM-based sequential modeling is incorporated to capture bidirectional contextual dependencies, facilitating the identification of distant entity relations. Third, the combination of multi-head attention and gated attention mechanisms enables the model to selectively emphasize salient semantic cues while suppressing irrelevant information. To further improve global prediction consistency, a Conditional Random Field (CRF) layer is applied at the output stage. Building upon this multi-mechanism framework, an adaptive dynamic network is introduced to enable input-dependent activation of feature modeling modules based on sentence-level semantic complexity. Rather than enforcing a fixed computation pipeline, the proposed mechanism supports flexible and context-aware feature interaction, allowing the model to better accommodate heterogeneous sentence structures. Experimental results on benchmark datasets demonstrate that the proposed approach achieves strong extraction performance and improved robustness, making it a flexible solution for downstream applications such as knowledge graph construction and semantic information retrieval.

Keywords:

entity-relation extraction; multi-head attention; gated attention; BiLSTM; conditional random field; adaptive dynamic network

1. Introduction

Entity Relationship Extraction (RE) is a fundamental task in Natural Language Processing (NLP), which aims to identify and extract entities (such as people, organizations, locations, and more) and the relationships between them from unstructured text. This task is crucial for a wide range of NLP applications, including intelligent question–answering (Q&A) systems, where understanding relationships between entities is key to providing accurate answers, knowledge graph construction, which organizes information in a structured form, and information retrieval, in turn helping systems understand and retrieve relevant documents [1]. The ability to accurately identify entities and their relationships is critical to improving the quality and relevance of search results and system responses.

The growing volume of unstructured text data in the era of big data has amplified the challenges faced by RE systems. Traditional techniques, such as rule-based approaches and feature-engineered machine learning models, often struggle to generalize to large-scale and heterogeneous text data. These conventional methods also encounter difficulties when processing diverse and complex linguistic phenomena, including varied sentence structures, ambiguous expressions, and implicit relationships between entities. As text data continues to increase in scale and complexity, the limitations of these traditional methods become more apparent, highlighting the need for more robust and adaptable solutions in RE.

In recent years, deep learning techniques have brought significant advancements to the field of RE. The development of pre-trained language models such as RoBERTa, BERT, and ERNIE has substantially improved the ability to capture rich contextual representations of text [2]. These models, trained on large-scale corpora, are capable of modeling nuanced semantic dependencies, making them highly effective for entity recognition and relation extraction. In addition, convolutional neural networks (CNNs) and bidirectional long short-term memory (BiLSTM) networks have been widely adopted to capture local and long-range dependencies in text. More recently, attention mechanisms and graph neural networks (GNNs) have been introduced to further enhance representation learning [3]. Attention mechanisms allow models to selectively focus on informative tokens, while GNNs explicitly model interactions among entities, improving extraction performance in complex relational contexts [4].

Despite these advances, several challenges remain. First, modeling complex contextual information and long-distance dependencies is still difficult, as relations may span multiple clauses or exhibit implicit semantic patterns. Second, existing models often lack robustness when salient relational cues are sparsely distributed or entangled with irrelevant contextual information, leading to suboptimal information utilization. Finally, many approaches show limited generalization beyond benchmark datasets, and their performance often degrades in real-world scenarios characterized by noisy, unstructured, or domain-specific text.

To address these challenges, we propose a novel multi-mechanism fusion model for entity–relation extraction. The proposed model integrates multiple complementary representation learning components to enhance robustness and adaptability across heterogeneous sentence structures. Rather than explicitly optimizing for hardware-level efficiency, our goal is to enable adaptive utilization of model capacity, allowing different inputs to activate different modeling components according to their structural and semantic complexity.

The main contributions of this paper are summarized as follows:

Context-aware lexical encoding with whole-word masking. We employ a pre-trained Chinese RoBERTa-wwm-ext encoder with a whole-word masking strategy, which preserves semantic integrity in multi-character Chinese text and improves contextual representation quality.
Selective feature modeling via attention mechanisms. Multi-head attention and gated attention mechanisms are jointly incorporated to enhance salient semantic features and suppress less informative context, improving robustness in entity–relation extraction.
Input-adaptive dynamic network architecture. We introduce a dynamic framework that adaptively controls the execution of feature modeling modules based on input characteristics, enabling flexible and robust utilization of model capacity across simple and complex sentence structures.

2. Related Work

Entity-relationship extraction has long been a fundamental task in NLP, with the aim of identifying and extracting entities and their relationships from unstructured text [5]. Over the years, various techniques have been developed to tackle this complex problem, ranging from traditional feature engineering and rule-based methods to more recent deep learning approaches. The rapid advancements in neural networks, particularly with the introduction of pre-trained language models and graph-based methods, have significantly improved the performance of entity-relationship extraction systems.

This chapter provides an overview of the main approaches in the field of entity-relationship extraction, categorizing them into four key areas: traditional feature engineering and rule-based methods, deep learning-based approaches, pre-trained language models, graph neural networks and joint extraction models, and document-level extraction.

2.1. Traditional Feature Engineering and Rule-Based Methods

Deep learning-based techniques for entity-relationship extraction have advanced significantly in the last few years. In order to extract syntactic and semantic features, early methods primarily relied on hand-designed features and rules that were combined with natural language processing tools like dependent syntactic parsers. For instance, Zhou et al. suggested a tree kernel-based relationship extraction method that incorporates context-sensitive structured parse tree information [6]. These techniques require a lot of feature engineering but have shown some promise in certain jobs. CNN and recurrent neural networks (RNNs) have been used in relation extraction tasks as neural network technology has advanced.

While LSTM and BiLSTM can capture long-distance dependencies in a phrase by introducing memory units and gate mechanisms, CNN is more effective at extracting local features. For instance, Zeng et al. [7] suggested utilizing CNN to effectively extract local characteristics when applying convolutional neural networks for relationship categorization.

2.2. Deep Learning-Based Approaches

Bidirectional LSTM networks, as suggested by Zhang et al. [8], are used for relational classification; they enhance classification performance by identifying the pre- and post-textual connections within a phrase. The relation extraction task is effectively completed by these strategies. The model’s performance is further enhanced by the addition of the attention mechanism, which also allows the model to dynamically concentrate on the crucial parts of the sentence, enhancing the relationship extraction process.

A bi-directional long- and short-term memory network (BiLSTM-ATT), for instance, was proposed by Zhou, Peng, et al. [9] and is based on the attention mechanism. The model is able to dynamically focus on crucial information in sentences by adding the attention mechanism on top of the BiLSTM, which improves the relationship classification process.

Effectively extracting features from multi-label data is challenging, particularly when there are limited labels available. Because conventional algorithms cannot fully capture the complex structure of the data, they often perform poorly in multi-label classification tasks. For overcoming this issue, Lu et al.’s [10] CNN-BiLSTM model shows a great deal of potential. By combining the benefits of CNNs with BiLSTM networks, this model is able to extract both long-range associations and local properties from the input. In multi-label classification problems, our hybrid approach outperforms traditional techniques, especially in cases of sparse data or complicated label interactions.

When attention mechanisms are combined with BiLSTM, the triples extraction accuracy can increase significantly. The aforementioned combination has exceptional efficacy in capturing interactions across extended distances, an essential component for precisely retrieving triples connections [11]. The BiLSTM can handle sequential information and capture long-term dependencies, while the attention mechanism allows the model to concentrate on the most pertinent portions of the input text. By combining these methods, triples extraction may be done in a more thorough and reliable manner, giving researchers greater insight into the underlying data and more accurate findings.

2.3. Pre-Trained Language Models

Pre-trained language models applications, such as RoBERTa, ALBERT, ERNIE, and BERT, improve relation extraction performance dramatically by using bi-directional training on large-scale corpora to develop rich contextual representations. In this thesis, a BERT model with bidirectional training in large-scale corpora is proposed, leading to a significant improvement in relation extraction performance [12]. By using a whole word masking method, the Chinese-RoBERTa-wwm-ext model enhances the representation capacity and relation extraction effect of Chinese text for the job of relation extraction [13].

2.4. Graph Neural Networks and Joint Extraction Models

Furthermore, GNN have been applied to the task of relationship extraction. Sentences are represented as graph structures, which allows graph convolutional networks (GCN) to handle syntactic dependencies in particular well and record complicated interactions between components. For example, by encoding phrases as graph structures, Vashishth et al. suggested a combination-based multi-relational graph convolutional network to capture complicated interactions [14]. For various kinds of nodes and edges, Zhang et al. [15] introduced the heterogeneous GNN (heterogeneous graph neural network), which can capture relations at a finer resolution and improve the impact of relation triples extraction even more.

As entity recognition and relation extraction are carried out concurrently, joint modeling and multi-task learning techniques circumvent the error propagation issue found in classic pipeline systems. To further improve relationship extraction, Wang et al. [16], for example, suggested incorporating the multitask learning paradigm by providing related tasks (such as named entity identification, syntactic analysis, etc.) with common feature representations. Miwa et al. [17] proposed an end-to-end relationship extraction technique utilizing LSTM on word sequence and tree architectures for joint modeling.

GNNs have been demonstrated to significantly increase relationship extraction performance in non-Euclidean areas when combined with BiLSTM and attention processes [18]. This combination improves the multi-mechanism fusion models’ capacity to extract complicated and detailed structures. A more thorough and precise comprehension of the data is produced by the interaction of GNNs, BiLSTM, and attention processes, which helps the models better represent the underlying relationships and patterns. Through the utilization of these integrated methodologies, scholars may get more resilient and efficient outcomes in connection to relationship extraction.

2.5. Document-Level Extraction

Relationship extraction at the document level has significantly improved using the reconstruction-based approach put out by Xu et al. [19]. This technique provides a new way that significantly improves the long-distance link extraction from documents. It is able to capture more intricate and subtle linkages that would be missed otherwise by utilizing the reconstruction approach. A thorough grasp of the content of the text and the connections among its many components depends on the capacity to extract long-distance connections efficiently [20]. This methodology makes a significant addition to the field of connection extraction and creates new avenues for more precise and in-depth textual data analysis.

Document-level relation extraction (DocRE) aims to identify relationships between entities across multiple sentences, requiring models to capture long-range dependencies and complex interactions. Sun et al. [21] proposed a two-stage dynamic graph attention network (TDGAT) to address these challenges. The first stage filters out irrelevant entity pairs using a binary classifier, reducing computational complexity and mitigating data imbalance. The second stage constructs a refined relational graph and applies a dynamic graph attention network to extract relations more effectively. This approach enhances the ability to model inter-sentence dependencies and multi-hop reasoning in document-level relation extraction.

Even with the above approaches’ tremendous advancements, there are still certain pressing issues that need to be resolved. First, handling complicated contextual data and long-distance dependencies is difficult for a single deep-learning model. Secondly, the current approaches are not able to properly utilize the important information in the text and are not able to focus on the right information. Furthermore, additional work needs to be done to enhance the models’ resilience and capacity for generalization so they can handle a variety of circumstances in real-world applications.

2.6. Dynamic and Adaptive Neural Networks

Recent years have witnessed increasing interest in dynamic and adaptive neural networks, which aim to adjust computational paths or model depth according to input complexity. Representative approaches include Adaptive Computation Time (ACT) [22], which allows recurrent networks to dynamically determine the number of computation steps per input, as well as SkipNet and dynamic-depth residual networks that learn to bypass layers during inference [23]. In the context of Transformers, early-exit models and dynamic-depth BERT variants enable intermediate predictions or selective layer execution to reduce inference cost while maintaining accuracy. Mixture-of-Experts (MoE) models further introduce token- or sample-level expert routing to scale model capacity efficiently. Similar ideas have also been explored in capsule networks through dynamic routing mechanisms.

Despite their success, most existing dynamic computation approaches are primarily designed for classification or sequence modeling tasks and focus on optimizing inference efficiency or reducing computational cost. Their gating mechanisms are often token-level or layer-level but are typically decoupled from structured prediction objectives. Moreover, these methods are rarely studied in the context of joint information extraction, where entity recognition and relation extraction are tightly coupled and sensitive to representation consistency.

In contrast, the dynamic adaptive framework proposed in this work is tailored to joint entity–relation extraction. Instead of dynamically varying model depth or routing tokens to experts, our approach performs module-level adaptive execution within a fixed CasRel-style extraction pipeline. The gating mechanism selectively activates or bypasses key feature modeling modules, e.g., BiLSTM and attention components, based on sample-level complexity, while preserving a consistent decoding structure. This design allows the model to balance robustness and representational adequacy for heterogeneous inputs, improving extraction stability rather than solely optimizing computational efficiency. To the best of our knowledge, this is one of the first attempts to integrate adaptive computation mechanisms into a joint entity–relation extraction framework.

3. Proposed Methods

Figure 1 illustrates the overall architecture of the proposed model. The model adopts Chinese-RoBERTa-wwm-ext as the underlying semantic encoder and introduces a dynamic adaptive framework between the encoding stage and the task-specific decoding stage. This framework dynamically controls the activation of multiple feature modeling modules, including two BiLSTM layers, a multi-head attention module, and a gated attention module, thereby enabling adaptive information flow and flexible network paths.

The proposed model consists of the following key components, corresponding to the elements shown in Figure 1:

Chinese PLM Encoder (RoBERTa-wwm-ext): This module encodes the input sentence into contextualized token representations. The whole-word masking strategy is particularly suitable for Chinese multi-character words.
Dynamic Adaptive Framework: As shown in Figure 1, this framework encompasses BiLSTM layer 1, multi-head attention, gated attention, and BiLSTM layer 2. A dynamic gating mechanism assigns activation scores $α_{i}$ to each module, and modules with $α_{i} > 0.5$ are activated during forward propagation. This design allows the model to adaptively select computational paths based on input complexity, rather than executing all modules for every sentence.
BiLSTM Layer 1 and BiLSTM Layer 2: The first BiLSTM layer captures bidirectional sequential dependencies from encoder outputs. After attention-based feature enhancement, the second BiLSTM layer further refines local contextual information. These two layers are explicitly distinguished in Figure 1 to avoid confusion.
Linear Layer: A linear layer is applied after BiLSTM processing to align feature dimensions and to project attention-enhanced representations into a unified latent space suitable for downstream entity and relation extraction.
CRF Layer: The CRF layer operates on the sequence-level feature representations to enhance global consistency. It serves as a feature refinement and regularization component rather than an independent decoding stream, and does not replace the CasRel-based span prediction mechanism.
Entity Extractor (Subject Tagger): This module corresponds to the subject tagger shown in Figure 1. It consists of binary classifiers that predict the start and end positions of subject entities.
Joint Object–Relationship Extraction Module: Conditioned on detected subjects, relation-specific object taggers predict object spans independently for each relation type, enabling joint extraction of relational triples.

A significant obstacle in NLP is combined entity and relation extraction, which aims to concurrently recognize entities and their relations (subject s, relation r, and object o) from unstructured text. A major challenge in natural language processing is joint entity and relation extraction, which aims to simultaneously identify entities and their semantic relations in the form of subject–relation–object triples

(s, r, o)

from unstructured text. In this work, we adopt an enhanced CasRel-based framework [4] to model this task through a conditional probabilistic factorization.

Following the CasRel paradigm, relational triples extraction is formulated as follows. Given an input sentence x, the goal is to extract a set of relational triples

T = {(s, r, o)},

where s and o denote subject and object entity spans, respectively, and

r \in R

denotes a predefined relation type.

The joint probability of all triples in a sentence is factorized as

P (T ∣ x) = \prod_{s \in S} P (s ∣ x) \prod_{(r, o) \in T ∣ s} P (o ∣ s, r, x),

(1)

where

S

denotes the set of detected subject entities, and

T ∣ s \subseteq T

represents the subset of triples whose subject is s.

Under this formulation, subject extraction is performed first by predicting the start and end positions of subject spans conditioned on the input sentence. For each detected subject s, object entities are then predicted independently for each relation type

r \in R

. For relation types that do not form a valid triple with subject s in the current sentence, a null object

o_{⌀}

is implicitly assumed.

This factorization decouples subject detection from relation-conditioned object prediction, enabling the model to naturally handle overlapping triples and multiple relations per entity pair. Importantly, all relation types are evaluated independently rather than through mutual exclusion, which is a key advantage of the CasRel framework. During inference, the model first decodes subject spans from the refined token representations, and then predicts relation-conditioned object spans for each detected subject. The final relational triples are constructed by combining subject spans, relation types, and corresponding object spans.

3.1. Chinese Word Embedding Layer (Chinese-RoBERTa-wwm-ext)

First, the Chinese-RoBERTa-wwm-ext pre-training model splits the sentences into tokens that need to be trained in order to obtain the appropriate encoding. This training model, in contrast to the Bert pre-training approach, uses the WWM strategy. During pre-training, the whole word is masked rather than only part of it, which is essential for preserving semantic integrity in multi-character languages such as Chinese. The model can more effectively maintain the term’s semantic integrity by hiding the entire word.

For the input sequence, the input embedding of each word is first derived by Equation (2) for subsequent positional embedding and paragraph embedding, assuming that the size of the word list is V.

E_{i} = W_{e} [x_{i}]

(2)

The word embedding matrix in Equation is

W_{e} \in R^{V * d}

, where d is the word embedding’s dimension.

The model uses Equation (3) to obtain the positional embeddings, which allow it to capture the positional information of words in a phrase. For the input sequence x, the positional embedding of each location i is:

P_{i} = W_{P} [i]

(3)

where the positional embedding is

P \in R^{n * d}

, where n is the maximum sequence length and

W_{P}

is the positional embedding matrix.

To differentiate between distinct paragraphs within the input, the model derives paragraph embeddings by the application of Equation (4). For each word

x_{i}

in the input sequence x, the paragraph embedding is

S_{i}

:

S_{i} = W_{s} [s e g m e n t (x_{i})]

(4)

where paragraph embedding is

s \in R^{2 * d}

,

W_{s}

is the paragraph embedding matrix,

segment (x_{i})

indicating the paragraph to which the word belongs.

An element-by-element summary of the word, position, and paragraph embeddings makes up the final input representation

H_{0}

:

H_{0} = E + P + S

(5)

3.2. Attention-Enhanced BiLSTM Module

The Attention-Enhanced BiLSTM module follows a sequential and dynamically controlled pipeline. It first applies a BiLSTM layer to the contextual embeddings

H_{0}

\in R^{L \times d}

to capture bidirectional dependencies. The output is projected back to dimension d and subsequently processed by a multi-head self-attention layer to model global token interactions. A gated attention mechanism is then applied to generate feature-wise importance weights, which are used to modulate the original representations through element-wise multiplication. Finally, the fused features are passed to downstream extraction modules.

By recognizing the bidirectional relationships between phrases and producing contextual representations of superior quality, BiLSTM offers a strong basis for the relational extraction job. The first BiLSTM processes each word in the phrase step-by-step and constructs the corresponding contextual representation, using as input the word embeddings produced by the pre-trained language model Chinese-RoBERTa-wwm-ext. The output is represented by Equation (6):

H = BiLSTM (H_{0})

(6)

To improve the accuracy of entity-relationship extraction, the model integrates gated and multi-head attention mechanisms, allowing it to dynamically adjust attention weights based on contextual information. Not every word in a sentence contributes equally to identifying entities and relationships, so an adaptive mechanism is necessary to highlight the most informative terms.

The gated attention layer is introduced to refine the focus of the model by assigning different levels of importance to each word in the sequence. This mechanism first computes attention scores based on contextual representations and then applies gating weights to regulate focus intensity. By dynamically modulating these weights, the model can suppress irrelevant information while emphasizing crucial words that contribute to relationship detection.

The final gated attention weights are computed using the following equation:

a_{i} = g_{i} * e_{i}

(7)

where

g_{i}

represents the weight of the bias that determines the degree of focus on each hidden state and

e_{i}

is the calculated attention score. By integrating these components, the model improves its ability to capture key relationship indicators, leading to a more precise and robust entity-relationship extraction.

Ultimately, the context vector is calculated using Equation (8), which creates a global representation of the input sequence by combining the weighted representations of every word.

c = \sum_{i = 1}^{n} α_{i} h_{i}

(8)

where n is the length of the input sequence.

The multi-head attention layer’s subsequent input layer receives the context vector from the gated attention layer. Here, the inputs are linearly processed using Equations (9)–(11) to produce the query, key, and value matrices, Q, K, and V.

Q = W_{Q} c

(9)

K = W_{K} c

(10)

V = W_{V} c

(11)

where

W_{Q}

,

W_{K}

, and

W_{V}

are trainable weight matrices that project the input vectors to different subspaces.

The weights of the attention allocation were then calculated by Equations (12) and (13):

Attention = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(12)

{head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(13)

where

d_{k}

denotes the dimension of the matrix K;

h e a d_{i}

indicates the output of the ith attention head;

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

represents the weight matrix of the ith attention head.

In the end, Equation (14) splices the outputs of each header together, and linear transformation yields the final output:

MultiHead (Q, K, V) = Concat ({head}_{1}, {head}_{2}, \dots, {head}_{n}) W^{o}

(14)

where Concat is the linear transformation’s weight matrix and

W^{O}

means to join the outputs of each attention head.

The outputs of Equations (6) and (14) are fed into the second BiLSTM layer to extract the local information, and then the results are fed into the main body extraction part via Equation (15):

h_{N} = L i n e a r (B i L s t m (H_{0} ⊙ A))

(15)

where ⊙ denotes element-wise multiplication, and A is the attention-enhanced feature map produced by the multi-head and gated attention modules.

The CRF layer uses the high-quality feature representations produced by the aforementioned modules to globally optimize these features, guaranteeing the accuracy and consistency of the output sequences via equations. The Equations (16)–(19):

s (x, y) = \sum_{i = 1}^{n} W_{y_{i - 1}, y_{i}} + \sum_{i = 1}^{n} {MultiHead_output}_{i}

(16)

where

s (x, y)

is the sum of the scores for the input sequence (x) and the label sequence (y); n is the length (i.e., word count) of the input sequence.

W_{y_{i - 1}, y_{i}}

indicates the element of the label transfer matrix, which represents the label transfer score;

U_{y_{i}}

indicates the vector of observed feature weights, which represents the observed feature weights of the label

y_{i}

’ and indicates the output feature representation of the Attention-Enhanced BiLSTM module, which represents the label

y_{i}

’s observed feature weights; and indicates the Attention-Enhanced BiLSTM module’s output feature representation, which represents the feature vector’s ith word.

Z (x) = \sum_{y \in γ} exp (s (x, y))

(17)

where

Z (x)

denotes the normalization factor, which represents the total score of all possible label sequences; Y indicates the set of all possible label sequences; exp is the exponential function, which is used to transform the results of the scoring function into positive values.

P (y | x) = \frac{exp (s (x, y))}{Z (x)}

(18)

where

P (y | x)

denotes the conditional probability of the labeled sequence y given the input sequence X. The CRF layer is employed solely to enhance sequence-level feature coherence and does not constitute an independent decoding stream for subjects or objects. All relational triples are decoded exclusively through the CasRel-based span prediction mechanism described in Section 3.4 and Section 3.5.

The final loss function is obtained as follows:

L (x, y) = - log P (y | x) = Z (x) - s (x, y)

(19)

3.3. Dynamic Adaptive Framework

In this section, we present the proposed dynamic adaptive framework, which introduces sample-dependent module activation into the joint entity–relation extraction model. The goal of this design is not to alter the overall CasRel-style decoding paradigm, but to adaptively allocate modeling capacity according to the complexity of each input sentence, thereby improving robustness and representation adequacy.

Sentences in real-world information extraction datasets exhibit significant heterogeneity in length, syntactic structure, and relational complexity. Applying an identical sequence of feature modeling modules to all sentences may lead to underfitting for complex cases or unnecessary computation for simple cases. To address this issue, we introduce a dynamic adaptive mechanism that selectively activates a subset of feature modeling modules for each input sample. However, several intermediate modules are optionally executed based on learned activation scores. This design preserves a stable decoding structure while enabling flexible representation learning.

The proposed dynamic adaptive framework operates at the module level and the sample level. For each input sentence, the model predicts a set of scalar activation scores that control whether the following modules are executed:

the first BiLSTM layer;
the multi-head attention module;
the gated attention module.

The second BiLSTM layer and the CRF decoding layer are always active and are not subject to dynamic gating. The activation mechanism does not operate at the token level or attention-head level; instead, each activation score corresponds to an entire module and is shared across all tokens within a sentence.

Let the encoder output be denoted as

H \in R^{B \times S \times d}

, where B is the batch size, S is the sequence length, and d is the hidden dimension. To obtain a sentence-level representation, the encoder output is flattened along the sequence and feature dimensions:

Flatten (H) \in R^{B \times (S \cdot d)}

.

A lightweight selector network is applied to predict activation scores:

α = σ (W \cdot Flatten (H) + b),

(20)

where

W \in R^{(S \cdot d) \times L}

,

b \in R^{L}

,

L = 3

is the number of gated modules, and

α \in {(0, 1)}^{B \times L}

. Each scalar

α_{i}

represents the activation score of module i for a given input sample.

Let

H^{(0)} = H

denote the encoder output. For each gated module i, the intermediate representation is updated according to

H^{(i)} = \{\begin{matrix} f_{i} (H^{(i - 1)}), & if module i is activated, \\ H^{(i - 1)}, & otherwise, \end{matrix}

(21)

where

f_{i} (\cdot)

denotes the transformation implemented by the corresponding module. In practice, a module is considered active if the average activation score over the mini-batch exceeds a predefined threshold, which is set to

0.5

in our implementation.

This conditional execution mechanism enables the model to bypass certain feature modeling modules for simpler inputs while retaining them for more complex sentences.

Gradients from the extraction loss propagate through the activation scores via standard backpropagation. No explicit sparsity or efficiency-oriented regularization is imposed; instead, the sigmoid-based activation naturally encourages selective module utilization.

The primary objective of the dynamic adaptive framework is to enhance robustness and adaptive utilization of model capacity rather than to guarantee explicit inference speedup. Potential computational savings are reflected indirectly through reduced module activation and may be further realized with optimized conditional execution implementations. We provide a concrete toy example in Appendix A to illustrate the complete decoding process, including the encoder output, the predicted subject span, the relation-conditioned object span, and the final triple construction.

3.4. Entity Extractor (Subject Extraction)

Following the CasRel framework, subject extraction is formulated as a span detection problem. Given the encoded sequence representation

H \in R^{L \times d}

produced by the attention-enhanced BiLSTM module, two independent binary classifiers are applied to predict the start and end positions of subject entities. Formally, for each token position i:

P_{i}^{{start}_{s}} = σ (W_{start} x_{i} + b_{start})

(22)

P_{i}^{{end}_{s}} = σ (W_{end} x_{i} + b_{end})

(23)

where

P_{i}^{{start}_{s}}

and

P_{i}^{{end}_{s}}

denote the probabilities that token i is the start or end of a subject span, respectively. Subject spans are constructed by pairing valid start and end positions under predefined constraints, allowing detection of multiple and overlapping subjects.

The likelihood of a subject span s, adapted from the CasRel framework [4], is defined as:

p_{θ} (s | x) = \prod_{t \in {{start}_{s}, {end}_{s}}} \prod_{i = 1}^{L} {(p_{i}^{t})}^{I {y_{i}^{t} = 1}} {(1 - p_{i}^{t})}^{I {y_{i}^{t} = 0}}

(24)

where

I {\cdot}

is the indicator function,

y_{i}^{t}

is the ground-truth label, and L is the sequence length.

3.5. Object-Relationship Extraction

For each detected subject s, object spans are predicted independently for every relation type

r \in R

, following the CasRel factorization. Two binary classifiers are applied for each relation to predict the start and end positions of object entities:

P_{i}^{{start}_{o}} = σ (W_{start}^{r} (x_{i} + v_{sub}^{k}) + b_{start}^{r})

(25)

P_{i}^{{end}_{o}} = σ (W_{end}^{r} (x_{i} + v_{sub}^{k}) + b_{end}^{r})

(26)

where

v_{sub}^{k}

is the encoded representation of the k-th detected subject, and

x_{i}

is the encoded token representation. Object spans are constructed similarly to subjects. This independent evaluation per relation allows a single subject-object pair to participate in multiple relations and supports overlapping triples.

The likelihood of an object span o for subject s under relation r is:

p_{\emptyset r} (o | s, x) = \prod_{t \in {{start}_{o}, {end}_{o}}} \prod_{i = 1}^{L} {(p_{i}^{t})}^{I {y_{i}^{t} = 1}} {(1 - p_{i}^{t})}^{I {y_{i}^{t} = 0}}

(27)

3.6. Illustrative Example of Ternary Relation Extraction

To further clarify how the proposed model identifies ternary relations, we provide a simple illustrative example.

Example sentence. English translation: Steve Jobs founded Apple and served as Apple’s CEO.

Given the input sentence, the encoder first produces contextualized token representations using the pretrained language model followed by the Attention-Enhanced BiLSTM module.

Subject identification. In the first stage, the subject extraction module predicts the start and end positions of subject entities. In this example, the model correctly identifies “Steve Jobs” as a subject entity.

Relation-aware object extraction. Conditioned on the detected subject, the model then performs relation-specific object prediction. For the relation founder, the object extractor identifies “Apple” as the corresponding object. Similarly, for the relation CEO, the model extracts “Apple” as the object.

As a result, the model outputs the following relational triples:

(Steve Jobs, founder, Apple) and (Steve Jobs, CEO, Apple) .

This example illustrates how the proposed CasRel-based framework effectively handles sentences containing multiple and overlapping ternary relations by decoupling subject detection and relation-conditioned object extraction.

4. Experimental Results and Analysis

4.1. Conditions of the Experiment and Parameter Configurations

The hardware specifications of the server used to train the model in this research were an NVIDIA Tesla P100 GPU, 16 GB of RAM, Windows 10 system software, Python 3.9, and PyTorch 1.8.0. In addition, the Chinese-RoBERTa-wwm-ext pre-training model consists of 12 encoder layers, each with a hidden size of 768. The model has 102 million parameters and utilizes 12 attention heads.

4.2. Datasets and Metrics for Evaluation

4.2.1. Datasets

The experiments are conducted on the DuIE dataset released for the 2019 Baidu Information Extraction Competition, which contains annotated relational triples covering 50 predefined relation types. Since the gold annotations of the official test set are not publicly available, we follow a common practice and re-split the original training data into training, validation, and test subsets with a ratio of 7:1:2. Specifically, the dataset is divided into 132,217 training sentences, 21,626 validation sentences, and 40,766 test sentences.

The gold standard used in our experiments consists of the manually annotated relational triples provided in the DuIE dataset. Each sentence may contain multiple triples, which are categorized into three types based on entity overlap patterns: Normal, Entity Pair Overlap (EPO), and Single Entity Overlap (SEO). The Normal type indicates that extracted triples share no common entities, EPO denotes that different triples share the same entity pair, and SEO refers to cases where multiple triples share only one entity.

To further illustrate these overlap patterns, we provide simple examples.

(1): Normal: “Steve Jobs founded Apple.” → (Steve Jobs, founder, Apple), where no entities are shared with other triples in the sentence.
(2): Entity Pair Overlap (EPO): “Steve Jobs founded Apple and later led Apple as CEO.” → (Steve Jobs, founder, Apple) and (Steve Jobs, CEO, Apple), where the same subject–object entity pair appears in multiple triples.
(3): Single Entity Overlap (SEO): “Steve Jobs founded Apple and Pixar.” → (Steve Jobs, founder, Apple) and (Steve Jobs, founder, Pixar), where only one entity is shared across triples.

4.2.2. Evaluation Indicators

The proposed method uses three assessment metrics (precision, recall, and F1 value) for the extracted triples and the relationships between the entities. The calculation method is shown in Table 1.

Precision: Measures the proportion of true positive predictions out of all predicted positives, reflecting the model’s accuracy in positive predictions.

Recall: Calculates the proportion of true positive predictions out of all actual positives, indicating the model’s ability to detect positives.

F1-score: Provides the harmonic mean of precision and recall, balancing their trade-off for overall performance evaluation.

4.3. Comparative Tests and Outcome Analysis

4.3.1. Comparative Test

In this paper, the more mainstream entity-relationship joint extraction models in recent years are selected as the baseline models, and comparative experiments are conducted on the same dataset. All of these models are representative in the field and can effectively evaluate the performance improvement of the method in this paper.

Casrel [4]: This model introduces a cascade binary tagging framework for relational triple extraction, where the model first identifies all potential subjects, and then, for each subject, applies relation-specific taggers to jointly recognize relations and their corresponding objects. This approach naturally handles overlapping triples and improves extraction performance on challenging datasets.

Tdeer [24]: This model employs a translation-decoding mechanism for joint entity-relationship extraction and introduces type information to improve the accuracy of relationship classification.

BERT + BiLSTM + CRF [25]: This model combines the BERT pretraining model to extract contextual semantic information, BiLSTM to capture sequence dependencies, and optimization of sequence annotation via CRF (conditional random field) to improve the accuracy and consistency of entity relationship extraction.

GraphRel [26]: The model adopts a two-phase architecture: in the first phase, a BiLSTM sentence encoder and a GCN encoder are superimposed to capture the sequence information and global structural information of the text; in the second phase, a relation-weighted GCN is constructed for enhancing the interaction modeling of the entity relations to improve the accuracy of the relation extraction.

Table 2 shows comparative results of different models on the DuIE dataset. The results indicate that the proposed model achieves the highest F1-score (0.875), outperforming all baseline methods and demonstrating superior overall extraction performance. Compared to the widely adopted BERT + BiLSTM + CRF framework, our model shows a notable improvement of 30.8% in F1-score, validating the effectiveness of the proposed architectural enhancements.

In terms of recall, the proposed model also attains the best result (0.875), suggesting a stronger capability in capturing comprehensive relational instances and reducing omission errors. Although GraphRel yields the highest precision (0.900), its recall is substantially lower (0.333), resulting in a limited F1-score (0.486) and indicating insufficient coverage of relational information. Similarly, CasRel exhibits relatively low recall (0.432), which constrains its overall performance despite moderate precision, leading to an F1-score of 0.549.

While the precision of the proposed model (0.875) is slightly lower than that of GraphRel, it still reflects strong predictive accuracy. This trade-off is mainly attributed to the multi-mechanism fusion strategy that integrates BiLSTM, Gated Attention, and Multi-Head Attention, which improves recall by identifying more potential entity–relation pairs, albeit at the cost of introducing a small number of low-confidence predictions. Overall, the proposed model achieves a better balance between precision and recall, making it more suitable for relation extraction tasks that require both high coverage and high accuracy.

In terms of computational cost, the proposed model introduces additional overhead due to the integration of multi-head attention, gated attention, and the dynamic adaptive framework on top of a pre-trained RoBERTa encoder. While this increases training time and memory consumption compared to simpler baselines, the dynamic adaptive mechanism allows selective layer activation, which helps mitigate unnecessary computation for structurally simple inputs. Overall, the model represents a trade-off between computational efficiency and extraction performance, favoring improved accuracy and robustness in complex relational scenarios.

4.3.2. Ablation Test

In order to evaluate the impact of each component in the model on the final performance, this paper conducts ablation experiments by removing the CRF, the attention module, the bi-directional long and short-term memory network, and the dynamic adaptive framework, and compares and analyzes them in terms of the three metrics: precision, recall, and F1-score.

Table 3 presents the results of ablation experiments and the comparison before and after incorporating the dynamic mechanism. The removal of key components—including the CRF layer, attention modules, BiLSTM encoder, and the dynamic adaptive framework—leads to varying degrees of performance degradation. Among them, the exclusion of the dynamic mechanism results in the most significant drop in Precision, confirming its critical role in enhancing model accuracy. The full model achieves the highest F1-score of 0.875, demonstrating the effectiveness of the overall architecture and the complementary contributions of its individual components.

In Table 3, the attention module is treated as a unified component. In practice, it consists of two complementary mechanisms: gated attention and multi-head self-attention. To better understand their individual roles, we further analyze their contributions qualitatively.

When only gated attention is applied on top of the BiLSTM encoder, the model primarily benefits from improved noise suppression, as gated attention assigns lower weights to semantically irrelevant tokens. This leads to a noticeable improvement in precision but provides limited gains in recall.

In contrast, BiLSTM combined with multi-head attention mainly enhances long-range dependency modeling and global token interaction. This configuration improves recall by capturing more potential entity–relation pairs, but may introduce additional false positives due to the lack of fine-grained feature filtering.

The combination of gated attention and multi-head attention integrates the strengths of both mechanisms. Multi-head attention captures diverse contextual dependencies, while gated attention selectively emphasizes relation-relevant features. As shown by the performance of the full model, this fusion achieves a better balance between precision and recall, resulting in the highest F1-score.

These observations indicate that the two attention mechanisms play complementary roles rather than being redundant, which justifies their joint use in the proposed architecture.

In addition, although Chinese-RoBERTa-wwm-ext is used as the backbone encoder, the proposed framework is encoder-agnostic and can be combined with other Chinese PLMs such as MacBERT or ERNIE without architectural modifications.

4.3.3. Effect of Dynamic Adaptive Framework on Triple Extraction Stability

Triples extraction is a fundamental task in natural language processing that aims to identify structured facts from raw text and plays a key role in information extraction. To evaluate model performance, this study uses three standard metrics:

c o r r e c t_n u m

,

p r e d i c t_n u m

, and

g o l d_n u m

.

The number of triples that the model successfully extracts is denoted by

c o r r e c t_n u m

. This statistic is crucial because it assesses how well the model detects the actual triples that are there in the data. A larger

c o r r e c t_n u m

denotes a better representation of the real connections and entities by the model.

The total number of triples that the model predicts is called

p r e d i c t_n u m

. This indication shows how likely it is for the model to produce possible triples. At first sight, a bigger

p r e d i c t_n u m

might appear encouraging, but if it’s not balanced appropriately, it could also result in more false positives. It also demonstrates the model’s capacity to investigate and postulate potential triples.

The real number of triples in the dataset is shown by the variable

g o l d_n u m

. This acts as a constant benchmark for assessing the model’s performance. We may ascertain the recall and accuracy of the model by comparing

c o r r e c t_n u m

and

p r e d i c t_n u m

with

g o l d_n u m

. In practical situations, a model that approaches

g o l d_n u m

in terms of

c o r r e c t_n u m

with high accuracy is more likely to perform well.

Figure 2 shows the evolutionary trends of

c o r r e c t_n u m

,

p r e d i c t_n u m

, and

g o l d_n u m

during the training process with and without the dynamic adaptive framework, respectively, reflecting the model’s ability to learn the triples structure at different training stages.

This trend demonstrates that the proposed model not only enhances recall in the later stages of training but also maintains a high level of precision, indicating strong stability and generalization. Furthermore, the close alignment between

p r e d i c t_n u m

and

g o l d_n u m

suggests that the model’s prediction count closely approximates the ground truth, reflecting low estimation error.

Figure 2 (lower) presents the performance of the control model without the dynamic mechanism. The

c o r r e c t_n u m

remains consistently low and exhibits minimal growth in the early training stages. Only after approximately 40 epochs does a modest upward trend emerge, ultimately stabilizing at around 38. Although

p r e d i c t_n u m

shows a slight increase, it remains significantly lower than that of the model with dynamic adaptation, indicating the difficulty of capturing complete semantic structures without adaptive path regulation. This result implies that the model’s capacity is underutilized in the absence of dynamic control.

More importantly, a comparative analysis of the two curves reveals that the model incorporating the dynamic mechanism begins to identify correct triples structures much earlier in training—reflected by the substantially higher

c o r r e c t_n u m

within the first 20 epochs. Additionally, it exhibits a more consistent growth trajectory and improved robustness throughout training. These observations highlight the effectiveness of the dynamic structure selection module in adaptively adjusting computation paths based on contextual input, thereby enhancing training efficiency, extraction performance, and the model’s adaptability to complex relational structures.

4.4. Analysis of Adaptive Computation Behavior

To illustrate the adaptive behavior of the proposed dynamic mechanism, we analyze the learned activation weights

α_{i}

for the dynamically gated modules (BiLSTM, Multi-Head Attention, Gated Attention) across sentences of varying lengths. Each

α_{i}

is a per-sample, per-module scalar representing the predicted probability that module i should be executed. Longer sentences tend to require more computation, yielding higher activation scores, while shorter sentences activate fewer modules.

Figure 3 depicts the relationship between sentence length and the average activation weight

α_{i}

for each module. Binned averages over ten sentence-length intervals are computed to highlight the global trend. The results show that the activation weights generally increase with sentence length, indicating that the selector network adaptively allocates more computation to structurally complex inputs.

Additionally, we compute the average number of active modules per sentence using a threshold of

α_{i} > 0.5

. The results are summarized in Table 4. Using this threshold, the average number of active modules per sentence is 1.7 for the dynamic framework, compared to 3.0 for the fixed architecture. This indicates that simpler sentences bypass some modules while complex sentences tend to activate more modules, reflecting adaptive computation.

This analysis provides a clear and interpretable demonstration of the dynamic adaptive framework, showing how computational resources are allocated according to input complexity without relying on explicit FLOPs or runtime measurements.

4.5. Extensibility to Different Pretrained Language Models

Although the proposed framework is evaluated primarily on Chinese datasets, its design is not tied to a specific pretrained language model. To examine the extensibility of the framework, we replace the Chinese-RoBERTa-wwm-ext encoder with other commonly used Chinese pretrained language models while keeping the overall architecture and hyperparameters unchanged.

Specifically, we evaluate MacBERT [27] and ERNIE [28] as alternative encoders. All models are trained and tested under the same experimental settings. Table 5 reports the performance comparison.

As shown in Table 5, the proposed framework achieves comparable performance across different pretrained language models. The consistent results indicate that the model is not tied to a specific encoder and can be readily extended to other pretrained language models with minimal modification.

4.6. Qualitative Analysis and Case Study

To further analyze the effectiveness of the proposed model beyond quantitative metrics, we present representative qualitative examples in which our approach successfully extracts relational triples, while some baseline models fail.

Example 1 (Entity Pair Overlap).

Sentence: Jack Ma founded Alibaba Group and later served as the chairman of Alibaba Group.

Gold triples: (Jack Ma, founder, Alibaba Group), (Jack Ma, chairman, Alibaba Group).

In this case, the proposed model correctly identifies the subject entity and extracts two relational triples by conditioning object prediction on both the subject and the relation type. Sequence labeling–based baselines, such as BERT + BiLSTM + CRF, may miss one of the relations or incorrectly merge them, due to their limited ability to handle overlapping relational structures.

Example 2 (Single Entity Overlap).

Sentence: Jack Ma founded Alibaba Group and Ant Group.

Gold triples: (Jack Ma, founder, Alibaba Group), (Jack Ma, founder, Ant Group).

The proposed model successfully extracts both triples by independently predicting multiple objects for the same relation type. In contrast, some baseline methods tend to extract only a single object entity or fail to distinguish multiple valid objects associated with the same subject–relation pair.

These qualitative examples illustrate that the proposed CasRel-based framework is better suited for handling overlapping and multi-relation scenarios, which are common in the DuIE dataset.

5. Conclusions

To address the limitations of fixed computation paths, limited structural flexibility, and insufficient adaptability to complex instances in Chinese entity–relation extraction, this study proposes a joint extraction model that integrates multi-mechanism feature representation with dynamic structural adjustment. The primary objective is to enhance modeling robustness and adaptive utilization of representational capacity across heterogeneous sentence structures.

The proposed architecture establishes a multi-mechanism joint framework that combines Chinese-RoBERTa-wwm-ext for contextual encoding, BiLSTM for sequential dependency modeling, and both multi-head attention and gated attention to capture diverse semantic interactions. A CRF layer is further employed to improve global sequence consistency. Building upon this foundation, a dynamic adaptive mechanism is introduced to enable input-dependent path selection based on semantic complexity. Rather than enforcing a fixed processing pipeline, the model adaptively activates different feature modeling components, allowing flexible and context-aware representation learning for both simple and structurally complex sentences.

Although the proposed model demonstrates improved robustness in joint entity–relation extraction, several challenges remain. In particular, handling highly complex relational structures, long-distance dependencies, and domain-specific linguistic variations remains non-trivial. The current framework is primarily evaluated on general-domain datasets, and its adaptability to specialized domains such as medical, legal, or social media text—where terminology, syntax, and relation distributions differ substantially—requires further investigation. Future work may explore transfer learning and domain adaptation strategies to enhance cross-domain robustness with limited labeled data.

Moreover, the present approach relies on deep neural architectures for feature learning, which may not fully capture explicit relational knowledge. Incorporating external structured knowledge sources, such as knowledge graphs, could further enhance relational reasoning and interpretability. While this work focuses on adaptive feature modeling and robustness rather than computational efficiency, future studies may investigate lightweight architectures and parameter-efficient strategies to improve scalability and deployment feasibility without compromising extraction accuracy.

Author Contributions

Conceptualization, B.Z.; Methodology, X.J.; Software, B.Z.; Validation, X.H. and B.Z.; Formal analysis, B.Z.; Investigation, X.H.; Resources, X.J.; Data curation, X.J.; Writing—original draft, B.Z.; Writing—review & editing, X.J. and X.H.; Supervision, X.J.; Project administration, X.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to data privacy protection and restrictions on usage rights.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Toy Example of CasRel-Based Decoding

To further clarify the probabilistic formulation and the decoding pipeline of the proposed model, we provide a worked toy example illustrating how relational triples are extracted from an input sentence.

Appendix A.1. Input and Encoding

Consider the input sentence:

“Alice works at Google.”

After tokenization, the sentence is represented as:

x = [Alice, works, at, Google]

with sequence length

L = 4

.

The sentence is first encoded by the Attention-Enhanced BiLSTM module described in Section 3.2, producing contextualized token representations:

H = [h_{1}, h_{2}, h_{3}, h_{4}], h_{i} \in R^{d} .

The CRF layer is then applied to these representations to refine sequence-level feature coherence. It does not directly output entity or relation predictions, but produces optimized feature representations that are subsequently used by the CasRel-based decoders.

Appendix A.2. Subject Span Prediction

Using the refined representations H, two binary classifiers predict the start and end positions of subject entities. Suppose the predicted probabilities yield:

Token	Alice	works	at	Google
$P^{{start}_{s}}$	1	0	0	0
$P^{{end}_{s}}$	1	0	0	0

A subject span is therefore detected:

s = “ Alice ” .

Appendix A.3. Relation-Conditioned Object Prediction

Given the detected subject s, the model evaluates all relation types

r \in R

independently. Assume

R

contains the relation works_at.

For relation works_at, the object start/end classifiers conditioned on subject s predict:

Token	Alice	works	at	Google
$P^{{start}_{o}}$	0	0	0	1
$P^{{end}_{o}}$	0	0	0	1

Thus, an object span is detected:

o = “ Google ” .

For all other relation types in

R

, no valid object span is predicted, and a null object

o_{⌀}

is implicitly assumed.

Appendix A.4. Triple Construction

Based on the detected subject and relation-conditioned object spans, the final extracted relational triple is:

(Alice, {works}_{-} at, Google) .

This example demonstrates how the proposed model decodes relational triples through a single sequential pipeline: CRF-refined feature encoding → subject span detection → relation-conditioned object span detection.

References

Zengeya, T.; Fonou-Dombeu, J.V. A review of state of the art deep learning models for ontology construction. IEEE Access 2024, 12, 82354–82383. [Google Scholar] [CrossRef]
Hadi, M.U.; Qureshi, R.; Shah, A.; Irfan, M.; Zafar, A.; Shaikh, M.B.; Akhtar, N.; Wu, J.; Mirjalili, S. Large language models: A comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Prepr. 2023, 1, 1–26. [Google Scholar]
Khemani, B.; Patil, S.; Kotecha, K.; Tanwar, S. A review of graph neural networks: Concepts, architectures, techniques, challenges, datasets, applications, and future directions. J. Big Data 2024, 11, 18. [Google Scholar] [CrossRef]
Wei, Z.; Su, J.; Wang, Y.; Tian, Y.; Chang, Y. A Novel Cascade Binary Tagging Framework for Relational Triple Extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1476–1488. [Google Scholar]
Tuo, M.; Yang, W. Review of entity relation extraction. J. Intell. Fuzzy Syst. 2023, 44, 7391–7405. [Google Scholar] [CrossRef]
Zhou, G.; Zhang, M.; Ji, D.; Zhu, Q. Tree kernel-based relation extraction with context-sensitive structured parse tree information. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, 28–30 June 2007; pp. 728–736. [Google Scholar]
Zeng, D.; Liu, K.; Lai, S.; Zhou, G.; Zhao, J. Relation classification via convolutional deep neural network. In Proceedings of the COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, 23–29 August 2014; pp. 2335–2344. [Google Scholar]
Zhang, S.; Zheng, D.; Hu, X.; Yang, M. Bidirectional long short-term memory networks for relation classification. In Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, Shanghai, China, 30 October–1 November 2015; pp. 73–78. [Google Scholar]
Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; Xu, B. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, 7–12 August 2016; pp. 207–212. [Google Scholar]
Lu, G.; Liu, Y.; Wang, J.; Wu, H. CNN-BiLSTM-Attention: A multi-label neural classifier for short texts with a small set of labels. Inf. Process. Manag. 2023, 60, 103320. [Google Scholar] [CrossRef]
Nayak, T.; Majumder, N.; Goyal, P.; Poria, S. Deep neural approaches to relation triplets extraction: A comprehensive survey. Cogn. Comput. 2021, 13, 1215–1232. [Google Scholar] [CrossRef]
Devlin, J. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z. Pre-training with whole word masking for chinese bert. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3504–3514. [Google Scholar] [CrossRef]
Vashishth, S.; Sanyal, S.; Nitin, V.; Talukdar, P. Composition-based multi-relational graph convolutional networks. arXiv 2019, arXiv:1911.03082. [Google Scholar]
Zhang, C.; Song, D.; Huang, C.; Swami, A.; Chawla, N.V. Heterogeneous graph neural network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 793–803. [Google Scholar]
Wang, X.; Zhang, Y.; Ren, X.; Zhang, Y.; Zitnik, M.; Shang, J.; Langlotz, C.; Han, J. Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning. Bioinformatics 2019, 35, 1745–1752. [Google Scholar] [CrossRef] [PubMed]
Miwa, M.; Bansal, M. End-to-end relation extraction using lstms on sequences and tree structures. arXiv 2016, arXiv:1601.00770. [Google Scholar]
Asif, N.A.; Sarker, Y.; Chakrabortty, R.K.; Ryan, M.J.; Ahamed, M.H.; Saha, D.K.; Badal, F.R.; Das, S.K.; Ali, M.F.; Moyeen, S.I.; et al. Graph neural network: A comprehensive review on non-euclidean space. IEEE Access 2021, 9, 60588–60606. [Google Scholar] [CrossRef]
Xu, W.; Chen, K.; Zhao, T. Document-level relation extraction with reconstruction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 14167–14175. [Google Scholar]
Li, H.; Wei, L.; Wang, Z. A Review and Outlook of the Latest Results on Document-level Information Extraction. Appl. Comput. Eng. 2024, 96, 120–129. [Google Scholar] [CrossRef]
Sun, Q.; Zhang, K.; Huang, K.; Xu, T.; Li, X.; Liu, Y. Document-level relation extraction with two-stage dynamic graph attention networks. Knowl.-Based Syst. 2023, 267, 110428. [Google Scholar] [CrossRef]
Graves, A. Adaptive Computation Time for Recurrent Neural Networks. arXiv 2016, arXiv:1603.08983. [Google Scholar]
Wang, X.; Yu, F.; Dou, Z.Y.; Darrell, T.; Gonzalez, J.E. SkipNet: Learning Dynamic Routing in Convolutional Networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
Li, X.; Luo, X.; Dong, C.; Yang, D.; Luan, B.; He, Z. TDEER: An efficient translating decoding schema for joint extraction of entities and relations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8055–8064. [Google Scholar]
Gao, W.; Zheng, X.; Zhao, S. Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF. J. Phys. Conf. Ser. 2021, 1848, 012083. [Google Scholar] [CrossRef]
Fu, T.J.; Li, P.H.; Ma, W.Y. Graphrel: Modeling text as relational graphs for joint entity and relation extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), Florence, Italy, 28 July–2 August 2019; pp. 1409–1418. [Google Scholar]
Cui, Y.; Che, W.; Liu, T.; Qin, B.; Wang, S.; Hu, G. Revisiting Pre-Trained Models for Chinese Natural Language Processing. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020. [Google Scholar]
Sun, Y.; Wang, S.; Li, Y.; Feng, S.; Chen, X.; Zhang, H.; Tian, X.; Zhu, D.; Tian, H.; Wu, H. ERNIE: Enhanced Representation through Knowledge Integration. arXiv 2019, arXiv:1904.09223. [Google Scholar] [CrossRef]

Figure 1. Architecture of the proposed multi-mechanism entity-relation extraction model. The framework integrates a Chinese-RoBERTa-wwm-ext encoder, BiLSTM layers, gated attention, multi-head attention, and CRF decoding, with an adaptive dynamic mechanism that adjusts computation based on input complexity.

Figure 2. Trends of ternary relation extraction during training with (upper) and without (lower) the dynamic adaptive framework. The x-axis denotes the training epoch, and the y-axis represents the number of extracted relational triples per epoch. Correct, Predicted, and Gold denote the numbers of correctly extracted triples, predicted triples, and gold-standard triples, respectively. All values correspond to raw per-epoch statistics without any smoothing or averaging.

Figure 3. Relationship between sentence length and the learned activation weights

α_{i}

for dynamically gated modules. Longer sentences tend to activate more modules, demonstrating adaptive computation behavior.

Figure 3. Relationship between sentence length and the learned activation weights

α_{i}

for dynamically gated modules. Longer sentences tend to activate more modules, demonstrating adaptive computation behavior.

Table 1. Evaluation metrics and their definitions.

Evaluation Metric	Definition
Precision	$P = \frac{TP}{TP + FP}$
Recall	$R = \frac{TP}{TP + FN}$
F1-score	$F_{1} = \frac{2 PR}{P + R}$

Table 2. Comparative test on the DuIE dataset.

Model	Precision	Recall	F1-Score
Casrel	0.753	0.432	0.549
Tdeer	0.713	0.457	0.557
BERT + BiLSTM + CRF	0.841	0.556	0.669
GraphRel	0.900	0.333	0.486
Ours (Full Model)	0.875	0.875	0.875

Table 3. Ablation test on the DuIE dataset.

Model Variant	Precision	Recall	F1-Score
-Crf	0.879	0.853	0.866
-Attention Module	0.861	0.838	0.849
-BiLSTM	0.846	0.856	0.851
-Dynamic Adaptive Framework	0.723	0.872	0.791
Ours (Full Model)	0.875	0.875	0.875

Table 4. Average number of active modules per sentence for different architectures.

Model	Avg. Active Modules
Fixed (all modules)	3.0
Dynamic (ours)	1.7

Table 5. Performance comparison with different pretrained language models on the Chinese relation triple extraction task.

Encoder	Precision	Recall	F1-Score
Chinese-RoBERTa-wwm-ext	0.875	0.875	0.875
MacBERT	0.878	0.876	0.877
ERNIE	0.872	0.871	0.871

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, X.; Hu, X.; Zhou, B. An Entity Relationship Extraction Method Based on Multi-Mechanism Fusion and Dynamic Adaptive Networks. Information 2026, 17, 38. https://doi.org/10.3390/info17010038

AMA Style

Jiang X, Hu X, Zhou B. An Entity Relationship Extraction Method Based on Multi-Mechanism Fusion and Dynamic Adaptive Networks. Information. 2026; 17(1):38. https://doi.org/10.3390/info17010038

Chicago/Turabian Style

Jiang, Xiantao, Xin Hu, and Bowen Zhou. 2026. "An Entity Relationship Extraction Method Based on Multi-Mechanism Fusion and Dynamic Adaptive Networks" Information 17, no. 1: 38. https://doi.org/10.3390/info17010038

APA Style

Jiang, X., Hu, X., & Zhou, B. (2026). An Entity Relationship Extraction Method Based on Multi-Mechanism Fusion and Dynamic Adaptive Networks. Information, 17(1), 38. https://doi.org/10.3390/info17010038

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Entity Relationship Extraction Method Based on Multi-Mechanism Fusion and Dynamic Adaptive Networks

Abstract

1. Introduction

2. Related Work

2.1. Traditional Feature Engineering and Rule-Based Methods

2.2. Deep Learning-Based Approaches

2.3. Pre-Trained Language Models

2.4. Graph Neural Networks and Joint Extraction Models

2.5. Document-Level Extraction

2.6. Dynamic and Adaptive Neural Networks

3. Proposed Methods

3.1. Chinese Word Embedding Layer (Chinese-RoBERTa-wwm-ext)

3.2. Attention-Enhanced BiLSTM Module

3.3. Dynamic Adaptive Framework

3.4. Entity Extractor (Subject Extraction)

3.5. Object-Relationship Extraction

3.6. Illustrative Example of Ternary Relation Extraction

4. Experimental Results and Analysis

4.1. Conditions of the Experiment and Parameter Configurations

4.2. Datasets and Metrics for Evaluation

4.2.1. Datasets

4.2.2. Evaluation Indicators

4.3. Comparative Tests and Outcome Analysis

4.3.1. Comparative Test

4.3.2. Ablation Test

4.3.3. Effect of Dynamic Adaptive Framework on Triple Extraction Stability

4.4. Analysis of Adaptive Computation Behavior

4.5. Extensibility to Different Pretrained Language Models

4.6. Qualitative Analysis and Case Study

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Toy Example of CasRel-Based Decoding

Appendix A.1. Input and Encoding

Appendix A.2. Subject Span Prediction

Appendix A.3. Relation-Conditioned Object Prediction

Appendix A.4. Triple Construction

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI