Beyond Isolated Features: Group-Level Feature-Driven Multimodal Fusion for Entity Relationship Extraction

Lv, Yana; Tao, Jiaqi; Du, Xiuli

doi:10.3390/electronics14081682

Open AccessArticle

Beyond Isolated Features: Group-Level Feature-Driven Multimodal Fusion for Entity Relationship Extraction

by

Yana Lv

^1,2,

Jiaqi Tao

^1,2,*

and

Xiuli Du

^1,2

¹

School of Information Engineering, Dalian University, Dalian 116622, China

²

Communication and Network Laboratory, Dalian University, Dalian 116622, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(8), 1682; https://doi.org/10.3390/electronics14081682

Submission received: 22 February 2025 / Revised: 13 April 2025 / Accepted: 15 April 2025 / Published: 21 April 2025

Download

Browse Figures

Versions Notes

Abstract

Named entity recognition and relation extraction are two crucial techniques in the construction of knowledge graphs, as their performance directly impacts downstream tasks. In scenarios such as social media, where text is short and contains numerous references, relying solely on a single modality often leads to suboptimal entity-relation extraction results. To address this issue, this paper proposes a multimodal entity-relation extraction model. The model incorporates group-level features into visual representations and integrates them into a BERT variant using a dynamic gating mechanism and an attention mechanism. This approach effectively reduces modal noise, balances the gap between modalities to some extent, and enhances both encoding performance and data utilization. The experimental results demonstrate that the proposed model achieves F1 scores of 87.78% and 84.16% on public datasets for MNER and MRE tasks, respectively, outperforming baseline models. Additionally, ablation studies validate the effectiveness of each module within the proposed model.

Keywords:

knowledge graph construction; named entity recognition; relation extraction; multimodal feature fusion

1. Introduction

Since Google introduced the concept of the knowledge graph in May 2012 [1], technologies related to knowledge graphs have remained a prominent research focus. Today, knowledge graphs are not only used as data storage tools but also serve as the knowledge backbone for applications such as intelligent search [2], recommendation systems [3], and question-answering systems [4], playing a crucial role in improving the efficiency and quality of public knowledge acquisition [5]. A knowledge graph aims to describe concepts, entities, events, and their relationships in the real world. Essentially, it is a semantic network graph where entities and attributes serve as nodes, while semantic relationships between them form the edges [6]. Named entity recognition (NER) and relation extraction (RE) are the two most critical tasks in the construction of knowledge graphs.

Specifically, named entity recognition aims to extract entities of specific types from unstructured text, while relation extraction seeks to determine the semantic relationships between these entities. Early research introduced rule-based approaches, followed by feature engineering and machine learning methods. More recently, deep learning has been widely applied to both tasks. Deep neural network models can automatically learn sentence features without requiring complex feature engineering, leading to significant advancements in NER and RE. However, existing models have yet to fully resolve the challenges of these tasks, primarily due to their excessive reliance on contextual associations in unstructured text. When dealing with incomplete or ambiguous descriptions, their performance tends to degrade. Additionally, these models struggle with polysemous words, often misclassifying them as different entities or failing to recognize the correct relationships between them. This issue is particularly evident in social media, where texts are typically short and often contain slang [7]. As a result, models find it difficult to learn robust features from such limited textual data. Fortunately, with advancements in internet technology and information storage, data on the web has shifted from traditional text-only formats to multimodal representations. For example, social media content often includes images alongside text. In NER and RE tasks, leveraging textual content as the primary source while incorporating images as supplementary information can effectively address the limitations of short text. This necessity has led to the emergence of Multimodal Named Entity Recognition (MNER) and Multimodal Relation Extraction (MRE) tasks.

The core of MNER and MRE tasks lies in learning effective visual features and integrating them into text representations. Previous studies have demonstrated the effectiveness of incorporating visual modalities into knowledge graph construction tasks [8]. For processing visual information, early research directly used entire images as global feature vectors to enhance text representations [9] or focused on encoding visual objects [10] by establishing explicit alignment relationships between objects and textual entities. While these approaches have made some progress, they fail to fully exploit fine-grained semantic correspondences between semantic units in sentence-image pairs. Subsequent researchers recognized that the ability of image encoders to represent rich image information is crucial [11]. Radford et al. [12] proposed the contrastive language-image pretraining model, CLIP. Although its direct application to MNER and MRE tasks was found to perform poorly, it provided valuable insights for later researchers. Following this, some researchers explored pretraining models better suited for MNER and MRE tasks, such as VisualBERT [13], ViLBERT [14], CLIP-ViL [15], METER [16], and X-VLM [17]. With advancements in image encoders, researchers have adopted a transformer-based framework, segmenting an entire image into multiple regions and enabling their interaction with text sequences [18]. For example, Zhang et al. [19] aligned entities and objects within visual and textual graphs constructed from potential relationships between objects and words.

Despite their notable successes, these approaches still face two major challenges. The first challenge is modal noise. Vempala and Preoţiuc-Pietro [20] found that approximately 33.8% of tweets lack textual content corresponding to images. This indicates that not all images provide complementary textual information; some are entirely unrelated to the text, introducing misleading noise that hampers prediction accuracy rather than contributing useful information. While RpBERT [21] trains a classifier before the main task to determine the relevance between images and tweets, this method heavily relies on large-scale labeled image-text datasets and only considers the overall image, ignoring key objects within it. The second challenge is modal discrepancy. The distributional differences between textual and visual features significantly limit a model’s ability to effectively capture cross-modal semantic correlations. For example, in the text “Rocky is ready for snow season” (corresponding to Figure 1), identifying the semantic relevance between the textual entity “Rocky” and the visual entity (a dog in the image) is particularly challenging.

Based on these two challenges, this paper integrates the strengths of existing MNER and MRE approaches and proposes a novel end-to-end model. The proposed method consists of five key components: visual feature extraction, visual feature adaptation, textual feature extraction, multimodal feature fusion, and task output. The main contributions of this work are as follows:

(1): Enhanced Visual Feature Representation: By incorporating group-level information during image encoding, the model selectively combines global and local visual features as an improved visual prefix for multimodal fusion. This approach not only reduces modal noise but also enhances data utilization.
(2): Feature Fusion within CorefBERT: The proposed model integrates feature fusion within CorefBERT, which effectively captures contextual information, understands inter-sentence relationships, and handles cross-sentence coreference and multi-referential chains. This not only mitigates modal discrepancies but also improves textual feature extraction.
(3): Comprehensive Evaluation and Ablation Studies: The proposed method is evaluated against baseline models on public benchmark datasets for MNER and MRE tasks, demonstrating its superior performance. Additionally, ablation studies validate the effectiveness of each module within the model.

2. Related Works

The following sections introduce the related work on the MNER and MRE tasks.

MNER was first explored in 2018 by Moon, Carvalho, et al. [22]. Their proposed method encoded text using RNNs and the entire image using CNNs, implicitly interacting between the two modalities. This development paved the way for MNER’s potential applications across various social media platforms. Subsequently, researchers explored different strategies to integrate image features into text representations. Zhang et al. [23] further investigated how to incorporate whole-image features into textual representations. Zheng et al. [24] designed a gated bilinear attention network with an adversarial strategy to better extract fine-grained objects from images. Additionally, some studies introduced graph-based methods to enhance modality alignment. Zhao et al. [25] and Yuan et al. [26] proposed a heterogeneous graph network and an edge-enhanced graph neural network, respectively, to align objects and entities through structured alignment mechanisms. Zhang et al. [27] introduced a multimodal fusion model based on a syntactic dependency text graph and a fully connected visual graph to explore fine-grained semantic alignment across modalities. Furthermore, some approaches with novel perspectives have been proposed. For instance, Xu et al. [28] introduced a general data partitioning strategy, using reinforcement learning to train a data discriminator that categorizes data into unimodal and multimodal groups for separate recognition tasks. Lu et al. [29] developed a multimodal interaction transformer with interaction position labels for unified representation and used transformers for both intra-modal and cross-modal connections. Wang et al. [30] introduced scene graphs as structured representations of visual content to enhance semantic interaction. Xu et al. [31] proposed a matching and alignment framework to improve the consistency of multimodal representations in MNER. Furthermore, Jia et al. [32,33] leveraged external knowledge to introduce MRC-MNER and MNER-QG, facilitating cross-modal interactive reasoning.

In 2021, Zheng et al. [34] were the first to introduce the MRE task and constructed a social media-related dataset. Since then, researchers have proposed various methods to optimize the relation extraction process. Chen et al. [35] proposed a hierarchical visual prefix fusion network that enhances text representation using image features. Kang et al. [36] introduced a novel translation-supervised prototype network for multimodal social relation extraction to capture triplet features. Huang et al. [37] designed an internal module to learn single-instance representations and an external module to focus on diverse sample relationships. Zhao et al. [38] introduced a two-stage visual fusion network that applies multimodal fusion to improve relation extraction. Liu et al. [39] proposed a multi-granularity cross-modal transformer to model complex interactions between text, global images, and local visual objects. Lastly, Hu et al. [40] supplemented existing information by retrieving external textual and visual evidence and integrating it to infer relationships between entities across different modalities.

Noting that the aforementioned studies have overlooked the modeling of group-level features and the optimization of fusion strategies, this paper focuses on these two aspects and proposes a novel approach to further enhance the performance of multimodal entity relation extraction. To further highlight the contributions of this study, Table 1 presents a comparative analysis of representative works on MNER and MRE tasks, focusing on aspects such as task coverage, core models, main contributions, and limitations. Since the data types, datasets, and evaluation metrics are consistent across the studies, they are not elaborated here. The comparison reveals that previous studies predominantly adopt ResNet or BERT as their backbone models, whereas our work leverages VIT and Coref-BERT, which are better aligned with the characteristics of the tasks. Furthermore, our approach introduces a novel perspective by modeling group-level semantics and optimizing their integration strategy—addressing a key gap that has often been overlooked in prior research.

3. Materials and Methods

The proposed model for MNER and MRE tasks is illustrated in Figure 2. It consists of five main components: (1) Text Encoder: Responsible for processing input text. This corresponds to the “Text Encoder” module in the figure. The “Text” in the figure represents the textual data we need to process. (2) Image Feature Extraction: Extracts image information and converts it into feature vectors. This corresponds to the “VIT” (Vision Transformer) module in the figure. The “Images” in the figure represent the image data we need to process. (3) Adaptive Module: Extracts group-level information and progressively refines visual features. This corresponds to the “Adaptive” module in the figure. The specific structure is discussed in the following sections. (4) Multimodal Feature Fusion: Effectively integrates visual features as a prefix into textual representations. This corresponds to the “Fusion Encoder” module in the figure. On the right side of the figure, a self-attention mechanism module is displayed. We apply a set of linear transformations to project the visual features into the same embedding space as the textual representations, obtaining the prefixes

ϕ_{k}^{l}

and

ϕ_{v}^{l}

, which correspond to the peach-colored and purple sections. (5) Final Task Prediction: Utilizes the fused multimodal information for MNER and MRE tasks. This architecture ensures effective cross-modal interaction, enhancing the model’s ability to capture entity relationships in social media data.

3.1. Task Definition

MNER: This is a sequence labeling task. Given a short text

T = \{t_{1}, t_{2}, \dots, t_{n}\}

and its corresponding image

I

as input, the task aims to identify entities within the sentence and assign each token in the sentence a predefined label type

X = \{x_{1}, x_{2}, \dots, x_{n}\}

,

x_{i} \in X

following the BIO tagging scheme.

MRE: This is a multi-class classification task. Given a short text

T = \{t_{1}, t_{2}, \dots, t_{n}\}

and its corresponding image

I

as input, the task aims to extract the relationship

r

between two entities

E_{p} = \{t_{i}, t_{i + 1}, \dots, t_{i + |E_{1}| - 1}\}

and

E_{q} = \{t_{j}, t_{j + 1}, \dots, t_{j + |E_{2}| - 1}\}

within the text, where

E = \{E_{1}, E_{2}, \dots, E_{z}\}

represents all explicitly annotated entities in the text, and

r \in R = \{r_{1}, r_{2}, \dots, r_{s}\}

denotes a predefined relationship type.

3.2. Text Encoder

This paper employs the BERT variant CorefBERT [41] as the text encoder. CorefBERT removes the Next Sentence Prediction (NSP) task from BERT and is trained solely on the Mention Reference Prediction (MRP) task. This allows it to focus more on learning contextual information and enhances its language representation capabilities. Additionally, CorefBERT incorporates enhanced designs in input embeddings and task-specific processing, particularly for entity and coreference information, making it well-suited for MNER and MRE tasks. Each input sentence is prepended with a [CLS] token and appended with a [SEP] token, represented as:

T^{'} = \{t_{0}, t_{1}, t_{2}, \dots, t_{n}, t_{n + 1}\}

where

t_{0}

and

t_{n + 1}

are inserted [CLS] and [SEP] tokens., and

t_{1}

to

t_{n}

represent the text sequence. Inspired by [42], the first four layers of CorefBERT are initialized as the text encoder, processing

T^{'}

to obtain its final representation. The remaining eight layers of CorefBERT are dedicated to multimodal feature fusion, which is discussed in detail in Section 3.5.

3.3. Image Feature Extraction

The images associated with the textual information consist of global images, which contain local image regions that provide relevant information about target entities. The global image serves as the background for local images, helping to recognize abstract expressions and acting as a weak learning signal.

Since the multi-scale visual features obtained from ResNet are relatively coarse and require pooling operations to align with transformer-based feature fusion, this paper adopts Vision Transformer (VIT) as the visual encoder. VIT stacks multiple layers of attention to learn high-dimensional features, allowing it to directly model relationships between any two regions within an image. This enables richer global semantic information extraction, making it advantageous for global feature representation.

However, due to its design, VIT initially struggles to focus on local features. To address this, this paper utilizes a Faster R-CNN model [43] to extract the top m most salient local visual objects

O = \{o_{1}, o_{2}, \dots, o_{m}\}

. Next, both the global image

I

and the local images

O

are standardized to 224 × 224 pixels before being fed into VIT. These images are divided into 49 patches, and VIT processes them through 12 transformer blocks. Each block outputs a vector representation with different levels of semantic intensity, producing a hierarchical visual feature list

\{F_{1}, F_{2}, \dots F_{i}, \dots, F_{12}\}

, as illustrated in Figure 2. In this hierarchy, lower-indexed vectors (i.e., smaller

i

in

F_{i}

) capture high-resolution, spatially rich, but semantically weaker visual features. Conversely, higher-indexed vectors (larger

i

in

F_{i}

) capture low-resolution but semantically stronger visual features. The

F_{i} \in ℝ^{50 \times d}

specific representation is given as:

(v_{c l s}, v_{1}, v_{2}, \dots, v_{49})

, where

v_{c l s} \in ℝ^{1 \times d}

is [CLS] token that represents the separator of all image information of each layer.

3.4. Adaptive Module

The adaptive module aims to obtain hierarchically matched visual features, which serve as input for multimodal feature fusion, enabling the model to understand multimodal data more effectively and accurately. This module consists of two key components: Group-Level Feature Extraction and Dynamic Gating Mechanism. As illustrated in Figure 3, these components work together to refine and adapt the extracted visual features, ensuring they align properly with the text representations before fusion.

This paper maps the hierarchical visual feature list

\{F_{1}, F_{2}, \dots, F_{12}\}

to

M (\cdot)

as follows:

X = M L P (V i e w ([F_{1}, F_{2}, \dots, F_{12}]))

(1)

where

[\cdot, \cdot]

denotes the concatenation operation. The reshaping operation

V i e w

transforms the concatenated hierarchical features into a tensor. Specifically, the 12-layer visual features are divided into 4 groups, where each group contains a mix of 3 layers of feature vectors, corresponding to a row in the reshaped tensor. Next,

M L P

layer is applied to reduce the dimensionality of the tensor, obtaining the final visual feature representation

X

,

X

is split into a group-level visual feature list:

\{V_{1}, V_{2}, V_{3}, V_{4}\}

, where

V_{i}

represents the visual feature of the

i

group.

Next, a dynamic gating mechanism is employed to map both hierarchical and group-level visual features onto the multimodal feature fusion module. The mapping weights are dynamically adjusted based on the visual features of the image. The dynamic gating module can be viewed as a path decision process, where the goal is to predict a normalized vector that represents the contribution of each visual feature block.

In the dynamic gating mechanism,

G_{i}^{(l)} \in [0, 1]

denotes the path probability from the

i

visual block to the

l

transformer layer. First, the gating signal

α^{(l)}

is generated as follows:

α^{(l)} = f (W_{l} (\frac{1}{n} \sum_{i = 1}^{n} P (\tilde{V_{i}})))

(2)

where:

f (\cdot)

is the

S e L U

activation function,

W_{l}

is the

M L P

trainable weight matrix for the

l

layer,

P (\cdot)

represents the global average pooling function.

n

denotes the total number of visual feature blocks, which is 16 in this paper.

\tilde{V_{i}}

is the

i

visual feature block, specifically one of the global and m local features extracted from each transformer layer and group Specifically, one of the global features and m local features of each layer and each group

\{F_{i}^{I}, F_{i}^{O_{1}}, F_{i}^{O_{2}}, \dots, F_{i}^{O_{m}}\}

.

The probability vector for the

l

transformer layer

G^{(l)}

is then computed as follows:

G^{(l)} = S o f t m a x (α^{(l)})

(3)

Based on the probability vector

G^{(l)}

, the final weighted visual feature

{\tilde{V}}_{g a t e d}^{(l)}

is obtained through a weighted

{\tilde{V}}^{(l)}

sum of the visual features:

{\tilde{V}}_{g a t e d}^{(l)} = G^{(l)} {\tilde{V}}^{(l)}

(4)

In summary, the final visual representation for the

l

transformer layer,

{\tilde{V}}_{g a t e d}^{(l)}

is constructed by combining both global and local features as follows:

{\tilde{V}}_{g a t e d}^{(l)} = [{\tilde{V}}_{g a t e d}^{(l, I)}; {\tilde{V}}_{g a t e d}^{(l, O_{1})}; \dots; {\tilde{V}}_{g a t e d}^{(l, O_{m})}]

(5)

3.5. Multi-Modal Feature Fusion

The visual features obtained in Section 3.4 are incorporated as visual prefixes into each attention layer of CorefBERT’s text sequence, as shown on the right side of Figure 2.

For the

l

transformer layer, the

l

visual gated feature

{\tilde{V}}_{g a t e d}^{(l)}

and the

(l - 1)

transformer output serve as inputs. Here, the output of the

(l - 1)

transformer is denoted as

H^{(l - 1)}

, while the first transformer layer takes the text sequence

T

as

H^{0}

.

Given an input sequence

T = \{t_{1}, t_{2}, \dots, t_{n}\}

, the context representation

H^{(l - 1)} \in ℝ^{n \times d}

is projected into query (Q), key (K), and value (V) vectors as follows:

Q^{l} = H^{(l - 1)} W_{l}^{Q}, K^{l} = H^{(l - 1)} W_{l}^{K}, V^{l} = H^{(l - 1)} W_{l}^{V}

(6)

where

W_{l}^{Q}

,

W_{l}^{K}

, and

W_{l}^{V}

are the attention mapping parameters, which have the same dimensions as

H^{(l - 1)}

.

Q^{l}

,

K^{l}

, and

V^{l}

represent the query, key, and value vectors of the

l

layer, respectively.

Regarding the visual feature prefix, for the

l

transformer layer, a set of linear transformations

W_{l}^{ϕ} \in ℝ^{d \times 2 \times d}

is applied to map them into the same embedding space as the text representations. The visual prompts

ϕ_{k}^{l}, ϕ_{v}^{l} \in ℝ^{h w (m + 1) \times d}

are defined as follows:

\{ϕ_{k}^{l}, ϕ_{v}^{l}\} = {\tilde{V}}_{g a t e d}^{(l)} W_{l}^{ϕ}

(7)

where

h w (m + 1)

represents the length of the visual sequence, and

m

also denotes the number of detected visual objects.

Based on the visual prefix, the attention computation is defined as:

P r e f i x_A t t e n t i o n^{l} = S o f t m a x (\frac{Q^{l} {[ϕ_{k}^{l}; K^{l}]}^{T}}{\sqrt{d}}) [ϕ_{v}^{l}; V^{l}]

(8)

3.6. Classifier

In the multi-modal feature fusion module, we obtain the final hidden layer vector

H^{L} = U (T, {\tilde{V}}_{g a t e d}^{(l)})

from the CorefBERT model. Let

U (\cdot)

denote the attention operation based on the visual prefix. This paper uses different classifier layers for MNER and MRE tasks.

For MNER, we input the hidden representation into a CRF model to predict the probability of each token belonging to different label categories, resulting in the final prediction sequence

x = \{x_{1}, x_{2}, \dots, x_{n}\}

. The probability of a given label sequence

x

is computed as follows:

p (x| H^{L}) = \frac{\prod_{i = 1}^{n} S_{i} (x_{i - 1}, x_{i}, H^{L})}{\sum_{x^{'} \in X^{'}} \prod_{i = 1}^{n} S_{i} ({x^{'}}_{i - 1}, {x^{'}}_{i}, H^{L})}

(9)

where

n

represents the length of the input text sequence,

S (\cdot)

is a potential function used to compute the path score,

x_{i}

denotes the label value at position

i

in the observation sequence, and

X^{'}

represents the predefined label set following the BIO tagging scheme.

The loss function is computed based on the probability of the predicted label sequence. The model parameters are optimized by minimizing the negative log-likelihood loss using maximum likelihood estimation:

L_{M N E R} = - \sum_{i = 1}^{M} l o g (p (x^{(i)}| H^{L^{(i)}}))

(10)

where

M

is the number of samples,

x^{(i)}

represents the correct label sequence for the

i

sample,

H^{L^{(i)}}

is the final hidden layer vector obtained from the multi-modal feature fusion module for the

i

sample.

For MRE, we predict the relationship

r \in R

between the head entity

E_{1}

and the tail entity

E_{2}

in sequence

x

. The final hidden layer representations of these two entities are denoted as

H^{L^{E_{1}}}

and

H^{L^{E_{2}}}

, respectively. The probability distribution over

R

relationship classes is computed as follows:

p (r| E_{1}, E_{2}) = S o f t m a x (W [H^{L^{E_{1}}}, H^{L^{E_{2}}}] + b)

(11)

L_{M R E} = - \sum_{i = 1}^{M} l o g (p (r| E_{1}, E_{2}))

(12)

where

W

and

b

are trainable fine-tuning parameters.

4. Results and Discussions

4.1. Dataset

This paper conducts MNER experiments on the Twitter-2017 dataset [9], which includes four entity types: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). The number of samples in the training, development, and test sets, as well as the distribution of each entity type, is shown in Table 2.

For the MRE experiment, we use the MNRE dataset [34], which contains 23 relationship categories. The detailed distribution is shown in Figure 4. The dataset is divided into 12,247 samples for training, 1624 for development, and 1614 for testing. Additional statistical information is provided in Table 3.

Both datasets were collected from the Twitter social media platform, with each text segment corresponding to an image. However, it is important to note that not every image contains valid entity or relationship cues. The preprocessing steps for the Twitter-2017 and MNRE datasets in this paper consist of two main aspects. For text data: special characters were removed, abbreviations were expanded, and standard tokenization tools were used for word segmentation. For image data: all images were resized to a resolution of 224 × 224 and underwent normalization.

4.2. Experimental Setup and Evaluation Index

The proposed method was implemented using PyTorch 1.8.1 on an NVIDIA RTX 4090 GPU, with clip-vit-base-patch32 and coref-bert-base as encoders, where the hidden representation size was set to 768. The AdamW optimizer was employed for training. During the first 10% of gradient updates, the learning rate was gradually increased to its peak value using a linear warm-up strategy, followed by a linear decay for the remainder of the training process, with a decay rate set to 0.01. The batch size was set to 16, and the number of image objects was fixed at 3. For the MNER task, experiments were conducted with different learning rates in the range of [1 × 10⁻⁵, 3 × 10⁻⁵], with a maximum input sentence length of 128 and a total of 30 training epochs. For the MRE task, a fixed learning rate of 1 × 10⁻⁵ was used, with a maximum input sentence length of 80 and a total of 20 training epochs. The evaluation metrics employed in this study included Precision (P), Recall (R), and F1-score to ensure consistency with the evaluation criteria used in baseline methods.

4.3. Experimental Baseline and Results

To ensure a comprehensive comparison, this study selects previous state-of-the-art methods as baselines, categorized into three groups. The first category includes models that consider only textual input, such as BERT-CRF, CNN-BiLSTM-CRF [44], PCNN [45], and MTB [10]. Specifically, CNN-BiLSTM-CRF utilizes BiLSTM and CNN for word- and character-level representations in named entity recognition, PCNN applies convolutional networks with piecewise pooling for relation extraction, and MTB is a BERT-based pre-trained model designed for relation extraction. The second category consists of large language models, including ChatGPT3.5 and GPT4. The third category includes multimodal input models, such as AdapCoAtt, UMT [18], UMGF [27], VisualBERT [13], MEGA [34], HVPNeT [35], MAF [31], ITA [46], MRC-MNER [32], MNER-QG [33], and MKGformer [8]. Among them, AdapCoAtt designs an adaptive co-attention network to explore visual information, UMT introduces a multimodal interaction module to obtain image-aware word representations, and UMGF constructs multimodal graphs for semantic alignment. VisualBERT leverages pre-trained multimodal models for text-image encoding and fusion, while MEGA develops a dual graph network for semantic consistency. HVPNeT mitigates the noise of irrelevant visual objects by exploring hierarchical visual features as insertable visual prefixes. MAF employs contrastive learning to achieve consistent representation, and ITA aligns image features to the text representation space so that the image modality primarily aids in disambiguation. MRC-MNER and MNER-QG incorporate machine reading comprehension queries to retrieve relevant visual information in the linguistic context. Finally, MKGformer utilizes a relevance-aware fusion module to reduce noisy information. Table 4 presents all experimental results of this study compared with the baselines.

In Table 4, it is evident that the overall performance of multimodal methods significantly outperforms text-based approaches. This further validates the necessity of incorporating image information into social media data. Social media texts are often short and contain a lot of omitted information, making it difficult for text-based features alone to fully express entities and their relationships. Images, however, provide additional contextual information, aiding the model in more accurately identifying entity types and inferring relationship types. Furthermore, from the trend in the F1 score improvement, it is clear that image information contributes more significantly to the relation extraction task, while the improvement in named entity recognition is more limited. This suggests that images primarily play a role in relation modeling, while text remains the primary source of information for entity recognition.

Additionally, although large-scale pretrained language models (PLMs) have shown excellent performance in natural language processing tasks, the experimental results indicate that directly using these large models yields only moderate results. This may be because PLMs are primarily trained for general text understanding and are not optimized for NER and RE tasks, especially lacking specific designs for cross-modal alignment. This further underscores the value of purpose-built multimodal models. Our approach better integrates image information through a hierarchical visual prefix, ensuring more consistent representations across modalities, thus improving the overall task performance.

Our method achieves the best performance on most metrics across both datasets, although it did not reach the highest accuracy. This phenomenon may be due to the query mechanism based on machine reading comprehension (MRC-MNER and MNER-QG), which allows for more precise localization of visual regions, thereby improving both cross-modal and within-modal relationship modeling. However, in terms of the F1 score, our method shows a 0.3% improvement on the MNER task and a 2.2% improvement on the MRE task, further validating the effectiveness of our approach. We attribute the successful results to factors such as group-level features, improvements in fusion methods, and the role of CorefBERT in text modeling. A detailed analysis is presented in the ablation study. In future improvements, the focus is on optimizing the multimodal fusion mechanism.

4.4. Ablation Experiment

To validate the effectiveness of each module, three ablation experiments were conducted on the MNRE dataset, as illustrated in Figure 5.

w/o Group: We removed the group-level information module and observed a significant decline in model performance. This indicates that the module plays a crucial role in enhancing image feature extraction efficiency and optimizing the utilization of visual information. Theoretically, the group-level information module helps capture higher-level semantic features from images, making visual representations more context-aware and semantically enriched. Its absence leads to a decreased ability to distinguish positive samples, ultimately affecting the recall rate. This phenomenon suggests that simple, single-source visual features may be insufficient for precise relation extraction. In contrast, the aggregation mechanism of group-level information can effectively reduce noise and improve the model’s discriminative capability.

w/o Coref: When we replaced Coref-BERT-Base with BERT-Base, the F1 score dropped by 0.47%. This result suggests that Coref-BERT exhibits greater robustness and adaptability in extracting textual features from social media data. The primary advantage of Coref-BERT lies in its ability to effectively model coreference resolution, thereby enhancing cross-sentence information capture. Social media text is often short and contains numerous coreference phenomena, such as pronouns (he/she/it) or omitted entity references. BERT-Base may struggle to establish global context across sentences when processing such data, whereas Coref-BERT, by strengthening coreference resolution capabilities, enables the model to more accurately understand relational structures within the text.

w/o Group + Coref: When both the group-level information module and CorefBERT were removed simultaneously, the precision (P) dropped by 2.11%, recall (R) decreased by 2.32%, and the F1 score declined by 2.21%. Such a severe performance degradation highlights the critical role of these two modules in the model. Moreover, their ability to collaborate effectively ensures optimal task performance.

4.5. Low Resource Scenario

This section further evaluates the model’s performance in low-resource environments by randomly sampling between 5% and 50% of the original training data to create low-resource training sets. Figure 6 presents the performance comparison of the proposed method with other baselines on the MNER task, while Figure 7 illustrates the comparison on the MRE task.

First, it is evident that in low-resource scenarios, multimodal models consistently outperform text-only models, confirming that incorporating image information is beneficial for both named entity recognition and relation extraction tasks. Second, in the MRE task, the proposed method achieves a significant improvement over the HVPNeT model. This substantial performance boost is attributed to the approach of integrating visual information as a prefix into the transformer, which effectively mitigates modality differences. Finally, the results demonstrate that the proposed method outperforms other baselines across different low-resource settings. This further validates its effectiveness in leveraging multimodal data while reducing modality noise, ensuring more robust performance even with limited training samples.

4.6. Cross-Task Scenario

Table 5 presents a comparison of the generalization ability of the proposed method with other models in cross-task scenarios. The first section, Twitter2017-MNRE, refers to models trained on Twitter-2017 that are subsequently trained and tested on MNRE. The second section, MNRE-Twitter2017, indicates models trained on MNRE that are then further trained and tested on Twitter-2017.

From the second column, it can be observed that the proposed method achieves the highest F1 scores, with minimal fluctuation after cross-task adaptation, and even exhibits slight improvements. This is mainly because both datasets originate from the same source, and the MNER task not only enhances the model’s ability to represent entities but also leverages multimodal data to better locate entities. This, in turn, improves contextual understanding and reasoning, benefiting the subsequent MRE task.

In contrast, the third column shows a decline in performance after cross-task adaptation, though the results remain competitive. This decline occurs because MRE training is more focused on global relation modeling, which weakens the model’s ability to finely recognize entity boundaries and categories, leading to a performance drop when transferred to MNER.

Despite these variations, the proposed method consistently achieves strong F1 scores with minimal fluctuations in both transfer settings, demonstrating its robustness and effectiveness in cross-task adaptation.

4.7. Training Cost

To analyze computational complexity and extraction speed, we conducted statistical tests on the classic HVPNET model and the proposed method. The FLOPs of HVPNET are 33.4 GFLOPs, and the number of parameters is 138 M. The FLOPs of the proposed model are 60.5 GFLOPs, and the number of parameters is 203 M. Compared to HVPNET, our model has increased computational complexity and a higher parameter count, mainly due to the choice of visual encoder. HVPNET employs ResNet, which is based on a convolutional neural network with FLOPs of 4.1 GFLOPs. In contrast, our model utilizes VIT, which relies on a multi-head self-attention mechanism with approximately 17 GFLOPs, which results in significantly higher computational overhead. Additionally, ResNet has 25.6 M parameters, whereas VIT has 88 M, which increases the computational load for each forward and backward pass, thereby extending the training time. Table 6 presents the training time for both models on the Twitter-2017 and MNRE datasets. Experimental results show that the average training time of our method is approximately 2.45 times that of HVPNET. Although the computational cost has increased, the model achieves improvements in accuracy and expressiveness, allowing it to better handle complex linguistic structures and diverse relationship types, thereby offering greater practical value and application potential.

4.8. Case Study

To provide a more intuitive comparison of the experimental results from different methods, this section selects several case studies from the MNER and MRE tasks for testing, as shown in Table 7. The methods compared include the pure text method BERT-CRF, the latest high-performance multimodal method MEGA, and the proposed method.

In the MNER case, all three methods successfully recognized Entity 2, but only the proposed method was able to identify Entity 1, which can be attributed to the effectiveness of the group-level feature approach in utilizing image data.

In the first MRE case, it is evident that images can complement the textual information. The results show that the pure text method predicted the wrong relationship category, while both multimodal methods predicted it correctly, highlighting the necessity of adding image information for this task.

In the second MRE case, it is notable that the textual and image correlation is minimal. The pure text method correctly predicted the relationship semantically, but the MEGA method processed the image information as noise, interfering with the correct prediction of the relationship. However, the proposed method correctly predicted the relationship, further demonstrating its improvement in addressing modal noise and modality gaps.

5. Conclusions

To address the two major challenges of modality noise and modality gaps in existing methods for MNER and MRE tasks, this paper proposes a new model. The model efficiently performs image semantic representation and integrates both image and text information, thereby compensating for the limitations of traditional methods in terms of missing information and improving the accuracy of entity relation extraction results. However, a limitation of this method is its relatively high training cost. The model is based on the image encoder VIT from the CLIP model, which is trained on large-scale image-text pairs, and introduces group-level information to fully leverage image data and enhance the richness of spatial features. Through an adaptive module, the image information is incorporated as a prefix in each self-attention layer of the CorefBERT model, effectively reducing modality noise and simplifying the difficulty of cross-modal information fusion. This paper validates the effectiveness of the proposed method through ablation experiments and comparisons with standard datasets and other baseline models.

This method improves the accuracy of multimodal named entity recognition and multimodal relation extraction, thereby facilitating the automatic construction and optimization of knowledge graphs. In practical applications, the social media knowledge graph built based on this method can provide more precise semantic understanding and knowledge support for tasks such as public opinion analysis, public safety monitoring, and social recommendations. For example, in public opinion analysis, this method can more accurately identify key entities and their relationships in trending events, helping governments and businesses stay informed about public sentiment in real time and make timely decisions. In public safety monitoring, this method can detect abnormal behavior patterns on social platforms, assisting in tasks such as fraud warning and online violence management. In social recommendations, the method can uncover potential interest groups based on multimodal relationships between users and optimize information delivery strategies. Thus, this method holds significant potential value for practical applications.

For future work, we plan to explore and refine the model using datasets from other domains, with a focus on enhancing its adaptability to complex tasks. Additionally, we aim to extend this approach to knowledge graph construction and related areas, investigating the potential of cross-modal learning for building more comprehensive knowledge representations. This will contribute to the advancement of artificial intelligence in a broader range of applications. Furthermore, given the computational complexity of the model, we will explore strategies to optimize resource efficiency, such as model pruning, quantization, and knowledge distillation, to reduce training and inference costs. These optimizations will improve the model’s efficiency and feasibility for real-world deployment.

Author Contributions

Conceptualization, Y.L. and J.T.; Methodology, Y.L. and X.D.; Software, Y.L. and X.D.; Formal analysis, J.T. and X.D.; Investigation, Y.L. and X.D.; Data curation, Y.L. Writing—original draft, Y.L. and J.T.; Supervision, J.T. and X.D.; Project administration, J.T. and X.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data available on request due to privacy or ethical restrictions.

Acknowledgments

The authors would appreciate the support from the Key Laboratory of Communication and Networks, Dalian University.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Singhal, A. Introducing the Knowledge Graph: Things, Not Strings [EB/OL]. Available online: https://blog.google/products/search/introducing-knowledge-graph-things-not/ (accessed on 19 November 2021).
Liu, J.; Ren, J.; Zheng, W.; Chi, L.; Lee, I.; Xia, F. Web of scholars: A scholar knowledge graph. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, China, 25–30 July 2020; pp. 2153–2156. [Google Scholar]
Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; Xu, B. Attention-based bidirectional long short-term memory networks for relation classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, 7–12 August 2016; pp. 207–212. [Google Scholar]
Yasunaga, M.; Ren, H.; Bosselut, A.; Liang, P.; Leskovec, J. QA-GNN: Reasoning with language models and knowledge graphs for question answering. arXiv 2021, arXiv:2104.06378. [Google Scholar]
Li, Y.; Wang, H.; Zhang, D. An Entity-Relation Extraction Method Based on the Mixture-of-Experts Model and Dependency Parsing. Appl. Sci. 2025, 15, 2119. [Google Scholar] [CrossRef]
Yao, Y.; Chen, Z.; Du, X.; Yao, T.; Li, Q.; Sun, M. Survey of Multimodal Knowledge Graph Construction Technology and Its Application in Military Field. Comput. Eng. Appl. 2024, 60, 18–37. [Google Scholar]
Wu, Z.; Zheng, C.; Cai, Y.; Chen, J.; Leung, H.-F.; Li, Q. Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1038–1046. [Google Scholar]
Chen, X.; Zhang, N.; Li, L.; Deng, S.; Tan, C.; Xu, C.; Huang, F.; Si, L.; Chen, H. Hybrid transformer with multi-level fusion for multimodal knowledge graph completion. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, 11–15 July 2022; pp. 904–915. [Google Scholar]
Lu, D.; Neves, L.; Carvalho, V.; Zhang, N.; Ji, H. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 1990–1999. [Google Scholar]
Soares, L.B.; FitzGerald, N.; Ling, J.; Kwiatkowski, T. Matching the blanks: Distributional similarity for relation learning. arXiv 2019, arXiv:1906.03158. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]
Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.J.; Chang, K.W. Visualbert: A simple and performant baseline for vision and language. arXiv 2019, arXiv:1908.03557. [Google Scholar]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Adv. Neural Inf. Process. Syst. 2019, 32, 13–23. [Google Scholar]
Shen, S.; Li, L.H.; Tan, H.; Bansal, M.; Rohrbach, A.; Chang, K.-W.; Yao, Z.; Keutzer, K. How much can clip benefit vision-and-language tasks? arXiv 2021, arXiv:2107.06383. [Google Scholar]
Dou, Z.-Y.; Xu, Y.; Gan, Z.; Wang, J.; Wang, S.; Wang, L.; Zhu, C.; Zhang, P.; Yuan, L.; Peng, N.; et al. An empirical study of training end-to-end vision-and-language transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 18166–18176. [Google Scholar]
Zeng, Y.; Zhang, X.; Li, H. Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv 2021, arXiv:2111.08276. [Google Scholar]
Yu, J.; Jiang, J.; Yang, L.; Xia, R. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3342–3352. [Google Scholar]
Zhang, Y.; Wang, J.; Yu, L.C.; Zhang, X. Ma-bert: Learning representation by incorporating multi-attribute knowledge in transformers. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 2338–2343. [Google Scholar]
Vempala, A.; Preoţiuc-Pietro, D. Categorizing and inferring the relationship between the text and image of twitter posts. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2830–2840. [Google Scholar]
Sun, L.; Wang, J.; Zhang, K.; Su, Y.; Weng, F. RpBERT: A text-image relation propagation-based BERT model for multimodal NER. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 13860–13868. [Google Scholar]
Moon, S.; Neves, L.; Carvalho, V. Multimodal named entity recognition for short social media posts. arXiv 2018, arXiv:1802.07862. [Google Scholar]
Zhang, Q.; Fu, J.; Liu, X.; Huang, X. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 7–8 February 2018; Volume 32. [Google Scholar] [CrossRef]
Zheng, C.; Wu, Z.; Wang, T.; Cai, Y.; Li, Q. Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Trans. Multimed. 2020, 23, 2520–2532. [Google Scholar] [CrossRef]
Zhao, G.; Dong, G.; Shi, Y.; Yan, H.; Xu, W.; Li, S. Entity-level interaction via heterogeneous graph for multimodal named entity recognition. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 6345–6350. [Google Scholar]
Yuan, L.; Cai, Y.; Wang, J.; Li, Q. Joint multimodal entity-relation extraction based on edge-enhanced graph alignment network and word-pair relation tagging. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 11051–11059. [Google Scholar]
Zhang, D.; Wei, S.; Li, S.; Wu, H.; Zhu, Q.; Zhou, G. Multi-modal graph fusion for named entity recognition with targeted visual guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 14347–14355. [Google Scholar]
Xu, B.; Huang, S.; Du, M.; Wang, H.; Song, H.; Sha, C.; Xiao, Y. Different data, different modalities! reinforced data splitting for effective multimodal information extraction from social media posts. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 1855–1864. [Google Scholar]
Lu, J.; Zhang, D.; Zhang, P. Flat multi-modal interaction transformer for named entity recognition. arXiv 2022, arXiv:2208.11039. [Google Scholar]
Wang, J.; Yang, Y.; Liu, K.; Zhu, Z.; Liu, X. M3S: Scene graph driven multi-granularity multi-task learning for multi-modal NER. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 31, 111–120. [Google Scholar] [CrossRef]
Xu, B.; Huang, S.; Sha, C.; Wang, H. MAF: A general matching and alignment framework for multimodal named entity recognition. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Tempe, AZ, USA, 21–25 February 2022; pp. 1215–1223. [Google Scholar]
Jia, M.; Shen, X.; Shen, L.; Pang, J.; Liao, L.; Song, Y.; Chen, M.; He, X. Query prior matters: A MRC framework for multimodal named entity recognition. In Proceedings of the 30th ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 3549–3558. [Google Scholar]
Jia, M.; Shen, L.; Shen, X.; Liao, L.; Chen, M.; He, X.; Chen, Z.; Li, J. Mner-qg: An end-to-end mrc framework for multimodal named entity recognition with query grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 8032–8040. [Google Scholar]
Zheng, C.; Feng, J.; Fu, Z.; Cai, Y.; Li, Q.; Wang, T. Multimodal relation extraction with efficient graph alignment. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 5298–5306. [Google Scholar]
Chen, X.; Zhang, N.; Li, L.; Yao, Y.; Deng, S.; Tan, C.; Huang, F.; Si, L.; Chen, H. Good visual guidance makes a better extractor: Hierarchical visual prefix for multimodal entity and relation extraction. arXiv 2022, arXiv:2205.03521. [Google Scholar]
Kang, H.; Li, X.; Jin, L.; Liu, C.; Zhang, Z.; Li, S.; Zhang, Y. TSPNet: Translation supervised prototype network via residual learning for multimodal social relation extraction. Neurocomputing 2022, 507, 166–179. [Google Scholar] [CrossRef]
Huang, Y.; Lin, Z. I2SRM: Intra-and Inter-Sample Relationship Modeling for Multimodal Information Extraction. In Proceedings of the 5th ACM International Conference on Multimedia in Asia, Tainan, Taiwan, 6–8 December 2023; p. 1. [Google Scholar]
Zhao, Q.; Gao, T.; Guo, N. Tsvfn: Two-stage visual fusion network for multimodal relation extraction. Inf. Process. Manag. 2023, 60, 103264. [Google Scholar] [CrossRef]
Liu, P.; Wang, G.; Li, H.; Liu, J.; Ren, Y.; Zhu, H.; Sun, L. Multi-granularity cross-modal representation learning for named entity recognition on social media. Inf. Process. Manag. 2024, 61, 103546. [Google Scholar] [CrossRef]
Hu, X.; Guo, Z.; Teng, Z.; King, I.; Yu, P.S. Multimodal relation extraction with cross-modal retrieval and synthesis. arXiv 2023, arXiv:2305.16166. [Google Scholar]
Ye, D.; Lin, Y.; Du, J.; Liu, Z.; Li, P.; Sun, M.; Liu, Z. Coreferential reasoning learning for language representation. arXiv 2020, arXiv:2004.06870. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 2021, 34, 9694–9705. [Google Scholar]
Ma, X. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. arXiv 2016, arXiv:1603.01354. [Google Scholar]
Zeng, D.; Liu, K.; Chen, Y.; Zhao, J. Distant supervision for relation extraction via piecewise convolutional neural networks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 1753–1762. [Google Scholar]
Wang, X.; Gui, M.; Jiang, Y.; Jia, Z.; Bach, N.; Wang, T.; Huang, Z.; Tu, K. ITA: Image-text alignments for multi-modal named entity recognition. arXiv 2021, arXiv:2112.06482. [Google Scholar]

Figure 1. An example from the Twitter dataset.

Figure 2. The overall architecture of the proposed model.

Figure 3. The structure of the adaptive module.

Figure 4. The Distribution of Relation Categories in the MNRE Dataset.

Figure 5. Ablation Experiment Results.

Figure 6. Performance comparison of low-resource scenarios on Twitter2017 dataset.

Figure 7. Performance comparison of low-resource scenarios on MNRE dataset.

Table 1. Comparison of representative MNER and MRE methods with our approach.

Paper	MNER	MRE	Main Model	Key Contributions	Limitations
Moon et al. [22]	✓ *	× *	CNN+RNN	First to propose MNER; implicit multimodal fusion.	No fine-grained alignment; sensitive to noisy images.
Zhang et al. [27]	✓	×	ResNet +BERT	Proposed multimodal graph fusion with targeted visual guidance.	Limited cross-modal interaction; lacks explicit alignment.
Jia et al. [32]	✓	×	ResNet +BERT	Reformulated MNER as MRC with query priors for better cross-modal reasoning.	No explicit modality alignment; limited robustness to noisy input.
Jia et al. [33]	✓	×	Darknet53 +BERT	Introduced query grounding to enhance multimodal MRC-based NER with better interpretability.	Shallow cross-modal interaction; vulnerable to visual noise.
Zheng et al. [34]	×	✓	GNN+BERT	First to introduce the MRE task and proposed efficient graph alignment for multimodal relation extraction.	Static alignment; limited modeling of fine-grained relations.
Chen et al. [35]	✓	✓	ResNet +BERT	Proposed hierarchical visual prefix for multimodal entity and relation extraction.	Insufficient image utilization; limited dependency modeling for long-range cross-modal interactions.
This paper	✓	✓	VIT+Coref-BERT	The concept of group-level information was introduced, and visual features were integrated into the latter part of the Coref-BERT model through a dynamic gating mechanism, thereby enhancing its performance on coreference resolution tasks.	Most previous studies have not considered the applicability of models to tasks and the concept of group-level features. Although extraction performance has been relatively satisfactory, there is still room for further optimization. This paper focuses on this aspect to further enhance extraction performance.

* The ✓ and × here represent whether the method is involved in MNER and MRE tasks in the original study.

Table 2. The Statistics of Twitter-2017 Dataset.

Entity Type	Train	Dev	Test
PER	2943	626	621
LOC	731	173	178
ORG	1674	375	395
MISC	701	150	157
Tweets	3373	723	723

Table 3. The Statistics of the MNRE Dataset.

Word	Sentence	Instance	Entity	Image
258 k	9201	15,485	30,970	9201

Table 4. Results comparison between this paper and the baseline method.

Methods	Twitter-2017			MNRE
Methods	P	R	F1	P	R	F1
BERT-CRF	83.32	83.57	83.44	-	-	-
CNN-BiLSTM-CRF	80.00	78.76	79.37	-	-	-
PCNN [45]	-	-	-	62.85	49.69	55.49
MTB [10]	-	-	-	64.46	57.81	60.86
ChatGPT3.5	-	-	57.50	-	-	35.20
GPT4	-	-	66.61	-	-	42.11
AdapCoAtt	84.16	80.24	82.15	-	-	-
UMT [18]	85.28	85.34	85.31	62.93	63.88	63.46
UMGF [27]	86.54	84.50	85.51	64.38	66.23	65.29
VisualBERT [13]	84.06	85.39	84.72	57.15	59.48	58.30
MEGA [34]	84.03	84.75	84.39	64.51	68.44	66.41
HVPNeT [35]	85.84	87.93	86.87	83.64	80.78	81.85
MAF [31]	86.13	86.38	86.25	-	-	-
ITA [46]	-	-	85.72	-	-	66.89
MRC-MNER [32]	88.78	85.00	86.85	-	-	-
MNER-QG [33]	88.57	85.96	87.25	-	-	-
MKGformer [8]	86.98	88.01	87.49	82.67	81.25	81.95
Ours	87.50	88.06	87.78	84.62	83.71	84.16

Table 5. Performance in cross-task scenarios.

Methods	Twitter2017-MNRE	MNRE-Twitter2017
UMGF	63.85–62.90 (0.95↓ *)	85.51–84.35 (1.16↓)
HVPNeT	81.85–82.50 (0.75↑ *)	86.87–87.13 (0.26↑)
Ours	84.16–84.23 (0.07↑)	87.78–87.04 (0.73↓)

* The ↑ and ↓ here represent the improvement and decline of performance.

Table 6. Training time comparison.

Methods	Twitter2017	MNRE	Avg Time (s)
HVPNeT	5468	7475	6472
Ours	11,565	20,197	15,881

Table 7. Case demonstration.

Image
Text	That time they gave me a One Direction ^1,* poster and Leeyum ^2,* showed up.	LILI and COLE came to the party.	President Bush when he sees the lights of America.
BERT-CRF	1-PER × , 2-PER ✓	per/per/peer	per/loc/ place_of_residence
MEGA	1-PER ×, 2-PER ✓	per/per/couple	misc/loc/held_on
Ours	1-PER ✓, 2-PER ✓	per/per/couple	per/loc/ place_of_residence

* The 1 and 2 here represent entity 1 and entity 2. * The × here indicates the entity was incorrectly identified, while the ✓ indicates the entity was correctly identified.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lv, Y.; Tao, J.; Du, X. Beyond Isolated Features: Group-Level Feature-Driven Multimodal Fusion for Entity Relationship Extraction. Electronics 2025, 14, 1682. https://doi.org/10.3390/electronics14081682

AMA Style

Lv Y, Tao J, Du X. Beyond Isolated Features: Group-Level Feature-Driven Multimodal Fusion for Entity Relationship Extraction. Electronics. 2025; 14(8):1682. https://doi.org/10.3390/electronics14081682

Chicago/Turabian Style

Lv, Yana, Jiaqi Tao, and Xiuli Du. 2025. "Beyond Isolated Features: Group-Level Feature-Driven Multimodal Fusion for Entity Relationship Extraction" Electronics 14, no. 8: 1682. https://doi.org/10.3390/electronics14081682

APA Style

Lv, Y., Tao, J., & Du, X. (2025). Beyond Isolated Features: Group-Level Feature-Driven Multimodal Fusion for Entity Relationship Extraction. Electronics, 14(8), 1682. https://doi.org/10.3390/electronics14081682

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Beyond Isolated Features: Group-Level Feature-Driven Multimodal Fusion for Entity Relationship Extraction

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Task Definition

3.2. Text Encoder

3.3. Image Feature Extraction

3.4. Adaptive Module

3.5. Multi-Modal Feature Fusion

3.6. Classifier

4. Results and Discussions

4.1. Dataset

4.2. Experimental Setup and Evaluation Index

4.3. Experimental Baseline and Results

4.4. Ablation Experiment

4.5. Low Resource Scenario

4.6. Cross-Task Scenario

4.7. Training Cost

4.8. Case Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI