MPCFN: A Multilevel Predictive Cross-Fusion Network for Multimodal Named Entity Recognition in Social Media

Qinjun Qiu; Bo Tan; Yukuan Zhou; Wenjing Chen; Miao Tian; Liufeng Tao

doi:10.3390/app152211855

,

and

¹

School of Computer Science, China University of Geosciences, Wuhan 430074, China

²

Yunnan Key Laboratory of Intelligent-Monitoring and Spatiotemporal Big Data Governance of Natural Resources, Kunming 650051, China

³

Yunnan Institute of Geology and Mineral Surveying and Mapping Co., Ltd., Kunming 650051, China

⁴

The Land and Resources Information Center, Department of Natural Resources of Yunnan Province, Kunming 650224, China

Appl. Sci.2025, 15(22), 11855;https://doi.org/10.3390/app152211855

Version Notes

Order Reprints

Abstract

The goal of the Multimodal Named Entity Recognition (MNER) job is to identify and classify named entities by combining various data modalities (such as text and images) and assigning them to specified categories. The growing prevalence of multimodal social media posts has spurred heightened interest in MNER, particularly due to its pivotal role in applications ranging from intention comprehension to personalized user recommendations. In the MNER task, the inconsistency between image information and text information and the difficulty of fully utilizing the image information to complement the text information are the two main difficulties currently faced. In order to solve these problems, this study proposes a Multilevel Predictive Cross-Fusion Network (MPCFN) approach for Multimodal Named Entity Recognition. First, textual features are extracted using BERT and visual features are extracted using ResNet, then irrelevant information in the image is filtered using the Correlation Prediction Gate. Second, the hierarchy of visual features received by each Transformer block is controlled by the Dynamic Gate and aligned between image and textual features using the Cross-Fusion Module to align the image and text features. Finally, the hidden layer representation is fed into the CRF layer optimized for decoding using Flooding. Through experiments on TWITTER-2015, TWITTER-2017, and WuKong datasets, our method achieves F1 scores of 76.74%, 87.61%, and 82.35%, outperforming the existing mainstream state-of-the-art models and proving the effectiveness and superiority of our method.

Keywords:

multimodality; multilevel predictive cross-fusion network; named entity recognition; social media

1. Introduction

Named Entity Recognition (NER) is a crucial information extraction method that seeks to recognize and categorize textual entities with particular meanings, such as the names of individuals, locations, businesses, dates, and times, among others [,]. By recognizing named entities in unstructured free-form text, NER plays an important role in several domains, including information retrieval, relational extraction, question and answer systems, machine translation, and text categorization []. Currently, mainstream text-based NER algorithms produce considerable results when processing well-formed text. They usually use CNN, LSTM, or Transformer as encoders to learn the contextual representation of input words and decode them with softmax or CRF [,].

However, using traditional NER methods for named entity recognition of text on social media like Twitter and Facebook is often less effective due to the fact that Twitter tweets usually contain a large number of spelling mistakes, abbreviations, slang, emoticons, and other non-standard languages [,]. These noises make it difficult to accurately recognize and classify named entities by traditional NER methods, which usually rely on well-formed text. In addition, the character limitation of Twitter tweets results in very limited contextual information per tweet, and much of the textual content can only be understood in combination with visual context [].

To address the challenges faced by traditional NER methods in complex textual environments, MNER has been proposed. The objective of MNER is to recognize and classify entities in post by utilizing associated images. Existing MNER methods are broadly categorized into two types: one class of methods inputs both the text and the entire image, encoding the implicit representation of each word together [,]; the other class aligns text vectors with visual object features to achieve more comprehensive semantic representations of words [,,,,,]. All these approaches demonstrate that augmenting linguistic representations with visual information enables MNER to achieve outperforms in comparison with traditional NER methods in NER tasks. Figure 1 provides two exemplars of MNER. Specifically, Figure 1a,b represent examples of MNER in English and Chinese, respectively. The objective of MNER is to accurately discern the correct entity types based on the provided image and text. For instance, in Figure 1a, it is expected to recognize that NFL, Patrick Willis, and Silicon Valley belong to the category of organization names (ORG type), person names (PER type), and place names (LOC type), respectively.

Figure 1. Two examples for MNER. (a,b) represent examples of MNER in English and Chinese, respectively. The text in different colors represents different entity categories.

Although prior methodologies have proved the efficacy of MNER, two major problems remain. First, there may be discrepancies between the text and accompanying images, with erroneous image information potentially disrupting entity recognition [,]. Second, employing global picture representations ignores the fine semantic alignment between images and text. Additionally, relying solely on object-level features can be problematic, as some images may contain misleading objects or lack explicit objects altogether.

To tackle these issues, we introduce a new MNER method that leverages a hierarchical visual prefix fusion network. Our approach begins with the extraction of unimodal features from text and images using BERT and ResNet, respectively. Subsequently, we employ a Correlation Prediction Gate to screen out irrelevant information in the image, applying adaptive weights to visual features. These weights are dynamically adjusted according to the relationship between the image and the text. Furthermore, a Dynamic Gate mechanism controls the hierarchical processing of visual features within each Transformer block, enabling each block to learn different levels of text and visual information. Finally, the hidden layer representations are fed into a CRF layer, which is optimized using the Flooding method for decoding. We evaluate our method on two public English datasets and one Chinese dataset, and the experimental results demonstrate the effectiveness and advancements of our proposed approach.

The contributions of this paper are summarized as follows:

(1): We present a Multilevel Predictive Cross-Fusion Network model to enhance MNER’s performance on social media. This model predicts the correlation between text and images to mitigate irrelevant image interference and employs cross-fusion instead of visual prefix fusion for integrating multimodal information.
(2): To address overfitting associated with model complexity, this study employs a technique known as Flooding to optimize the Conditional Random Field (CRF) layer. This approach effectively mitigates overfitting by modifying the CRF layer’s loss function, thereby enhancing the model’s generalization capability. By incorporating Flooding technology, we further improve the model’s robustness in managing complex sequence annotation tasks.
(3): Through extensive experiments and analysis, we show that our model performs competitively compared to existing state-of-the-art models.

In Section 2, we provide a comprehensive review of prior research on NER and MNER. We first present a detailed overview of traditional NER methods and their limitations, followed by an analysis of the advantages of multimodal approaches and the findings of previous studies. In Section 3, we provide a detailed description of our proposed MNER method, focusing on the implementation details of correlation prediction between image–text pairs and the multimodal information fusion strategy. Section 4 describes the datasets and evaluation criteria employed in this work, as well as providing a thorough analysis of the experimental results. Finally, Section 5 summarizes the key findings of the research and suggests potential directions for future work.

2. Related Work

NER plays a vital role in natural language processing, with applications in areas such as knowledge graphs, social media analysis, and financial market forecasting. Early research on NER primarily focused on rule-based approaches, where domain experts manually crafted rules and employed pattern-matching techniques to identify entities within posts. For instance, Kim et al. (2000) [] introduced the Brill rule inference method for processing speech input. However, rule-based NER methods exhibit several limitations, including difficulties in handling complex contexts, high maintenance costs, limited generalizability, and challenges in addressing semantic complexity.

Traditional machine learning methods for NER include the Maximum Entropy Model (MEM) [], Hidden Markov Model (HMM) [], Support Vector Machine (SVM) [], and Conditional Random Fields (CRFs) []. These methods rely on manually designed features to extract entities from text. For example, Borthwick et al. (1998) [] used the MEM method for entity recognition; McCallum et al. (2003) [] proposed the application of a CRF model in NER. However, these approaches have high demands for feature selection, necessitating the extraction of various task-specific features from the text, and they struggle with handling long-distance dependencies effectively.

To address the limitations of traditional methods, deep learning techniques have been employed to enhance NER accuracy []. Researchers have utilized neural networks such as Bidirectional Long Short-Term Memory (BiLSTM) [] and Convolutional Neural Networks (CNN) [] to encode words into vectors, subsequently using decoders like CRF [], LSTM [], and Transformers [] to generate labels. The introduction of pre-trained language models like BERT [] and GPT [,] has significantly enhanced NER performance. These models capture comprehensive language features through pre-training on extensive corpora and can be fine-tuned for specific NER tasks. Nevertheless, they still face challenges in comprehending multimodal data, such as informal and ambiguous text found in social media posts.

Previous Research on Multimodal NER

The MNER model has been proposed to improve text-based NER models while also leveraging multimodal data. The MNER task seeks to identify and classify named entities within text that is paired with images. This approach emphasizes aligning textual elements with corresponding regions in images and integrating both textual and visual contexts to improve entity recognition performance.

Some MNER approaches directly utilize global images []. For instance, Lu et al. (2018) [] and Moon et al. (2018) [] use modal attention networks to combine textual and visual information in the BiLSTM layer. However, it is important to note that not all visual regions in an image contribute equally to enhancing the accuracy of model predictions. To solve this problem, some researchers make use of convolutional architectures (such as ResNet) to evenly segment the entire image into multiple visual regions, and then use multimodal interaction modules with attention mechanisms to select visual regions related to the text. For example, Lu et al. (2018) [] used a pre-trained ResNet model to extract visual areas, which were subsequently merged into word embeddings using a visual attention algorithm. Zhang et al. (2018) [] employed an adaptive co-attention network to merge text embeddings with visual region representations. Yu et al. (2020) [] developed a multimodal interaction module that captures both image-perceived word representations and word-perceived visual representations while also using a pure text entity span detection module as an auxiliary component to reduce visual bias. Chen et al. (2021) [] obtained the final multimodal representation through attention-guided visual layers using an external knowledge database. Due to the large proportion of unrelated text–image pairs in the dataset, Sun et al. (2021) [] incorporated text–image relationship classification as a sub-task to evaluate the value of image features and used an improved BERT encoder to collect modality fusion information. Additionally, tackling the challenge of precise semantic matching between objects in images and words in post, Liu et al. (2024) [] carried out cross-modal semantic interactions between text and vision at different visual granularities and enhanced word representations in post via semantic enrichment.

To more effectively mitigate the influence of irrelevant parts of an image, some researchers employ object detection techniques, such as Mask R-CNN, to extract visual objects from the entire image and treat these objects as local visual information interacting with text. Mask R-CNN was used by Wu et al. (2020) [] to detect objects. The top k items were embedded into vectors and integrated with text features using dense common attention modules. To address the issue of potentially missing relevant image information caused by relying solely on local images, Chen et al. (2022) [] incorporated both local and global images obtained through object detection into their model, employing a hierarchical network to fuse text and image information. The interactions between text and image representations are difficult to model; thus, they are frequently trained independently within each modality and out of alignment inside a single space. Wang et al. (2022) [] and Wang et al. (2022) [] addressed this issue by utilizing external databases to convert images into text, thereby enhancing the effectiveness of text-based pre-trained models. Due to multigranular cross-modal representation learning and cross-modal semantic interaction between text and vision at different visual granularities, Liu et al. (2024) [] obtained effective multimodal guided representations for each word, experimentally demonstrating that the results on TWITTER-2015 and TWITTER-2017 outperform the current performance.

In summary, although existing research on Multimodal Named Entity Recognition (MNER) has made significant progress in multimodal fusion, two critical issues still need to be addressed urgently:

(1): Unbalanced feature alignment granularity: Most models (e.g., RpBERT [], HVPNeT []) either rely solely on global image features or overemphasize local object features. They fail to achieve hierarchical collaboration and dynamic adaptation of “global–local” visual features, which limits the semantic matching accuracy between textual entities and visual regions.
(2): Insufficient visual noise-filtering capability: Existing methods (e.g., MGCMT []) mostly use attention mechanisms in the late stage of feature fusion to select effective visual information, lacking early targeted noise-filtering designs. As a result, irrelevant visual noise tends to propagate continuously during model training, increasing the risk of entity misrecognition.

To address these issues, this paper proposes a Multilevel Predictive Cross-Fusion Network (MPCFN). Through an innovative architecture of “early noise filtering based on a Correlation Prediction Gate—dynamic gating for hierarchical visual feature allocation—cross-fusion for bidirectional modal alignment”, this network specifically addresses the aforementioned research gaps.

3. Methodology

We start this section by describing the MNER problem and provide an overview of our suggested approach. Then, we give a thorough explanation of the precise implementation procedure of our system, utilizing the picture and sequence shown in Figure 2 as an example.

Figure 2. Overall framework of our proposed methodology: Stage 1: Extract text features by using BERT and visual features by using ResNet. Stage 2: Utilize a Correlation Prediction Gate to predict the relationships between text and image, and apply adaptive weights to the visual features based on these predictions. Stage 3: Predict a normalized vector

G

to control the level of visual features each block receives via the Dynamic Gate. Stage 4: Aligning image and text features through cross-fusion. Stage 5: A CRF layer optimized with Flooding receives the representation from the hidden layer for decoding.

Task Definition: Given an input pair consisting of a text sentence

T

and an associated image

V

, the goal of MNER is to extract a set of entities from

T

and classify each extracted entity into one of the pre-defined types. Similar to most existing MNER approaches, we frame this task as a sequence-labeling problem. Let

T = (t_{1}, t_{2}, \dots, t_{n})

denote the sequence of input tokens, and

y = (y_{1}, y_{2}, \dots, y_{n})

represent the corresponding sequence of labels.

3.1. Overall Framework

The overall architecture of our model is shown in Figure 2. For clarity, the model is divided into 5 main components: (1) text and visual feature extraction, (2) correlation prediction, (3) the Dynamic Gate, (4) cross-fusion, (5) CRF and Flooding layer.

Firstly, we employ the BERT model for comprehensive feature extraction from the text data. Utilizing the visual grounding toolkit, we extract local object information from the raw images. Subsequently, we encode these global images and local objects using the ResNet architecture to derive rich visual features. Additionally, we have designed and implemented an adaptive matrix that dynamically learns during the model training process, filtering visual features based on the relevance between the image and text pairs. In addition, we maintain a dynamically updated normalized vector

G

, which assigns filtered visual and textual features to each Transformer layer with varying weights. This approach facilitates hierarchical integration and fusion of visual and textual information. During the decoding stage, the CRF layer is employed to further refine the output. To mitigate overfitting due to high model complexity, we have incorporated a technique called Flooding, enhancing the model’s generalization and stability. Through these meticulously designed methods, our model achieves exceptional performance in the joint representation learning of images and text.

3.2. Stage 1: Text and Visual Feature Extraction

In this work, each text sequence

X = \{x_{0}, x_{1}, \dots, x_{n}\}

is fed into a pre-trained BERT [] to obtain the sequence representations

T = \{T_{0}, T_{1}, \dots, T_{n}\}

, where

T_{i} \in R^{d}

is the extracted word representation for

x_{i}

, and

x_{0}

and

x_{n}

denote the two inserted special tokens “[CLS]” and “[SEP]” that denote the start and end position of a sentence.

T_{i} = B E R T (x_{i}; θ^{bert}) \in R^{d}

(1)

The BERT parameter is represented by

θ^{b e r t}

. In particular,

T_{i}

is derived by summing these sub-tokens if the tokenizer divides

x_{i}

into numerous sub-tokens.

We use the visual grounding tools to extract the regions of a picture that contain entities given the image []. Then, we transform the initial image and the divided image to

224 \times 224

pixels as the hierarchical image

O = \{o_{1}, o_{2}, \dots, o_{n}\}

. Visual representations are extracted through the pre-trained ResNet.

v_{i} = R e s N e t (o_{i}; θ^{res}) \in R^{d}

(2)

where

θ^{res}

is the ResNet parameter and

o_{i}

is a 224 × 224 pixels image for the

i

-th hierarchical image.

3.3. Stage 2: Correlation Prediction

Visual cues that are not directly related to the text can introduce uncertainty or even negatively impact multimodal model learning. Therefore, prior to incorporating visual features, we filter them based on their relevance to the corresponding text. Given a hierarchical image feature

v_{i}

, we integrate it with the text feature

T

and input the combined feature into BERT to derive the joint feature []. The Correlation Prediction Gate is defined as a softmax function, which inputs the joint features into the gate to obtain a relationship matrix that controls the visual feature weights.

G_{m a t r i x} = s o f t m a x ([v_{i}, T])

(3)

where

v_{i}

is the visual features,

T

is the text features, [⋅,⋅] indicates the concatenation operation, and

G_{m a t r i x}

indicates the relational matrix.

Then, we apply weights to visual features

v_{i}

using the relational matrix

G_{m a t r i x}

to derive visual representations that are pertinent to the textual content. The visual representations correlated with the text, denoted as

V^{'}

, are computed through the following method:

V^{'} = \{v_{i} G_{m a t r i x}| i = 0, \dots, n} = {v_{i}^{'} | i = 0, \dots, n}

(4)

where

v_{i}

indicates initial visual features.

3.4. Stage 3: Dynamic Gate

Path decision-making can be viewed as this module’s function, and the Dynamic Gate’s prediction of normalized vector G governs the hierarchy of visual features that each Transformer block extracts. First, we create the gate signals’ logits,

α^{(l)}

[]:

α^{(l)} = f (W_{l} (\frac{1}{c} \sum_{i = 1}^{c} P (V_{i})))

(5)

where

f (\cdot)

represents the activate function Leaky_ReLU, and

P

represents the global average pooling layer. We use the MLP layer

W_{l}

to reduce the feature dimension by

c

and implement a soft gate by generating continuous values as path probabilities. The calculation formula for

G

is as follows:

G = \{s o f t m a x (α^{(l)})| l = 0, \dots, 12} = {G_{l} | l = 0, \dots, 12}

(6)

where

G_{l}

indicates the probability vector of

l

-th Transformer block.

In order to assign appropriate visual features to each Transformer layer, we use the Dynamic Gate to link the visual features

{V_{i}}^{'}

of each image in

V^{'}

to all Transformer layers, and apply adaptive weights

G_{i}

during the linking process. The visual feature

V_{G a t e}

selected by the Dynamic Gate is calculated as follows:

V_{G a t e} = \{[{v_{0}^{'}, v_{1}^{'}, \dots, v}_{n}^{'}] G_{i}| G_{i} \in G}

(7)

where

v_{i}

is the visual features, and [⋅, ] indicates the concatenation operation.

3.5. Stage 4: Cross-Fusion

While Transformer layers are capable of capturing which contextual words are more relevant to the prediction of input words, they fail to incorporate relevant visual context, resulting in underutilization of image information. To address this, we employ a cross-fusion approach that ensures equal consideration of text and image information from the initial fusion stage. This approach allows each word to learn not only a text-based representation informed by visual context but also a visual representation enhanced by textual information, thereby improving the model’s integrated understanding of both text and visual data. Consequently, the model can more accurately predict words closely associated with each visual segment, thereby enhancing the effectiveness of multimodal information fusion.

Image-Aware Word Representation: To enhance word representation through associated images, we utilize a multihead cross-modal attention mechanism []. This mechanism treats

V_{G a t e} \in R^{d \times 49}

as queries, and

T \in R^{d \times (d \times 49)}

as keys and values:

{C A}_{i} (V_{G a t e}, T) = s o f t m a x (\frac{{[W_{q_{i}} V_{G a t e}]}^{⊤} [W_{k_{i}} T]}{\sqrt{d / m}}) {[W_{v_{i}} T]}^{⊤}

(8)

C A (V_{G a t e}, T) = W^{'} {[{C A}_{1} (V_{G a t e}, T), \dots, {C A}_{12} (V_{G a t e}, T)]}^{⊤}

(9)

where

{W_{q_{i}}, W_{k_{i}}, W_{v_{i}}}

and

W^{'}

indicate the weight matrices for the query, key, value, and multihead attention, respectively.

{C A}_{i}

denotes the

i

-th head of cross-modal attention. We then add three further sub-layers on top of that:

\begin{matrix} \hat{T C} = L N (V_{G a t e} + C A (V_{G a t e}, T)) \end{matrix}

(10)

T C = L N (\hat{T C} + F F N (\hat{T C}))

(11)

where FFN refers to the feed-forward network, LN denotes layer normalization [,], and

T C

is the output representations of the cross-modal Transformer layer.

Word-Aware Visual Representation: To capture visual representations aligned with each word, we also employ the CMT layer, which uses

T

as queries and

V_{G a t e}

as keys and values, creating a symmetric variant of the previously described CMT layer. This approach yields a word-specific visual representation, denoted as

V C

, effectively aligning each word with its closely related visual blocks.

3.6. Stage 5: CRF and Flooding Layer

CRF Layer: To combine textual and visual representations, we concatenate the textual representations

T C

and the visual representations

V C

to form the final hidden representation

H = (t_{0}, t_{1}, \dots, v_{0}, v_{1}, \dots)

, where

t_{i}

denotes the

i_th

textual representation and

v_{i}

denotes the

i_th

visual representation []. We input

H

into a standard Conditional Random Field layer that defines the probability of a sequence of labels y given an input sentence

S

and its associated image

V

:

P (y∣ S, V) = \frac{\exp (score (H, y))}{\sum_{y^{'}} \exp (score (H, y^{'}))}

(12)

s c o r e (H, y) = \sum_{i = 0}^{n} T_{y_{i}, y_{i + 1}} + \sum_{i = 1}^{n} E_{h_{i}, y_{i}}

(13)

where

T_{y_{i}, y_{i + 1}}

is denoted as the transition score from label

y_{i}

to label

y_{i + 1}

, and

E_{h_{i}, y_{i}}

is denoted as the emission score for the i-th word corresponding to label

y_{i}

.

Flooding Layer: Over-parameterized deep networks can cause the training loss to continue to approach zero, making the model overconfident and degrading test performance. To address this problem, we use a solution called Flooding [], where a baseline

b

is set, and, when the training loss reaches the value of the baseline, the gradient rises, preventing further reduction of the training loss. If the original learning objective is

L

, the proposed modified learning objective

L^{'}

is

L^{'} (θ) = | L (θ) - b | + b

(14)

where

b > 0

denotes the flood level set by the user, and

θ

represents the model parameter.

4. Experiments

4.1. Dataset

In our experiments, we use three datasets to evaluate the performance: two English benchmark MNER datasets TWITTER-2015 (Zhang et al., 2018 []) and TWITTER-2017 (Lu et al., 2018 []) and one Chinese dataset WuKong (Gu et al., 2022 []).

TWITTER-2015 and TWITTER-2017: TWITTER-2015 and TWITTER-2017 contain 8257 and 4819 tweet data, respectively, and the dataset sequence annotation method is the BIO tagging scheme. The statistical details of the dataset are provided in Table 1.

Table 1. The basic statistics of our three datasets.

WuKong: This Chinese dataset is constructed from data sourced from the Chinese Internet, featuring 55,423 image–text pairs labeled using the BMES tagging scheme. It includes four entity types: Person, Location, Organization, and Geo-Political Entity. Statistical details of the dataset are provided in Table 1.

4.2. Evaluation Metric

We use precision (P), recall (R), and F1 score (F1) as measures, and we apply exact match assessment, where a named entity is considered correctly identified only if its borders and type match the ground truth []. Based on the numbers of true positives (

T P

), false positives (

F P

), and false negatives (

F N

), precision, recall, and F1 score are computed. The detailed calculation is as follows:

Precision = \frac{T P}{T P + F P}

(15)

Recall = \frac{T P}{T P + F N}

(16)

Precision assesses how well the NER system identifies only the correct entities, while recall assesses the system’s capability to identify all entities within the corpus.

F1-score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(17)

The harmonic mean of recall and precision is the F1 score. Given that the dataset includes multiple entity types, we use the micro-averaged F-score to ensure all entities are treated equally.

4.3. Baselines

We performed a thorough comparison of our model against various baseline models to highlight its superiority. The models compared are categorized into two types: text-based approaches and multimodal approaches.

Text-based approaches:

We utilize BERT (Devlin et al., 2018 []) as a language model for MNER with a softmax decoder, as it has been pre-trained on a substantial number of unlabeled text data.
An extension of BERT, BERT-CRF uses a normal CRF layer as the decoder rather than a softmax layer.

Multimodal approaches:

UMT (Yu et al., 2020 []): In order to acquire both Image-Aware Word Representations and visual representations inspired by the words, UMT (Yu et al., 2020) uses a multimodal interaction module.
RpBERT (Sun et al., 2021 []) adopts a method of text–image relation propagation to select visual clues.
UMGF (Zhang et al., 2021 []) presents a unified multimodal graph fusion approach for MNER.
HVPNeT (Chen et al., 2022 []) introduces a dynamic gated aggregation method to obtain hierarchical multiscaled visual features, which are used as visual prefixes for fusion.
GMNER (Yu et al., 2023 []) introduces a hierarchical indexing framework called H-Index, which generates entity–type–region triples hierarchically using a sequence-to-sequence model.
MGCMT (Liu et al., 2024 []) improves word representation through semantic enhancement and cross-modal interaction at various levels, achieving effective multimodal guidance for each word.

4.4. Experiment Configuration

For the two English datasets, TWITTER-2015 and TWITTER-2017, the hyperparameters are the same for us. The batch size, the input text sequence’s maximum length, the dropout rate, the learning rate, and epochs are, respectively, set to 8, 147, 0.1, 3 × 10^–5, and 50. During the training process, the weight decay parameters for the text, image, and CRF are set to 1 × 10^–2, 5 × 10^–2, 1 × 10^–2, respectively. The text representations

T

are encoded using the BERT–base–uncased model pre-trained by Devlin et al. (2018) []. The visual representations

V

are encoded with ResNet-101. When applying the Flooding method to the TWITTER-15 and TWITTER-17 datasets, we designated the parameter

b

as 0.3 and 0.5, correspondingly.

For the WuKong dataset, we adjusted the learning rate, the number of epochs, and

b

to 1 × 10^–5, 10, and 0.6, respectively, and used BERT–base–Chinese as the text encoder. Other parameters were consistent with TWITTER-2015 and TWITTER-2017. When testing other baseline models on the WuKong dataset, we utilized the default parameters of the original models, merely substituting the English pre-trained models with their Chinese counterparts. This modification was implemented to assess the performance variance of the models in different linguistic contexts.

5. Result

5.1. Ablation Study

To examine the contributions of key components in our model, we conducted a component study. This study presents a comparative analysis of the full model and its variants with different combinations of the Correlation Prediction (CP), Cross-Fusion (CF), and Flooding Modules integrated (i.e., analyzing the model performance when each component is included or excluded). As shown in Table 2, we observed the following:

Table 2. Experimental results of the ablation study.

(1): When the model is not integrated with all components, its performance is lower compared to the full model that includes all components. However, even in such partial-component integration scenarios, the model’s F1 score still exceeds that of the text-only baseline model, as well as that of most existing MNER methods. This finding fully demonstrates that each component in our model plays an indispensable key role in enhancing the final entity recognition results.
(2): For the English datasets TWITTER-15 and TWITTER-17, the model performance significantly drops when the Flooding module is not integrated, indicating that integrating the Flooding module plays a crucial role in model training. In contrast, for the Chinese dataset WuKong, the performance decline (when the Flooding module is excluded) is less pronounced. This is because the WuKong dataset is over three times larger than the TWITTER datasets, making the data more diverse and reducing the likelihood of overfitting. Consequently, the contribution of integrating the Flooding module is less significant in this case.
(3): The model experiences a not insignificant performance degradation when neither the Correlation Prediction Module nor the Cross-Fusion Module is integrated. This result suggests that integrating the Correlation Prediction Module (for filtering extraneous images from the data) and integrating the Cross-Fusion Module (for image fusion) are both beneficial for the NER task.

5.2. Analysis of Results Compared to Existing Models

Table 3 presents the experimental results of the MPCFN and all baselines on three datasets. From the experimental results, we can observe the following:

Table 3. Performance results of different MNER approaches.

(1): Comparing SOTA multimodal methods with text-based unimodal approaches reveals that multimodal methods generally perform better, suggesting that incorporating additional visual information is often beneficial for NER tasks.
(2): Comparing current multimodal models with text-based unimodal methods shows that multimodal approaches generally perform better on English datasets like TWITTER-2015 and TWITTER-2017, with a maximum F1 score improvement of about 3.5%. However, on the Chinese dataset WuKong, the maximum performance improvement is approximately 1% (for instance, when comparing RpBERT with BERT), and, in some cases, the performance of certain multimodal models (such as UMGF and MGCMT) is not superior to text-based methods. This highlights the need for further enhancement of multimodal models.
(3): When comparing text-based unimodal methods with existing multimodal approaches, the proposed Multilevel Predictive Cross-Fusion Network demonstrates substantial performance improvements across three distinct datasets. Specifically, our model achieves improvements of 4.93%, 4.17%, and 3.44% over text-based unimodal methods on these datasets. Furthermore, relative to current state-of-the-art methods, our model yields performance gains of 1.42%, 0.74%, and 2.12%, respectively.

The observed improvement in model performance can be attributed to several factors. First, unlike models such as RpBERT that rely solely on global image features, our approach incorporates additional fine-grained image features for interaction with text, enhancing the alignment of text and image features. Second, whereas models like HVPNeT introduce a Visual Gate to regulate visual features only in the later stages of training, our model employs correlation prediction to filter visual object features at all levels early in the training process. This early-stage screening effectively reduces irrelevant information and minimizes the risk of inadvertently discarding useful visual features by applying a layered filtering approach. These strategies collectively contribute to the exceptional performance of the Multilevel Predictive Cross-Fusion Network in multimodal learning tasks.

5.3. Parameter Sensitivity Analysis

The choice of b in Flooding layer. During training, the parameter

b

is crucial for setting the lower bound of the loss function. If

b

is too high, the model may struggle to achieve convergence. Conversely, if

b

is too low, it may impair the model’s ability to mitigate overconfident predictions, thereby impacting performance on the test set. In order to balance this contradiction, we conducted a series of experiments and parameterized three different datasets with the experimental results. Specifically, we set

b

to 0.3, 0.5, and 0.6, respectively. As shown in Table 4, these parameter settings are based on a careful analysis of multiple rounds of experiments.

Table 4. The test for different values of parameter

b

in our model.

Unimodal learning. This study is dedicated to exploring the specific contribution of different data modalities to the performance of MNER models. By evaluating the unimodal learning effects on text, global images, and local objects, we find that they all have a significant impact on model performance. Table 5 shows that removing either the global image or local objects causes a notable drop in model performance, with similar performance declines observed for both cases. This finding further validates that, in visual scenes, the relative position information of local objects and the global image information complement each other and together play a key role in improving the model performance.

Table 5. The test for different modes in the unimodal learning.

5.4. Case Study

Figure 3 presents a qualitative comparison of our proposed MPCFN model against two baseline methods, UMT and MGCMT, in the MNER task. The case studies illustrate how the MPCFN effectively addresses common error types and demonstrates superior recognition capabilities by leveraging multimodal information.

Figure 3. The case comparisons among UMT, MGCMT, and the MPCFN (ours).

Key advantages of the MPCFN include the following:

Accurate Boundary Recognition. In the first case, the MPCFN correctly identifies the full organization name “JKF Ag” as an ORG, while the baseline models only recognize the partial entity “JKF”. This demonstrates the MPCFN’s precision in determining entity boundaries.

Robustness to Complex Entities. For the phrase “Saudi Oil Minister”, the MPCFN accurately recognizes “Saudi” as an LOC, whereas the baselines misclassify the entire phrase as an ORG. This shows the MPCFN’s ability to disambiguate and classify compound entities.

Accurate Entity Identification with Visual Context. In the case of “Sandy from kfc”, the accompanying image contains a prominent, clear KFC logo in the background, which features the iconic portrait of Colonel Sanders. The MPCFN is the only model that correctly identifies “kfc” as an organization (ORG), by effectively utilizing this decisive visual cue. In contrast, MGCMT misclassifies it as a location (LOC), and UMT fails to recognize it as an entity (O). In the tweet “RT @BiebsVogue: justin is probably looking at his twitter like this”, the MPCFN successfully identifies “justin” as a PER (person) by effectively utilizing the accompanying image. In contrast, MGCMT fails to recognize this entity due to insufficient visual–text alignment. This highlights the MPCFN’s superior ability to integrate visual cues for accurate entity recognition.

Accurate entity and non-entity judgment. The last example, featuring the song title “Drag Me Down”, provides a critical test of the model’s ability to leverage contextual cues. While the baseline models (UMT, MGCMT) treated it as a non-entity, our MPCFN model correctly classified it as MISC. This case underscores the MPCFN’s advantage in interpreting nuanced and non-standard entity mentions.

These examples validate that the MPCFN achieves more precise and robust performance by effectively fusing visual and textual information, leading to fewer errors in entity boundary recognition, type classification, and identification compared to existing approaches.

5.5. Attention Analysis Between Textual Entities and Visual Objects

To further reveal the multimodal fusion mechanism of the Cross-Fusion Module, this section analyzes the cross-modal attention matrices generated during the module’s core operation: Image-Aware Word Representation (IAWR). Notably, the matrix is derived from swapping the query (Q) matrices in the multihead cross-modal attention (Equations (8) and (9)), directly reflecting how the model establishes semantic associations between textual entities and visual regions.

As shown in Figure 4, this heatmap visually demonstrates the core function of the cross-fusion module: for the personal name entity “Richard Howarth” in the text (corresponding to a word segment in the text sequence), regions R37 and R46–R47 in the heatmap appear dark blue (attention weight > 0.08), precisely aligned with the facial region of the person in the image; the R43-R47 region corresponding to “Simon Woodings” also shows a high-weight response, matching the visual region of the person. This strong “entity word–visual region” correlation proves that the Cross-Fusion Module achieves fine-grained alignment of text and image features through a query matrix interchange mechanism.

Figure 4. Heatmap of the query-swapped attention matrix (Image-Aware Word Representation phase in cross-fusion). The left panel shows the 7 × 7 segmented input image, with regions labeled R11–R77 (row × column corresponding to grid coordinates). The middle heatmap visualizes the attention matrix (text tokens × visual regions), and only the portion with non-zero attention intensity is displayed—regions R51 and beyond in the image are not included in the intercepted heatmap. The right panel is the weight intensity legend, indicating the correspondence between color depth and attention weight value. The bottom text presents the original social media post content, where entities like “Richard Howarth” and “Simon Woodings” are marked in red with their categories (B-PER/I-PER denotes the beginning/middle part of a person name entity), and the table maps text tokens to their x-axis indices.

For non-entity words in the text, such as the preposition “with”, the conjunction “and”, and even the general noun “morning”, the model significantly reduces the attention weight (lighter colors) given to the person regions in the image. This demonstrates that the model is able to effectively distinguish key entity information from background information in the text. Instead of paying attention uniformly to the entire image, it focuses attention resources on the most relevant visual areas, suppressing the interference of irrelevant information.

5.6. Generalization Study

Model generalization capability describes how well a model adapts when dealing with data that are different from the training data; models with high generalization capability are able to learn general patterns and laws from the training data and apply this knowledge to new data. We only trained on TWITTER2015 and TWITTER-2017 and validated on each other’s test sets, avoiding the WuKong dataset because it is a Chinese dataset with a significant gap compared to the other two English datasets. The results are displayed in Table 6, where TWITTER-2017 → TWITTER-2015 indicates that models trained on the TWITTER-2017 dataset are used to evaluate the TWITTER-2015 dataset and vice versa. From the results in Table 6, it is obvious that our model outperforms other models in terms of generalization ability. We hypothesize that this advantage may be attributed to the crucial role of the Flood layer, which effectively mitigates the overfitting phenomenon by limiting the continuous decrease in the training loss, thus enhancing the model’s adaptability to new data.

Table 6. Performance comparison of generalization ability.

5.7. Low-Resource Experiment

In this study, we aimed to evaluate the model’s performance in resource-constrained environments. To accomplish this, we implemented a strategy of randomly sampling between 10% and 50% of the training samples from three distinct datasets, thereby creating a series of training sets under low-resource conditions. On this basis, we conducted a systematic performance comparison of the proposed MPCFN model against the existing UMT and HVPNeT models. The results of this comparison are detailed in Figure 5.

Figure 5. Performance comparison in low-resource setting.

The experimental results demonstrate that the MPCFN significantly outperforms both UMT and HVPNeT in the English datasets TWITTER-2015 and TWITTER-2017, regardless of the low-resource configurations. In addition, on the Chinese dataset WuKong, the MPCFN similarly demonstrates superior performance over the other two models under all conditions except the 10% sample configuration. These results consistently validate the efficiency and robustness of the MPCFN in the face of data scarcity, demonstrating its significant potential and broad applicability for named entity recognition in low-resource environments.

6. Conclusions and Future Work

To cope with the problem where traditional text-based NER methods find it difficult to deal with texts with ambiguous meanings and the challenges of existing MNER methods in terms of data alignment and visual information interference, in this paper we propose an innovative MPCFN for the MNER task. We introduce a relevance prediction module to filter non-essential details in the image and align the features of the image and the text through the Cross-Fusion Module. To enhance model training further, we combine the CRF layer with the Flooding layer to optimize the generalization ability of the network. Through extensive experiments on two benchmark English datasets and one Chinese dataset, our experimental results confirm the excellent performance of the MPCFN on multimodal data in different languages.

Although the MPCFN model proposed in this paper has been fully validated on three benchmark datasets (TWITTER-2015, TWITTER-2017, and WuKong), a detailed observation of the experimental process and in-depth analysis of the results reveal two limitations of the model: (1) Limited coverage of data scenarios: The model is only designed and optimized for the basic social media scenario of “single text–single image”, and has not been adapted to more complex multimodal input forms such as “multiple images associated with a single text” and “text + short video frame sequences”. (2) Weak cross-domain transfer performance: The effectiveness of the model has only been verified on social media datasets, and no adaptive optimization or sufficient testing has been conducted for non-social media multimodal data such as news, e-commerce, or medical data.

This work opens up several avenues for future research: on the one hand, testing and optimization on more linguistic datasets to enhance the performance of the MPCFN in multilingual environments and to address the differences and specific challenges between different languages; on the other hand, how to apply the MPCFN model to multimodal tasks in other domains, including multimodal sentiment analysis and multimodal machine translation, etc., to validate its generalization and scalability in different tasks.

The code, data, and best-performing models are available at https://github.com/kuyuanchou/MPCFN.

Author Contributions

Q.Q.: conceptualization, methodology, supervision, project administration, writing—original draft, writing—review and editing. Y.Z.: software, investigation, formal analysis, validation, writing—original draft. B.T.: investigation, formal analysis, writing—original draft. Y.Z. and M.T.: methodology, formal analysis, resources, visualization. W.C.: conceptualization, supervision, project administration, funding acquisition. L.T.: supervision, writing—review and editing, project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financially supported by the National Key R&D Program of China (no. 2022YFB3904200), and the Open Fund Program of Yunnan Key Laboratory of Intelligent Monitoring and Spatiotemporal Big Data Governance of Natural Resources (no. 202449CE340023).

Data Availability Statement

The authors confirm that the data supporting the findings of this study are available within the article.

Acknowledgments

This study was financially supported by the National Key R&D Program of China (no. 2022YFB3904200) and the Open Fund Program of Yunnan Key Laboratory of Intelligent Monitoring and Spatiotemporal Big Data Governance of Natural Resources (no. 202449CE340023).

Conflicts of Interest

We declare that we do not have any commercial or associative interests that represent a conflict of interest in connection with the work submitted.

References

Yu, J.; Bohnet, B.; Poesio, M. Named Entity Recognition as Dependency Parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6470–6476. [Google Scholar]
Nasar, Z.; Jaffry, S.W.; Malik, M.K. Named entity recognition and relation extraction: State-of-the-art. ACM Comput. Surv. 2021, 54, 1–39. [Google Scholar] [CrossRef]
Kim, J.H.; Woodland, P.C. A rule-based named entity recognition system for speech input. In Proceedings of the Sixth International Conference on Spoken Language Processing, Beijing, China, 16–20 October 2000. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; long and short papers. Volume 1, pp. 4171–4186. [Google Scholar]
Jie, Z.; Lu, W. Dependency-Guided LSTM-CRF for Named Entity Recognition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3862–3872. [Google Scholar]
Wu, Z.; Zheng, C.; Cai, Y.; Chen, J.; Leung, H.; Li, Q. Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1038–1046. [Google Scholar]
Zheng, C.; Wu, Z.; Wang, T.; Cai, Y.; Li, Q. Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Trans. Multimed. 2020, 23, 2520–2532. [Google Scholar] [CrossRef]
Yu, J.; Jiang, J.; Yang, L.; Xia, R. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In Proceedings of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar]
Lu, D.; Neves, L.; Carvalho, V.; Zhang, N.; Ji, H. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Long Papers. Volume 1, pp. 1990–1999. [Google Scholar]
Moon, S.; Neves, L.; Carvalho, V. Multimodal Named Entity Recognition for Short Social Media Posts. In Proceedings of the NAACL-HLT, New Orleans, LA, USA, 1–6 June 2018; pp. 852–860. [Google Scholar]
Chen, L.; Kong, H.; Wang, H.; Yang, W.K.; Lou, J.; Xu, F.L. HVP-Net: A hybrid voxel-and point-wise network for place recognition. IEEE Trans. Intell. Veh. 2023, 9, 395–406. [Google Scholar] [CrossRef]
Wang, X.; Gui, M.; Jiang, Y.; Jia, Z.; Bach, N.; Wang, T.; Huang, Z.; Huang, F.; Tu, K. ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 3176–3189. [Google Scholar]
Jia, M.; Shen, X.; Shen, L.; Pang, J.; Liao, L.; Song, Y.; Chen, M.; He, X. Query prior matters: A MRC framework for multimodal named entity recognition. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 3549–3558. [Google Scholar]
Lu, J.; Zhang, D.; Zhang, P. Flat Multi-modal Interaction Transformer for Named Entity Recognition. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 2055–2064. [Google Scholar]
Asgari-Chenaghlu, M.; Feizi-Derakhshi, M.R.; Farzinvash, L.; Balafar, M.A.; Motamed, C. CWI: A multimodal deep learning approach for named entity recognition from social media using character, word and image features. Neural Comput. Appl. 2022, 34, 1905–1922. [Google Scholar] [CrossRef]
Wang, X.; Ye, J.; Li, Z.; Tian, J.; Jiang, Y.; Yan, J.; Zhang, J.; Xiao, Y. CAT-MNER: Multimodal named entity recognition with knowledge-refined cross-modal attention. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18 July–22 July 2022; IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
Xu, B.; Huang, S.; Sha, C.; Wang, H. MAF: A general matching and alignment framework for multimodal named entity recognition. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Tempe, AZ, USA, 21–25 February 2022; pp. 1215–1223. [Google Scholar]
Kapur, J.N. Maximum-Entropy Models in Science and Engineering; John Wiley & Sons: Hoboken, NJ, USA, 1989. [Google Scholar]
Eddy, S.R. Hidden markov models. Curr. Opin. Struct. Biol. 1996, 6, 361–365. [Google Scholar] [CrossRef] [PubMed]
Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
Lafferty, J.; McCallum, A.; Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the ICML, Williamstown, MA, USA, 28 June–1 July 2001; Volume 1, p. 3. [Google Scholar]
Borthwick, A.; Sterling, J.; Agichtein, E.; Grishman, R. NYU: Description of the MENE named entity system as used in MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7), Fairfax, VA, USA, 29 April–1 May 1998. [Google Scholar]
Mccallum, A.; Li, W. Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, AB, Canada, May 31–June 1 2003; pp. 188–191. [Google Scholar]
Li, P.; Zhou, G.; Guo, Y.; Zhang, S.; Jiang, Y.; Tang, Y. EPIC: An epidemiological investigation of COVID-19 dataset for Chinese named entity recognition. Inf. Process. Manag. 2024, 61, 103541. [Google Scholar] [CrossRef]
Chiu, J.P.C.; Nichols, E. Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 2016, 4, 357–370. [Google Scholar] [CrossRef]
Zhao, Z.; Yang, Z.; Luo, L.; Wang, L.; Zhang, Y.; Lin, H.; Wang, J. Disease named entity recognition from biomedical literature using a novel convolutional neural network. BMC Med. Genom. 2017, 10, 75–83. [Google Scholar] [CrossRef] [PubMed]
Zheng, S.; Wang, F.; Bao, H.; Hao, Y.; Zhou, P.; Xu, B. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Long Papers. Association for Computational Linguistics: Stroudsburg, PA, USA; Volume 1. [Google Scholar]
Gui, T.; Ye, J.; Zhang, Q.; Zhou, Y.; Gong, Y.; Huang, X. Leveraging document-level label consistency for named entity recognition. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, Montreal, QC, Canada, 19–27 August 2021; pp. 3976–3982. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. Comput. Sci. 2018, in press. [Google Scholar]
Shi, S.; Hu, K.; Xie, J.; Guo, Y.; Wu, H. Robust scientific text classification using prompt tuning based on data augmentation with L2 regularization. Inf. Process. Manag. 2024, 61, 103531. [Google Scholar] [CrossRef]
Liu, Y.; Huang, S.; Li, R.; Yan, N.; Du, Z. USAF: Multimodal Chinese named entity recognition using synthesized acoustic features. Inf. Process. Manag. 2023, 60, 103290. [Google Scholar] [CrossRef]
Zhang, Q.; Fu, J.; Liu, X.; Huang, X. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Chen, D.; Li, Z.; Gu, B.; Chen, Z. Multimodal named entity recognition with image attributes and image knowledge. In Database Systems for Advanced Applications, Proceedings of the 26th International Conference, DASFAA 2021, Taipei, Taiwan, 11–14 April 2021; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 186–201. [Google Scholar]
Sun, L.; Wang, J.; Zhang, K.; Su, Y.; Weng, F. RpBERT: A text-image relation propagation-based BERT model for multimodal NER. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 19–21 May 2021; Volume 35, pp. 13860–13868. [Google Scholar]
Liu, P.; Wang, G.; Li, H.; Liu, J.; Ren, Y.; Zhu, H.; Sun, L. Multi-granularity cross-modal representation learning for named entity recognition on social media. Inf. Process. Manag. 2024, 61, 103546. [Google Scholar] [CrossRef]
Wang, X.; Cai, J.; Jiang, Y.; Xie, P.; Tu, K.; Lu, W. Named Entity and Relation Extraction with Multi-Modal Retrieval. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 5925–5936. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5999–6009. [Google Scholar]
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016. [Google Scholar]
Ishida, T.; Yamane, I.; Sakai, T.; Niu, G.; Sugiyama, M. Do We Need Zero Training Loss After Achieving Zero Training Error? In Proceedings of the International Conference on Machine Learning, PMLR, Online, 13–18 July 2020; pp. 4604–4614. [Google Scholar]
Gu, J.; Meng, X.; Lu, G.; Hou, L.; Minzhe, N.; Liang, X.; Yao, L.; Huang, R.; Zhang, W.; Jiang, X.; et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Adv. Neural Inf. Process. Syst. 2022, 35, 26418–26431. [Google Scholar]
Li, J.; Sun, A.; Han, J.; Li, C. A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 2020, 34, 50–70. [Google Scholar] [CrossRef]
Zhang, D.; Wei, S.; Li, S.; Wu, H.; Zhu, Q.; Zhou, G. Multi-modal graph fusion for named entity recognition with targeted visual guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 19–21 May 2021; Volume 35, pp. 14347–14355. [Google Scholar]
Yu, J.; Li, Z.; Wang, J.; Xia, R. Grounded multimodal named entity recognition on social media. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1, pp. 9141–9154. [Google Scholar]

Figure 1. Two examples for MNER. (a,b) represent examples of MNER in English and Chinese, respectively. The text in different colors represents different entity categories.

Figure 2. Overall framework of our proposed methodology: Stage 1: Extract text features by using BERT and visual features by using ResNet. Stage 2: Utilize a Correlation Prediction Gate to predict the relationships between text and image, and apply adaptive weights to the visual features based on these predictions. Stage 3: Predict a normalized vector

G

to control the level of visual features each block receives via the Dynamic Gate. Stage 4: Aligning image and text features through cross-fusion. Stage 5: A CRF layer optimized with Flooding receives the representation from the hidden layer for decoding.

Figure 3. The case comparisons among UMT, MGCMT, and the MPCFN (ours).

Figure 4. Heatmap of the query-swapped attention matrix (Image-Aware Word Representation phase in cross-fusion). The left panel shows the 7 × 7 segmented input image, with regions labeled R11–R77 (row × column corresponding to grid coordinates). The middle heatmap visualizes the attention matrix (text tokens × visual regions), and only the portion with non-zero attention intensity is displayed—regions R51 and beyond in the image are not included in the intercepted heatmap. The right panel is the weight intensity legend, indicating the correspondence between color depth and attention weight value. The bottom text presents the original social media post content, where entities like “Richard Howarth” and “Simon Woodings” are marked in red with their categories (B-PER/I-PER denotes the beginning/middle part of a person name entity), and the table maps text tokens to their x-axis indices.

Figure 5. Performance comparison in low-resource setting.

Table 1. The basic statistics of our three datasets.

Entity Type	TWITTER-2015			TWITTER-2017			WuKong
Entity Type	Train	Dev	Test	Train	Dev	Test	Train	Dev	Test
PER	2217	552	1816	2943	626	621	7780	2144	2088
LOC	2091	522	1697	731	173	178	1381	400	366
ORG	928	247	839	1674	375	395	6381	1677	1684
MISC/GPE	940	225	726	701	150	157	6865	1765	1854
No. of entities	6176	1546	5078	6049	1324	1351	22407	5986	5992
No. of tweets	4000	1000	3257	3373	723	723	36024	9699	9700

Table 2. Experimental results of the ablation study.

Methods	TWITTER-2015			TWITTER-2017			WuKong
Methods	P	R	F1	P	R	F1	P	R	F1
Baseline model	73.76	75.52	74.63	86.96	87.05	86.50	77.00	82.88	79.83
+CP	74.05	75.94	74.98	87.32	87.19	87.26	80.86	83.13	81.98
+Flooding	74.87	74.47	74.67	88.09	86.53	87.30	79.36	83.36	81.31
+CF	74.43	75.44	74.94	84.82	86.45	85.63	82.31	81.04	81.67
+CP + Flooding	76.08	75.67	75.88	88.38	86.68	87.52	80.41	83.66	82.01
+CP + CF	73.37	76.46	74.88	85.83	86.97	86.40	83.45	80.96	82.19
+Flooding + CF	76.64	75.98	76.31	88.74	85.71	87.20	80.34	83.29	81.79
MPCFN (ours)	76.84	76.65	76.74	88.34	86.90	87.61	81.82	82.89	82.35

Table 3. Performance results of different MNER approaches.

Modality	Methods	TWITTER-2015			TWITTER-2017			WuKong
Modality	Methods	P	R	F1	P	R	F1	P	R	F1
Text	BERT	68.30	74.61	71.32	82.19	83.72	82.95	76.82	81.12	78.91
Text	BERT-CRF	69.22	74.59	71.81	83.32	83.57	83.44	78.37	79.39	78.88
Text + Image	UMT	71.67	75.23	73.41	85.28	85.34	85.31	79.06	80.46	79.75
	RpBERT	73.29	75.23	74.25	85.86	86.75	86.30	78.44	82.09	80.23
	UMGF	74.49	75.21	74.85	86.54	84.50	85.51	78.22	74.62	76.38
	HVPNeT	73.87	76.82	75.32	85.84	87.93	86.87	79.98	80.12	80.05
	GMNER	61.65	62.03	61.27	65.62	66.54	64.72	-	-	-
	MGCMT	73.57	75.59	74.57	86.03	86.16	86.09	75.28	80.56	77.83
	MPCFN (ours)	76.84	76.65	76.74	88.34	86.90	87.61	81.82	82.89	82.35

Table 4. The test for different values of parameter

b

in our model.

Table 4. The test for different values of parameter

b

in our model.

Parameter b	TWITTER-2015			TWITTER-2017			WuKong
Parameter b	P	R	F1	P	R	F1	P	R	F1
0.2	76.07	75.58	75.82	87.09	85.86	86.47	82.01	80.59	81.30
0.3	76.84	76.65	76.74	87.90	87.12	87.51	81.19	83.39	82.28
0.4	76.88	76.05	76.46	86.93	85.64	86.28	80.33	83.28	81.78
0.5	77.32	75.60	76.45	88.34	86.90	87.61	81.08	83.29	82.17
0.6	76.39	75.50	75.94	87.89	85.94	86.90	81.82	82.89	82.35
0.7	75.80	76.49	76.15	89.10	85.27	87.14	81.12	83.24	82.17
0.8	75.51	75.25	75.38	87.42	85.42	86.41	80.79	83.18	81.97

Table 5. The test for different modes in the unimodal learning.

Methods	TWITTER-2015			TWITTER-2017			WuKong
Methods	P	R	F1	P	R	F1	P	R	F1
MPCFN	76.84	76.65	76.74	88.34	86.90	87.61	81.82	82.89	82.35
w/o uni-text	73.76	75.52	74.63	86.96	87.05	86.50	77.00	82.88	79.83
w/o uni-obj	75.05	75.88	75.47	87.59	86.75	87.17	81.22	83.36	82.28
w/o uni-img	76.81	75.21	76.01	88.46	86.23	87.33	80.52	83.11	81.79

Table 6. Performance comparison of generalization ability.

Methods	TWITTER-2017 → TWITTER-2015			TWITTER-2015 → TWITTER-2017
Methods	P	R	F1	P	R	F1
UMT	64.67	63.59	64.13	67.80	55.23	60.87
MGCMT	67.67	63.28	65.40	66.41	58.11	61.98
MPCFN (ours)	73.71	62.84	67.84	65.79	59.22	62.33

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

MPCFN: A Multilevel Predictive Cross-Fusion Network for Multimodal Named Entity Recognition in Social Media

Abstract

1. Introduction

2. Related Work

Previous Research on Multimodal NER

3. Methodology

3.1. Overall Framework

3.2. Stage 1: Text and Visual Feature Extraction

3.3. Stage 2: Correlation Prediction

3.4. Stage 3: Dynamic Gate

3.5. Stage 4: Cross-Fusion

3.6. Stage 5: CRF and Flooding Layer

4. Experiments

4.1. Dataset

4.2. Evaluation Metric

4.3. Baselines

4.4. Experiment Configuration

5. Result

5.1. Ablation Study

5.2. Analysis of Results Compared to Existing Models

5.3. Parameter Sensitivity Analysis

5.4. Case Study

5.5. Attention Analysis Between Textual Entities and Visual Objects

5.6. Generalization Study

5.7. Low-Resource Experiment

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics