You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

7 November 2025

MPCFN: A Multilevel Predictive Cross-Fusion Network for Multimodal Named Entity Recognition in Social Media

,
,
,
,
and
1
School of Computer Science, China University of Geosciences, Wuhan 430074, China
2
Yunnan Key Laboratory of Intelligent-Monitoring and Spatiotemporal Big Data Governance of Natural Resources, Kunming 650051, China
3
Yunnan Institute of Geology and Mineral Surveying and Mapping Co., Ltd., Kunming 650051, China
4
The Land and Resources Information Center, Department of Natural Resources of Yunnan Province, Kunming 650224, China

Abstract

The goal of the Multimodal Named Entity Recognition (MNER) job is to identify and classify named entities by combining various data modalities (such as text and images) and assigning them to specified categories. The growing prevalence of multimodal social media posts has spurred heightened interest in MNER, particularly due to its pivotal role in applications ranging from intention comprehension to personalized user recommendations. In the MNER task, the inconsistency between image information and text information and the difficulty of fully utilizing the image information to complement the text information are the two main difficulties currently faced. In order to solve these problems, this study proposes a Multilevel Predictive Cross-Fusion Network (MPCFN) approach for Multimodal Named Entity Recognition. First, textual features are extracted using BERT and visual features are extracted using ResNet, then irrelevant information in the image is filtered using the Correlation Prediction Gate. Second, the hierarchy of visual features received by each Transformer block is controlled by the Dynamic Gate and aligned between image and textual features using the Cross-Fusion Module to align the image and text features. Finally, the hidden layer representation is fed into the CRF layer optimized for decoding using Flooding. Through experiments on TWITTER-2015, TWITTER-2017, and WuKong datasets, our method achieves F1 scores of 76.74%, 87.61%, and 82.35%, outperforming the existing mainstream state-of-the-art models and proving the effectiveness and superiority of our method.

1. Introduction

Named Entity Recognition (NER) is a crucial information extraction method that seeks to recognize and categorize textual entities with particular meanings, such as the names of individuals, locations, businesses, dates, and times, among others [,]. By recognizing named entities in unstructured free-form text, NER plays an important role in several domains, including information retrieval, relational extraction, question and answer systems, machine translation, and text categorization []. Currently, mainstream text-based NER algorithms produce considerable results when processing well-formed text. They usually use CNN, LSTM, or Transformer as encoders to learn the contextual representation of input words and decode them with softmax or CRF [,].
However, using traditional NER methods for named entity recognition of text on social media like Twitter and Facebook is often less effective due to the fact that Twitter tweets usually contain a large number of spelling mistakes, abbreviations, slang, emoticons, and other non-standard languages [,]. These noises make it difficult to accurately recognize and classify named entities by traditional NER methods, which usually rely on well-formed text. In addition, the character limitation of Twitter tweets results in very limited contextual information per tweet, and much of the textual content can only be understood in combination with visual context [].
To address the challenges faced by traditional NER methods in complex textual environments, MNER has been proposed. The objective of MNER is to recognize and classify entities in post by utilizing associated images. Existing MNER methods are broadly categorized into two types: one class of methods inputs both the text and the entire image, encoding the implicit representation of each word together [,]; the other class aligns text vectors with visual object features to achieve more comprehensive semantic representations of words [,,,,,]. All these approaches demonstrate that augmenting linguistic representations with visual information enables MNER to achieve outperforms in comparison with traditional NER methods in NER tasks. Figure 1 provides two exemplars of MNER. Specifically, Figure 1a,b represent examples of MNER in English and Chinese, respectively. The objective of MNER is to accurately discern the correct entity types based on the provided image and text. For instance, in Figure 1a, it is expected to recognize that NFL, Patrick Willis, and Silicon Valley belong to the category of organization names (ORG type), person names (PER type), and place names (LOC type), respectively.
Figure 1. Two examples for MNER. (a,b) represent examples of MNER in English and Chinese, respectively. The text in different colors represents different entity categories.
Although prior methodologies have proved the efficacy of MNER, two major problems remain. First, there may be discrepancies between the text and accompanying images, with erroneous image information potentially disrupting entity recognition [,]. Second, employing global picture representations ignores the fine semantic alignment between images and text. Additionally, relying solely on object-level features can be problematic, as some images may contain misleading objects or lack explicit objects altogether.
To tackle these issues, we introduce a new MNER method that leverages a hierarchical visual prefix fusion network. Our approach begins with the extraction of unimodal features from text and images using BERT and ResNet, respectively. Subsequently, we employ a Correlation Prediction Gate to screen out irrelevant information in the image, applying adaptive weights to visual features. These weights are dynamically adjusted according to the relationship between the image and the text. Furthermore, a Dynamic Gate mechanism controls the hierarchical processing of visual features within each Transformer block, enabling each block to learn different levels of text and visual information. Finally, the hidden layer representations are fed into a CRF layer, which is optimized using the Flooding method for decoding. We evaluate our method on two public English datasets and one Chinese dataset, and the experimental results demonstrate the effectiveness and advancements of our proposed approach.
The contributions of this paper are summarized as follows:
(1)
We present a Multilevel Predictive Cross-Fusion Network model to enhance MNER’s performance on social media. This model predicts the correlation between text and images to mitigate irrelevant image interference and employs cross-fusion instead of visual prefix fusion for integrating multimodal information.
(2)
To address overfitting associated with model complexity, this study employs a technique known as Flooding to optimize the Conditional Random Field (CRF) layer. This approach effectively mitigates overfitting by modifying the CRF layer’s loss function, thereby enhancing the model’s generalization capability. By incorporating Flooding technology, we further improve the model’s robustness in managing complex sequence annotation tasks.
(3)
Through extensive experiments and analysis, we show that our model performs competitively compared to existing state-of-the-art models.
In Section 2, we provide a comprehensive review of prior research on NER and MNER. We first present a detailed overview of traditional NER methods and their limitations, followed by an analysis of the advantages of multimodal approaches and the findings of previous studies. In Section 3, we provide a detailed description of our proposed MNER method, focusing on the implementation details of correlation prediction between image–text pairs and the multimodal information fusion strategy. Section 4 describes the datasets and evaluation criteria employed in this work, as well as providing a thorough analysis of the experimental results. Finally, Section 5 summarizes the key findings of the research and suggests potential directions for future work.

3. Methodology

We start this section by describing the MNER problem and provide an overview of our suggested approach. Then, we give a thorough explanation of the precise implementation procedure of our system, utilizing the picture and sequence shown in Figure 2 as an example.
Figure 2. Overall framework of our proposed methodology: Stage 1: Extract text features by using BERT and visual features by using ResNet. Stage 2: Utilize a Correlation Prediction Gate to predict the relationships between text and image, and apply adaptive weights to the visual features based on these predictions. Stage 3: Predict a normalized vector G to control the level of visual features each block receives via the Dynamic Gate. Stage 4: Aligning image and text features through cross-fusion. Stage 5: A CRF layer optimized with Flooding receives the representation from the hidden layer for decoding.
Task Definition: Given an input pair consisting of a text sentence T and an associated image V , the goal of MNER is to extract a set of entities from T and classify each extracted entity into one of the pre-defined types. Similar to most existing MNER approaches, we frame this task as a sequence-labeling problem. Let T = t 1 , t 2 , , t n denote the sequence of input tokens, and y = y 1 , y 2 , , y n represent the corresponding sequence of labels.

3.1. Overall Framework

The overall architecture of our model is shown in Figure 2. For clarity, the model is divided into 5 main components: (1) text and visual feature extraction, (2) correlation prediction, (3) the Dynamic Gate, (4) cross-fusion, (5) CRF and Flooding layer.
Firstly, we employ the BERT model for comprehensive feature extraction from the text data. Utilizing the visual grounding toolkit, we extract local object information from the raw images. Subsequently, we encode these global images and local objects using the ResNet architecture to derive rich visual features. Additionally, we have designed and implemented an adaptive matrix that dynamically learns during the model training process, filtering visual features based on the relevance between the image and text pairs. In addition, we maintain a dynamically updated normalized vector G , which assigns filtered visual and textual features to each Transformer layer with varying weights. This approach facilitates hierarchical integration and fusion of visual and textual information. During the decoding stage, the CRF layer is employed to further refine the output. To mitigate overfitting due to high model complexity, we have incorporated a technique called Flooding, enhancing the model’s generalization and stability. Through these meticulously designed methods, our model achieves exceptional performance in the joint representation learning of images and text.

3.2. Stage 1: Text and Visual Feature Extraction

In this work, each text sequence X = x 0 , x 1 , , x n is fed into a pre-trained BERT [] to obtain the sequence representations T = T 0 , T 1 , , T n , where T i R d is the extracted word representation for x i , and x 0 and x n denote the two inserted special tokens “[CLS]” and “[SEP]” that denote the start and end position of a sentence.
T i = B E R T x i ; θ bert   R d
The BERT parameter is represented by θ b e r t . In particular, T i is derived by summing these sub-tokens if the tokenizer divides x i into numerous sub-tokens.
We use the visual grounding tools to extract the regions of a picture that contain entities given the image []. Then, we transform the initial image and the divided image to 224 × 224 pixels as the hierarchical image O = o 1 , o 2 , , o n . Visual representations are extracted through the pre-trained ResNet.
v i = R e s N e t o i ; θ res   R d
where θ res   is the ResNet parameter and o i is a 224 × 224 pixels image for the i -th hierarchical image.

3.3. Stage 2: Correlation Prediction

Visual cues that are not directly related to the text can introduce uncertainty or even negatively impact multimodal model learning. Therefore, prior to incorporating visual features, we filter them based on their relevance to the corresponding text. Given a hierarchical image feature v i , we integrate it with the text feature T and input the combined feature into BERT to derive the joint feature []. The Correlation Prediction Gate is defined as a softmax function, which inputs the joint features into the gate to obtain a relationship matrix that controls the visual feature weights.
G m a t r i x = s o f t m a x [ v i ,   T ]
where v i is the visual features, T is the text features, [⋅,⋅] indicates the concatenation operation, and G m a t r i x indicates the relational matrix.
Then, we apply weights to visual features v i using the relational matrix G m a t r i x to derive visual representations that are pertinent to the textual content. The visual representations correlated with the text, denoted as V , are computed through the following method:
V = v i G m a t r i x     i = 0 ,   , n } =   {   v i   |   i = 0 ,   , n }
where v i indicates initial visual features.

3.4. Stage 3: Dynamic Gate

Path decision-making can be viewed as this module’s function, and the Dynamic Gate’s prediction of normalized vector G governs the hierarchy of visual features that each Transformer block extracts. First, we create the gate signals’ logits, α ( l ) []:
α ( l ) =   f ( W l ( 1 c i = 1 c P ( V i ) ) )
where f represents the activate function Leaky_ReLU, and P represents the global average pooling layer. We use the MLP layer W l to reduce the feature dimension by c and implement a soft gate by generating continuous values as path probabilities. The calculation formula for G is as follows:
G =   s o f t m a x α ( l )   l = 0 ,   , 12 } = {   G l   |   l = 0 ,   , 12 }
where G l indicates the probability vector of l -th Transformer block.
In order to assign appropriate visual features to each Transformer layer, we use the Dynamic Gate to link the visual features V i of each image in V to all Transformer layers, and apply adaptive weights G i during the linking process. The visual feature V G a t e selected by the Dynamic Gate is calculated as follows:
V G a t e =   v 0 ,   v 1 ,   ,   v n G i     G i G }
where v i is the visual features, and [⋅, ] indicates the concatenation operation.

3.5. Stage 4: Cross-Fusion

While Transformer layers are capable of capturing which contextual words are more relevant to the prediction of input words, they fail to incorporate relevant visual context, resulting in underutilization of image information. To address this, we employ a cross-fusion approach that ensures equal consideration of text and image information from the initial fusion stage. This approach allows each word to learn not only a text-based representation informed by visual context but also a visual representation enhanced by textual information, thereby improving the model’s integrated understanding of both text and visual data. Consequently, the model can more accurately predict words closely associated with each visual segment, thereby enhancing the effectiveness of multimodal information fusion.
Image-Aware Word Representation: To enhance word representation through associated images, we utilize a multihead cross-modal attention mechanism []. This mechanism treats V G a t e R d × 49 as queries, and T R d × d × 49 as keys and values:
C A i ( V G a t e , T ) = s o f t m a x W q i V G a t e W k i T d / m W v i T
C A ( V G a t e , T ) = W C A 1 ( V G a t e , T ) , , C A 12 ( V G a t e , T )
where { W q i ,   W k i ,   W v i } and W indicate the weight matrices for the query, key, value, and multihead attention, respectively. C A i denotes the i -th head of cross-modal attention. We then add three further sub-layers on top of that:
T C ^ = L N ( V G a t e + C A ( V G a t e , T ) )
T C = L N ( T C ^ + F F N ( T C ^ ) )
where FFN refers to the feed-forward network, LN denotes layer normalization [,], and T C is the output representations of the cross-modal Transformer layer.
Word-Aware Visual Representation: To capture visual representations aligned with each word, we also employ the CMT layer, which uses T as queries and V G a t e as keys and values, creating a symmetric variant of the previously described CMT layer. This approach yields a word-specific visual representation, denoted as V C , effectively aligning each word with its closely related visual blocks.

3.6. Stage 5: CRF and Flooding Layer

CRF Layer: To combine textual and visual representations, we concatenate the textual representations T C and the visual representations V C to form the final hidden representation H = t 0 , t 1 , , v 0 , v 1 , , where t i denotes the i_th textual representation and v i denotes the i_th visual representation []. We input H into a standard Conditional Random Field layer that defines the probability of a sequence of labels y given an input sentence S and its associated image V :
P y S , V = exp score H , y y   exp score H , y
s c o r e ( H , y ) = i = 0 n   T y i , y i + 1 + i = 1 n   E h i , y i
where T y i , y i + 1 is denoted as the transition score from label y i to label y i + 1 , and E h i , y i is denoted as the emission score for the i-th word corresponding to label y i .
Flooding Layer: Over-parameterized deep networks can cause the training loss to continue to approach zero, making the model overconfident and degrading test performance. To address this problem, we use a solution called Flooding [], where a baseline b is set, and, when the training loss reaches the value of the baseline, the gradient rises, preventing further reduction of the training loss. If the original learning objective is L , the proposed modified learning objective L is
L ( θ ) = | L ( θ ) b | + b
where b > 0 denotes the flood level set by the user, and θ represents the model parameter.

4. Experiments

4.1. Dataset

In our experiments, we use three datasets to evaluate the performance: two English benchmark MNER datasets TWITTER-2015 (Zhang et al., 2018 []) and TWITTER-2017 (Lu et al., 2018 []) and one Chinese dataset WuKong (Gu et al., 2022 []).
TWITTER-2015 and TWITTER-2017: TWITTER-2015 and TWITTER-2017 contain 8257 and 4819 tweet data, respectively, and the dataset sequence annotation method is the BIO tagging scheme. The statistical details of the dataset are provided in Table 1.
Table 1. The basic statistics of our three datasets.
WuKong: This Chinese dataset is constructed from data sourced from the Chinese Internet, featuring 55,423 image–text pairs labeled using the BMES tagging scheme. It includes four entity types: Person, Location, Organization, and Geo-Political Entity. Statistical details of the dataset are provided in Table 1.

4.2. Evaluation Metric

We use precision (P), recall (R), and F1 score (F1) as measures, and we apply exact match assessment, where a named entity is considered correctly identified only if its borders and type match the ground truth []. Based on the numbers of true positives ( T P ), false positives ( F P ), and false negatives ( F N ), precision, recall, and F1 score are computed. The detailed calculation is as follows:
Precision   =   T P T P + F P
Recall = T P T P + F N
Precision assesses how well the NER system identifies only the correct entities, while recall assesses the system’s capability to identify all entities within the corpus.
F1-score = 2 × Precision × Recall Precision + Recall
The harmonic mean of recall and precision is the F1 score. Given that the dataset includes multiple entity types, we use the micro-averaged F-score to ensure all entities are treated equally.

4.3. Baselines

We performed a thorough comparison of our model against various baseline models to highlight its superiority. The models compared are categorized into two types: text-based approaches and multimodal approaches.
Text-based approaches:
  • We utilize BERT (Devlin et al., 2018 []) as a language model for MNER with a softmax decoder, as it has been pre-trained on a substantial number of unlabeled text data.
  • An extension of BERT, BERT-CRF uses a normal CRF layer as the decoder rather than a softmax layer.
Multimodal approaches:
  • UMT (Yu et al., 2020 []): In order to acquire both Image-Aware Word Representations and visual representations inspired by the words, UMT (Yu et al., 2020) uses a multimodal interaction module.
  • RpBERT (Sun et al., 2021 []) adopts a method of text–image relation propagation to select visual clues.
  • UMGF (Zhang et al., 2021 []) presents a unified multimodal graph fusion approach for MNER.
  • HVPNeT (Chen et al., 2022 []) introduces a dynamic gated aggregation method to obtain hierarchical multiscaled visual features, which are used as visual prefixes for fusion.
  • GMNER (Yu et al., 2023 []) introduces a hierarchical indexing framework called H-Index, which generates entity–type–region triples hierarchically using a sequence-to-sequence model.
  • MGCMT (Liu et al., 2024 []) improves word representation through semantic enhancement and cross-modal interaction at various levels, achieving effective multimodal guidance for each word.

4.4. Experiment Configuration

For the two English datasets, TWITTER-2015 and TWITTER-2017, the hyperparameters are the same for us. The batch size, the input text sequence’s maximum length, the dropout rate, the learning rate, and epochs are, respectively, set to 8, 147, 0.1, 3 × 10–5, and 50. During the training process, the weight decay parameters for the text, image, and CRF are set to 1 × 10–2, 5 × 10–2, 1 × 10–2, respectively. The text representations T are encoded using the BERT–base–uncased model pre-trained by Devlin et al. (2018) []. The visual representations V are encoded with ResNet-101. When applying the Flooding method to the TWITTER-15 and TWITTER-17 datasets, we designated the parameter b as 0.3 and 0.5, correspondingly.
For the WuKong dataset, we adjusted the learning rate, the number of epochs, and b to 1 × 10–5, 10, and 0.6, respectively, and used BERT–base–Chinese as the text encoder. Other parameters were consistent with TWITTER-2015 and TWITTER-2017. When testing other baseline models on the WuKong dataset, we utilized the default parameters of the original models, merely substituting the English pre-trained models with their Chinese counterparts. This modification was implemented to assess the performance variance of the models in different linguistic contexts.

5. Result

5.1. Ablation Study

To examine the contributions of key components in our model, we conducted a component study. This study presents a comparative analysis of the full model and its variants with different combinations of the Correlation Prediction (CP), Cross-Fusion (CF), and Flooding Modules integrated (i.e., analyzing the model performance when each component is included or excluded). As shown in Table 2, we observed the following:
Table 2. Experimental results of the ablation study.
(1)
When the model is not integrated with all components, its performance is lower compared to the full model that includes all components. However, even in such partial-component integration scenarios, the model’s F1 score still exceeds that of the text-only baseline model, as well as that of most existing MNER methods. This finding fully demonstrates that each component in our model plays an indispensable key role in enhancing the final entity recognition results.
(2)
For the English datasets TWITTER-15 and TWITTER-17, the model performance significantly drops when the Flooding module is not integrated, indicating that integrating the Flooding module plays a crucial role in model training. In contrast, for the Chinese dataset WuKong, the performance decline (when the Flooding module is excluded) is less pronounced. This is because the WuKong dataset is over three times larger than the TWITTER datasets, making the data more diverse and reducing the likelihood of overfitting. Consequently, the contribution of integrating the Flooding module is less significant in this case.
(3)
The model experiences a not insignificant performance degradation when neither the Correlation Prediction Module nor the Cross-Fusion Module is integrated. This result suggests that integrating the Correlation Prediction Module (for filtering extraneous images from the data) and integrating the Cross-Fusion Module (for image fusion) are both beneficial for the NER task.

5.2. Analysis of Results Compared to Existing Models

Table 3 presents the experimental results of the MPCFN and all baselines on three datasets. From the experimental results, we can observe the following:
Table 3. Performance results of different MNER approaches.
(1)
Comparing SOTA multimodal methods with text-based unimodal approaches reveals that multimodal methods generally perform better, suggesting that incorporating additional visual information is often beneficial for NER tasks.
(2)
Comparing current multimodal models with text-based unimodal methods shows that multimodal approaches generally perform better on English datasets like TWITTER-2015 and TWITTER-2017, with a maximum F1 score improvement of about 3.5%. However, on the Chinese dataset WuKong, the maximum performance improvement is approximately 1% (for instance, when comparing RpBERT with BERT), and, in some cases, the performance of certain multimodal models (such as UMGF and MGCMT) is not superior to text-based methods. This highlights the need for further enhancement of multimodal models.
(3)
When comparing text-based unimodal methods with existing multimodal approaches, the proposed Multilevel Predictive Cross-Fusion Network demonstrates substantial performance improvements across three distinct datasets. Specifically, our model achieves improvements of 4.93%, 4.17%, and 3.44% over text-based unimodal methods on these datasets. Furthermore, relative to current state-of-the-art methods, our model yields performance gains of 1.42%, 0.74%, and 2.12%, respectively.
The observed improvement in model performance can be attributed to several factors. First, unlike models such as RpBERT that rely solely on global image features, our approach incorporates additional fine-grained image features for interaction with text, enhancing the alignment of text and image features. Second, whereas models like HVPNeT introduce a Visual Gate to regulate visual features only in the later stages of training, our model employs correlation prediction to filter visual object features at all levels early in the training process. This early-stage screening effectively reduces irrelevant information and minimizes the risk of inadvertently discarding useful visual features by applying a layered filtering approach. These strategies collectively contribute to the exceptional performance of the Multilevel Predictive Cross-Fusion Network in multimodal learning tasks.

5.3. Parameter Sensitivity Analysis

The choice of b in Flooding layer. During training, the parameter b is crucial for setting the lower bound of the loss function. If b is too high, the model may struggle to achieve convergence. Conversely, if b is too low, it may impair the model’s ability to mitigate overconfident predictions, thereby impacting performance on the test set. In order to balance this contradiction, we conducted a series of experiments and parameterized three different datasets with the experimental results. Specifically, we set b to 0.3, 0.5, and 0.6, respectively. As shown in Table 4, these parameter settings are based on a careful analysis of multiple rounds of experiments.
Table 4. The test for different values of parameter b   in our model.
Unimodal learning. This study is dedicated to exploring the specific contribution of different data modalities to the performance of MNER models. By evaluating the unimodal learning effects on text, global images, and local objects, we find that they all have a significant impact on model performance. Table 5 shows that removing either the global image or local objects causes a notable drop in model performance, with similar performance declines observed for both cases. This finding further validates that, in visual scenes, the relative position information of local objects and the global image information complement each other and together play a key role in improving the model performance.
Table 5. The test for different modes in the unimodal learning.

5.4. Case Study

Figure 3 presents a qualitative comparison of our proposed MPCFN model against two baseline methods, UMT and MGCMT, in the MNER task. The case studies illustrate how the MPCFN effectively addresses common error types and demonstrates superior recognition capabilities by leveraging multimodal information.
Figure 3. The case comparisons among UMT, MGCMT, and the MPCFN (ours).
Key advantages of the MPCFN include the following:
Accurate Boundary Recognition. In the first case, the MPCFN correctly identifies the full organization name “JKF Ag” as an ORG, while the baseline models only recognize the partial entity “JKF”. This demonstrates the MPCFN’s precision in determining entity boundaries.
Robustness to Complex Entities. For the phrase “Saudi Oil Minister”, the MPCFN accurately recognizes “Saudi” as an LOC, whereas the baselines misclassify the entire phrase as an ORG. This shows the MPCFN’s ability to disambiguate and classify compound entities.
Accurate Entity Identification with Visual Context. In the case of “Sandy from kfc”, the accompanying image contains a prominent, clear KFC logo in the background, which features the iconic portrait of Colonel Sanders. The MPCFN is the only model that correctly identifies “kfc” as an organization (ORG), by effectively utilizing this decisive visual cue. In contrast, MGCMT misclassifies it as a location (LOC), and UMT fails to recognize it as an entity (O). In the tweet “RT @BiebsVogue: justin is probably looking at his twitter like this”, the MPCFN successfully identifies “justin” as a PER (person) by effectively utilizing the accompanying image. In contrast, MGCMT fails to recognize this entity due to insufficient visual–text alignment. This highlights the MPCFN’s superior ability to integrate visual cues for accurate entity recognition.
Accurate entity and non-entity judgment. The last example, featuring the song title “Drag Me Down”, provides a critical test of the model’s ability to leverage contextual cues. While the baseline models (UMT, MGCMT) treated it as a non-entity, our MPCFN model correctly classified it as MISC. This case underscores the MPCFN’s advantage in interpreting nuanced and non-standard entity mentions.
These examples validate that the MPCFN achieves more precise and robust performance by effectively fusing visual and textual information, leading to fewer errors in entity boundary recognition, type classification, and identification compared to existing approaches.

5.5. Attention Analysis Between Textual Entities and Visual Objects

To further reveal the multimodal fusion mechanism of the Cross-Fusion Module, this section analyzes the cross-modal attention matrices generated during the module’s core operation: Image-Aware Word Representation (IAWR). Notably, the matrix is derived from swapping the query (Q) matrices in the multihead cross-modal attention (Equations (8) and (9)), directly reflecting how the model establishes semantic associations between textual entities and visual regions.
As shown in Figure 4, this heatmap visually demonstrates the core function of the cross-fusion module: for the personal name entity “Richard Howarth” in the text (corresponding to a word segment in the text sequence), regions R37 and R46–R47 in the heatmap appear dark blue (attention weight > 0.08), precisely aligned with the facial region of the person in the image; the R43-R47 region corresponding to “Simon Woodings” also shows a high-weight response, matching the visual region of the person. This strong “entity word–visual region” correlation proves that the Cross-Fusion Module achieves fine-grained alignment of text and image features through a query matrix interchange mechanism.
Figure 4. Heatmap of the query-swapped attention matrix (Image-Aware Word Representation phase in cross-fusion). The left panel shows the 7 × 7 segmented input image, with regions labeled R11–R77 (row × column corresponding to grid coordinates). The middle heatmap visualizes the attention matrix (text tokens × visual regions), and only the portion with non-zero attention intensity is displayed—regions R51 and beyond in the image are not included in the intercepted heatmap. The right panel is the weight intensity legend, indicating the correspondence between color depth and attention weight value. The bottom text presents the original social media post content, where entities like “Richard Howarth” and “Simon Woodings” are marked in red with their categories (B-PER/I-PER denotes the beginning/middle part of a person name entity), and the table maps text tokens to their x-axis indices.
For non-entity words in the text, such as the preposition “with”, the conjunction “and”, and even the general noun “morning”, the model significantly reduces the attention weight (lighter colors) given to the person regions in the image. This demonstrates that the model is able to effectively distinguish key entity information from background information in the text. Instead of paying attention uniformly to the entire image, it focuses attention resources on the most relevant visual areas, suppressing the interference of irrelevant information.

5.6. Generalization Study

Model generalization capability describes how well a model adapts when dealing with data that are different from the training data; models with high generalization capability are able to learn general patterns and laws from the training data and apply this knowledge to new data. We only trained on TWITTER2015 and TWITTER-2017 and validated on each other’s test sets, avoiding the WuKong dataset because it is a Chinese dataset with a significant gap compared to the other two English datasets. The results are displayed in Table 6, where TWITTER-2017 → TWITTER-2015 indicates that models trained on the TWITTER-2017 dataset are used to evaluate the TWITTER-2015 dataset and vice versa. From the results in Table 6, it is obvious that our model outperforms other models in terms of generalization ability. We hypothesize that this advantage may be attributed to the crucial role of the Flood layer, which effectively mitigates the overfitting phenomenon by limiting the continuous decrease in the training loss, thus enhancing the model’s adaptability to new data.
Table 6. Performance comparison of generalization ability.

5.7. Low-Resource Experiment

In this study, we aimed to evaluate the model’s performance in resource-constrained environments. To accomplish this, we implemented a strategy of randomly sampling between 10% and 50% of the training samples from three distinct datasets, thereby creating a series of training sets under low-resource conditions. On this basis, we conducted a systematic performance comparison of the proposed MPCFN model against the existing UMT and HVPNeT models. The results of this comparison are detailed in Figure 5.
Figure 5. Performance comparison in low-resource setting.
The experimental results demonstrate that the MPCFN significantly outperforms both UMT and HVPNeT in the English datasets TWITTER-2015 and TWITTER-2017, regardless of the low-resource configurations. In addition, on the Chinese dataset WuKong, the MPCFN similarly demonstrates superior performance over the other two models under all conditions except the 10% sample configuration. These results consistently validate the efficiency and robustness of the MPCFN in the face of data scarcity, demonstrating its significant potential and broad applicability for named entity recognition in low-resource environments.

6. Conclusions and Future Work

To cope with the problem where traditional text-based NER methods find it difficult to deal with texts with ambiguous meanings and the challenges of existing MNER methods in terms of data alignment and visual information interference, in this paper we propose an innovative MPCFN for the MNER task. We introduce a relevance prediction module to filter non-essential details in the image and align the features of the image and the text through the Cross-Fusion Module. To enhance model training further, we combine the CRF layer with the Flooding layer to optimize the generalization ability of the network. Through extensive experiments on two benchmark English datasets and one Chinese dataset, our experimental results confirm the excellent performance of the MPCFN on multimodal data in different languages.
Although the MPCFN model proposed in this paper has been fully validated on three benchmark datasets (TWITTER-2015, TWITTER-2017, and WuKong), a detailed observation of the experimental process and in-depth analysis of the results reveal two limitations of the model: (1) Limited coverage of data scenarios: The model is only designed and optimized for the basic social media scenario of “single text–single image”, and has not been adapted to more complex multimodal input forms such as “multiple images associated with a single text” and “text + short video frame sequences”. (2) Weak cross-domain transfer performance: The effectiveness of the model has only been verified on social media datasets, and no adaptive optimization or sufficient testing has been conducted for non-social media multimodal data such as news, e-commerce, or medical data.
This work opens up several avenues for future research: on the one hand, testing and optimization on more linguistic datasets to enhance the performance of the MPCFN in multilingual environments and to address the differences and specific challenges between different languages; on the other hand, how to apply the MPCFN model to multimodal tasks in other domains, including multimodal sentiment analysis and multimodal machine translation, etc., to validate its generalization and scalability in different tasks.
The code, data, and best-performing models are available at https://github.com/kuyuanchou/MPCFN.

Author Contributions

Q.Q.: conceptualization, methodology, supervision, project administration, writing—original draft, writing—review and editing. Y.Z.: software, investigation, formal analysis, validation, writing—original draft. B.T.: investigation, formal analysis, writing—original draft. Y.Z. and M.T.: methodology, formal analysis, resources, visualization. W.C.: conceptualization, supervision, project administration, funding acquisition. L.T.: supervision, writing—review and editing, project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financially supported by the National Key R&D Program of China (no. 2022YFB3904200), and the Open Fund Program of Yunnan Key Laboratory of Intelligent Monitoring and Spatiotemporal Big Data Governance of Natural Resources (no. 202449CE340023).

Data Availability Statement

The authors confirm that the data supporting the findings of this study are available within the article.

Acknowledgments

This study was financially supported by the National Key R&D Program of China (no. 2022YFB3904200) and the Open Fund Program of Yunnan Key Laboratory of Intelligent Monitoring and Spatiotemporal Big Data Governance of Natural Resources (no. 202449CE340023).

Conflicts of Interest

We declare that we do not have any commercial or associative interests that represent a conflict of interest in connection with the work submitted.

References

  1. Yu, J.; Bohnet, B.; Poesio, M. Named Entity Recognition as Dependency Parsing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6470–6476. [Google Scholar]
  2. Nasar, Z.; Jaffry, S.W.; Malik, M.K. Named entity recognition and relation extraction: State-of-the-art. ACM Comput. Surv. 2021, 54, 1–39. [Google Scholar] [CrossRef]
  3. Kim, J.H.; Woodland, P.C. A rule-based named entity recognition system for speech input. In Proceedings of the Sixth International Conference on Spoken Language Processing, Beijing, China, 16–20 October 2000. [Google Scholar]
  4. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; long and short papers. Volume 1, pp. 4171–4186. [Google Scholar]
  5. Jie, Z.; Lu, W. Dependency-Guided LSTM-CRF for Named Entity Recognition. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 3862–3872. [Google Scholar]
  6. Wu, Z.; Zheng, C.; Cai, Y.; Chen, J.; Leung, H.; Li, Q. Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1038–1046. [Google Scholar]
  7. Zheng, C.; Wu, Z.; Wang, T.; Cai, Y.; Li, Q. Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Trans. Multimed. 2020, 23, 2520–2532. [Google Scholar] [CrossRef]
  8. Yu, J.; Jiang, J.; Yang, L.; Xia, R. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In Proceedings of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar]
  9. Lu, D.; Neves, L.; Carvalho, V.; Zhang, N.; Ji, H. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Long Papers. Volume 1, pp. 1990–1999. [Google Scholar]
  10. Moon, S.; Neves, L.; Carvalho, V. Multimodal Named Entity Recognition for Short Social Media Posts. In Proceedings of the NAACL-HLT, New Orleans, LA, USA, 1–6 June 2018; pp. 852–860. [Google Scholar]
  11. Chen, L.; Kong, H.; Wang, H.; Yang, W.K.; Lou, J.; Xu, F.L. HVP-Net: A hybrid voxel-and point-wise network for place recognition. IEEE Trans. Intell. Veh. 2023, 9, 395–406. [Google Scholar] [CrossRef]
  12. Wang, X.; Gui, M.; Jiang, Y.; Jia, Z.; Bach, N.; Wang, T.; Huang, Z.; Huang, F.; Tu, K. ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 3176–3189. [Google Scholar]
  13. Jia, M.; Shen, X.; Shen, L.; Pang, J.; Liao, L.; Song, Y.; Chen, M.; He, X. Query prior matters: A MRC framework for multimodal named entity recognition. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 3549–3558. [Google Scholar]
  14. Lu, J.; Zhang, D.; Zhang, P. Flat Multi-modal Interaction Transformer for Named Entity Recognition. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 2055–2064. [Google Scholar]
  15. Asgari-Chenaghlu, M.; Feizi-Derakhshi, M.R.; Farzinvash, L.; Balafar, M.A.; Motamed, C. CWI: A multimodal deep learning approach for named entity recognition from social media using character, word and image features. Neural Comput. Appl. 2022, 34, 1905–1922. [Google Scholar] [CrossRef]
  16. Wang, X.; Ye, J.; Li, Z.; Tian, J.; Jiang, Y.; Yan, J.; Zhang, J.; Xiao, Y. CAT-MNER: Multimodal named entity recognition with knowledge-refined cross-modal attention. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18 July–22 July 2022; IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
  17. Xu, B.; Huang, S.; Sha, C.; Wang, H. MAF: A general matching and alignment framework for multimodal named entity recognition. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Tempe, AZ, USA, 21–25 February 2022; pp. 1215–1223. [Google Scholar]
  18. Kapur, J.N. Maximum-Entropy Models in Science and Engineering; John Wiley & Sons: Hoboken, NJ, USA, 1989. [Google Scholar]
  19. Eddy, S.R. Hidden markov models. Curr. Opin. Struct. Biol. 1996, 6, 361–365. [Google Scholar] [CrossRef] [PubMed]
  20. Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar] [CrossRef]
  21. Lafferty, J.; McCallum, A.; Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the ICML, Williamstown, MA, USA, 28 June–1 July 2001; Volume 1, p. 3. [Google Scholar]
  22. Borthwick, A.; Sterling, J.; Agichtein, E.; Grishman, R. NYU: Description of the MENE named entity system as used in MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7), Fairfax, VA, USA, 29 April–1 May 1998. [Google Scholar]
  23. Mccallum, A.; Li, W. Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, AB, Canada, May 31–June 1 2003; pp. 188–191. [Google Scholar]
  24. Li, P.; Zhou, G.; Guo, Y.; Zhang, S.; Jiang, Y.; Tang, Y. EPIC: An epidemiological investigation of COVID-19 dataset for Chinese named entity recognition. Inf. Process. Manag. 2024, 61, 103541. [Google Scholar] [CrossRef]
  25. Chiu, J.P.C.; Nichols, E. Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 2016, 4, 357–370. [Google Scholar] [CrossRef]
  26. Zhao, Z.; Yang, Z.; Luo, L.; Wang, L.; Zhang, Y.; Lin, H.; Wang, J. Disease named entity recognition from biomedical literature using a novel convolutional neural network. BMC Med. Genom. 2017, 10, 75–83. [Google Scholar] [CrossRef] [PubMed]
  27. Zheng, S.; Wang, F.; Bao, H.; Hao, Y.; Zhou, P.; Xu, B. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Long Papers. Association for Computational Linguistics: Stroudsburg, PA, USA; Volume 1. [Google Scholar]
  28. Gui, T.; Ye, J.; Zhang, Q.; Zhou, Y.; Gong, Y.; Huang, X. Leveraging document-level label consistency for named entity recognition. In Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, Montreal, QC, Canada, 19–27 August 2021; pp. 3976–3982. [Google Scholar]
  29. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. Comput. Sci. 2018, in press. [Google Scholar]
  30. Shi, S.; Hu, K.; Xie, J.; Guo, Y.; Wu, H. Robust scientific text classification using prompt tuning based on data augmentation with L2 regularization. Inf. Process. Manag. 2024, 61, 103531. [Google Scholar] [CrossRef]
  31. Liu, Y.; Huang, S.; Li, R.; Yan, N.; Du, Z. USAF: Multimodal Chinese named entity recognition using synthesized acoustic features. Inf. Process. Manag. 2023, 60, 103290. [Google Scholar] [CrossRef]
  32. Zhang, Q.; Fu, J.; Liu, X.; Huang, X. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  33. Chen, D.; Li, Z.; Gu, B.; Chen, Z. Multimodal named entity recognition with image attributes and image knowledge. In Database Systems for Advanced Applications, Proceedings of the 26th International Conference, DASFAA 2021, Taipei, Taiwan, 11–14 April 2021; Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 186–201. [Google Scholar]
  34. Sun, L.; Wang, J.; Zhang, K.; Su, Y.; Weng, F. RpBERT: A text-image relation propagation-based BERT model for multimodal NER. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 19–21 May 2021; Volume 35, pp. 13860–13868. [Google Scholar]
  35. Liu, P.; Wang, G.; Li, H.; Liu, J.; Ren, Y.; Zhu, H.; Sun, L. Multi-granularity cross-modal representation learning for named entity recognition on social media. Inf. Process. Manag. 2024, 61, 103546. [Google Scholar] [CrossRef]
  36. Wang, X.; Cai, J.; Jiang, Y.; Xie, P.; Tu, K.; Lu, W. Named Entity and Relation Extraction with Multi-Modal Retrieval. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 5925–5936. [Google Scholar]
  37. Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
  38. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5999–6009. [Google Scholar]
  39. Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016. [Google Scholar]
  40. Ishida, T.; Yamane, I.; Sakai, T.; Niu, G.; Sugiyama, M. Do We Need Zero Training Loss After Achieving Zero Training Error? In Proceedings of the International Conference on Machine Learning, PMLR, Online, 13–18 July 2020; pp. 4604–4614. [Google Scholar]
  41. Gu, J.; Meng, X.; Lu, G.; Hou, L.; Minzhe, N.; Liang, X.; Yao, L.; Huang, R.; Zhang, W.; Jiang, X.; et al. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Adv. Neural Inf. Process. Syst. 2022, 35, 26418–26431. [Google Scholar]
  42. Li, J.; Sun, A.; Han, J.; Li, C. A survey on deep learning for named entity recognition. IEEE Trans. Knowl. Data Eng. 2020, 34, 50–70. [Google Scholar] [CrossRef]
  43. Zhang, D.; Wei, S.; Li, S.; Wu, H.; Zhu, Q.; Zhou, G. Multi-modal graph fusion for named entity recognition with targeted visual guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 19–21 May 2021; Volume 35, pp. 14347–14355. [Google Scholar]
  44. Yu, J.; Li, Z.; Wang, J.; Xia, R. Grounded multimodal named entity recognition on social media. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Volume 1, pp. 9141–9154. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.