CASF: Correlation-Alignment and Significance-Aware Fusion for Multimodal Named Entity Recognition

Li, Hui; Tao, Yunshi; Wang, Huan; Wang, Zhe; Liu, Qingzheng

doi:10.3390/a18080511

Open AccessArticle

CASF: Correlation-Alignment and Significance-Aware Fusion for Multimodal Named Entity Recognition

by

Hui Li

^1,2,3,4,

Yunshi Tao

^1,2,3,4

,

Huan Wang

^1,2,3,4

,

Zhe Wang

^1,2,3,4,* and

Qingzheng Liu

^1,2,3,4

¹

School of Computer Science and Technology, Guangxi University of Science and Technology, Liuzhou 545006, China

²

Guangxi Key Laboratory of Intelligent Computing and Distributed Information Processing, Liuzhou 545006, China

³

Cybersecurity Monitoring Center for Guangxi Education System, Liuzhou 545006, China

⁴

Liuzhou Key Laboratory of Big Data Intelligent Processing and Security, Liuzhou 545006, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(8), 511; https://doi.org/10.3390/a18080511

Submission received: 4 July 2025 / Revised: 5 August 2025 / Accepted: 12 August 2025 / Published: 14 August 2025

(This article belongs to the Section Algorithms for Multidisciplinary Applications)

Download

Browse Figures

Versions Notes

Abstract

With the increasing content richness of social media platforms, Multimodal Named Entity Recognition (MNER) faces the dual challenges of heterogeneous feature fusion and accurate entity recognition. Aiming at the key problems of inconsistent distribution of textual and visual information, insufficient feature alignment and noise interference fusion, this paper proposes a multimodal named entity recognition model based on dual-stream Transformer: CASF-MNER, which designs cross-modal cross-attention based on visual and textual features, constructs a bidirectional interaction mechanism between single-layer features, forms a higher-order semantic correlation modeling, and realizes the cross relevance alignment of modal features; construct a dynamic perception mechanism of multimodal feature saliency features based on multiscale pooling method, construct an entropy weighting strategy of global feature distribution information to adaptively suppress noise redundancy and enhance key feature expression; establish a deep semantic fusion method based on hybrid isomorphic model, design a progressive cross-modal interaction structure, and combine with contrastive learning to realize global fusion of the deep semantic space and representational consistency optimization. The experimental results show that CASF-MNER achieves excellent performance on both Twitter-2015 and Twitter-2017 public datasets, which verifies the effectiveness and advancement of the method proposed in this paper.

Keywords:

multimodal named entity recognition; dual-stream Transformer; contrastive learning; feature fusion; conditional random field

1. Introduction

In recent years, the development of artificial intelligence and machine learning has greatly promoted its in-depth application in many fields such as medical health, drug discovery, disease diagnosis, and education [1,2]. At the same time, the rapid popularization of social media platforms has led to the emergence of diverse and heterogeneous information such as text, images, and time series, greatly expanding the dimensions of information expression. Multimodal Named Entity Recognition (MNER) was proposed in this context, with the core objective of integrating sentence-level text with associated images to extract and classify entities from multi-source data, thereby enhancing recognition accuracy and robustness in complex social media environments. However, social media data typically exhibits characteristics such as fragmentation, high noise levels, and sparse contextual information, making it challenging for single-modal approaches to effectively address these issues [3]. As a result, recent research has increasingly focused on the effective integration of cross-modal information, leveraging the fusion of visual and textual features to improve named entity recognition performance in dynamic and complex scenarios.

Nevertheless, the MNER task still faces several key challenges. First, the semantic abstraction level, granularity and expression are highly heterogeneous between textual and visual modalities, leading to significant difficulties in feature alignment and deep information fusion. As shown in Figure 1, in sports events, for example, it is difficult to accurately describe the organizational entities in “Basketball to Face Texas Tech” by text alone, while group gatherings and venue logos in the image become important complementary clues; in daily scenes, when it comes to other entities such as “Handsome Rob”, the visual content is often rich in diversity or even fuzzy. In daily scenes involving other entities such as “Handsome Rob”, the visual content is often rich, diverse and even blurred, while the text has weak directionality and fuzzy boundaries [4,5], which brings more challenges to fine-grained alignment and recognition. In addition, pseudo-entities (e.g., background characters, non-target objects) and text summarization and omission are common in social media [6], which tend to exacerbate misalignment and noise interference in cross-modal information matching, especially in atypical categories and complex environments.

To address the above challenges, recent studies have given solutions from different directions. One class of work focuses on enhancing modal alignment, with typical approaches such as introducing multilevel visual feature extractors and semantically enhanced text encoders (e.g., CLIP, ViLT, etc.) to facilitate multimodal information interactions in high-complexity scenes [7,8]. Another class of approaches focuses on fine-grained region partitioning, saliency mechanisms, or multi-layer dynamic interactions to achieve more accurate segment-level alignment and feature enhancement to improve the recognition of difficult cases and local entities [9,10]. In addition, redundant visual noise is effectively suppressed through multi-layer gating, confidence weighting or adaptive fusion mechanisms to improve robust performance in the case of weak alignment and modal differences [11,12]. Further, there are also methods that try to use advanced means such as external knowledge graphs and multimodal reasoning to enhance the representation and generalization ability of complex semantic relationships among entities [13]. These methods alleviate the difficulties of multimodal heterogeneous alignment, noise suppression, and information interaction to a certain extent, but they are still insufficient in the areas of entity recognition and modal consistency modeling in fine-grained and high-variance scenarios, which poses a challenge to further enhance the utility of MNER models.

To this end, this paper proposes CASF-MNER: a noise reduction model that fuses relevance alignment with saliency enhancement. The approach addresses the bottlenecks of existing work in terms of weak feature alignment, semantic inconsistency and residual fusion, and this paper contributes as follows:

1.: Correlation-Alignment Mechanisms. After visual and textual streams each extract features through self-attention mechanism, cross-modal cross-attention based on visual and textual features is designed to construct a two-way interaction mechanism between single-layer features. The higher-order semantic alignment of visual and textual features is gradually refined to realize dynamic linkage and deep feature complementarity between heterogeneous data at different semantic levels.
2.: Significance Feature Module. Construct entropy weighting strategies for multi-scale pooling and global feature distribution information for adaptive saliency modeling and feature enhancement of visual and textual sequences. By combining the statistics of maximum pooling, mean pooling and soft pooling, fine feature weight distribution is generated to achieve the dual goals of key information enhancement and redundancy suppression.
3.: Deep Semantic Fusion Method. Based on the hybrid isomorphic model, a progressive cross-modal interaction architecture is established to realize the deep fusion and complementary enhancement of visual-textual representations through bi-directional mapping with multi-layer parameter sharing. Each fusion layer combines saliency feature mapping and hierarchical contrast loss to fully explore global and local multi-granularity information while maintaining modality specificity, and realize efficient integration of multi-modal heterogeneous features.

2. Related Work

2.1. Multi-Modal Named Entity Recognition

In recent years, multimodal named entity recognition has garnered significant attention, largely driven by the proliferation of text-image data across social media platforms [14,15,16]. Integrating visual context into textual entity recognition brings benefits, but also introduces major challenges, such as semantic gaps between modalities, feature redundancy during fusion, and noisy real-world visual data.To tackle these issues, recent research has focused on more refined fusion mechanisms. For example, Jiang et al. [14] proposed a hybrid attention-based network that imposes semantic constraints at both the word and modality levels, enabling selective fusion and effective noise attenuation. Similarly, Zeng et al. [15] leveraged large language models and adaptive contrastive learning strategies to address mismatches between image and text as well as labeled data scarcity. Xu et al. [16] further explored denoising frameworks, combining topic-guided prompts and curriculum noise reduction, showing that structured sample introduction mitigates the impact of irrelevant image information on recognition performance. Despite these advances, fully solving the problem of contextual adaptation and visual coverage in MNER remains an open and challenging research frontier.

Progress in the visual modeling branch of MNER has been equally notable. Techniques based on region detection and visual grounding have enhanced the coverage and alignment of visual entities [17,18,19]. In particular, the use of multiple images per instance [18] and specifically designed multi-image datasets have greatly enriched the context for entity classification. Grounded MNER extends this by requiring entity position localization in addition to classification, driving the development of multi-level alignment strategies that link detected objects with textual entities and leverage generative pre-training to reinforce semantic consistency [19]. Nevertheless, mainstream approaches still tend to extract static global features or rely on pre-defined category schemes for visual input [20], limiting dynamic adaptation to complex semantic scenarios. Furthermore, generative and discriminative visual selection strategies are being explored to filter noisy or irrelevant content, addressing issues of weak cross-modal alignment and information redundancy [21].

Overall, mainstream MNER methods face two persistent challenges. First, inconsistencies in multimodal data representations often result in joint models introducing inefficient or even interfering signals, leading to suboptimal fusion of heterogeneous information. Second, even with high-quality visual inputs, it remains difficult to achieve deep cross-modal alignment and to suppress noisy signals, exposing entity recognition to misclassification risk.

To address these challenges, we present an end-to-end CASF-MNER model, which holistically optimizes the synergy between vision and text representations from both multi-level and multi-granularity perspectives. Through progressive cross-modal interactions and hierarchical constraint strategies, our model enhances representation consistency and discrimination ability, mitigating problems associated with weak alignment and redundant information fusion in multimodal settings.

2.2. Multimodal Fusion Based on Dual-Stream Transformer

Early MNER methods mostly focused on simple splicing of text (e.g., BiLSTM, BERT) and image (e.g., ResNet) features to constitute a two-stream processing flow, but it is difficult to adequately cope with the heterogeneity of modal representations, resulting in limited information fusion. With the wide application of the Transformer model, the multimodal direction is gradually extended to a more complex structure of the dual-stream Transformer architecture, where the Transformer realizes autonomous management and global attention mechanism for different modal data respectively.

Yu et al. [22] proposed a specialized module for multimodal dynamic alignment in a dual-stream Transformer for multilayer interaction, which is able to capture dynamic semantic associations between vision and text to improve the overall representation capability, but such methods are still generally based on static visual branches built on top of encoders such as ResNet, where inter-representation anisotropy is still prominent, and do not take into account the loss of robustness caused by modal saliency differences. The loss of robustness due to modal significance difference is not taken into account.

To address the above issues, this paper constructs a dual-stream Transformer in the CASF-MNER model from the perspective of model heterogeneity. It naturally resolves the heterogeneity barriers between multimodal representations at the architectural level and further integrates the saliency feature module to enhance the dynamic adaptability and robustness of multimodal feature integration.

2.3. Cross-Modal Contrastive Learning

In the field of NLP, contrastive learning mechanisms have been gradually introduced to enhance semantic aggregation and discretization [23,24,25], but most of the traditional schemes serve single-modal representations, and it is difficult to fully exploit the cross-modal synergy and weakly-aligned representations among heterogeneous signals from multiple sources. In order to meet the demands of multimodal tasks, a variety of cross-modal contrast learning methods have been proposed in recent years, which attempt to reduce the differences in multimodal feature distributions through global and local paradigms, and enhance the feature space distinction and robustness [24]. The application of contrast learning in computer vision and natural language processing has been expanding, and has been proven to improve the model’s consistency of representation to data transformation perturbations and local enhancements [26,27].

Based on this, this paper incorporates cross-modal contrast objectives in CASF models and introduces consistency guidance signals to facilitate deeper mining of potential text-vision connections, thus effectively mitigating the problem of weak modal alignment and representation bias.

3. CASF-MNER Model

Similar to existing MNER studies [10,23,28], this paper formalizes the task as a sequence annotation problem. The CASF model is shown in Figure 2, which is mainly composed of five stacked modules. First is the feature extraction layer for vectorizing text and images; second is the dual-stream Transformer architecture, which consists of three stacked layers: the information Encoding Layer, the significance Feature module, and the deep semantic fusion module; and finally combines with contrastive learning to capture labeling dependencies in the entity recognition task through the Conditional Random Field framework. In addition, this paper defines the number of text encoder layers as

L_{T}

, the number of visual encoder layers is

L_{I}

, the number of layers of the correlation-alignment mechanisms is

L_{C}

, the number of layers of the deep semantic fusion method is

L_{D}

, where

L_{BERT} = L_{T} + L_{C} + L_{D}

,

L_{CLIP} = L_{I} + L_{C} + L_{D}

, ensure that the model structure is balanced and parametrically efficient.

Let

T = (w_{{1}}, w_{{2}}, \dots, w_{{N}})

denote the input word sequence, where

w_{i} \in T

denotes the i-th word in the sentence.

Y = (y_{{1}}, y_{{2}}, \dots, y_{{N}})

are the corresponding sequence labels, where

y_{i} \in Y

is the set of predefined labels under the standard BIO model. Meanwhile, this paper uses

I = (I_{1}, I_{2}, \dots, I_{M})

to denote the input set of visual objects, which form complementary information with the text, and together support the accurate recognition and classification of entities.For the reader’s convenience, all main notations are collectively summarized in Table 1.

3.1. Visual and Textual Feature Extraction

In the feature extraction framework of this paper, a dual-stream feature encoding scheme for fusion of visual and textual information to capture multimodal contextual representations is proposed in this paper.

Visual feature extraction, in order to supplement the entity information that may be missing in the sentence, unlike traditional methods, this paper not only extracts global features based on the original image, but also proposes a visual feature extractor, which is combined with Visual Representation Enhancement (VRE), to construct a multi-source visual representation. In this framework, this paper introduces Fast-RCNN and visual grounding techniques to extract local visual object information. In the visual grounding process, this paper uses the Stanford grammar parser to identify all the noun phrases in the input sentences and detects the corresponding boundary objects of each noun phrase by the visual grounding tool. However, it is always difficult to fully cover the potential visual information in an image by relying only on noun phrases. Therefore this paper further introduces four predefined entity categories (PER, LOC, ORG, MISC) to discover more relevant visual objects.

In this paper, we choose to use CLIP as a visual feature extractor to obtain contextually relevant visual representations. In this paper, a uniform Patchsize is used to divide the input image into three visual blocks respectively. In addition, similar to the BERT embedding technique, this paper inserts a special marker [CLS] at the beginning position of the visual block sequence, and then processes the segmented image blocks through the visual embedding module of CLIP, finally obtaining the visual feature

E_{I} = (e_{1}^{I}, e_{2}^{I}, \dots, e_{M}^{I})

represented in the form of an image sequence, where where M is the length of the transformed image sequence.

For text feature extraction, this paper chooses to use BERT as a text feature extractor. BERT is chosen as the base feature extraction model for this paper because of its ability to generate differentiated representations for the same words in different contexts. BERT embedding mainly consists of three parts: lexical meta-embedding, segmentation embedding, and positional embedding. In this paper, each word in a sentence is first transformed into a context vector

e_{i}

, and a special marker [CLS] is inserted at the start position of the sentence and a marker [SEP] is inserted at the end position to complete the marker preprocessing. The text sequence is finally represented as

E_{T} = (e_{1}^{T}, e_{2}^{T}, \dots, e_{N}^{T})

, where N is the length of the converted text sequence.

3.2. Multimodal Dual Stream Transformer

To address the heterogeneity between textual and visual representations, we design a multimodal Transformer architecture that adopts a unified encoding strategy through a dual-stream framework. As depicted in Figure 2, our model features two parallel Transformer branches, tailored respectively for processing text and image inputs. Importantly, these two streams are coupled at the initial encoding stages by sharing lower-layer Transformer blocks and their parameters. This shared-weight approach projects both modalities into a harmonized latent space early on, effectively bridging semantic and structural discrepancies that typically exist between text and visual features.In contrast to traditional multimodal systems that rely on separate or heterogeneous encoding modules for each modality, our method is built on a homogeneous backbone shared across modalities at the foundational level. Such an architectural design not only facilitates tighter coupling and more thorough feature interaction across modalities but also streamlines parameter usage. Thanks to this setup, the model maintains modality-specific nuances in upper layers while ensuring that both streams undergo aligned feature transformations and gradual fusion from the outset. The architecture includes the following core components:

3.2.1. Information Encoding Layer

This layer adopts a dual-stream Transformer structure, which achieves cross-modal feature alignment through a parameter sharing mechanism. Its core design concept is to eliminate modal representation differences through a unified computing unit and construct a transferable feature space. The attention mechanism, as the core component of Transformer, is mathematically equivalent to establishing a dynamic correlation matrix between input elements. This study extends it to a multi-head form, whose architecture is shown in Figure 3:

The image encoder utilizes the CLIP-ViT-B/32 architecture, which has been trained on a large-scale dataset containing 400 million image-text pairs, enabling it to effectively extract features such as basic edges, textures, and spatial patterns. The encoder architecture includes 12 Transformer layers, each with 768 hidden units and 12 self-attention heads. GELU is used as the activation function throughout, and dropout is applied at a rate of 0.1 to support regularization. Each input image is split into 32 × 32 pixel patches, with each patch converted into an embedding and combined with positional information prior to being fed into the encoder. For processing textual data, we adopt BERT-base-uncased, which is optimized for modeling core lexical meanings and syntactic relationships. This text encoder similarly contains 12 Transformer blocks, each leveraging a 768-dimensional hidden space, 12-head self-attention, and a 3072-unit feed-forward module using GELU activation. The dropout rate in every layer is also set to 0.1 to reduce overfitting risk. Both visual and textual encoders are fine-tuned together in our setup, which encourages effective alignment and mutual adaptation of multimodal features for the MNER task.

3.2.2. Correlation Alignment Mechanisms

Correlation Alignment Mechanisms (CAM) are devised to precisely map and align intricate relationships between visual and textual modalities. Unlike conventional approaches that either treat cross-modal interactions as a one-off process or lack deep, iterative coupling, CAM introduces a progressive, multi-layer cross-attention framework. This design allows the semantic information of the two modalities to refine each other at each stage, thereby achieving more comprehensive feature alignment. However, the role of Figure 4 has not yet been introduced at this point, and the mechanism of this module is similar to that shown in Figure 5.

In CAM, the initially encoded visual and textual high-level semantic features are first initialized with each other to provide a basis for subsequent multi-layer alignment. Each computational layer interacts recursively by:

I_{l}, T_{l} = CAM - Encode (I_{l - 1}, T_{l - 1}), l = 1, \dots, L_{C}

(1)

where

I_{l}

,

T_{l}

are the hidden states of the encoder at layer l.

Here, features from one modality are not only integrated with, but also act as dynamic context for, the other modality using outputs from the previous layer as queries, keys, and values. This bidirectional interaction results in continuous enhancement and nuanced alignment between modalities. Additionally, to promote further distinctiveness and robustness in the fused representation, we incorporate a combination of visual and textual contrastive loss at every correlation layer. Within each mini-batch, the similarity between paired (positive) and unpaired (negative) samples is calculated, explicitly encouraging closer alignment for positive pairs and separation for unmatched pairs. Overall, our approach moves beyond traditional paradigms by applying recursive, progressive correlation at multiple depths and integrating contrastive learning throughout each alignment stage. This enables CAM to achieve highly adaptive and context-sensitive cross-modal representation, even in the presence of ambiguous or noisy input.

3.2.3. Significance Feature Module

Significance Feature Module (SFM) is the second element designed in this paper. It aims to achieve global optimization and redundancy suppression of sequence features by fusing multi-scale statistical features with dynamic recalibration mechanism. The core method lies in mapping the input tensor

X \in R^{L \times H \times D}

to the optimized feature representation

U \in R^{L \times H \times D}

, which is mathematically characterized as:

U = SFM (X)

(2)

where

X_{r} = reshape (X) \in R^{L \times (H \cdot D)}

is the dimensionally compressed 2D feature,

α \in R^{L}

is the dynamically generated significance scoring vector, and

f_{1}

,

f_{2}

are the full connectivity layers. The architecture is shown in Figure 4:

The computational flow of the module starts with the global statistical modeling of the sequence features: firstly, multi-scale statistics of the sequence dimensions are extracted by maximum pooling, mean pooling, and soft pooling, which capture the strong response extremes, the equilibrium distribution characteristics, and the probability weighted features of the features, respectively. Next, the three types of statistics are fused by the fully connected layer

f_{1}

nonlinear mapping to generate the initial significance score:

α = σ (f_{1} (X_{Max}) + f_{1} (X_{Avg})) \times f_{1} (X_{Soft})

(3)

The information entropy weight of each feature unit in the global context. Ultimately, the scores are used to nonlinearly modulate and residual weight the raw features:

SFM (X) = reshape (f_{2} (α) \times X_{r} + X_{r})

(4)

Unlike conventional attention or pooling mechanisms that largely depend on a single statistical descriptor or static weighting, SFM introduces an innovative fusion of multi-scale statistics and entropy-aware dynamic recalibration within a unified module. Specifically, SFM distinguishes itself by simultaneously leveraging maximum and average pooling to gather both extreme and holistic contextual cues, and further integrates soft pooling to capture subtle correlations through probability-based weighting—an aspect often overlooked in standard designs. The significance score

α

functioning as an adaptive information entropy proxy, allows SFM to selectively amplify salient features and suppress redundant or noisy regions in a data-driven manner. This targeted reweighting advances beyond static or hand-crafted attention schemes by providing a context-sensitive, sample-specific modulation of the feature map. Thanks to its streamlined and efficient structure, SFM stands out for its ease of integration with mainstream encoders like Transformers, providing a plug-and-play mechanism that robustly enhances global feature quality without introducing significant computational cost.

3.2.4. Deep Semantic Fusion Method

The Deep Semantic Fusion (DSF) module, as the third core contribution of this paper, addresses the persistent challenges of semantic misalignment and suboptimal fusion in multimodal feature integration. Unlike conventional fusion approaches that often employ shallow concatenation or fixed attention paradigms, DSF utilizes an iterative, cross-modal interaction strategy coupled with an explicit saliency modeling mechanism. This design uniquely enables the multi-level coupling of visual and textual semantics, while explicitly preserving the distinctive and discriminative characteristics of each modality—a key departure from widespread methods that tend to blur modality-specific information during fusion. The overall framework is depicted in Figure 5:

The module implements the following calculation process:

\begin{matrix} I_{l} & = DSF (Q_{l - 1}, Φ (I_{l - 1}, T_{l - 1})) \end{matrix}

(5)

\begin{matrix} T_{l} & = DSF (Q_{l - 1}^{T}, SFM (Φ (T_{l - 1}, I_{l - 1}))) \end{matrix}

(6)

where

l \in {L_{C}, \dots, L_{D}}

denotes the fusion level,

Φ

is the feature interaction function, and

Q_{l - 1}

,

Q_{l - 1}^{T}

represent the query matrices of the two modalities, respectively.

A fundamental distinction of DSF lies in its joint application of SFM-driven adaptive weighting and recurrent cross-modal feature refinement—enabling not only the mitigation of distribution mismatch and redundancy, but also the dynamic recalibration of semantic relevance across modalities and fusion depths. Furthermore, DSF incorporates contrastive objectives at each hierarchical stage, providing fine-grained supervision that enforces consistent alignment and maximizes the discriminative power of the shared representation space throughout the entire stacking process.

3.3. Visual-Textual Contrast Learning

In order to eliminate the cross-modal semantic gap and enhance the quality of multimodal alignment, this study constructs a bi-directional symmetric contrastive learning (CL) framework. Joint representation optimization via visual-text and text-visual two-way contrast loss. Given the batch data

B = {(I_{i}, T_{i})}_{i = 1}^{N}

, where

(I_{i}, T_{i})

is a positive sample pair and the rest of the cross-modal combinations are negative samples, the objective function is defined as follows:

L_{I 2 T} = - \frac{1}{N} \sum_{i = 1}^{N} log \frac{exp (sim (I_{i}, T_{i}) / τ)}{\sum_{j = 1}^{N} exp (sim (I_{j}, T_{i}) / τ)}

(7)

L_{T 2 I} = - \frac{1}{N} \sum_{i = 1}^{N} log \frac{exp (sim (T_{i}, I_{i}) / τ)}{\sum_{j = 1}^{N} exp (sim (T_{j}, I_{i}) / τ)}

(8)

The cross-modal similarity is calculated as follows:

sim (I, T) = \frac{α {(I)}^{⊤} β (T)}{∥ α (I) ∥ \cdot ∥ β (T) ∥}

(9)

where

α (I)

and

β (T)

are the embedding functions for the two modalities,

τ > 0

controls the degree of sharpening of the feature distribution, N is the batch size, and

sim (\cdot)

is the similarity calculation function.

3.4. Conditional Random Field Decoder

In the multimodal named entity recognition task, entity labels not only rely on the feature representation at the current moment, but also have strong sequence dependencies with neighboring labels. To better portray this structural information, this paper employs the conditional random field (CRF) as a sequence decoder to model the label sequence of the fused joint modal feature

H = f (I, T)

. Specifically, Conditional Random Field is able to jointly model the transfer relationship between context features and labels by defining the transfer score function

ϕ_{i} (y_{i}, y_{i + 1}; H)

, which improves the named entity boundaries and internal consistency prediction capability.

Let

y = (y_{1}, \dots, y_{M})

be a label sequence of length M. The conditional probability of the label sequence on the CRF layer is:

p (y | H; θ^{CRF}) = \frac{\prod_{i = 1}^{M - 1} ϕ_{i} (y_{i}, y_{i + 1}; H)}{\sum_{y^{'} \in Y} \prod_{i = 1}^{M - 1} ϕ_{i} (y_{i}^{'}, y_{i + 1}^{'}; H)}

(10)

where

ϕ_{i} (y_{i}^{'}, y_{i + 1}^{'}; H)

is the potential function,

Y

denotes the set of all possible labeling sequences, and

θ^{CRF}

is the set of parameters defining the potential function and the fraction of conversions from labels

y_{i}

to

y_{i + 1}

. The conditional probability is maximized using a negative log-likelihood loss for training:

L_{ME} = - \frac{1}{| D |} \sum_{i = 1}^{N} log p ({\hat{y}}^{(i)} | H^{(i)}; θ^{CRF})

(11)

The overall training loss function is as follows:

L = L_{MNER} + λ_{I T} L_{I 2 T} + λ_{T I} L_{T 2 I}

(12)

where

λ_{I T}

and

λ_{T I}

are the trade-off parameters for the two contrasting losses.

By introducing the CRF layer, the model can fully utilize the coupling between multimodal features and the label dependencies between contexts to achieve more accurate and smooth label sequence prediction.

4. Experiment

4.1. Datasets

This paper is unfolded based on two multimodal named entity recognition benchmark datasets, Twitter2015 [29] and Twitter2017 [30], which contain user postings on the Twitter platform in 2014–2015 and 2016–2017, respectively. It should be emphasized that each tweet consists of text-image pairs, in which the text content may not be semantically related to the corresponding image, and the named entities in the text present a discrete distribution of zero to multiple features. The dataset is constructed with a four-dimensional entity annotation system: person (PER), organization (ORG), geographical location (LOC) and other categories (MISC). After adopting the standard dataset preprocessed by literature [24], as shown in Table 2, this paper systematically counts the frequency distribution of each entity type and the total number of multimodal tweets in the training set, validation set and test set.

4.2. Experimental Settings

The experiments in this paper were done on an NVIDIA RTX 4090 graphics card, based on the PyTorch 1.12.1 with CUDA 11.6 framework. For both Twitter-2015 and Twitter-2017 datasets, the training parameters of the model are carefully adjusted in this paper to adapt to the characteristics of different datasets. In order to show these experimental parameters more intuitively, this paper organizes them into a table, as shown in Table 3.

The experimental parameter settings in Table 3 are mainly based on typical experiences from existing literature in related fields, combined with the characteristics of this research task and the data set used, and have been finely tuned multiple times on the validation set. Specifically, the base learning rate and weight decay are selected using a grid search method to screen for the optimal combination within a reasonable range, so as to improve model performance while ensuring a smooth and convergent training process. For the conditional random field (CRF) layer, given its unique structure in sequence labeling tasks and sensitivity to convergence rates, we set a relatively high initial learning rate to accelerate parameter updates and improve overall training efficiency. Parameters such as mini-batch size, maximum sequence length, number of training epochs, and gradient clipping are selected according to common deep learning standards to balance model convergence speed, training adequacy, and the need to suppress gradient explosion.

4.3. Evaluation Metrics

In this study, we refer to the evaluation framework established in the literature [20] as a way to evaluate the effectiveness of the CASF-MNER model proposed in this paper, and we construct a three-dimensional evaluation system that includes Precision rate, Recall rate, and

F_{1}

value. Specifically, the Precision rate is used to measure the reliability of the model’s prediction results, which is calculated as the ratio of correctly recognized entities to all predicted entities:

Precision = \frac{T P}{T P + F P}

(13)

The Recall metric, on the other hand, focuses on assessing the model’s ability to cover real entities and is mathematically defined as:

Recall = \frac{T P}{T P + F N}

(14)

In order to comprehensively balance the trade-off between Precision rate and Recall rate, this paper takes

F_{1}

value as the core evaluation index. This index combines the two organically through the reconciliation average algorithm, and its expression is:

F_{1} = \frac{2 \times Precision \times Recall}{Precision + Recall}

(15)

This quantitative index dynamically characterizes the comprehensive performance of the model in the interval of [0,1], and achieves the maximum value when the Precision rate and Recall rate reach the equilibrium state, which effectively avoids the bias problem that may arise from the evaluation of a single index.

4.4. Model Training and Performance Evaluation

4.4.1. Model Training

During the model training process, we monitored the performance of both Twitter-2015 and Twitter-2017 datasets in detail, focusing on the trends of the overall Precision, Recall, and

F_{1}

scores. The relevant results are shown in Figure 6.

Overall, as the number of training rounds (Epoch) increases, the model’s metrics on both datasets improve significantly. It is worth noting that after round 10, some metrics such as Precision and

F_{1}

showed a brief decline. This is mainly due to the overfitting tendency of the model to some noisy or hard-to-discriminate samples as it gradually adapts to the details of the training set, which leads to temporary fluctuations in the metrics. Subsequently, the model performance stabilizes and gradually converges to a higher level at a later stage.

Comparing the two datasets, the final metrics of Twitter-2017 are better than the Twitter-2015 dataset, with

F_{1}

up to about 0.87, indicating that the model is also able to achieve better generalization results in more challenging data environments. In terms of details, Recall-2017 basically maintains above 0.87 since round 10 with minimal fluctuation, showing the model’s strong ability to Recall relevant instances. Precision and

F_{1}

, on the other hand, have also been steadily improved since the 10th round, indicating that the ability to discard invalid information and accurately categorize is simultaneously enhanced.

For the Twitter-2015 dataset, the three metrics oscillated slightly as training progressed, with

F_{1}

scores fluctuating especially around round 20, but generally maintaining a steady upward trend. In the end, Precision, Recall, and

F_{1}

are around 0.73 respectively, which is a stable performance. This indicates that the model can still get some performance improvement after long-term training when the data is relatively old or slightly noisy.

In summary, the change process of the model’s performance under different epochs fully reflects its effective learning ability and good convergence characteristics. Higher

F_{1}

scores and smaller indicator fluctuations prove the reliability of this paper’s method in the multimodal named entity recognition task.

4.4.2. Model Performance Evaluation

In order to visualize the model performance, we select three representative samples for the case study, as shown in Figure 7.

Case (a) demonstrates the value of incorporating visual semantics in entity type determination. In the example “Thanks Andrew for the great Tesla road trip presentation,” the BERT-CRF model—relying solely on textual information—mislabels two entities, and the MGCMT model, despite improvements, still does not correctly identify the entity “Andrew.” Our approach leverages visual grounding by modeling the association between the speaker and the Tesla car, aided by saliency-guided attention, enabling correct identification of all entities in this particular case.

Case (b) explores the model’s ability to handle entities from technical or professional domains. In the example “Coiner: Practical Applications of [GIS MISC] in Crisis Mapping #TAMUCC #COMM4335 #ESRI #NYC,” BERT-CRF struggles to identify “GIS” as a MISC entity, while MGCMT fails to correctly classify “ESRI” as an organization. Our framework leverages deep multimodal fusion to more accurately distinguish entity categories by leveraging connections between textual terms and visual cues (such as GIS interfaces). However, despite these advantages in certain technical contexts, our model’s performance on detecting MISC categories on the Twitter2015 dataset still leaves room for improvement.

Case (c) presents a complex example involving rich media content. For “Jennifer Lawrence on the cover of Harper’s Bazaar Magazine Bulgaria (June 2016),” both BERT-CRF and MGCMT fail to comprehensively and correctly identify all relevant entities, particularly those belonging to the PERSON category. In this scenario, our model is able to utilize both textual and visual features to recognize people, organizations, and geographic references, demonstrating its potential for nuanced, cross-modal entity extraction.

Notably, while our hybrid model consistently achieves higher accuracy than the baseline on major entity types (e.g., PERSON, ORG, and LOC), we also observe slightly lower performance than the unimodal BERT-CRF on MISC categories on the Twitter2015 dataset. This result may be due to the inherent ambiguity and contextual dependencies of MISC entities, which are more difficult to capture through multimodal associations and may require more advanced knowledge integration or additional training data. Furthermore, some misclassifications still occur when the visual context is weak or misleading, or when the entity is underrepresented in both modalities.

4.5. Experimental Results and Analysis

In order to fully validate the effectiveness of the model in this paper on the task of multimodal named entity recognition, we conducted comparative experiments between the CASF-MNER model and existing benchmark models on two public datasets, TWITTER-2015 and TWITTER-2017.

Analysis of text-based named entity recognition (NER) methods. As shown in Table 4 and Table 5, among text-based single-modal methods, methods based on pre-trained language models generally outperform traditional methods. On the TWITTER-2015 dataset, the

F_{1}

score of BERT-CRF was 71.81%, significantly higher than the 64.42% of BiLSTM-CRF, representing an improvement of 7.39 percentage points. HBiLSTM-CRF outperformed BiLSTM-CRF by 4.75%, achieving a score of 69.17%, which demonstrates the effectiveness of hierarchical architectures in sequence labeling tasks. On the TWITTER-2017 dataset, BERT-CRF achieved an

F_{1}

score of 83.44%, which is 7.13% higher than BiLSTM-CRF, indicating that pre-trained models can better capture semantic information in social media text.

Analysis of multimodal named entity recognition (NER) methods. The data in Table 4 and Table 5 show that models incorporating visual information generally outperform text-only models. On the TWITTER-2015 dataset, UMGF achieved an

F_{1}

score of 74.85%, which is 3.04% higher than the best text-based model, BERT-CRF; GDN-CMCF and MGCMT achieved 73.05% and 74.18%, respectively. On the TWITTER-2017 dataset, the advantage of graph fusion is even more pronounced, with UMGF, GDN-CMCF, and MGCMT achieving

F_{1}

scores of 85.51%, 85.71%, and 85.89%, respectively, significantly higher than single-modal methods. This indicates that visual information plays an important supplementary role in handling ambiguity and incomplete expressions in social media text.

Comparative analysis with other MNER methods. A comprehensive analysis of Table 4 and Table 5 shows that the CASF-MNER model proposed in this paper achieves excellent results on both datasets. On TWITTER-2015, the

F_{1}

score reaches 74.16%, which is close to the current best UMGF (74.85%) but outperforms the other comparison methods in terms of Recall; on TWITTER-2017, the

F_{1}

score is 86.81%, which outperforms all the comparison methods, and improves by 0.92 percentage points over the next best MGCMT (85.89%). Notably, the model in this paper performs particularly well on the difficult-to-recognize ORG and MISC categories, reaching 85.22% and 70.38%, respectively, on TWITTER-2017, which is significantly higher than the other comparison methods.

Further analysis of the methods. As shown in Figure 8, the

F_{1}

scores for the four entity categories demonstrate the differences between multimodal methods and text-based methods, particularly in the ORG and MISC categories on the TWITTER-2017 dataset, where the CASF-MNER model achieved significant improvements. However, it is worth noting that in the MISC category of the TWITTER-2015 dataset, the

F_{1}

scores of multimodal methods such as CASF-MNER were lower than those of the BERT-CRF text-based method, indicating a negative transfer effect caused by multimodal information. This phenomenon likely results from the limited accuracy of data collection and annotation in the early stages of social media environments. Additionally, models may sometimes misclassify certain visually similar domain entities (such as local businesses or specific cultural landmarks) as more general or higher-frequency categories, potentially due to data distribution biases during CLIP pre-training. Among these four categories, the MISC category has the most complex instance distribution, resulting in generally lower image quality and text-image synergy. The model struggles to fully leverage multimodal synergy during joint representation, and may even be misled by erroneous or irrelevant visual information, leading to negative impacts on the MISC category and ultimately causing the

F_{1}

score to decline. While multimodal information can significantly improve the overall performance of named entity recognition in most scenarios, its performance in specific categories is constrained by the quality of the data itself and the attributes of the category.

It should be noted that although this study achieved significant performance improvements on the two mainstream multimodal social media datasets TWITTER-2015 and TWITTER-2017, the current experiments were limited to these datasets due to computational power, time, and the availability of publicly labeled datasets. The lack of more comprehensive validation on a wider range of datasets has, to some extent, limited the universality and comprehensiveness of the experimental results. Therefore, the experimental results primarily reflect the model’s effectiveness in these two typical scenarios, and its performance in more complex and diverse social media environments remains to be further evaluated.

4.6. Ablation Study

In order to further explore the role of different modules of the model in this paper for entity recognition, we conducted ablation experiments, the results of which are shown in the following table:

As shown in Table 6, we removed the visual representation enhancement, the

F_{1}

scores of the model on the Twitter-2015 and Twitter-2017 datasets decreased by 0.84% and 1.01%, respectively. It indicates that high-quality visual representations play an important role for entity recognition tasks. The enhanced features obtained by using fine-grained target detection and visual grounding can provide the model with more accurate visual semantic information, which can effectively improve the understanding and localization of entities.

When the CAM is removed, the

F_{1}

scores of the model on Twitter-2015 and Twitter-2017 are reduced by 0.43% and 0.17%, respectively. This result suggests that modal alignment has a positive significance in facilitating multimodal fine-grained semantic fusion, which helps to improve the performance of downstream tasks.

After canceling the SFM and inputting the visual modality information directly into the subsequent module, the model

F_{1}

scores decreased by 0.79% and 0.58% on the two datasets, respectively. The results show that the SFM module helps to improve feature semantic consistency and cross-modal co-expression, which has a positive impact on model performance.

Removal of the DSF resulted in a 2.11% and 1.21% decrease in

F_{1}

scores on Twitter-2015 and Twitter-2017, respectively, which is the largest performance decrease among the modules. This phenomenon indicates that the DSF module effectively realizes the semantic mapping and integration of heterogeneous modal information through deep feature fusion, which fully exploits the complementary relationship between textual and visual features, and is crucial to the overall performance improvement.

The

F_{1}

scores of the model on the two datasets are reduced by 0.59% and 0.33% respectively when training without introducing contrast loss. This result validates the effectiveness of the contrast learning mechanism in facilitating the hidden space alignment of heterogeneous modal representations with mitigating inter-modal distributional differences, thus enhancing the cross-modal semantic association modeling capability.

In order to more intuitively show the independent contribution of each functional module to the overall model performance, this paper further visualizes the results of the ablation experiments in Figure 9. By comparing the performance curves of the complete model with the removal of each sub-module, it can be clearly observed that the removal of the different modules brings about significant and differentiated impacts on the entity recognition Precision, Recall, and

F_{1}

score.

In Figure 9a, DSF removal leads to the maximum

F_{1}

loss with positive P/R deviation, proving its key role in integrating BERT representations and enhancing Recall. VRE contributes the next highest (0.93%), mainly optimizing the fusion of CLIP visual features with textual information. the significant negative P/R deviation of SFM on TWITTER-2017 reflects its particular contribution to Precision rate. According to the average

F_{1}

loss bar chart in Figure 9b, the DSF (Deep Semantic Fusion) module stands out with the highest

F_{1}

loss after ablation, firmly establishing its pivotal contribution to the overall model performance. Specifically, removing DSF results in a 1.66% loss in the

F_{1}

score—noticeably higher than the losses observed upon disabling VRE (0.93%), SFM (0.69%), CL (0.46%), or CAM (0.30%). This pattern indicates that, while other modules each play significant roles in boosting the model’s effectiveness, DSF is irreplaceable in achieving deep cross-modal fusion for entity recognition. Furthermore, Figure 9a reveals that, once DSF is ablated, the corresponding sample point exhibits both a larger

F_{1}

reduction and a positive Precision-Recall shift, suggesting that DSF is particularly beneficial for retrieving challenging entities and enables more comprehensive, nuanced cross-modal semantic integration.

In contrast, CAM (Cross-modal Alignment) and CL (Contrastive Learning) contribute relatively less to the

F_{1}

improvement, but their influence remains non-negligible. The radar plot in Figure 9c highlights that CAM excels at “modality alignment,” effectively suppressing cross-modal noise and strengthening the interrelation of input representations. CL, while exhibiting lower scores in “contextual understanding,” demonstrates strong alignment capability; however, its contribution to global or intricate contextual reasoning is limited. Together, these two modules enhance the quality and consistency of underlying feature representations, providing a robust foundation for the semantic fusion operations carried out by DSF.

Delving deeper into the relationship between SFM and DSF, Figure 9c shows that the two modules emphasize different functional dimensions. SFM plays a prominent role in “semantic consistency” and “co-occurrence expression” by selectively extracting and amplifying salient information at the feature input stage, whereas DSF is more adept in “semantic fusion” and “contextual reasoning” at a higher abstraction level. Their functions are complementary: SFM refines the initial signal, filtering out irrelevant noise and delivering high-quality features to DSF, which then performs deeper integration and information interaction across the global semantic space. Empirical results demonstrate that removing either module leads to performance drops, and neither can fully substitute for the other.

Taken as a whole, Figure 9 illustrates the division of labor and mutual reinforcement among the various modules in the multimodal named entity recognition workflow. DSF is central to deep semantic integration, while VRE and SFM focus on optimizing modal features, and CL and CAM facilitate latent space alignment and noise mitigation. Rather than contributing in an additive fashion, these modules interact synergistically—each providing necessary support for others—to achieve both fine-grained and holistic improvements in model performance.

4.7. Analysis of Statistical Significance Between CASF and MGCMT

To further validate the performance advantage of the proposed CASF-MNER model over representative multimodal baselines, we performed rigorous statistical significance testing on the

F_{1}

scores. Specifically, we independently trained and evaluated both the CASF and MGCMT models on the same data split using 10 random seeds to account for training variability and potential randomness inherent in deep learning model optimization. For each model, the

F_{1}

score was recorded from 10 runs, and we computed the mean and standard deviation as indicators of central tendency and stability.

The experimental results, summarized in Table 7, show that CASF-MNER achieved an average

F_{1}

score of 86.80 with a standard deviation of 0.518, while MGCMT obtained an average

F_{1}

score of 85.88 and a higher standard deviation of 0.694. To objectively assess whether the observed difference in average

F_{1}

scores is statistically meaningful rather than resulting from random fluctuations, we conducted an independent samples t-test. The test yielded a t-value of 3.50 and a corresponding p-value of 0.00282. According to conventional statistical standards (typically p < 0.05), this p-value confirms that the difference is statistically significant, i.e., there is strong evidence that the CASF model consistently outperforms MGCMT rather than this being due to chance.

This statistical analysis not only underscores the reliability of the observed performance improvement, but also reflects the model’s stability across repeated trials. Collectively, these findings further corroborate the practical effectiveness and robustness of the CASF-MNER architecture for the multimodal named entity recognition task.

Figure 10 shows the box plots of the

F_{1}

scores for the two models, allowing for a visual comparison of their distribution and variability. Overall, both the experiments and statistical analysis confirm that the CASF model exhibits higher and more stable performance in terms of

F_{1}

score.

5. Conclusions and Future Work

This paper proposes a dual-stream Transformer model, CASF-MNER, which extracts features from visual and textual modalities separately. By integrating correlation alignment, saliency features, and deep semantic fusion mechanisms, the model enhances the collaborative and integrative effects of multimodal features, thereby highlighting the innovation and effectiveness of this work. Specifically, this paper constructs a correlation alignment module and a saliency feature module, combined with cross-modal contrastive learning, to effectively improve the fine-grained alignment and discriminative power between visual and textual representations. Additionally, the deep semantic fusion module designed in this paper enables more profound and high-dimensional information interaction and fusion capabilities across different modalities, further alleviating cross-modal noise interference and semantic divergence issues. The novel architecture of this model provides a new research perspective for multimodal named entity recognition and opens up broader possibilities for deep feature interaction and information sharing. However, due to the limitations of the scale and timeliness of existing datasets, this paper still has some shortcomings. The experimental data used did not fully cover the latest social media trends and diverse multimodal expressions, leading to room for improvement in the model’s generalization ability and robustness in complex scenarios. Looking ahead, we will further explore the scalability of CASF-MNER within mainstream Transformer frameworks, delve deeper into the fusion mechanisms between visual and linguistic expressions at multiple granularity levels, and plan to incorporate the latest social media datasets that include emerging online slang, colloquialisms, and richer multimodal expressions. This will systematically enhance the model’s performance in real-world complex contexts and continue to drive advancements in research related to multimodal understanding and entity recognition.

Author Contributions

Conceptualization, H.L. and Y.T.; methodology, H.L.; validation, Y.T. and H.W.; writing—original draft preparation, H.L.; writing—review and editing, Y.T.; visualization, H.W. and Z.W.; project administration, Q.L.; funding acquisition, H.W. and Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guangxi Key Research and Development Program (grant number GuiKe AB24010309); the Natural Science Foundation of Guangxi Zhuang Autonomous Region (grant number 2024GXNSFAA010242); and the Guangxi Education Department Program (grant number 2025KY0343).

Data Availability Statement

Data were obtained from third party and are available at https://github.com/jefferyYu/UMT/ (accessed on 20 February 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BERT	Bidirectional Encoder Representations from Transformers
CLIP	Contrastive Language–Image Pretraining
CAM	Correlation-Alignment Mechanisms
SFM	Significance Feature Module
DSF	Deep Semantic Fusion
VRE	Visual Representation Enhancement
NLP	Natural Language Processing
NER	Named Entity Recognition
CRF	Conditional Random Field
CL	Contrastive Learning
ResNet	Residual Neural Network
MNER	Multimodal Named Entity Recognition
Faster-RCNN	Faster Region-based Convolutional Neural Network
ViLT	Vision-and-Language Transformer
MHSA	Multi-Headed Self-Attention
FFN	Feed-forward Network
CNN	Convolutional Neural Network
LSTM	Long Short Term Memory
UMGF	Unified Multimodal Graph Fusion
GDN-CMCF	Gated Disentangled Network with Cross-Modality Consensus Fusion
MGCMT	Multi-Granularity Cross-Modal Transformer
CASF	Correlation-Alignment and Significance-aware Fusion

References

Joseph, G.; Bhatti, N.; Mittal, R.; Bhatti, A. Current application and future prospects of artificial intelligence in healthcare and medical education: A review of literature. Cureus 2025, 17, e77313. [Google Scholar] [CrossRef]
Souza, A.S.d.; Amorim, V.M.d.F.; Soares, E.P.; de Souza, R.F.; Guzzo, C.R. Antagonistic Trends Between Binding Affinity and Drug-Likeness in SARS-CoV-2 Mpro Inhibitors Revealed by Machine Learning. Viruses 2025, 17, 935. [Google Scholar] [CrossRef]
Liu, W.; Ren, A.; Wang, C.; Peng, Y.; Xie, S.; Li, W. MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition. Multimed. Tools Appl. 2024, 83, 71639–71663. [Google Scholar] [CrossRef]
Alfaqeeh, M. TriMod Fusion for Multimodal Named Entity Recognition in Social Media. In Proceedings of the 2024 34th International Conference on Collaborative Advances in Software and COmputiNg (CASCON), Toronto, ON, Canada, 11–13 November 2024; IEEE: New York, NY, USA, 2024; pp. 1–9. [Google Scholar]
Chen, F.; Feng, Y. Chain-of-thought prompt distillation for multimodal named entity recognition and multimodal relation extraction. arXiv 2023, arXiv:2306.14122. [Google Scholar]
He, L.; Wang, Q.; Liu, J.; Duan, J.; Wang, H. Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition. Appl. Sci. 2024, 14, 2333. [Google Scholar] [CrossRef]
Liu, H.; Wang, Y.; Liu, D. A Multimodal Named Entity Recognition Approach Based on Multi-Perspective Contrastive Learning. In Proceedings of the 2024 7th International Conference on Machine Learning and Natural Language Processing (MLNLP), Chengdu, China, 18–20 October 2024; IEEE: New York, NY, USA, 2024; pp. 1–8. [Google Scholar]
Feng, J.; Wang, G.; Zheng, C.; Cai, Y.; Fu, Z.; Wang, Y.; Wei, X.Y.; Li, Q. Towards bridged vision and language: Learning cross-modal knowledge representation for relation extraction. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 561–575. [Google Scholar] [CrossRef]
Li, E.; Li, T.; Luo, H.; Chu, J.; Duan, L.; Lv, F. Adaptive Multi-Scale Language Reinforcement for Multimodal Named Entity Recognition. IEEE Trans. Multimed. 2025. [Google Scholar] [CrossRef]
Li, J.; Li, H.; Pan, Z.; Sun, D.; Wang, J.; Zhang, W.; Pan, G. Prompting chatgpt in MNER: Enhanced multimodal named entity recognition with auxiliary refined knowledge. arXiv 2023, arXiv:2305.12212. [Google Scholar] [CrossRef]
Wang, X.; Ye, J.; Li, Z.; Tian, J.; Jiang, Y.; Yan, M.; Zhang, J.; Xiao, Y. CAT-MNER: Multimodal named entity recognition with knowledge-refined cross-modal attention. In Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; IEEE: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
Cai, C.; Wang, Q.; Qin, B.; Xu, R. A multi-task framework based on decomposition for multimodal named entity recognition. Neurocomputing 2024, 604, 128388. [Google Scholar] [CrossRef]
Zeng, Q.; Yuan, M.; Wan, J.; Wang, K.; Shi, N.; Che, Q.; Liu, B. ICKA: An instruction construction and knowledge alignment framework for multimodal named entity recognition. Expert Syst. Appl. 2024, 255, 124867. [Google Scholar] [CrossRef]
Jiang, C.; Wang, Y.; Xiong, B. Dual similarity enhanced hybrid orthogonal fusion for multimodal named entity recognition. Pattern Recognit. 2025, 169, 111940. [Google Scholar] [CrossRef]
Zeng, Q.; Yuan, M.; Su, Y.; Mi, J.; Che, Q.; Wan, J. Improving Multimodal Named Entity Recognition via Text-image Relevance Prediction with Large Language Models. Neurocomputing 2025, 651, 130982. [Google Scholar] [CrossRef]
Xu, M.; Peng, K.; Liu, J.; Zhang, Q.; Song, L.; Li, Y. Multimodal Named Entity Recognition based on topic prompt and multi-curriculum denoising. Inf. Fusion 2025, 124, 103405. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Huang, S.; Xu, B.; Li, C.; Ye, J.; Lin, X. Mner-mi: A multi-image dataset for multimodal named entity recognition in social media. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia, 20–25 May 2024; pp. 11452–11462. [Google Scholar]
Bao, X.; Tian, M.; Wang, L.; Zha, Z.; Qin, B. Contrastive pre-training with multi-level alignment for grounded multimodal named entity recognition. In Proceedings of the 2024 International Conference on Multimedia Retrieval, Phuket, Thailand, 10–14 June 2024; pp. 795–803. [Google Scholar]
Cui, S.; Cao, J.; Cong, X.; Sheng, J.; Li, Q.; Liu, T.; Shi, J. Enhancing multimodal entity and relation extraction with variational information bottleneck. IEEE/ACM Trans. Audio Speech Lang. Process. 2024, 32, 1274–1285. [Google Scholar] [CrossRef]
Yang, L. SAMNER: Image Screening and Cross-Modal Alignment Networks for Multimodal Named Entity Recognition. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; IEEE: New York, NY, USA, 2024; pp. 1–8. [Google Scholar]
Yu, J.; Jiang, J.; Yang, L.; Xia, R. Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020. [Google Scholar]
Guo, A.; Zhao, X.; Tan, Z.; Xiao, W. MGICL: Multi-grained interaction contrastive learning for multimodal named entity recognition. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 639–648. [Google Scholar]
Xiao, L.; Mao, R.; Zhang, X.; He, L.; Cambria, E. Vanessa: Visual connotation and aesthetic attributes understanding network for multimodal aspect-based sentiment analysis. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, FL, USA, 12–16 November 2024; pp. 11486–11500. [Google Scholar]
Zhang, Q.; Li, Z.; Kong, J.; Zuo, M. Multimodal Entity Recognition and Relation Extraction via Dynamic Visual-Textual Enhanced Fusion for Social Media Applications in Consumer Electronics. IEEE Trans. Consum. Electron. 2024. [Google Scholar] [CrossRef]
Bao, X.; Tian, M.; Zha, Z.; Qin, B. MPMRC-MNER: A unified MRC framework for multimodal named entity recognition based multimodal prompt. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; pp. 47–56. [Google Scholar]
Wei, P.; Ouyang, H.; Hu, Q.; Zeng, B.; Feng, G.; Wen, Q. VEC-MNER: Hybrid Transformer with Visual-Enhanced Cross-Modal Multi-level Interaction for Multimodal NER. In Proceedings of the 2024 International Conference on Multimedia Retrieval, Phuket, Thailand, 10–14 June 2024; pp. 469–477. [Google Scholar]
Yan, T.; Zhao, S.; Ma, W.; Song, S.; Wang, C.; Rao, Z.; Chen, S.; Luo, Z.; Liu, X. FRCL-MNER: A Finer Grained Rank-Based Contrastive Learning Framework for Multimodal NER. IEEE Trans. Neural Netw. Learn. Syst. 2025. [Google Scholar] [CrossRef]
Zhang, Q.; Fu, J.; Liu, X.; Huang, X. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Lu, D.; Neves, L.; Carvalho, V.; Zhang, N.; Ji, H. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 15–20 July 2018; pp. 1990–1999. [Google Scholar]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar] [CrossRef]
Ma, X.; Hovy, E. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv 2016, arXiv:1603.01354. [Google Scholar]
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural architectures for named entity recognition. arXiv 2016, arXiv:1603.01360. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Zhang, D.; Wei, S.; Li, S.; Wu, H.; Zhu, Q.; Zhou, G. Multi-modal graph fusion for named entity recognition with targeted visual guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 14347–14355. [Google Scholar]
Huang, G.; He, Q.; Dai, Z.; Zhong, G.; Yuan, X.; Pun, C.M. Gdn-cmcf: A gated disentangled network with cross-modality consensus fusion for multimodal named entity recognition. IEEE Trans. Comput. Soc. Syst. 2023, 11, 3944–3954. [Google Scholar] [CrossRef]
Liu, P.; Wang, G.; Li, H.; Liu, J.; Ren, Y.; Zhu, H.; Sun, L. Multi-granularity cross-modal representation learning for named entity recognition on social media. Inf. Process. Manag. 2024, 61, 103546. [Google Scholar] [CrossRef]

Figure 1. Sports events and other entity Recognition.

Figure 2. Overall architecture of the CASF-MNER model.

Figure 3. Overall architecture of the coding layer.

Figure 4. Overall architecture of SFM module.

Figure 5. Overall architecture of the DSF method.

Figure 6. Relationship between evaluation metrics and epoch.

Figure 7. Performance evaluation case. Note: ✓ indicates a correct prediction, ✗ indicates an incorrect prediction.

Figure 8. Comparison of

F_{1}

scores for various categories.

Figure 8. Comparison of

F_{1}

scores for various categories.

Figure 9. Module Contribution and Characterization.

Figure 10.

F_{1}

Score Distribution: CASF vs. MGCMT.

Figure 10.

F_{1}

Score Distribution: CASF vs. MGCMT.

Table 1. Major Notations Used in This Paper.

Symbol	Definition
X	Input feature tensor
$I, T$	Visual and textual features
H	Joint multimodal feature representation ( $H = f (I, T)$ )
$D S F (\cdot)$	Deep Semantic Fusion module function
l	Layer index in the fusion network
$Q_{l - 1}, Q_{l - 1}^{T}$	Query matrices from previous layer (visual/text)
$Φ (\cdot)$	Feature interaction function
$s i m (I, T)$	Similarity between image and text embeddings
$α (\cdot), β (\cdot)$	Projection functions for visual and textual features
$τ$	Temperature parameter in contrastive loss
$L_{I 2 T}, L_{T 2 I}$	Image-to-text and text-to-image contrastive losses
$y = (y_{1}, \dots, y_{M})$	Predicted label sequence, length M
$Y$	Set of all possible label sequences
$ϕ_{i} (y_{i}, y_{i + 1}; H)$	Transition/potential function in CRF
$θ^{C R F}$	Parameter set of the CRF layer
$p (y ∣ H; θ^{C R F})$	Conditional probability of label sequence y
	given feature representation H and CRF parameters $θ^{C R F}$
$T P, F P, F N$	True positive, false positive, false negative (evaluation)
Precision, Recall, $F_{1}$	Evaluation metrics

Table 2. Statistical data for the two datasets.

Entity	Twitter-2015			Twitter-2017
Entity	Train	Dev	Test	Train	Dev	Test
PER	2217	552	1816	2943	626	621
LOC	2091	522	1697	731	173	178
ORG	928	247	839	1674	375	395
MISC	940	225	726	701	150	157
Total	6176	1546	5078	6049	1324	1351
#Tweets	4000	1000	3257	3373	723	723

Table 3. Experimental parameters for both datasets.

Parameters	Twitter-2015	Twitter-2017
Mini-batch size	32	32
Epoch	40	40
Learning rate	3 × 10^{$- 5$}	3 × 10^{$- 5$}
CRF learning rate	4 × 10^{$- 2$}	3 × 10^{$- 2$}
Weight decay	2 × 10^{$- 3$}	1 × 10^{$- 2$}
Max sequence length	40	40
Gradient clipping	1.0	2.0

Table 4. Performance comparison on the Twitter-2015 dataset.

Modality	Models	Single Type ( $F_{1}$ )				Overall
Modality	Models	PER	LOC	ORG	MISC	Pre.	Rec.	$F_{1}$
Text-Only	BiLSTM-CRF [31]	76.77	72.56	41.33	26.80	68.14	61.09	64.42
	CNN-BiLSTM-CRF [32]	80.86	75.39	47.77	32.61	66.24	68.09	67.15
	HBiLSTM-CRF [33]	82.34	76.83	51.59	32.52	70.32	68.05	69.17
	BERT [34]	84.72	79.91	58.26	38.81	68.30	74.61	71.32
	BERT-CRF	84.74	80.51	60.27	37.29	69.22	74.59	71.81
Text-Image	GVATT-HBiLSTM-CRF [30]	82.66	77.21	55.06	35.25	73.96	67.90	70.80
	AdaCAN-CNN-BiLSTM-CRF [29]	81.98	78.95	53.07	34.02	72.75	68.74	70.69
	GVATT-BERT-CRF [22]	84.43	80.87	59.02	38.14	69.15	74.46	71.70
	AdaCAN-BERT-CRF [22]	85.28	80.64	59.39	38.88	69.87	74.59	72.15
	UMGF [35]	84.26	83.17	62.45	42.42	74.49	75.21	74.85
	GDN-CMCF [36]	85.59	81.25	60.05	40.00	71.71	74.44	73.05
	MGCMT [37]	84.93	83.76	61.59	37.60	73.81	74.59	74.18
	CASF (ours)	86.24	81.00	61.77	41.15	73.12	75.23	74.16

Note: Bold numbers indicate best performance.

Table 5. Performance comparison on the Twitter-2017 dataset.

Modality	Models	Single Type ( $F_{1}$ )				Overall
Modality	Models	PER	LOC	ORG	MISC	Pre.	Rec.	$F_{1}$
Text-Only	BiLSTM-CRF [31]	85.12	72.68	72.50	52.56	79.52	73.43	76.31
	CNN-BiLSTM-CRF [32]	87.99	77.44	74.02	60.82	80.00	78.77	79.37
	HBiLSTM-CRF [33]	87.91	78.57	76.67	59.32	82.69	78.16	80.37
	BERT [34]	90.88	84.00	79.25	61.63	82.19	83.72	82.95
	BERT-CRF	90.25	83.05	81.13	62.21	83.32	83.57	83.44
Text-Image	GVATT-HBiLSTM-CRF [30]	89.34	78.53	79.12	62.21	83.41	80.38	81.63
	AdaCAN-CNN-BiLSTM-CRF [29]	89.34	78.53	79.12	62.21	83.41	80.38	81.63
	GVATT-BERT-CRF [22]	90.94	83.52	81.91	62.75	83.64	84.38	84.01
	AdaCAN-BERT-CRF [22]	90.20	82.97	82.67	64.83	85.13	83.20	84.10
	UMGF [35]	91.92	85.22	83.13	69.83	86.54	84.50	85.51
	GDN-CMCF [36]	91.38	84.92	84.13	67.57	85.49	85.94	85.71
	MGCMT [37]	91.74	86.32	84.07	65.75	85.87	85.92	85.89
	CASF (ours)	92.63	83.14	85.22	70.38	87.16	86.45	86.81

Note: Bold numbers indicate best performance.

Table 6. Statistics of ablation study results.

Models	TWITTER-2015			TWITTER-2017
Models	P	R	$F_{1}$	P	R	$F_{1}$
CASF-MNER	73.12	75.23	74.16	87.16	86.45	86.81
w/o VRE	72.16	74.52	73.32	85.82	85.79	85.80
w/o CAM	72.03	75.52	73.73	87.79	85.52	86.64
w/o SFM	73.32	73.43	73.37	85.84	86.63	86.23
w/o DSF	70.09	74.12	72.05	85.03	86.17	85.60
w/o CL	72.33	74.85	73.57	87.24	85.73	86.48

Table 7. Statistical Significance Test Results for CASF and MGCMT.

Group	AVG- $F_{1}$	SD	t-Value	p-Value
CASF	86.80	0.518
MGCMT	85.88	0.694
t-test			3.50	0.00282

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, H.; Tao, Y.; Wang, H.; Wang, Z.; Liu, Q. CASF: Correlation-Alignment and Significance-Aware Fusion for Multimodal Named Entity Recognition. Algorithms 2025, 18, 511. https://doi.org/10.3390/a18080511

AMA Style

Li H, Tao Y, Wang H, Wang Z, Liu Q. CASF: Correlation-Alignment and Significance-Aware Fusion for Multimodal Named Entity Recognition. Algorithms. 2025; 18(8):511. https://doi.org/10.3390/a18080511

Chicago/Turabian Style

Li, Hui, Yunshi Tao, Huan Wang, Zhe Wang, and Qingzheng Liu. 2025. "CASF: Correlation-Alignment and Significance-Aware Fusion for Multimodal Named Entity Recognition" Algorithms 18, no. 8: 511. https://doi.org/10.3390/a18080511

APA Style

Li, H., Tao, Y., Wang, H., Wang, Z., & Liu, Q. (2025). CASF: Correlation-Alignment and Significance-Aware Fusion for Multimodal Named Entity Recognition. Algorithms, 18(8), 511. https://doi.org/10.3390/a18080511

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CASF: Correlation-Alignment and Significance-Aware Fusion for Multimodal Named Entity Recognition

Abstract

1. Introduction

2. Related Work

2.1. Multi-Modal Named Entity Recognition

2.2. Multimodal Fusion Based on Dual-Stream Transformer

2.3. Cross-Modal Contrastive Learning

3. CASF-MNER Model

3.1. Visual and Textual Feature Extraction

3.2. Multimodal Dual Stream Transformer

3.2.1. Information Encoding Layer

3.2.2. Correlation Alignment Mechanisms

3.2.3. Significance Feature Module

3.2.4. Deep Semantic Fusion Method

3.3. Visual-Textual Contrast Learning

3.4. Conditional Random Field Decoder

4. Experiment

4.1. Datasets

4.2. Experimental Settings

4.3. Evaluation Metrics

4.4. Model Training and Performance Evaluation

4.4.1. Model Training

4.4.2. Model Performance Evaluation

4.5. Experimental Results and Analysis

4.6. Ablation Study

4.7. Analysis of Statistical Significance Between CASF and MGCMT

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI