Multimodal Named-Entity Recognition Based on Symmetric Fusion with Contrastive Learning

Wu, Yubo; Liu, Junqiang

doi:10.3390/sym18020353

Open AccessArticle

Multimodal Named-Entity Recognition Based on Symmetric Fusion with Contrastive Learning

by

Yubo Wu

and

Junqiang Liu

^*

College of Information and Electronic Engineering, Zhejiang Gongshang University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(2), 353; https://doi.org/10.3390/sym18020353

Submission received: 13 December 2025 / Revised: 30 January 2026 / Accepted: 2 February 2026 / Published: 14 February 2026

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Multimodal named-entity recognition (MNER) aims to identify entity information by leveraging multimodal features. With recent research shifting to multi-image scenarios, existing methods overlook modality noise and lack effective cross-modal interaction, leading to prominent semantic gaps. This study innovatively integrates symmetric multimodal fusion with contrastive learning, proposing a novel model with a symmetric-encoder collaborative architecture. To mitigate the noise, a modality refinement encoder maps each modality to an exclusive space, while an aligned encoder bridges gaps via contrastive learning in a shared space, surpassing the superficial cross-modal mapping of existing models. Building on these encoders, the symmetric fusion module achieves deep bidirectional fusion, breaking traditional one-way or concatenation-based limitations. Experiments on two datasets show the model outperforms state-of-the-art methods, with ablation experiments validating the symmetric encoder’s uniqueness for consistent multimodal learning.

Keywords:

multimodal named-entity recognition; symmetric multimodal fusion; multimodal contrastive learning; attention mechanism

1. Introduction

Named-entity recognition (NER) is a fundamental task in the field of Natural Language Processing (NLP), which focuses on identifying specific entities in sentences and classifying them into pre-defined categories such as person names, place names, and organization names [1]. In recent years, with the rapid development of multimodal learning, multimodal named-entity recognition (MNER) has emerged. MNER aims to introduce images as additional contextual information for text to improve the performance of NER with text modality [2]. Early studies directly explored the entire image using global visual cues. These methods either encode the original image into a feature vector to enhance semantics [3] or roughly segment the image into multiple grids for cross-modal interaction [4]. For more fine-grained visual cues, subsequent studies began to extract salient objects from images and encode them as local visual prompts to guide task prediction [5]. Although these methods have achieved remarkable success, in real social media scenarios, user-generated tweets often contain text and multiple images. Considering only text and a single image cannot meet the requirements of real MNER scenarios.

The core of the multi-image MNER task lies in integrating multiple visual information to enhance text semantics. There are two key issues that need to be fully addressed. The first is modality noise. Each modality contains information that is irrelevant to the task, which not only fails to contribute to the final task but may even mislead the prediction. For example, in Figure 1, regarding the text modality, the model tends to predict “Joey” and “Niko” as the person type (PER) because some words (such as “rest” and “wrestling”) endow them with human characteristics. However, from the perspective of text and images, the “Cat” object (the area in the box) in the image can clearly guide the correct prediction. Nevertheless, in the image modality, there are two aspects that introduce noise: first, at the global level, not all image regions are informative for the recognition of target entities; second, at the local level, the salient regions in the image express more complex visual semantics (“Two cats lie on the bed”) rather than the simple term “Cat”. This information may interfere with the model’s assignment of attention weights to image regions, thereby hindering the final task prediction. Therefore, it is crucial to extract key information and reduce redundant noise.

Even if redundant noise is eliminated, a second issue arises: the mismatched feature representations of input text and images, as well as the intrinsic discrepancies across these two modalities. Specifically, since text and visual representations are derived from different encoders, they maintain different feature spaces and distributions. These modality differences make it difficult for text and images to “understand” each other and capture cross-modal semantic correlations. For Figure 1, ideally, the text entities “Joey” and “Niko” should have stronger semantic correlations with the “Cat” in the box than other regions. However, the differences between text and visual representations prevent the establishment of such an alignment and hinder the exploration of predictive visual cues. Although previous studies converted visual objects into text descriptions to achieve consistent semantic expression, this method is highly dependent on text descriptions generated by external tools. Additionally, the inherent defects in tools readily induce deviations in semantic transmission, which impairs the performance of cross-modal alignment.

Existing methods have made certain efforts to solve these problems, but such attempts remain fragmented and inadequate in addressing the core challenges of multi-image MNER tasks. For instance, some methods try to expand single-image fusion frameworks to multi-image scenarios by simply stitching multiple images into a single input; this approach fails to account for the distinct contribution of each image and ignores the existence of internal noise information in the images. At the same time, potential redundancy or conflict between visual information from different images are also overlooked.

This paper proposes a new named-entity recognition model that integrates symmetric multimodal fusion with contrastive learning (SMCL). Specifically, the model adopts an architecture of symmetric-encoder collaborative cross-modal fusion. To address the modality noise, a modality refinement encoder (RE) is designed, which maps text and images to their respective exclusive semantic spaces and learns modality-specific features. Additionally, orthogonal constraints are introduced to optimize the feature space, systematically filter redundant information, and extract high-purity features. To tackle the modality differences, a modality alignment encoder (AE) is constructed, which maps the features of each modality to a unified semantic space. Based on a multimodal contrastive learning mechanism, it achieves the aggregation of similar categories and separation of different categories, accurately captures the potential semantic associations between different modalities, and effectively bridges the semantic gap. On this basis, the refined and aligned representations are input into the symmetric multimodal fusion module (SFM). This module generates hyper-feature representation containing rich cross-modal semantics through dual enhancement and deep cross-fusion, providing more predictive feature support for the named-entity recognition task.

It is worth noting that the symmetric-encoder proposed in this study specifically refers to the equivalent design of bimodalities (text and image) in the feature-processing pipeline. Both text and image modalities undergo symmetric processing encompassing modality-specific space feature refinement and shared-space feature alignment. A unified framework enables bidirectional collaboration between intra-modal optimization and cross-modal fusion, distinguishing it from traditional unidirectional dependency or asymmetric processing paradigms. In summary, the contributions of this paper are as follows:

For complex multi-image scenarios, a new multimodal named-entity recognition model, SMCL, is designed. This model combines symmetric multimodal fusion and contrastive learning, enabling full exploration of the inherent characteristics of each modality and the cross-modal semantic associations.
A new multimodal fusion module, SFM, is designed for symmetric multimodal fusion. This module conducts an independent enhancement and optimization of each modality, effectively strengthening the capacity of feature representation. On this basis, the module achieves deep multimodal fusion and yields a hyper-feature representation that integrates full-modal information.
Extensive experiments on the MNER-MI and MNER-MI-Plus datasets demonstrate that the proposed model, SMCL, achieves state-of-the-art performance on the multi-image MNER task. Furthermore, ablation experiments and case studies fully validate the effectiveness of SMCL in addressing the issues of modality noise and modality differences, thereby providing a robust solution for MNER in multi-image scenarios.

2. Related Works

2.1. Multimodal Named-Entity Recognition

Multimodal named-entity recognition (MNER) extends traditional text-centric named-entity recognition (NER) by incorporating additional modality information, which primarily encompasses visual content [3] and auditory content [6], and significantly improves the performance of entity recognition. In the early stages, Zhang et al. [7], Lu et al. [8], and Arshad et al. [9] made pioneering attempts. Specifically, they employed recurrent neural networks (RNNs) and convolutional neural networks (CNNs) as modality-specific encoders, where RNNs were dedicated to processing textual data and CNNs to modeling visual information. An implicit cross-modal interaction mechanism was further devised to capture inter-modal semantic correlations, thus pioneering the first round of exploratory research on the MNER task. Later, Yu et al. [10] and Wang et al. [11] used more advanced text encoders to obtain better text representations, while Wu et al. [12] proposed using objects in images as image representations.

Regarding the interaction between text and images, existing approaches primarily depend on the attention mechanism. For instance, Zhang et al. [13] adopted a graph-based approach to facilitate text–image interaction. Chen et al. [14] and Xu et al. [15] leveraged image representations as prompts to enable interactions between such representations and each layer of the text encoder. It is worth noting that with the advancement of large language models (LLMs), Li et al. [16] proposed a two-stage framework named PGIM, which integrates ChatGPT as an implicit knowledge base into MNER. Specifically, this framework generates refined auxiliary knowledge through multimodal example awareness and prompt guidance, circumvents cross-modal alignment challenges by adopting a text–text paradigm, and consequently enhances the performance.

This paper proposes a novel MNER model involving multiple images. Specifically, the core information of each modality is extracted first, followed by a systematic exploration of inter-modal commonalities and intra-modal unique characteristics. On this basis, a representation of full multimodal views is constructed, which effectively filters out modality noises and bridges the modality gap. This model not only provides a novel idea for multimodal fusion, but also ultimately enhances the overall performance of NER.

2.2. Multimodal Contrastive Learning

In the field of natural language processing (NLP), multimodal contrastive learning focuses on sample construction and semantic representation enhancement, whereas the cross-modal fusion field takes the establishment of a universal semantic space as its core objective. Bao et al. [17] propose a novel contrastive pre-training method with tailored multi-level alignment, which incorporates a text–image alignment module based on contrastive learning to optimize the consistency of cross-modal representations.

Against the backdrop of demand analysis for the MNER task, existing research still faces critical limitations: most contrastive learning-based MNER methodologies are tailored to single-image scenarios, and even a handful of multi-image oriented studies fail to achieve in-depth integration between contrastive learning paradigms and the sequence-labeling attributes of NER, rendering them incompetent in meeting entity-level fine-grained semantic alignment requirements and effectively capturing the correlations between textual entities and visual cues from multiple images; furthermore, cross-modal contrastive learning methods predominantly adopt simplistic bimodal interaction architectures, lacking systematic filtering mechanisms for modal noise in multi-image scenarios, which prevents them from addressing redundant image backgrounds and cross-image information conflicts, thereby inducing pronounced modal interference in MNER tasks and compromised model robustness.

To address the above limitations, this study proposes a Symmetric Fusion and Contrastive Learning Integration Model (SMCL) and constructs a cross-domain learning framework suitable for MNER tasks in multi-image scenarios. With a symmetric encoder architecture as the core, the model deeply integrates multimodal contrastive learning with the sequence labeling characteristics of NER. Specifically, the Refinement Encoder (RE) introduces an orthogonal constraint mechanism to effectively filter modal noise from multiple images, thus resolving the interference caused by redundant information. By leveraging the Alignment Encoder (AE) and coordinating the Central Moment Discrepancy (CMD) similarity loss and discrepancy loss functions, the model achieves the precise separation and alignment of modal-shared features and modal-specific features, breaking through the limitations of traditional one-way guided alignment methods.

2.3. Multimodal Named-Entity Recognition in Multi-Image Scenarios

With the explosive growth of user-generated content on social media platforms, posts that integrate textual content and multiple images have become increasingly prevalent. According to research by Zhang et al. [7], over 42% of tweets include more than one image. In this case, only relying on text or a single image for named-entity recognition can no longer meet practical needs. To address this limitation, Huang et al. [18] proposed a temporal prompt model, which treats multiple images as frames in a video, utilizes temporal information to establish relationships between different images, and achieves text–image interaction by taking multiple images as prompts.

However, this method suffers from two critical limitations regarding symmetry and robustness: first, it neglects the redundant noise inherent in individual modalities, imposing an extra burden on the subsequent fusion process; second, it relies on unidirectional guidance from high-quality image prompts to facilitate text generation or understanding, which leads to significant performance degradation when text–image modal mismatch occurs.

In contrast, this paper emphasizes the symmetry and robustness of multimodal processing: the proposed model symmetrically extracts and refines the distinctive features of both text and image modalities, thereby achieving noise suppression to eliminate modal interference. Furthermore, leveraging multimodal contrastive learning, the model symmetrically maps text and image representations into a shared semantic space. By exploring the intrinsic commonalities between the two symmetrically represented modalities, it ultimately achieves accurate and robust cross-modal alignment.

3. Methods

This section begins with the task formulation of multimodal named-entity recognition (MNER) in multi-image scenarios, clearly defining the core problem to be addressed. Subsequently, a novel MNER model is proposed, which integrates Symmetric Multimodal fusion with Contrastive Learning (SMCL). With symmetry as its foundational design principle, this model adopts an architecture of symmetric-encoder collaborative cross-modal fusion, consisting of a multi-image feature extraction module, a text feature extraction module, and a symmetric multimodal fusion module. Built on this symmetric architecture, the model alleviates the additional fusion burden caused by modality noise and cross-modal differences. Together, these efforts contribute to achieving superior MNER performance.

3.1. Problem Formulation

Taking a text sequence T and its corresponding set of images {

I_{1}, I_{2}, \dots, I_{m}

} as input, where m is the upper limit of image quantity (in this paper,

m = 4

). The goal of the task is to extract named entities from the text T and assign them to a pre-defined type. Consistent with previous MNER studies, this task is regarded as a sequence labeling problem. Let

T = (t_{1}, t_{2}, \dots, t_{n})

be the input word sequence, and

y = (y_{1}, y_{2}, \dots, y_{n})

be the corresponding label sequence, where

y_{i} \in Y

and

Y

is the pre-defined label set, using the BIO2 labeling scheme [19].

3.2. Model Overview

The overall architecture of the model is shown in Figure 2, and its workflow is described as follows.

First, feature extraction is performed. For multi-image features, each image is sequentially input into the Vision Transformer (ViT) [20] to extract feature representations. Then, a learnable positional embedding vector is added to each image feature to mark its temporal order. Finally, the feature representations of all images are fed into the Transformer encoder [21] to generate the overall multi-image representation. For text features, the text sequence is input into the embedding layer of BERT [22] to obtain the initial representation of each token. Subsequently, the image feature is used as a guiding signal through the Image Guidance Module (IGM) to generate a text representation guided by image semantics.

Next, the modality optimization phase implements symmetric processing. The refined multi-image representation and semantic-augmented text representation are symmetrically fed into their dedicated Modality Refinement Encoders (RE) and a shared Modality Alignment Encoder (AE). The RE modules focus on modality noise filtering, while the AE module achieves precise cross-modal semantic alignment. Then the optimized features from each modality are input into the Symmetric Multimodal Fusion Module (SFM). Leveraging self-attention reinforcement and a symmetry modality interaction strategy, this module constructs a hyper-feature representation (Hyper) that encapsulates high-level cross-modal semantic information, maximizing the synergy between visual and textual semantics.

Finally, the Hyper is fed into the Conditional Random Field (CRF) layer [23], which captures the sequential dependencies between tokens to accomplish the final NER prediction with accurate entity boundary and type classification.

3.3. Multi-Image Feature Extraction

The ViT model is used as the image encoder to extract the feature representations of the input m images, and the Transformer encoder is used to capture the mutual relationships between the m images. The specific steps are as follows.

3.3.1. Multi-Image Representation

Each image is resized to 224 × 224 pixels and then input into ViT. ViT divides each image into 14 × 14 = 196 non-overlapping 16 × 16 pixel patches, and generates a representation for each image patch through linear embedding, obtaining

S = (s_{1}, s_{2}, \dots, s_{196}) \in R^{196 \times d_{v}}

, where

d_{v} = 768

denotes the output dimension of ViT-base. Subsequently, a learnable special token [CLS] with the same dimension as the image patch is added at the beginning of the sequence S to form the sequence

S^{'} = (s_{0}, s_{1}, s_{2}, \dots, s_{196}) \in R^{197 \times d_{v}}

, where

s_{1}

to

s_{196}

are the original sequence, and

s_{0}

is the special token [CLS]. Next, the activation representation of the [CLS] token in the last layer of ViT is used to obtain the corresponding image representation, resulting in the representation set of m images

V = (v_{1}, v_{2}, \dots, v_{m}) \in R^{d_{v} \times m}

, where

d_{v}

is defined as the dimension of the image representation. When the count of input images falls below m, zero vectors are used for padding to ensure the consistency of the input.

3.3.2. Temporal Positional Encoding

Since the positional information of images affects the feature extraction results (for example, the first image usually contains more detailed information, and the text may contain words indicating the position of the image), a learnable temporal positional encoding is set to indicate the position and temporal information of multiple images.

Let the positional embedding vector

C = (c_{1}, c_{2}, \dots, c_{m})

, where

c_{1}

to

c_{m}

are the positional embeddings corresponding to each image. Add the positional embedding vector C to the image representation V:

V^{'} = V + C

, where

V^{'} \in R^{d_{v} \times m}

is a representation of multiple images, containing position and temporal information.

3.3.3. Multi-Image Relationship Modeling

The Transformer is a deep learning model architecture based on the attention mechanism, whose core advantage lies in its ability to efficiently capture long-range dependencies in sequential data. After adding the positional embedding vector, the Transformer encoder is used to capture the interaction information between multiple images. The

V^{'}

is input into the Transformer encoder, and the self-attention mechanism of the encoder is used to realize global semantic modeling and feature association across images, obtaining the multi-image representation

G = (g_{1}, g_{2}, \dots, g_{m}) \in R^{d_{v} \times m}

.

3.4. Text Feature Extraction

BERT is used as the text encoder to extract the feature representation of the text input, and the text is guided by image information to obtain a text representation containing image–text interaction information. The full procedure is specified as follows.

3.4.1. Text Representation

Prepend the [CLS] token and append the [SEP] token to the text input T, obtaining

T^{'} = (t_{0}, t_{1}, t_{2}, \dots, t_{n}, t_{n + 1})

, where

t_{1}

to

t_{n}

are the original input text,

t_{0}

is the [CLS] token, and

t_{n + 1}

is the [SEP] token. Next, input it into the embedding layer of BERT to obtain the text representation of the 0-th layer Transformer

H^{0} = (h_{0}^{0}, h_{1}^{0}, \dots, h_{n}^{0}, h_{n + 1}^{0}) \in R^{d_{t} \times (n + 2)}

, where

d_{t} = 768

is defined as the dimension of the text representation.

3.4.2. Image Modality Guidance

The Image Guidance Module (IGM) uses image representations as prior knowledge to guide text representations, establishing a correlation mapping between visual and text features. This guidance mechanism is not a simple feature concatenation but a cross-modal interaction through the attention mechanism during encoding, resulting in text representations with better visual consistency and semantic accuracy. Specific implementation details are shown in Figure 3.

This module projects the image representation G into each layer of the BERT text encoder for subsequent interaction with text:

Z^{l} = W_{z}^{l} G, 1 \leq l \leq L

(1)

where L is defined as the number of Transformer layers in the text encoder,

Z^{l} \in R^{d_{t} \times m}

is defined as the image-guided projection corresponding to the l-th layer Transformer,

W_{z}^{l} \in R^{d_{t} \times d_{v}}

is the weight matrix corresponding to the l-th layer Transformer,

d_{t}

is defined as the dimension of the text representation, and

d_{v}

is defined as the dimension of the image representation. Subsequently, feed the text representation

H^{(l - 1)}

from the

(l - 1)

-th layer and

Z^{l}

into the l-th layer Transformer of BERT, yielding the l-th layer representation

H^{l}

. The specific implementation is as follows:

First, project

H^{(l - 1)}

into the query

Q^{l}

, key

K^{l}

, and value

V^{l}

of the l-th layer:

Q^{l} = W_{Q}^{l} H^{(l - 1)}; K^{l} = W_{K}^{l} H^{(l - 1)}; V^{l} = W_{V}^{l} H^{(l - 1)}

(2)

where

W_{Q}^{l}, W_{K}^{l}, W_{V}^{l} \in R^{d_{t} \times d_{t}}

are weight matrices. Next, project

Z^{l}

into additional keys

K_{z}^{l}

and values

V_{z}^{l}

to interact with the

(l - 1)

-th layer text representation:

K_{z}^{l} = ϕ_{k}^{l} Z^{l}; V_{z}^{l} = ϕ_{v}^{l} Z^{l}

(3)

H^{l} = Softmax (\frac{{(Q^{l})}^{T} [K_{z}^{l}; K^{l}]}{\sqrt{d_{t}}}) {[V_{z}^{l}; V^{l}]}^{T}

(4)

where

H^{(l - 1)} \in R^{(n + 2) \times d_{t}}

,

ϕ_{k}^{l}, ϕ_{v}^{l} \in R^{d_{t} \times d_{t}}

are weight matrices,

K_{z}^{l}

is derived from image projection features and

K^{l}

is derived from textual features. Specifically, since background regions irrelevant to textual entities may exist in images, the corresponding

K_{z}^{l}

will generate low-correlation attention weights, which are defined as noise-related weights. To accurately filter valid weights, a dynamic threshold

τ_{l}

based on the statistical features of the attention weights in the current layer is introduced:

τ_{l} = σ ({\bar{w}}_{l})

(5)

where

{\bar{w}}_{l}

denotes the mean value of attention weights at the l-th layer and

σ

represents the Sigmoid function.

This threshold can be adaptively adjusted such that a low value of

w_{l}

in the presence of substantial image noise leads to a corresponding reduction in

τ_{l}

to retain more potentially valid weights, whereas an increase in

τ_{l}

is induced by minimal image noise to rigorously filter out low-correlation weights. Meanwhile, attention weights below

τ_{l}

are set to 0, with only those above the threshold retained for subsequent calculations. The filtered weights are then re-normalized to ensure that valid visual information can precisely guide the learning of textual representations, while irrelevant noise is suppressed. Following the processing of L Transformer layers, the text representation

H \in R^{(n + 2) \times d_{t}}

can be obtained.

3.5. Symmetric Multimodal Fusion

To address the issues associated with modality noise and modality gaps, the extracted representation of each modality is projected into a modal-shared subspace and a modal-specific subspace. Through the symmetric multimodal fusion module (SFM), a hyper-feature representation (Hyper) containing rich cross-modal semantics is constructed to perform the subsequent MNER task. The specific process is as follows.

3.5.1. Modality Refinement and Alignment

The representation vectors of the two modalities are projected into two different representation vectors. The first one is the modal-shared representation

U_{o}^{a} \in R^{d_{t}}

, which is the aligned representation learned by each modality in the same semantic subspace. The second one is the modal-specific representation

U_{o}^{r} \in R^{d_{t}}

, which is obtained by capturing the unique features of each modality:

U_{o}^{a} = AE (O, θ^{a})

(6)

U_{o}^{r} = RE (O, θ_{o}^{r})

(7)

where

o \in {i, t}

represents the two modalities,

O \in {G, H}

represents the vector representations corresponding to the two modalities, and AE and RE respectively represent the Modality Alignment Encoder and the Modality Refinement Encoder. AE shares the parameter

θ^{a}

between the two modalities, while RE assigns separate parameters

θ_{o}^{r}

to each modality.

Notably, noise in the textual modality mainly stems from redundant modifiers and meaningless high-frequency words, whose feature representations are concentrated in the low-dimensional embedding space; in contrast, noise in the visual modality is mostly composed of background regions and local details of non-target entities, and the features of such noise exhibit the property of local isolation. Accordingly, if the architectures of AE and RE are designed as sophisticated structures, such as Transformer layers, such designs would strengthen the invalid correlations between noise patches and target patches and excessively capture the local correlations of noise, which not only leads to the dilution of entity features but also increases the noise-filtering burden of subsequent fusion modules. In contrast, a fully connected layer can directly map low-dimensional noise features to invalid dimensions via linear projection and suppress the feature responses of isolated noise patches directly through weight learning.

Therefore, to prevent the redundant information from concentrating in the representation vectors, both AE and RE consist of a fully connected layer and a ReLU activation function. Thus, Equations (5) and (6) can also be expressed as follows:

U_{o}^{a} = ReLU (FC (O))

(8)

U_{o}^{r} = ReLU (FC (O))

(9)

where FC represents the fully connected layer. AE and RE are designed to bridge cross-modal semantic gaps and filter modality noise, respectively. During the RE training process, an orthogonality constraint is introduced, which systematically filters out irrelevant noise by limiting the redundancy between modality-specific representations and shared representations, as well as among different modality-specific representations.

3.5.2. Symmetric Multimodal Fusion Module

The symmetric multimodal fusion module (SFM) aims to independently enhance and optimize each modality using a multi-layer self-attention mechanism, and achieve deep fusion through a symmetric cross-modal interaction mechanism. Its workflow is shown in Figure 4.

First, input the shared representation of each modality into the Transformer layer, and strengthen the intra-modal feature representation by means of the self-attention mechanism:

P_{o}^{a} = ϕ (U_{o}^{a}, θ^{T r})

(10)

P_{o}^{r} = ϕ (U_{o}^{r}, θ^{T r})

(11)

where

P_{o}^{a} \in R^{d_{t}}

and

P_{o}^{r} \in R^{d_{t}}

are the representations enhanced by the self-attention mechanism,

ϕ

represents the Transformer layer, and

θ^{T r}

is the learnable parameter. Concatenate the shared representation and the specific representation of modality o, feed the concatenated representations into a fully connected layer, and derive the joint representation of modality o through the Sigmoid activation function:

\begin{matrix} P_{o} & = UNI (P_{o}^{a}, P_{o}^{r}, θ_{o}) \\ = Sigmoid (FC (P_{o}^{a} \oplus P_{o}^{r})) \end{matrix}

(12)

where

P_{o} \in R^{d_{t}}

is the joint representation of modality o,

θ_{o}

is the learnable parameter, and ⊕ represents the concatenation operation. Then, use the image representation as the query, and the text representation as the key and value:

Q^{p} = P_{i} W_{Q}; K^{p} = P_{t} W_{K}; V^{p} = P_{t} W_{V}

(13)

where

Q^{p}, K^{p}, V^{p} \in R^{d_{t} \times d_{t}}

are learnable attention matrices. Calculate the similarity matrix

μ

between the text representation and the image representation:

μ = Softmax (\frac{Q^{p} {(K^{p})}^{T}}{\sqrt{d_{t}}})

(14)

Finally, use the output results of the cross-modal attention and the feed-forward layer to obtain the hyper-feature representation:

Hyper = FC (F (μ V))

(15)

where

H y p e r \in R^{d_{t}}

represents the hyper-feature that aggregates the similarity representations of the two modalities, and

F

is the feed-forward layer of the Transformer.

Hyper integrates the specific features extracted by RE, the shared features learned by AE, and the correlation features generated by symmetric cross-modal interaction. It can simultaneously take into account modal specificity and cross-modal consistency, providing comprehensive and accurate feature support for the CRF layer, and thereby improving the accuracy of entity boundary localization and category classification.

3.6. CRF Decoder

A Conditional Random Field (CRF) decoder is used to perform the NER task. Input the hyper-feature representation Hyper, which aggregates the similarity features of each modality, into the standard CRF layer.

The CRF decoder takes as input the token-level hyper-representations

Hyper = {[h_{0}, h_{1}, \dots, h_{n + 1}]}^{T}

(where

h_{0}

denotes the representation of the token,

h_{1}

to

h_{n}

correspond to the representations of the original text tokens, and

h_{n + 1}

denotes the representation of the token). It calculates the emission probability of the corresponding label for each token via the emission score

E_{h_{i}, y_{i}} = h_{i} \cdot W_{emit} + b_{emit}

(

W_{emit} \in R^{d_{t} \times | Y |}

, where

| Y |

is the number of labels). Combined with the transition score

T_{y_{i}, y_{i + 1}}

(which captures the dependencies between labels), the decoder performs sequence-level entity prediction using the log-likelihood functions in Equations (16) and (17), and finally outputs a label sequence

y \in R^{n \times 1}

that has the same length as the text token sequence.

p (y | Hyper) = \frac{e x p (\sum_{i = 1}^{n} E_{h_{i}, y_{i}} + \sum_{i = 0}^{n} T_{h_{i}, y_{i + 1}})}{F (Hyper)}

(16)

F (Hyper) = \sum_{y \in Y} e x p (\sum_{i = 1}^{n} E_{h_{i}, y_{i}} + \sum_{i = 0}^{n} T_{h_{i}, y_{i + 1}})

(17)

where

E_{h_{i}, y_{i}}

is the emission score of the i-th token,

T_{h_{i}, y_{i + 1}}

is the transition score from token

y_{i}

to token

y_{i + 1}

, and Y represents the pre-defined label set using the BIO labeling scheme.

3.7. Learning

The overall learning of the model is performed by minimizing the Loss:

L = L_{t a s k} + α L_{s i m} + β L_{d i f f}

(18)

where

α

and

β

are interaction weights, which determine the contribution of each regularization component to the total loss

L

. Each of these loss components is responsible for achieving the desired subspace characteristics. The specific implementation is as follows.

3.7.1. Similarity Loss

The minimization of similarity loss serves to narrow the gap between different modalities, thereby promoting the alignment of shared feature representation. The selection of the Central Moment Discrepancy (CMD) [24] indicator can achieve this goal. CMD is an advanced distance measurement method that measures the difference between the distributions of two representations by matching the moment differences of each order of them. Simply put, when the two distributions grow increasingly similar, the CMD distance exhibits a corresponding decrease.

First, define CMD. Let X and Y be bounded random samples defined on the interval

{[a, b]}^{N}

with their respective probability distributions p and q. The central moment discrepancy regularizer

C M D_{K}

is defined as follows:

C M D_{K} (X, Y) = \frac{1}{| b - a |} {∥ E (X) - E (Y) ∥}_{2} + \sum_{k = 2}^{K} \frac{1}{{| b - a |}^{k}} {∥ C_{k} (X) - C_{k} (Y) ∥}_{2}

(19)

where

E (X) = \frac{1}{| X |} \sum_{x \in X} x

is the empirical expectation vector of sample X, and

C_{k} (X) = E ({(x - E (X))}^{k})

is the vector composed of the k-th order sample central moments of all coordinate components of X. For the proposed named-entity recognition model, the CMD loss between the shared representations is calculated as follows:

L_{s i m} = C M D_{K} (U_{i}^{a}, U_{t}^{a})

(20)

3.7.2. Difference Loss

The difference loss ensures that the modal-shared representation and the modal-specific representation capture different information. Non-redundancy is achieved by imposing a soft orthogonality constraint between them [25]. Soft orthogonal constraints ensure non-redundancy between the shared representations and the specific representations, as well as between specific representations of different modalities, which is directly related to the noise filtering of RE.

In a training batch, two matrices

M_{o}^{a}

and

M_{o}^{r}

are defined. The rows of the matrices represent the hidden vectors

U_{o}^{a}

and

U_{o}^{r}

of the modality of each input. Then, the orthogonality constraint calculation formula for this modality vector pair is as follows:

∥ {(M_{o}^{a})}^{T} M_{o}^{r} ∥_{F}^{2}

(21)

where

{∥ \cdot ∥}_{F}^{2}

is the square of the Frobenius norm. In addition to the constraint between the shared representation and the specific representation, an orthogonality constraint is also added between the specific representations. Therefore, the overall difference loss is calculated as follows:

L_{d i f f} = \sum_{o \in {i, t}} ∥ {(M_{o}^{a})}^{T} M_{o}^{r} ∥_{F}^{2} + {∥ {(M_{i}^{r})}^{T} M_{t}^{r} ∥}_{F}^{2}

(22)

The discrepancy loss penalizes correlations between

U_{o}^{a}

and

U_{o}^{r}

, as well as between

U_{i}^{r}

and

U_{t}^{r}

, forcing the RE to eliminate redundant noise and preserve the specific features conducive to entity recognition. This improves the noise-filtering performance of the RE, delivering high-purity modal features to the downstream fusion module.

3.7.3. Task Loss

The task loss is used to estimate the prediction quality during training. The log-likelihood loss is selected as the corresponding loss function, whose formulation is presented as follows:

L_{t a s k} = - \frac{1}{| D_{t a s k} |} \sum_{j = 1}^{N} l o g (p (y | Hyper))

(23)

where

D_{t a s k}

is the training sample batch, and N is the batch size.

4. Experiments

This section comprehensively evaluates the proposed named-entity recognition model (SMCL) through a series of experiments. Following recent studies, Precision (P), Recall (R), and F1-score (F1) are used as evaluation metrics. The experimental results show that the proposed model outperforms various single-modal and multi-modal methods on both datasets.

4.1. Experimental Setup

4.1.1. Datasets

MNER-MI and MNER-MI-Plus are two MNER datasets containing multiple images. As shown in Table 1, the MNER-MI dataset has a total of 8576 tweets and 11,862 named entities, divided into a training set, a validation set, and a test set, containing 6856, 860, and 860 tweets, respectively. Each tweet contains an average of about three images. The MNER-MI-Plus dataset has a total of 13,395 tweets and 20,586 named entities, also divided into a training set, a validation set, and a test set, containing 10,229, 1583, and 1583 tweets, respectively, with an average of about two images per tweet.

4.1.2. Model Setup

Except for BiLSTM-based methods, all methods use BERT-base (https://huggingface.co/bert-base-uncased (accessed on 29 November 2025)) as the text encoder and ViT-base-patch16 (https://huggingface.co/google/vit-base-patch16-224 (accessed on 29 November 2025)) as the image encoder. For single-image multimodal models, open-source codes from GitHub (https://github.com) were reused after dataset format adaptation; For multi-image models, UMT-MI extended UMT with multi-image concatenation and SMCL-consistent temporal positional encoding. TPM-MI adopted the original implementation without structural modifications.

4.1.3. Parameter Setup

AdamW [26] is used as the optimizer. The grid search is performed on the validation set to find the learning rate in the range of [

1 \times 10^{- 5}

,

7 \times 10^{- 5}

] and the batch size in the range of [8, 32]. Mini-batch backpropagation is used for training, and the model with the best performance on the validation set is selected for evaluation on the test set. All models were trained for up to 50 epochs with early stopping (patience = 5), using validation-optimal weights for testing, and the best hyperparameter configuration was selected via single-run search based on validation F1-score.

4.2. Baseline

4.2.1. Named-Entity Recognition Models

For the single-text modality, a variety of well-established models that are widely applied in the named entity recognition (NER) task have been thoroughly investigated. Specifically, BiLSTM-CRF [27] is investigated for the first time, which combines a Bidirectional Long Short-Term Memory (BiLSTM) network with a Conditional Random Field (CRF) layer to capture bidirectional dependency relationships in text and perform sequence labeling. Then, HBiLSTM-CRF [28] is further studied, which uses a hierarchical bidirectional LSTM structure to further improve the ability to model the internal structure of words, thus enriching character-level word representations. Finally, the pre-trained model BERT is investigated. This is a powerful text encoder based on the Transformer architecture, which can capture long-range dependency relationships in text and learn rich language knowledge through pre-training. BERT-CRF combines the powerful encoding ability of BERT with a CRF-based decoder, further improving the performance in sequence labeling tasks.

4.2.2. Multimodal Named-Entity Recognition Models

With respect to multimodal experiments that integrate text and image modalities, representative MNER models are selected as baselines. UMT [10] designs a multi-modal interaction module to construct bidirectional associations between text and images; OCSGA [12] adopts an object detector to extract objects in images and uses the text labels of these objects as image representations; UMGF [13] proposes a method relying on a graph-based model to build connections between text and images; MAF [4] puts forward a universal matching and alignment framework, which is designed to align text and image representations while mitigating the interference caused by image noise; ITA [29] extracts objects, captions, and text from images as image representations; both HVPNeT [14] and VisualPT-MoE [15] leverage image representations as prompt to accomplish interaction with all layers of the text encoder; PGIM [16] leverages ChatGPT as an implicit knowledge base, generates auxiliary refined knowledge, and fuses this with original text to enhance performance.

However, these MNER methods only use a single image, while a tweet is usually accompanied by multiple images in practical applications. Studying only single-image MNER is far from sufficient. To address this, Huang et al. [18] proposed TPM-MI, which can receive multiple images and interact with text using image information as prompts. But this method ignores the existence of modality noise, and it relies on high-quality image prompts to guide text generation or understanding. When the image does not match the text, the performance is affected.

Therefore, this paper proposes a new MNER model that captures the unique features of each modality and refines the modality representations. Combining symmetric multimodal fusion with contrastive learning, each modality is projected into a shared subspace, their commonalities are learned, and consistent cross-modal representations are regularized to bridge the modality gap. And with the help of the symmetric multimodal fusion module and CRF layer, the MNER task achieves better performance.

4.3. Experimental Results

The results of the experiments are illustrated in Table 2, which can be analyzed from three dimensions. First, by comparing the performance of single-modal named-entity recognition models, it can be seen that BERT-based models perform significantly better than BiLSTM-based models on both datasets. This result intuitively confirms the advantages of pre-trained language models.

Then, a horizontal comparison between single-image multimodal named-entity recognition models and single-modal models shows that almost all multimodal models show significant performance improvements. This indicates that the visual information contained in images can provide effective assistance in text-entity recognition, and also verifies the positive effect of multimodal fusion on task optimization.

In the comparison of multi-image multimodal named-entity recognition models, methods using multi-image input are always better than those using a single input. Taking the UMT-MI model as an example, this model integrates multi-image information by stitching multiple images into a single image on the basis of the original UMT model, and its performance is significantly better than the original UMT model. This result further indicates that introducing multi-image information can provide the model with richer visual cues, reduce the perspective limitations or information loss that may exist in a single image, and thereby optimize the execution effect of the MNER task.

It is worth noting that, compared with the current optimal method TPM-MI, the proposed model SMCL achieves more considerable results. For example, on the MNER-MI dataset, its F1-score is 2.44% higher than that of TPM-MI. This is because TPM-MI has two limitations.

First, it lacks a systematic cross-modal semantic alignment mechanism, making it difficult to achieve an in-depth association between text and image features. Second, it does not effectively filter the noise information within the modality, such as redundant modifications in text and irrelevant backgrounds in images, resulting in limited model learning efficiency.

In contrast, SMCL achieves the refinement and alignment of each modality through an architecture of symmetric-encoder collaborative cross-modal fusion. On the one hand, it filters noise information relying on orthogonal constraints and retains the core features of the modality; On the other hand, it deeply explores the semantic associations between each modality based on multimodal contrastive learning, achieving accurate alignment and cross-modal fusion. After the modality representations are refined and aligned, the Symmetric Multimodal Fusion Module (SFM) further plays a key role in cross-modal fusion. Through its symmetric cross-modal interaction mechanism, the specific representation and shared representation are effectively integrated. The combined effect of these three mechanisms ultimately leads to a significant improvement in performance.

4.4. Ablation Experiments and Analysis

To verify the effectiveness of each component of the proposed MNER model SMCL, Table 3 presents the ablation results after removing each component on the MNER-MI and MNER-MI-Plus datasets. Among them, “w/o SFM” indicates that the SFM module is removed; “w/o AE” and “w/o RE” respectively indicate that the learning of modal-shared representation and modal-specific representation is removed; “w/o AE+RE” denotes that this symmetric encoder branch is removed; and “w/o IGM” removes the multi-image guidance on text representations.

Replacing the SFM with a straightforward concatenation mechanism paired with a Transformer layer leads to a marked decline in the overall performance of the model. Specifically, experiments conducted on the MNER-MI dataset reveal a 2.85% reduction in F1-score relative to the complete model configuration. This finding first highlights the indispensable value of cross-modal complementary information for the NER task: textual ambiguities are frequently resolved by leveraging visual clues in images, and latent entity information in visual data requires textual context to achieve accurate positioning. Beyond this, the result further provides direct empirical evidence for the SFM’s efficacy in effectively integrating information from both modalities.

To further intuitively demonstrate the semantic alignment ability of the SFM, the cross-modal attention weights are visualized in Figure 5. The darker color blocks in the heatmap represent higher attention weights, clearly showing that the entity token has a strong correlation with the global image feature and the local feature block. This visualization indicates that the SFM can accurately capture the potential semantic association between text entities and image content, which constitutes the key to its superiority compared to simple feature concatenation.

Further analysis shows that after the symmetric mechanism of modal-shared representation learning and modal-specific feature learning is removed, the model loses the ability to finely process different modalities. On the one hand, the inherent semantic differences between modalities (such as the abstract nature of text and the concrete nature of images) cannot be effectively modeled, leading to deviations in cross-modal alignment. On the other hand, redundant information within the modality (such as irrelevant modifiers in text and background interference elements in images) cannot be filtered, resulting in the model receiving a large amount of noise input. The combined effect of these two aspects results in a reduction in the accuracy of entity boundary positioning and category judgment, which is ultimately reflected in a notable decline in the F1-score.

Moreover, replacing the “AE+RE” symmetric dual-branch structure with a single projection layer gives rise to two critical limitations. Specifically, the absence of a dedicated learning pathway for modal-specific representations affects the model’s ability to filter redundant intra-modal noise effectively, while the lack of a dedicated modeling mechanism for shared modal representations hinders its ability to bridge semantic gaps between modals. As a result, the model fails to accomplish the two core goals of modal noise filtering and cross-modal feature alignment simultaneously, with its F1-score decreasing by 5.94% on the MNER-MI dataset and 5.01% on the MNER-MI-Plus dataset relative to the full-version SMCL model. These results thus fully demonstrate the necessity and irreplaceability of the AE+RE symmetric dual-branch structure in the proposed model.

In addition, the ablation results of the IGM confirm the positive guiding role of image representation in understanding text. When the image contains visual features of entities, encoded image representation can provide concrete references for resolving the ambiguity of text entities, helping the model accurately identify named entities. This one-way guidance mechanism complements the two-way fusion of SFM, jointly improving its ability to recognize multimodal entities in complex scenarios.

4.5. Hyperparameter Sensitivity Analysis

To verify the robustness of the proposed model SMCL to key hyperparameters and clarify the optimal parameter value range, this section selects core hyperparameters to conduct a sensitivity analysis based on the characteristics of multi-image MNER tasks. The experiment adopts the “single-variable method”, where only the value of one hyperparameter is changed each time, while keeping other parameters at the baseline configuration (baseline:

K = 3

,

α = 0.3

,

β = 0.2

, learning rate

l r = 3 \times 10^{- 5}

). Experiments are independently run fivetimes on the MNER-MI and MNER-MI-Plus datasets, and the averages of the P, R, and F1-score are taken as the results. The F1-score is used as the core evaluation index to measure sensitivity.

4.5.1. Hyperparameters

Combining the model mechanism and the conventional values in the field, three categories of key hyperparameters and their gradient ranges are determined, as detailed in Table 4. All parameter gradients cover the “low-baseline-high” interval to ensure a comprehensive verification of the impact of parameter changes on performance.

In light of relevant studies in the field of multimodal learning, the weights of the orthogonal constraint loss and contrastive learning loss are generally configured within the range of 0.1–0.5, aiming to balance the contributions of the main task loss and regularization loss. Considering the architectural characteristics of the proposed model, the similarity loss (

L_{s i m}

) is designed for cross-modal alignment, where an excessively large weight ought to be avoided to preclude the suppression of modal-specific feature learning; the discrepancy loss (

L_{d i f f}

) is employed to impose orthogonal constraints, and an overly large weight also needs to be prevented to mitigate the loss of feature information. Consequently, the hyperparameters are initially set as

α \in [0.1, 0.5]

and

β \in [0.1, 0.4]

. Subsequently, the optimal values (

α = 0.3

,

β = 0.2

) are determined through sensitivity analysis with the single-variable control method, which ensures that the model attains a trade-off among cross-modal alignment performance, modality noise filtering capability, and overall task performance.

4.5.2. Results of Hyperparameter Sensitivity Experiments

The CMD distance achieves distribution alignment by matching the first K-order central moments of cross-modal features, and the order K directly affects the alignment accuracy. As shown in Figure 6a, on the MNER-MI dataset, the F1-score reaches a peak of 79.76% when

K = 3

; when

K = 2

, the F1 decreases by 1.30% (to 78.46%) due to insufficient moment matching, and when

K = 4

and

K = 5

, the F1 drops to 79.25% and 78.09% respectively due to the over-matching of high-order noise features. A similar pattern is observed on the MNER-MI-Plus dataset, where

K = 3

yields the optimal F1 (85.14%) and

K = 5

leads to a 1.68% performance drop (to 83.46%). This indicates that K is a moderately sensitive parameter, and only when

K = 3

can it balance alignment sufficiency and noise resistance, while values that are too high or too low will weaken the ability to capture cross-modal semantic correlations.

The loss function weights control the balance of multi-objective optimization, and the experimental results are shown in Figure 6b,c.

For the similarity loss weight

α

, when

α = 0.3

, the cross-modal alignment and task loss achieve the optimal balance, resulting in the highest F1; when

α > 0.4

, the alignment loss dominates excessively, suppressing the learning of modality-specific features, and the F1 of the MNER-MI dataset drops to 78.52%; when

α < 0.2

, insufficient alignment leads to an increased cross-modal semantic gap, with F1 decreasing by 1.24% (MNER-MI: 78.52%). Therefore,

α

is a moderately sensitive parameter, and to ensure the effectiveness of

α

in driving cross-modal semantic fusion, its value range should be stably maintained between 0.2 and 0.3.

For the difference loss weight

β

, when

β = 0.2

, the redundancy filtering effect is the best, leading to the highest F1; when

β > 0.3

, excessive feature separation causes semantic information loss, and F1 decreases by 0.89% (MNER-MI: 78.87%); when

β < 0.1

, residual redundant information interferes with entity judgment, and F1 drops to 79.03%. So

β

is a low-sensitive parameter, and to guarantee

β

’s ability to safeguard the purity of modal features, its value range needs to be within 0.1 and 0.2.

Optimizer parameters, particularly the learning rate

l r

, exert a critical influence on the model’s convergence behavior and fitting capacity. As illustrated in Figure 6d, the sensitivity analysis of

l r

reveals distinct performance patterns: when

l r = 3 \times 10^{- 5}

, the model achieves stable convergence, and the F1-score reaches its peak, indicating an optimal balance between training efficiency and generalization; if

l r

increases to

7 \times 10^{- 5}

, the optimization process enters an oscillatory state, causing the F1-score on the MNER-MI dataset to plummet to 78.32% due to unstable parameter updates; conversely, when

l r

decreases to

1 \times 10^{- 5}

, the convergence speed becomes excessively slow, leading to underfitting and an F1-score of merely 78.01% on MNER-MI. These observations collectively demonstrate that

l r

is a highly sensitive hyperparameter, and its value must be strictly constrained at around

3 \times 10^{- 5}

to ensure both convergence stability and performance optimality.

4.6. Case Study

To more clearly demonstrate the efficiency of the proposed named-entity model, SMCL, in completing the MNER task in multi-image scenarios, a case study is conducted, and the recognition results are carefully compared with those of the existing method, TPM-MI. The specific cases and results are shown in Figure 7.

In the first case, the text contains the entity “Disney Channel”. The TPM-MI model only recognizes “Disney” as the organization type (ORG) and ignores the key information carried by “Channel”. The core reason for this misjudgment is that the image corresponding to this case does not contain any explicit visual cues that can be directly associated with “Disney Channel”, making it difficult for the model to complete accurate recognition with only limited image information. In contrast, the SMCL model not only captures the semantic binding relationship between “Disney” and “Channel” in the text by deeply exploring intra-modal information and cross-modal interaction information, but also combines the potential association features between modalities, and finally successfully recognizes “Disney Channel” as miscellaneous (MISC), demonstrating its ability to parse complex entity structures.

A similar phenomenon appears in the second case. When the visual prompts provided by the image are ambiguous and the image quality is low, methods that rely only on images as auxiliary prompts will have recognition deviations due to insufficient information. This result further indicates that the mechanism that over-reliance on image modal prompts has significant one-sidedness, and its performance stability will be affected by fluctuations in image quality.

The third case more intuitively reflects the model’s ability to handle noise information. This case contains two images, where the second image only serves as a reference for “innocent person” and is irrelevant to the core entity “Richard”, which is a typical example of modal noise. The TPM-MI model fails to recognize this level of information difference and incorrectly associates the elements in the noise image with the target entity, leading to deviations in the recognition result. In contrast, the SMCL model effectively filters the noise information through the modal-specific representation refinement mechanism, eliminates the interference of irrelevant images, and at the same time captures the in-depth association between “Richard” and the text context, relying on cross-modal semantic alignment, and finally accurately recognizes it as the person type (PER).

Comprehensive comparison results of the three cases show that the proposed model SMCL not only filters redundant information, effectively address the problem of modal noise, and improve recognition accuracy but also handles scenarios with insufficient image prompt quality by strengthening semantic alignment between modalities. This fully verifies its advancement and robustness in performing the MNER task in complex multi-image scenarios.

4.7. Computational Cost Analysis

To evaluate the practicality of the proposed SMCL model, a comprehensive analysis of its computational cost is conducted, covering parameter overhead, training efficiency, and inference speed. The model is compared with representative baseline models (BERT-CRF, UMT-MI, TPM-MI) and its ablation variant (SMCL w/o SFM) using three key metrics: parameter count (Params), training time per epoch, and inference speed (samples per second). As shown in Table 5, SMCL has 133.5 M parameters, which is 12.4% more than TPM-MI (118.7 M). The additional parameters mainly come from the symmetric branch encoders (AE+RE) and the Symmetric Multimodal Fusion Module (SFM). Specifically, RE introduces independent learnable parameters for text and image modalities to filter the modal noise, while SFM incorporates multi-layer Transformer and cross-modal attention mechanisms to achieve deep fusion. In terms of training efficiency, SMCL takes 2.3 h per epoch, which is 9.5% longer than TPM-MI (2.1 h), due to the additional computational cost of contrastive learning alignment (AE) and orthogonal constraint optimization (RE). For inference speed, SMCL processes 190.2 samples per second, a 11.8% decrease compared to TPM-MI (215.8 samples/s), which is attributed to the cross-modal interaction steps in SFM.

However, the computational cost increment is justified by significant performance gains. SMCL outperforms TPM-MI by 2.44% and 1.72% in F1-score on MNER-MI and MNER-MI-Plus datasets, respectively. Moreover, ablation experiments show that removing SFM reduces parameters by 8.9% and shortens training time by 13.0%, but leads to a 6.1% drop in F1-score (MNER-MI), confirming that SFM is indispensable for cross-modal feature synergy. In practical applications, SMCL’s inference speed (190 samples/s) meets the real-time requirements of short text-processing scenarios, and its cost can be further reduced via model compression or lightweight backbone replacement in future work.

Overall, SMCL achieves a favorable balance between computational cost and performance, providing a robust and practical solution for multi-image MNER tasks without excessive overhead.

4.8. Error Analysis

To verify the superiority of SMCL in handling complex multi-image MNER tasks, we randomly selected 1000 error samples from the test sets of MNER-MI and MNER-MI-Plus, and compared the error distribution with the state-of-the-art baseline TPM-MI. Errors are categorized into three types based on their manifestations and causes.

4.8.1. Definition of Error Types

Boundary Errors: Misidentification of the start or end positions of named entities.
Type Confusion: Incorrect classification of entity types.
Modality Conflict Errors: Prediction errors caused by inconsistent semantic information between text and image modalities.

4.8.2. Quantitative Error Distribution

Table 6 presents the quantitative distribution and reduction rates of the three error types for both models. SMCL achieves significant reductions in all error categories, with an overall error count reduction of 21.8% on MNER-MI and 24.0% on MNER-MI-Plus.

4.8.3. Qualitative Analysis of Typical Cases

Boundary Error Correction: For the text “Disney Channel”, TPM-MI only recognizes “Disney” due to insufficient visual cues. SMCL captures the semantic binding between “Disney” and “Channel” through the token-level fusion mechanism of the SFM module, achieving complete entity recognition.
Type Confusion Correction: For the entity “Richard” in the text, TPM-MI mislabels it as MISC due to interference from noise images. SMCL filters modal noise via the RE module and aligns cross-modal semantics through the AE module, correctly classifying “Richard” as PER.
Modality Conflict Correction: For “Joey” and “Niko” with conflicting text–image cues (text implies PER, images imply MISC), TPM-MI over-relies on text information. SMCL balances bimodal contributions through symmetric fusion, correctly predicting MISC.

4.8.4. Error-Reduction Mechanism

SMCL’s error reduction benefits from the synergistic effect of its core components. First, the SFM module enhances token-level semantic dependencies, effectively reducing boundary errors. Then, contrastive learning in the AE module strengthens the discriminability of entity type features, alleviating type confusion. Finally, orthogonal constraints in the RE module filter conflicting noise, mitigating modality conflict errors.

This verifies SMCL’s robustness and superiority in handling complex multi-image MNER scenarios with ambiguous boundaries, confusing types, or conflicting modal information.

5. Conclusions and Future Work

To effectively solve the problems of modality noise and modality differences in the MNER task in multi-image scenarios, this paper integrates symmetric multimodal fusion with contrastive learning and proposes a new MNER model. To address the problem of modality noise, each modality is projected into its own specific subspace to learn modal-specific representation, which is optimized through orthogonal constraints. To reduce the modality difference, each modality is mapped to a shared subspace, and by optimizing the distribution of feature representations, the feature representations of the same category are made more compact and those of different categories are made more separated. Specifically, the symmetric multimodal fusion module is designed to independently enhance and optimize each modality with a multi-layer self-attention mechanism, thereby achieving in-depth fusion through symmetric cross-modal feature interaction. The experimental results demonstrate that the performance of this model is significantly superior to that of existing approaches.

It should be noted that the model also has limitations. When processing multi-image input, it adopts an equal-weight fusion strategy and fails to distinguish the differences in the contribution of different images to text understanding. In practice, different images vary in their importance for the comprehension of posts, and some images may contain key entity visual cues, while others may only provide auxiliary context or irrelevant information. Equal-weight fusion would dilute the contribution of effective information. Future work will introduce an image–text relevance scoring mechanism to assign dynamic weights to different images, thereby achieving more accurate multi-image feature fusion and further enhancing the model’s adaptability in complex scenarios.

Author Contributions

Conceptualization, Y.W.; Methodology, Y.W.; Software, Y.W.; Validation, Y.W.; Formal analysis, Y.W.; Investigation, Y.W.; Resources, J.L.; Data curation, Y.W.; Writing—original draft, Y.W.; Writing—review & editing, J.L.; Visualization, Y.W.; Supervision, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, X.; Yang, J.; Liu, H.; Hu, P. HTLinker: A Head-to-Tail Linker for Nested Named Entity Recognition. Symmetry 2021, 13, 1596. [Google Scholar] [CrossRef]
Zeng, Q.; Yuan, M.; Su, Y.; Mi, J.; Che, Q.; Wan, J. Improving multimodal named entity recognition via text-image relevance prediction with large language models. Neurocomputing 2025, 651, 130982. [Google Scholar] [CrossRef]
Moon, S.; Neves, L.; Carvalho, V. Multimodal named entity recognition for short social media posts. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; pp. 852–860. [Google Scholar] [CrossRef]
Xu, B.; Huang, S.; Sha, C.; Wang, H. Maf: A general matching and alignment framework for multimodal named entity recognition. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Tempe, AZ, USA, 21–25 February 2022; pp. 1215–1223. [Google Scholar] [CrossRef]
Zheng, C.; Wu, Z.; Wang, T.; Cai, Y.; Li, Q. Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Trans. Multimed. 2021, 23, 2520–2532. [Google Scholar] [CrossRef]
Sui, D.; Tian, Z.; Chen, Y.; Liu, K.; Zhao, J. A large-scale chinese multimodal ner dataset with speech clues. In Proceedings of the 59th Annual Meeting of the As sociation for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Virtual Event, 1–6 August 2021; pp. 2807–2818. [Google Scholar] [CrossRef]
Zhang, Q.; Fu, J.; Liu, X.; Huang, X. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; pp. 5674–5681. [Google Scholar] [CrossRef]
Lu, D.; Neves, L.; Carvalho, V.; Zhang, N.; Ji, H. Visual Attention Model for Name Tagging in Multimodal Social Media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; pp. 1990–1999. Available online: https://aclanthology.org/P18-1185/ (accessed on 29 November 2025).
Arshad, O.; Gallo, I.; Nawaz, S.; Calefati, A. Aiding intra-text representations with visual context for multimodal named entity recognition. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; pp. 337–342. [Google Scholar] [CrossRef]
Yu, J.; Jiang, J.; Yang, L.; Xia, R. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3342–3352. [Google Scholar] [CrossRef]
Wang, X.; Tian, J.; Gui, M.; Li, Z.; Ye, J.; Yan, M.; Xiao, Y. Promptmner: Prompt-based entity-related visual clue extraction and integration for multimodal named entity recognition. In Proceedings of the International Conference on Database Systems for Advanced Applications, Virtual Event, 11–14 April 2022; pp. 297–305. [Google Scholar] [CrossRef]
Wu, Z.; Zheng, C.; Cai, Y.; Chen, J.; Leung, H.F.; Li, Q. Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1038–1046. [Google Scholar] [CrossRef]
Zhang, D.; Wei, S.; Li, S.; Wu, H.; Zhu, Q.; Zhou, G. Multi-modal graph fusion for named entity recognition with targeted visual guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; pp. 14347–14355. [Google Scholar] [CrossRef]
Chen, X.; Zhang, N.; Li, L.; Yao, Y.; Deng, S.; Tan, C.; Huang, F.; Si, L.; Chen, H. Good visual guidance make a better extractor: Hierarchical visual prefix for multimodal entity and relation extraction. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL, Seattle, WA, USA, 10–15 July 2022; pp. 1607–1618. [Google Scholar] [CrossRef]
Xu, B.; Huang, S.; Du, M.; Wang, H.; Song, H.; Xiao, Y.; Lin, X. A unified visual prompt tuning framework with mixture-of-experts for multimodal information extraction. In Proceedings of the International Conference on Database Systems for Advanced Applications, Tianjin, China, 17–20 April 2023; pp. 544–554. [Google Scholar] [CrossRef]
Li, J.; Li, H.; Pan, Z.; Sun, D.; Wang, J.; Zhang, W.; Pan, G. Prompting ChatGPT in MNER: Enhanced Multimodal Named Entity Recognition with Auxiliary Refined Knowledge. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP, Singapore, 6–10 December 2023; pp. 2787–2802. [Google Scholar] [CrossRef]
Bao, X.; Tian, M.; Wang, L.; Zha, Z.; Qin, B. Contrastive Pre-training with Multi-level Alignment for Grounded Multimodal Named Entity Recognition. In Proceedings of the 2024 International Conference on Multimedia Retrieval (ICMR’24), Phuket, Thailand, 10–14 June 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 795–803. [Google Scholar] [CrossRef]
Huang, S.; Xu, B.; Li, C.; Ye, J.; Lin, X. MNER-MI: A Multi-image Dataset for Multimodal Named Entity Recognition in Social Media. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Torino, Italy, 20–25 May 2024; pp. 11452–11462. Available online: https://aclanthology.org/2024.lrec-main.1001 (accessed on 29 November 2025).
Sang, E.F.; Veenstra, J. Representing Text Chunks. In Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics, University of Bergen, Bergen, Norway, 8–12 June 1999; pp. 173–179. Available online: https://aclanthology.org/E99-1023/ (accessed on 29 November 2025).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image Is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Virtual Event, Austria, 3–7 May 2021; Available online: https://openreview.net/forum?id=YicbFdNTTy (accessed on 29 November 2025).
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Lafferty, J.; McCallum, A.; Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, Williams College, Williamstown, MA, USA, 28 June–1 July 2001; pp. 282–289. [Google Scholar]
Zellinger, W.; Grubinger, T.; Lughofer, E.; Natschläger, T.; Saminger-Platz, S. Central Moment Discrepancy (CMD) for Domain-Invariant Representation Learning. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017; Available online: https://openreview.net/forum?id=SkB-_mcel (accessed on 29 November 2025).
Liu, P.; Qiu, X.; Huang, X. Adversarial Multi-task Learning for Text Classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1–10. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019; Available online: https://openreview.net/forum?id=Bkg6RiCqY7 (accessed on 29 November 2025).
Huang, Z.; Xu, W.; Yu, K. Bidirectional lstm-crf models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar] [CrossRef]
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 260–270. [Google Scholar] [CrossRef]
Wang, X.; Gui, M.; Jiang, Y.; Jia, Z.; Bach, N.; Wang, T.; Huang, Z.; Tu, K. Ita: Image-text alignments for multi-modal named entity recognition. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 3176–3189. [Google Scholar] [CrossRef]

Figure 1. An example of multimodal named-entity recognition with multiple images.

Figure 2. The Architecture Diagram of the proposed model SMCL.

Figure 3. Implementation details of the Image Guidance Module.

Figure 4. Implementation details of the symmetric multimodal fusion module.

Figure 5. The cross-modal attention heat map for SFM.

Figure 6. Sensitivity curves of key hyperparameters. (a) Sensitivity of CMD distance order K. (b) Sensitivity of similarity loss weight

α

. (c) Sensitivity of difference loss weight

β

. (d) Sensitivity of learning rate

l r

.

Figure 6. Sensitivity curves of key hyperparameters. (a) Sensitivity of CMD distance order K. (b) Sensitivity of similarity loss weight

α

. (c) Sensitivity of difference loss weight

β

. (d) Sensitivity of learning rate

l r

.

Figure 7. Case study of the proposed model SMCL and the model TPM-MI.

Table 1. Dataset Information.

Dataset	MNER-MI	MNER-MI-Plus
Training	6856	10,229
Validation	860	1583
Test	860	1583
Total	8576	13,395

Table 2. Model performance comparison, with the best values highlighted in bold.

Modality	Model	MNER-MI			MNER-MI-Plus
Modality	Model	P	R	F1	P	R	F1
Text Only	BiLSTM-CRF	64.03	65.91	64.96	73.65	70.74	72.17
	HBiLSTM-CRF	64.51	68.55	66.47	72.19	74.34	73.25
	BERT	69.04	73.54	71.22	77.35	79.19	78.26
	BERT-CRF	70.78	75.05	72.85	80.15	78.52	79.33
Text + Single Image	UMT	74.23	74.03	74.13	81.71	79.50	80.59
	OCSGA	75.75	72.04	73.85	81.44	79.13	80.27
	UMGF	73.74	75.30	74.51	82.31	79.65	80.96
	MAF	74.91	73.60	74.25	80.17	81.29	80.73
	ITA	74.95	74.21	74.58	79.64	81.46	80.54
	HVPNET	74.93	75.28	75.10	81.88	80.94	81.41
	VisualPT-MoE	74.77	75.01	74.89	82.72	80.64	81.67
	PGIM	75.80	73.46	74.61	81.13	81.39	81.26
Text + Multiple Images	UMT-MI	76.56	75.90	76.23	82.26	82.96	82.61
	TPM-MI	77.45	77.19	77.32	83.66	83.18	83.42
		±0.41	±0.38	±0.35	±0.33	±0.31	±0.29
	SMCL	78.35	81.23	79.76	84.93	85.35	85.14
		±0.36	±0.40	±0.32	±0.28	±0.30	±0.27

Table 3. Effects of different components, with the best values highlighted in bold.

Model	MNER-MI			MNER-MI-Plus
Model	P	R	F1	P	R	F1
SMCL	78.35	81.23	79.76	84.93	85.35	85.14
w/o SFM	75.83	78.02	76.91	82.97	83.65	83.31
w/o RE	74.23	75.56	74.89	81.09	82.37	81.72
w/o AE	74.11	75.43	74.76	80.78	82.14	81.45
w/o AE+RE	73.29	74.36	73.82	79.33	80.95	80.13
w/o IGM	75.87	76.95	76.41	81.97	83.05	82.51

Table 4. Key Hyperparameters and Ranges.

Category	Hyperparameter	Range
Modality Alignment and Constraint	CMD Distance Order K	[2, 5]
Loss Function Weight	Similarity Loss Weight $α$	[0.1, 0.5]
	Difference Loss Weight $β$	[0.1, 0.4]
Optimizer Related	Learning Rate $l r$	[ $1 \times 10^{- 5}$ , $7 \times 10^{- 5}$ ]

Table 5. Computational cost comparison.

Model	Params (M)	Training Time per Epoch (h)	Inference Speed (Samples/s)	F1-Score (MNER-MI)	F1-Score (MNER-MI-Plus)
BERT-CRF	109.8	1.2	312.6	72.85	79.33
UMT-MI	112.3	1.8	236.5	76.23	82.61
TPM-MI	118.7	2.1	215.8	77.32	83.42
SMCL	133.5	2.3	190.2	79.76	85.14
w/o SFM	121.6	2.0	220.7	73.66	79.79

Table 6. Quantitative distribution of error types (%).

Dataset	Model	Boundary Errors	Type Confusion	Modality Conflict Errors
MNER-MI	TPM-MI	42.3	35.7	22.0
MNER-MI	SMCL	31.5 (↓ 25.5%)	28.2 (↓ 21.0%)	14.3 (↓ 35.0%)
MNER-MI-Plus	TPM-MI	40.1	33.9	26.0
MNER-MI-Plus	SMCL	29.7 (↓ 26.0%)	26.5 (↓ 21.8%)	18.8 (↓ 27.7%)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, Y.; Liu, J. Multimodal Named-Entity Recognition Based on Symmetric Fusion with Contrastive Learning. Symmetry 2026, 18, 353. https://doi.org/10.3390/sym18020353

AMA Style

Wu Y, Liu J. Multimodal Named-Entity Recognition Based on Symmetric Fusion with Contrastive Learning. Symmetry. 2026; 18(2):353. https://doi.org/10.3390/sym18020353

Chicago/Turabian Style

Wu, Yubo, and Junqiang Liu. 2026. "Multimodal Named-Entity Recognition Based on Symmetric Fusion with Contrastive Learning" Symmetry 18, no. 2: 353. https://doi.org/10.3390/sym18020353

APA Style

Wu, Y., & Liu, J. (2026). Multimodal Named-Entity Recognition Based on Symmetric Fusion with Contrastive Learning. Symmetry, 18(2), 353. https://doi.org/10.3390/sym18020353

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Named-Entity Recognition Based on Symmetric Fusion with Contrastive Learning

Abstract

1. Introduction

2. Related Works

2.1. Multimodal Named-Entity Recognition

2.2. Multimodal Contrastive Learning

2.3. Multimodal Named-Entity Recognition in Multi-Image Scenarios

3. Methods

3.1. Problem Formulation

3.2. Model Overview

3.3. Multi-Image Feature Extraction

3.3.1. Multi-Image Representation

3.3.2. Temporal Positional Encoding

3.3.3. Multi-Image Relationship Modeling

3.4. Text Feature Extraction

3.4.1. Text Representation

3.4.2. Image Modality Guidance

3.5. Symmetric Multimodal Fusion

3.5.1. Modality Refinement and Alignment

3.5.2. Symmetric Multimodal Fusion Module

3.6. CRF Decoder

3.7. Learning

3.7.1. Similarity Loss

3.7.2. Difference Loss

3.7.3. Task Loss

4. Experiments

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Model Setup

4.1.3. Parameter Setup

4.2. Baseline

4.2.1. Named-Entity Recognition Models

4.2.2. Multimodal Named-Entity Recognition Models

4.3. Experimental Results

4.4. Ablation Experiments and Analysis

4.5. Hyperparameter Sensitivity Analysis

4.5.1. Hyperparameters

4.5.2. Results of Hyperparameter Sensitivity Experiments

4.6. Case Study

4.7. Computational Cost Analysis

4.8. Error Analysis

4.8.1. Definition of Error Types

4.8.2. Quantitative Error Distribution

4.8.3. Qualitative Analysis of Typical Cases

4.8.4. Error-Reduction Mechanism

5. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI