1. Introduction
Named-entity recognition (NER) is a fundamental task in the field of Natural Language Processing (NLP), which focuses on identifying specific entities in sentences and classifying them into pre-defined categories such as person names, place names, and organization names [
1]. In recent years, with the rapid development of multimodal learning, multimodal named-entity recognition (MNER) has emerged. MNER aims to introduce images as additional contextual information for text to improve the performance of NER with text modality [
2]. Early studies directly explored the entire image using global visual cues. These methods either encode the original image into a feature vector to enhance semantics [
3] or roughly segment the image into multiple grids for cross-modal interaction [
4]. For more fine-grained visual cues, subsequent studies began to extract salient objects from images and encode them as local visual prompts to guide task prediction [
5]. Although these methods have achieved remarkable success, in real social media scenarios, user-generated tweets often contain text and multiple images. Considering only text and a single image cannot meet the requirements of real MNER scenarios.
The core of the multi-image MNER task lies in integrating multiple visual information to enhance text semantics. There are two key issues that need to be fully addressed. The first is modality noise. Each modality contains information that is irrelevant to the task, which not only fails to contribute to the final task but may even mislead the prediction. For example, in
Figure 1, regarding the text modality, the model tends to predict “Joey” and “Niko” as the person type (PER) because some words (such as “rest” and “wrestling”) endow them with human characteristics. However, from the perspective of text and images, the “Cat” object (the area in the box) in the image can clearly guide the correct prediction. Nevertheless, in the image modality, there are two aspects that introduce noise: first, at the global level, not all image regions are informative for the recognition of target entities; second, at the local level, the salient regions in the image express more complex visual semantics (“Two cats lie on the bed”) rather than the simple term “Cat”. This information may interfere with the model’s assignment of attention weights to image regions, thereby hindering the final task prediction. Therefore, it is crucial to extract key information and reduce redundant noise.
Even if redundant noise is eliminated, a second issue arises: the mismatched feature representations of input text and images, as well as the intrinsic discrepancies across these two modalities. Specifically, since text and visual representations are derived from different encoders, they maintain different feature spaces and distributions. These modality differences make it difficult for text and images to “understand” each other and capture cross-modal semantic correlations. For
Figure 1, ideally, the text entities “Joey” and “Niko” should have stronger semantic correlations with the “Cat” in the box than other regions. However, the differences between text and visual representations prevent the establishment of such an alignment and hinder the exploration of predictive visual cues. Although previous studies converted visual objects into text descriptions to achieve consistent semantic expression, this method is highly dependent on text descriptions generated by external tools. Additionally, the inherent defects in tools readily induce deviations in semantic transmission, which impairs the performance of cross-modal alignment.
Existing methods have made certain efforts to solve these problems, but such attempts remain fragmented and inadequate in addressing the core challenges of multi-image MNER tasks. For instance, some methods try to expand single-image fusion frameworks to multi-image scenarios by simply stitching multiple images into a single input; this approach fails to account for the distinct contribution of each image and ignores the existence of internal noise information in the images. At the same time, potential redundancy or conflict between visual information from different images are also overlooked.
This paper proposes a new named-entity recognition model that integrates symmetric multimodal fusion with contrastive learning (SMCL). Specifically, the model adopts an architecture of symmetric-encoder collaborative cross-modal fusion. To address the modality noise, a modality refinement encoder (RE) is designed, which maps text and images to their respective exclusive semantic spaces and learns modality-specific features. Additionally, orthogonal constraints are introduced to optimize the feature space, systematically filter redundant information, and extract high-purity features. To tackle the modality differences, a modality alignment encoder (AE) is constructed, which maps the features of each modality to a unified semantic space. Based on a multimodal contrastive learning mechanism, it achieves the aggregation of similar categories and separation of different categories, accurately captures the potential semantic associations between different modalities, and effectively bridges the semantic gap. On this basis, the refined and aligned representations are input into the symmetric multimodal fusion module (SFM). This module generates hyper-feature representation containing rich cross-modal semantics through dual enhancement and deep cross-fusion, providing more predictive feature support for the named-entity recognition task.
It is worth noting that the symmetric-encoder proposed in this study specifically refers to the equivalent design of bimodalities (text and image) in the feature-processing pipeline. Both text and image modalities undergo symmetric processing encompassing modality-specific space feature refinement and shared-space feature alignment. A unified framework enables bidirectional collaboration between intra-modal optimization and cross-modal fusion, distinguishing it from traditional unidirectional dependency or asymmetric processing paradigms. In summary, the contributions of this paper are as follows:
For complex multi-image scenarios, a new multimodal named-entity recognition model, SMCL, is designed. This model combines symmetric multimodal fusion and contrastive learning, enabling full exploration of the inherent characteristics of each modality and the cross-modal semantic associations.
A new multimodal fusion module, SFM, is designed for symmetric multimodal fusion. This module conducts an independent enhancement and optimization of each modality, effectively strengthening the capacity of feature representation. On this basis, the module achieves deep multimodal fusion and yields a hyper-feature representation that integrates full-modal information.
Extensive experiments on the MNER-MI and MNER-MI-Plus datasets demonstrate that the proposed model, SMCL, achieves state-of-the-art performance on the multi-image MNER task. Furthermore, ablation experiments and case studies fully validate the effectiveness of SMCL in addressing the issues of modality noise and modality differences, thereby providing a robust solution for MNER in multi-image scenarios.
2. Related Works
2.1. Multimodal Named-Entity Recognition
Multimodal named-entity recognition (MNER) extends traditional text-centric named-entity recognition (NER) by incorporating additional modality information, which primarily encompasses visual content [
3] and auditory content [
6], and significantly improves the performance of entity recognition. In the early stages, Zhang et al. [
7], Lu et al. [
8], and Arshad et al. [
9] made pioneering attempts. Specifically, they employed recurrent neural networks (RNNs) and convolutional neural networks (CNNs) as modality-specific encoders, where RNNs were dedicated to processing textual data and CNNs to modeling visual information. An implicit cross-modal interaction mechanism was further devised to capture inter-modal semantic correlations, thus pioneering the first round of exploratory research on the MNER task. Later, Yu et al. [
10] and Wang et al. [
11] used more advanced text encoders to obtain better text representations, while Wu et al. [
12] proposed using objects in images as image representations.
Regarding the interaction between text and images, existing approaches primarily depend on the attention mechanism. For instance, Zhang et al. [
13] adopted a graph-based approach to facilitate text–image interaction. Chen et al. [
14] and Xu et al. [
15] leveraged image representations as prompts to enable interactions between such representations and each layer of the text encoder. It is worth noting that with the advancement of large language models (LLMs), Li et al. [
16] proposed a two-stage framework named PGIM, which integrates ChatGPT as an implicit knowledge base into MNER. Specifically, this framework generates refined auxiliary knowledge through multimodal example awareness and prompt guidance, circumvents cross-modal alignment challenges by adopting a text–text paradigm, and consequently enhances the performance.
This paper proposes a novel MNER model involving multiple images. Specifically, the core information of each modality is extracted first, followed by a systematic exploration of inter-modal commonalities and intra-modal unique characteristics. On this basis, a representation of full multimodal views is constructed, which effectively filters out modality noises and bridges the modality gap. This model not only provides a novel idea for multimodal fusion, but also ultimately enhances the overall performance of NER.
2.2. Multimodal Contrastive Learning
In the field of natural language processing (NLP), multimodal contrastive learning focuses on sample construction and semantic representation enhancement, whereas the cross-modal fusion field takes the establishment of a universal semantic space as its core objective. Bao et al. [
17] propose a novel contrastive pre-training method with tailored multi-level alignment, which incorporates a text–image alignment module based on contrastive learning to optimize the consistency of cross-modal representations.
Against the backdrop of demand analysis for the MNER task, existing research still faces critical limitations: most contrastive learning-based MNER methodologies are tailored to single-image scenarios, and even a handful of multi-image oriented studies fail to achieve in-depth integration between contrastive learning paradigms and the sequence-labeling attributes of NER, rendering them incompetent in meeting entity-level fine-grained semantic alignment requirements and effectively capturing the correlations between textual entities and visual cues from multiple images; furthermore, cross-modal contrastive learning methods predominantly adopt simplistic bimodal interaction architectures, lacking systematic filtering mechanisms for modal noise in multi-image scenarios, which prevents them from addressing redundant image backgrounds and cross-image information conflicts, thereby inducing pronounced modal interference in MNER tasks and compromised model robustness.
To address the above limitations, this study proposes a Symmetric Fusion and Contrastive Learning Integration Model (SMCL) and constructs a cross-domain learning framework suitable for MNER tasks in multi-image scenarios. With a symmetric encoder architecture as the core, the model deeply integrates multimodal contrastive learning with the sequence labeling characteristics of NER. Specifically, the Refinement Encoder (RE) introduces an orthogonal constraint mechanism to effectively filter modal noise from multiple images, thus resolving the interference caused by redundant information. By leveraging the Alignment Encoder (AE) and coordinating the Central Moment Discrepancy (CMD) similarity loss and discrepancy loss functions, the model achieves the precise separation and alignment of modal-shared features and modal-specific features, breaking through the limitations of traditional one-way guided alignment methods.
2.3. Multimodal Named-Entity Recognition in Multi-Image Scenarios
With the explosive growth of user-generated content on social media platforms, posts that integrate textual content and multiple images have become increasingly prevalent. According to research by Zhang et al. [
7], over 42% of tweets include more than one image. In this case, only relying on text or a single image for named-entity recognition can no longer meet practical needs. To address this limitation, Huang et al. [
18] proposed a temporal prompt model, which treats multiple images as frames in a video, utilizes temporal information to establish relationships between different images, and achieves text–image interaction by taking multiple images as prompts.
However, this method suffers from two critical limitations regarding symmetry and robustness: first, it neglects the redundant noise inherent in individual modalities, imposing an extra burden on the subsequent fusion process; second, it relies on unidirectional guidance from high-quality image prompts to facilitate text generation or understanding, which leads to significant performance degradation when text–image modal mismatch occurs.
In contrast, this paper emphasizes the symmetry and robustness of multimodal processing: the proposed model symmetrically extracts and refines the distinctive features of both text and image modalities, thereby achieving noise suppression to eliminate modal interference. Furthermore, leveraging multimodal contrastive learning, the model symmetrically maps text and image representations into a shared semantic space. By exploring the intrinsic commonalities between the two symmetrically represented modalities, it ultimately achieves accurate and robust cross-modal alignment.
3. Methods
This section begins with the task formulation of multimodal named-entity recognition (MNER) in multi-image scenarios, clearly defining the core problem to be addressed. Subsequently, a novel MNER model is proposed, which integrates Symmetric Multimodal fusion with Contrastive Learning (SMCL). With symmetry as its foundational design principle, this model adopts an architecture of symmetric-encoder collaborative cross-modal fusion, consisting of a multi-image feature extraction module, a text feature extraction module, and a symmetric multimodal fusion module. Built on this symmetric architecture, the model alleviates the additional fusion burden caused by modality noise and cross-modal differences. Together, these efforts contribute to achieving superior MNER performance.
3.1. Problem Formulation
Taking a text sequence
T and its corresponding set of images {
} as input, where
m is the upper limit of image quantity (in this paper,
). The goal of the task is to extract named entities from the text
T and assign them to a pre-defined type. Consistent with previous MNER studies, this task is regarded as a sequence labeling problem. Let
be the input word sequence, and
be the corresponding label sequence, where
and
is the pre-defined label set, using the BIO2 labeling scheme [
19].
3.2. Model Overview
The overall architecture of the model is shown in
Figure 2, and its workflow is described as follows.
First, feature extraction is performed. For multi-image features, each image is sequentially input into the Vision Transformer (ViT) [
20] to extract feature representations. Then, a learnable positional embedding vector is added to each image feature to mark its temporal order. Finally, the feature representations of all images are fed into the Transformer encoder [
21] to generate the overall multi-image representation. For text features, the text sequence is input into the embedding layer of BERT [
22] to obtain the initial representation of each token. Subsequently, the image feature is used as a guiding signal through the Image Guidance Module (IGM) to generate a text representation guided by image semantics.
Next, the modality optimization phase implements symmetric processing. The refined multi-image representation and semantic-augmented text representation are symmetrically fed into their dedicated Modality Refinement Encoders (RE) and a shared Modality Alignment Encoder (AE). The RE modules focus on modality noise filtering, while the AE module achieves precise cross-modal semantic alignment. Then the optimized features from each modality are input into the Symmetric Multimodal Fusion Module (SFM). Leveraging self-attention reinforcement and a symmetry modality interaction strategy, this module constructs a hyper-feature representation (Hyper) that encapsulates high-level cross-modal semantic information, maximizing the synergy between visual and textual semantics.
Finally, the Hyper is fed into the Conditional Random Field (CRF) layer [
23], which captures the sequential dependencies between tokens to accomplish the final NER prediction with accurate entity boundary and type classification.
3.3. Multi-Image Feature Extraction
The ViT model is used as the image encoder to extract the feature representations of the input m images, and the Transformer encoder is used to capture the mutual relationships between the m images. The specific steps are as follows.
3.3.1. Multi-Image Representation
Each image is resized to 224 × 224 pixels and then input into ViT. ViT divides each image into 14 × 14 = 196 non-overlapping 16 × 16 pixel patches, and generates a representation for each image patch through linear embedding, obtaining , where denotes the output dimension of ViT-base. Subsequently, a learnable special token [CLS] with the same dimension as the image patch is added at the beginning of the sequence S to form the sequence , where to are the original sequence, and is the special token [CLS]. Next, the activation representation of the [CLS] token in the last layer of ViT is used to obtain the corresponding image representation, resulting in the representation set of m images , where is defined as the dimension of the image representation. When the count of input images falls below m, zero vectors are used for padding to ensure the consistency of the input.
3.3.2. Temporal Positional Encoding
Since the positional information of images affects the feature extraction results (for example, the first image usually contains more detailed information, and the text may contain words indicating the position of the image), a learnable temporal positional encoding is set to indicate the position and temporal information of multiple images.
Let the positional embedding vector , where to are the positional embeddings corresponding to each image. Add the positional embedding vector C to the image representation V: , where is a representation of multiple images, containing position and temporal information.
3.3.3. Multi-Image Relationship Modeling
The Transformer is a deep learning model architecture based on the attention mechanism, whose core advantage lies in its ability to efficiently capture long-range dependencies in sequential data. After adding the positional embedding vector, the Transformer encoder is used to capture the interaction information between multiple images. The is input into the Transformer encoder, and the self-attention mechanism of the encoder is used to realize global semantic modeling and feature association across images, obtaining the multi-image representation .
3.4. Text Feature Extraction
BERT is used as the text encoder to extract the feature representation of the text input, and the text is guided by image information to obtain a text representation containing image–text interaction information. The full procedure is specified as follows.
3.4.1. Text Representation
Prepend the [CLS] token and append the [SEP] token to the text input T, obtaining , where to are the original input text, is the [CLS] token, and is the [SEP] token. Next, input it into the embedding layer of BERT to obtain the text representation of the 0-th layer Transformer , where is defined as the dimension of the text representation.
3.4.2. Image Modality Guidance
The Image Guidance Module (IGM) uses image representations as prior knowledge to guide text representations, establishing a correlation mapping between visual and text features. This guidance mechanism is not a simple feature concatenation but a cross-modal interaction through the attention mechanism during encoding, resulting in text representations with better visual consistency and semantic accuracy. Specific implementation details are shown in
Figure 3.
This module projects the image representation
G into each layer of the BERT text encoder for subsequent interaction with text:
where
L is defined as the number of Transformer layers in the text encoder,
is defined as the image-guided projection corresponding to the
l-th layer Transformer,
is the weight matrix corresponding to the
l-th layer Transformer,
is defined as the dimension of the text representation, and
is defined as the dimension of the image representation. Subsequently, feed the text representation
from the
-th layer and
into the
l-th layer Transformer of BERT, yielding the
l-th layer representation
. The specific implementation is as follows:
First, project
into the query
, key
, and value
of the
l-th layer:
where
are weight matrices. Next, project
into additional keys
and values
to interact with the
-th layer text representation:
where
,
are weight matrices,
is derived from image projection features and
is derived from textual features. Specifically, since background regions irrelevant to textual entities may exist in images, the corresponding
will generate low-correlation attention weights, which are defined as noise-related weights. To accurately filter valid weights, a dynamic threshold
based on the statistical features of the attention weights in the current layer is introduced:
where
denotes the mean value of attention weights at the l-th layer and
represents the Sigmoid function.
This threshold can be adaptively adjusted such that a low value of in the presence of substantial image noise leads to a corresponding reduction in to retain more potentially valid weights, whereas an increase in is induced by minimal image noise to rigorously filter out low-correlation weights. Meanwhile, attention weights below are set to 0, with only those above the threshold retained for subsequent calculations. The filtered weights are then re-normalized to ensure that valid visual information can precisely guide the learning of textual representations, while irrelevant noise is suppressed. Following the processing of L Transformer layers, the text representation can be obtained.
3.5. Symmetric Multimodal Fusion
To address the issues associated with modality noise and modality gaps, the extracted representation of each modality is projected into a modal-shared subspace and a modal-specific subspace. Through the symmetric multimodal fusion module (SFM), a hyper-feature representation (Hyper) containing rich cross-modal semantics is constructed to perform the subsequent MNER task. The specific process is as follows.
3.5.1. Modality Refinement and Alignment
The representation vectors of the two modalities are projected into two different representation vectors. The first one is the modal-shared representation
, which is the aligned representation learned by each modality in the same semantic subspace. The second one is the modal-specific representation
, which is obtained by capturing the unique features of each modality:
where
represents the two modalities,
represents the vector representations corresponding to the two modalities, and AE and RE respectively represent the Modality Alignment Encoder and the Modality Refinement Encoder. AE shares the parameter
between the two modalities, while RE assigns separate parameters
to each modality.
Notably, noise in the textual modality mainly stems from redundant modifiers and meaningless high-frequency words, whose feature representations are concentrated in the low-dimensional embedding space; in contrast, noise in the visual modality is mostly composed of background regions and local details of non-target entities, and the features of such noise exhibit the property of local isolation. Accordingly, if the architectures of AE and RE are designed as sophisticated structures, such as Transformer layers, such designs would strengthen the invalid correlations between noise patches and target patches and excessively capture the local correlations of noise, which not only leads to the dilution of entity features but also increases the noise-filtering burden of subsequent fusion modules. In contrast, a fully connected layer can directly map low-dimensional noise features to invalid dimensions via linear projection and suppress the feature responses of isolated noise patches directly through weight learning.
Therefore, to prevent the redundant information from concentrating in the representation vectors, both AE and RE consist of a fully connected layer and a ReLU activation function. Thus, Equations (5) and (6) can also be expressed as follows:
where FC represents the fully connected layer. AE and RE are designed to bridge cross-modal semantic gaps and filter modality noise, respectively. During the RE training process, an orthogonality constraint is introduced, which systematically filters out irrelevant noise by limiting the redundancy between modality-specific representations and shared representations, as well as among different modality-specific representations.
3.5.2. Symmetric Multimodal Fusion Module
The symmetric multimodal fusion module (SFM) aims to independently enhance and optimize each modality using a multi-layer self-attention mechanism, and achieve deep fusion through a symmetric cross-modal interaction mechanism. Its workflow is shown in
Figure 4.
First, input the shared representation of each modality into the Transformer layer, and strengthen the intra-modal feature representation by means of the self-attention mechanism:
where
and
are the representations enhanced by the self-attention mechanism,
represents the Transformer layer, and
is the learnable parameter. Concatenate the shared representation and the specific representation of modality
o, feed the concatenated representations into a fully connected layer, and derive the joint representation of modality
o through the Sigmoid activation function:
where
is the joint representation of modality
o,
is the learnable parameter, and ⊕ represents the concatenation operation. Then, use the image representation as the query, and the text representation as the key and value:
where
are learnable attention matrices. Calculate the similarity matrix
between the text representation and the image representation:
Finally, use the output results of the cross-modal attention and the feed-forward layer to obtain the hyper-feature representation:
where
represents the hyper-feature that aggregates the similarity representations of the two modalities, and
is the feed-forward layer of the Transformer.
Hyper integrates the specific features extracted by RE, the shared features learned by AE, and the correlation features generated by symmetric cross-modal interaction. It can simultaneously take into account modal specificity and cross-modal consistency, providing comprehensive and accurate feature support for the CRF layer, and thereby improving the accuracy of entity boundary localization and category classification.
3.6. CRF Decoder
A Conditional Random Field (CRF) decoder is used to perform the NER task. Input the hyper-feature representation Hyper, which aggregates the similarity features of each modality, into the standard CRF layer.
The CRF decoder takes as input the token-level hyper-representations
(where
denotes the representation of the token,
to
correspond to the representations of the original text tokens, and
denotes the representation of the token). It calculates the emission probability of the corresponding label for each token via the emission score
(
, where
is the number of labels). Combined with the transition score
(which captures the dependencies between labels), the decoder performs sequence-level entity prediction using the log-likelihood functions in Equations (16) and (17), and finally outputs a label sequence
that has the same length as the text token sequence.
where
is the emission score of the
i-th token,
is the transition score from token
to token
, and
Y represents the pre-defined label set using the BIO labeling scheme.
3.7. Learning
The overall learning of the model is performed by minimizing the Loss:
where
and
are interaction weights, which determine the contribution of each regularization component to the total loss
. Each of these loss components is responsible for achieving the desired subspace characteristics. The specific implementation is as follows.
3.7.1. Similarity Loss
The minimization of similarity loss serves to narrow the gap between different modalities, thereby promoting the alignment of shared feature representation. The selection of the Central Moment Discrepancy (CMD) [
24] indicator can achieve this goal. CMD is an advanced distance measurement method that measures the difference between the distributions of two representations by matching the moment differences of each order of them. Simply put, when the two distributions grow increasingly similar, the CMD distance exhibits a corresponding decrease.
First, define CMD. Let
X and
Y be bounded random samples defined on the interval
with their respective probability distributions
p and
q. The central moment discrepancy regularizer
is defined as follows:
where
is the empirical expectation vector of sample
X, and
is the vector composed of the
k-th order sample central moments of all coordinate components of
X. For the proposed named-entity recognition model, the CMD loss between the shared representations is calculated as follows:
3.7.2. Difference Loss
The difference loss ensures that the modal-shared representation and the modal-specific representation capture different information. Non-redundancy is achieved by imposing a soft orthogonality constraint between them [
25]. Soft orthogonal constraints ensure non-redundancy between the shared representations and the specific representations, as well as between specific representations of different modalities, which is directly related to the noise filtering of RE.
In a training batch, two matrices
and
are defined. The rows of the matrices represent the hidden vectors
and
of the modality of each input. Then, the orthogonality constraint calculation formula for this modality vector pair is as follows:
where
is the square of the Frobenius norm. In addition to the constraint between the shared representation and the specific representation, an orthogonality constraint is also added between the specific representations. Therefore, the overall difference loss is calculated as follows:
The discrepancy loss penalizes correlations between
and
, as well as between
and
, forcing the RE to eliminate redundant noise and preserve the specific features conducive to entity recognition. This improves the noise-filtering performance of the RE, delivering high-purity modal features to the downstream fusion module.
3.7.3. Task Loss
The task loss is used to estimate the prediction quality during training. The log-likelihood loss is selected as the corresponding loss function, whose formulation is presented as follows:
where
is the training sample batch, and
N is the batch size.
4. Experiments
This section comprehensively evaluates the proposed named-entity recognition model (SMCL) through a series of experiments. Following recent studies, Precision (P), Recall (R), and F1-score (F1) are used as evaluation metrics. The experimental results show that the proposed model outperforms various single-modal and multi-modal methods on both datasets.
4.1. Experimental Setup
4.1.1. Datasets
MNER-MI and MNER-MI-Plus are two MNER datasets containing multiple images. As shown in
Table 1, the MNER-MI dataset has a total of 8576 tweets and 11,862 named entities, divided into a training set, a validation set, and a test set, containing 6856, 860, and 860 tweets, respectively. Each tweet contains an average of about three images. The MNER-MI-Plus dataset has a total of 13,395 tweets and 20,586 named entities, also divided into a training set, a validation set, and a test set, containing 10,229, 1583, and 1583 tweets, respectively, with an average of about two images per tweet.
4.1.2. Model Setup
Except for BiLSTM-based methods, all methods use BERT-base (
https://huggingface.co/bert-base-uncased (accessed on 29 November 2025)) as the text encoder and ViT-base-patch16 (
https://huggingface.co/google/vit-base-patch16-224 (accessed on 29 November 2025)) as the image encoder. For single-image multimodal models, open-source codes from GitHub (
https://github.com) were reused after dataset format adaptation; For multi-image models, UMT-MI extended UMT with multi-image concatenation and SMCL-consistent temporal positional encoding. TPM-MI adopted the original implementation without structural modifications.
4.1.3. Parameter Setup
AdamW [
26] is used as the optimizer. The grid search is performed on the validation set to find the learning rate in the range of [
,
] and the batch size in the range of [8, 32]. Mini-batch backpropagation is used for training, and the model with the best performance on the validation set is selected for evaluation on the test set. All models were trained for up to 50 epochs with early stopping (patience = 5), using validation-optimal weights for testing, and the best hyperparameter configuration was selected via single-run search based on validation F1-score.
4.2. Baseline
4.2.1. Named-Entity Recognition Models
For the single-text modality, a variety of well-established models that are widely applied in the named entity recognition (NER) task have been thoroughly investigated. Specifically, BiLSTM-CRF [
27] is investigated for the first time, which combines a Bidirectional Long Short-Term Memory (BiLSTM) network with a Conditional Random Field (CRF) layer to capture bidirectional dependency relationships in text and perform sequence labeling. Then, HBiLSTM-CRF [
28] is further studied, which uses a hierarchical bidirectional LSTM structure to further improve the ability to model the internal structure of words, thus enriching character-level word representations. Finally, the pre-trained model BERT is investigated. This is a powerful text encoder based on the Transformer architecture, which can capture long-range dependency relationships in text and learn rich language knowledge through pre-training. BERT-CRF combines the powerful encoding ability of BERT with a CRF-based decoder, further improving the performance in sequence labeling tasks.
4.2.2. Multimodal Named-Entity Recognition Models
With respect to multimodal experiments that integrate text and image modalities, representative MNER models are selected as baselines. UMT [
10] designs a multi-modal interaction module to construct bidirectional associations between text and images; OCSGA [
12] adopts an object detector to extract objects in images and uses the text labels of these objects as image representations; UMGF [
13] proposes a method relying on a graph-based model to build connections between text and images; MAF [
4] puts forward a universal matching and alignment framework, which is designed to align text and image representations while mitigating the interference caused by image noise; ITA [
29] extracts objects, captions, and text from images as image representations; both HVPNeT [
14] and VisualPT-MoE [
15] leverage image representations as prompt to accomplish interaction with all layers of the text encoder; PGIM [
16] leverages ChatGPT as an implicit knowledge base, generates auxiliary refined knowledge, and fuses this with original text to enhance performance.
However, these MNER methods only use a single image, while a tweet is usually accompanied by multiple images in practical applications. Studying only single-image MNER is far from sufficient. To address this, Huang et al. [
18] proposed TPM-MI, which can receive multiple images and interact with text using image information as prompts. But this method ignores the existence of modality noise, and it relies on high-quality image prompts to guide text generation or understanding. When the image does not match the text, the performance is affected.
Therefore, this paper proposes a new MNER model that captures the unique features of each modality and refines the modality representations. Combining symmetric multimodal fusion with contrastive learning, each modality is projected into a shared subspace, their commonalities are learned, and consistent cross-modal representations are regularized to bridge the modality gap. And with the help of the symmetric multimodal fusion module and CRF layer, the MNER task achieves better performance.
4.3. Experimental Results
The results of the experiments are illustrated in
Table 2, which can be analyzed from three dimensions. First, by comparing the performance of single-modal named-entity recognition models, it can be seen that BERT-based models perform significantly better than BiLSTM-based models on both datasets. This result intuitively confirms the advantages of pre-trained language models.
Then, a horizontal comparison between single-image multimodal named-entity recognition models and single-modal models shows that almost all multimodal models show significant performance improvements. This indicates that the visual information contained in images can provide effective assistance in text-entity recognition, and also verifies the positive effect of multimodal fusion on task optimization.
In the comparison of multi-image multimodal named-entity recognition models, methods using multi-image input are always better than those using a single input. Taking the UMT-MI model as an example, this model integrates multi-image information by stitching multiple images into a single image on the basis of the original UMT model, and its performance is significantly better than the original UMT model. This result further indicates that introducing multi-image information can provide the model with richer visual cues, reduce the perspective limitations or information loss that may exist in a single image, and thereby optimize the execution effect of the MNER task.
It is worth noting that, compared with the current optimal method TPM-MI, the proposed model SMCL achieves more considerable results. For example, on the MNER-MI dataset, its F1-score is 2.44% higher than that of TPM-MI. This is because TPM-MI has two limitations.
First, it lacks a systematic cross-modal semantic alignment mechanism, making it difficult to achieve an in-depth association between text and image features. Second, it does not effectively filter the noise information within the modality, such as redundant modifications in text and irrelevant backgrounds in images, resulting in limited model learning efficiency.
In contrast, SMCL achieves the refinement and alignment of each modality through an architecture of symmetric-encoder collaborative cross-modal fusion. On the one hand, it filters noise information relying on orthogonal constraints and retains the core features of the modality; On the other hand, it deeply explores the semantic associations between each modality based on multimodal contrastive learning, achieving accurate alignment and cross-modal fusion. After the modality representations are refined and aligned, the Symmetric Multimodal Fusion Module (SFM) further plays a key role in cross-modal fusion. Through its symmetric cross-modal interaction mechanism, the specific representation and shared representation are effectively integrated. The combined effect of these three mechanisms ultimately leads to a significant improvement in performance.
4.4. Ablation Experiments and Analysis
To verify the effectiveness of each component of the proposed MNER model SMCL,
Table 3 presents the ablation results after removing each component on the MNER-MI and MNER-MI-Plus datasets. Among them, “w/o SFM” indicates that the SFM module is removed; “w/o AE” and “w/o RE” respectively indicate that the learning of modal-shared representation and modal-specific representation is removed; “w/o AE+RE” denotes that this symmetric encoder branch is removed; and “w/o IGM” removes the multi-image guidance on text representations.
Replacing the SFM with a straightforward concatenation mechanism paired with a Transformer layer leads to a marked decline in the overall performance of the model. Specifically, experiments conducted on the MNER-MI dataset reveal a 2.85% reduction in F1-score relative to the complete model configuration. This finding first highlights the indispensable value of cross-modal complementary information for the NER task: textual ambiguities are frequently resolved by leveraging visual clues in images, and latent entity information in visual data requires textual context to achieve accurate positioning. Beyond this, the result further provides direct empirical evidence for the SFM’s efficacy in effectively integrating information from both modalities.
To further intuitively demonstrate the semantic alignment ability of the SFM, the cross-modal attention weights are visualized in
Figure 5. The darker color blocks in the heatmap represent higher attention weights, clearly showing that the entity token has a strong correlation with the global image feature and the local feature block. This visualization indicates that the SFM can accurately capture the potential semantic association between text entities and image content, which constitutes the key to its superiority compared to simple feature concatenation.
Further analysis shows that after the symmetric mechanism of modal-shared representation learning and modal-specific feature learning is removed, the model loses the ability to finely process different modalities. On the one hand, the inherent semantic differences between modalities (such as the abstract nature of text and the concrete nature of images) cannot be effectively modeled, leading to deviations in cross-modal alignment. On the other hand, redundant information within the modality (such as irrelevant modifiers in text and background interference elements in images) cannot be filtered, resulting in the model receiving a large amount of noise input. The combined effect of these two aspects results in a reduction in the accuracy of entity boundary positioning and category judgment, which is ultimately reflected in a notable decline in the F1-score.
Moreover, replacing the “AE+RE” symmetric dual-branch structure with a single projection layer gives rise to two critical limitations. Specifically, the absence of a dedicated learning pathway for modal-specific representations affects the model’s ability to filter redundant intra-modal noise effectively, while the lack of a dedicated modeling mechanism for shared modal representations hinders its ability to bridge semantic gaps between modals. As a result, the model fails to accomplish the two core goals of modal noise filtering and cross-modal feature alignment simultaneously, with its F1-score decreasing by 5.94% on the MNER-MI dataset and 5.01% on the MNER-MI-Plus dataset relative to the full-version SMCL model. These results thus fully demonstrate the necessity and irreplaceability of the AE+RE symmetric dual-branch structure in the proposed model.
In addition, the ablation results of the IGM confirm the positive guiding role of image representation in understanding text. When the image contains visual features of entities, encoded image representation can provide concrete references for resolving the ambiguity of text entities, helping the model accurately identify named entities. This one-way guidance mechanism complements the two-way fusion of SFM, jointly improving its ability to recognize multimodal entities in complex scenarios.
4.5. Hyperparameter Sensitivity Analysis
To verify the robustness of the proposed model SMCL to key hyperparameters and clarify the optimal parameter value range, this section selects core hyperparameters to conduct a sensitivity analysis based on the characteristics of multi-image MNER tasks. The experiment adopts the “single-variable method”, where only the value of one hyperparameter is changed each time, while keeping other parameters at the baseline configuration (baseline: , , , learning rate ). Experiments are independently run fivetimes on the MNER-MI and MNER-MI-Plus datasets, and the averages of the P, R, and F1-score are taken as the results. The F1-score is used as the core evaluation index to measure sensitivity.
4.5.1. Hyperparameters
Combining the model mechanism and the conventional values in the field, three categories of key hyperparameters and their gradient ranges are determined, as detailed in
Table 4. All parameter gradients cover the “low-baseline-high” interval to ensure a comprehensive verification of the impact of parameter changes on performance.
In light of relevant studies in the field of multimodal learning, the weights of the orthogonal constraint loss and contrastive learning loss are generally configured within the range of 0.1–0.5, aiming to balance the contributions of the main task loss and regularization loss. Considering the architectural characteristics of the proposed model, the similarity loss () is designed for cross-modal alignment, where an excessively large weight ought to be avoided to preclude the suppression of modal-specific feature learning; the discrepancy loss () is employed to impose orthogonal constraints, and an overly large weight also needs to be prevented to mitigate the loss of feature information. Consequently, the hyperparameters are initially set as and . Subsequently, the optimal values (, ) are determined through sensitivity analysis with the single-variable control method, which ensures that the model attains a trade-off among cross-modal alignment performance, modality noise filtering capability, and overall task performance.
4.5.2. Results of Hyperparameter Sensitivity Experiments
The CMD distance achieves distribution alignment by matching the first
K-order central moments of cross-modal features, and the order
K directly affects the alignment accuracy. As shown in
Figure 6a, on the MNER-MI dataset, the F1-score reaches a peak of 79.76% when
; when
, the F1 decreases by 1.30% (to 78.46%) due to insufficient moment matching, and when
and
, the F1 drops to 79.25% and 78.09% respectively due to the over-matching of high-order noise features. A similar pattern is observed on the MNER-MI-Plus dataset, where
yields the optimal F1 (85.14%) and
leads to a 1.68% performance drop (to 83.46%). This indicates that
K is a moderately sensitive parameter, and only when
can it balance alignment sufficiency and noise resistance, while values that are too high or too low will weaken the ability to capture cross-modal semantic correlations.
The loss function weights control the balance of multi-objective optimization, and the experimental results are shown in
Figure 6b,c.
For the similarity loss weight , when , the cross-modal alignment and task loss achieve the optimal balance, resulting in the highest F1; when , the alignment loss dominates excessively, suppressing the learning of modality-specific features, and the F1 of the MNER-MI dataset drops to 78.52%; when , insufficient alignment leads to an increased cross-modal semantic gap, with F1 decreasing by 1.24% (MNER-MI: 78.52%). Therefore, is a moderately sensitive parameter, and to ensure the effectiveness of in driving cross-modal semantic fusion, its value range should be stably maintained between 0.2 and 0.3.
For the difference loss weight , when , the redundancy filtering effect is the best, leading to the highest F1; when , excessive feature separation causes semantic information loss, and F1 decreases by 0.89% (MNER-MI: 78.87%); when , residual redundant information interferes with entity judgment, and F1 drops to 79.03%. So is a low-sensitive parameter, and to guarantee ’s ability to safeguard the purity of modal features, its value range needs to be within 0.1 and 0.2.
Optimizer parameters, particularly the learning rate
, exert a critical influence on the model’s convergence behavior and fitting capacity. As illustrated in
Figure 6d, the sensitivity analysis of
reveals distinct performance patterns: when
, the model achieves stable convergence, and the F1-score reaches its peak, indicating an optimal balance between training efficiency and generalization; if
increases to
, the optimization process enters an oscillatory state, causing the F1-score on the MNER-MI dataset to plummet to 78.32% due to unstable parameter updates; conversely, when
decreases to
, the convergence speed becomes excessively slow, leading to underfitting and an F1-score of merely 78.01% on MNER-MI. These observations collectively demonstrate that
is a highly sensitive hyperparameter, and its value must be strictly constrained at around
to ensure both convergence stability and performance optimality.
4.6. Case Study
To more clearly demonstrate the efficiency of the proposed named-entity model, SMCL, in completing the MNER task in multi-image scenarios, a case study is conducted, and the recognition results are carefully compared with those of the existing method, TPM-MI. The specific cases and results are shown in
Figure 7.
In the first case, the text contains the entity “Disney Channel”. The TPM-MI model only recognizes “Disney” as the organization type (ORG) and ignores the key information carried by “Channel”. The core reason for this misjudgment is that the image corresponding to this case does not contain any explicit visual cues that can be directly associated with “Disney Channel”, making it difficult for the model to complete accurate recognition with only limited image information. In contrast, the SMCL model not only captures the semantic binding relationship between “Disney” and “Channel” in the text by deeply exploring intra-modal information and cross-modal interaction information, but also combines the potential association features between modalities, and finally successfully recognizes “Disney Channel” as miscellaneous (MISC), demonstrating its ability to parse complex entity structures.
A similar phenomenon appears in the second case. When the visual prompts provided by the image are ambiguous and the image quality is low, methods that rely only on images as auxiliary prompts will have recognition deviations due to insufficient information. This result further indicates that the mechanism that over-reliance on image modal prompts has significant one-sidedness, and its performance stability will be affected by fluctuations in image quality.
The third case more intuitively reflects the model’s ability to handle noise information. This case contains two images, where the second image only serves as a reference for “innocent person” and is irrelevant to the core entity “Richard”, which is a typical example of modal noise. The TPM-MI model fails to recognize this level of information difference and incorrectly associates the elements in the noise image with the target entity, leading to deviations in the recognition result. In contrast, the SMCL model effectively filters the noise information through the modal-specific representation refinement mechanism, eliminates the interference of irrelevant images, and at the same time captures the in-depth association between “Richard” and the text context, relying on cross-modal semantic alignment, and finally accurately recognizes it as the person type (PER).
Comprehensive comparison results of the three cases show that the proposed model SMCL not only filters redundant information, effectively address the problem of modal noise, and improve recognition accuracy but also handles scenarios with insufficient image prompt quality by strengthening semantic alignment between modalities. This fully verifies its advancement and robustness in performing the MNER task in complex multi-image scenarios.
4.7. Computational Cost Analysis
To evaluate the practicality of the proposed SMCL model, a comprehensive analysis of its computational cost is conducted, covering parameter overhead, training efficiency, and inference speed. The model is compared with representative baseline models (BERT-CRF, UMT-MI, TPM-MI) and its ablation variant (SMCL w/o SFM) using three key metrics: parameter count (Params), training time per epoch, and inference speed (samples per second). As shown in
Table 5, SMCL has 133.5 M parameters, which is 12.4% more than TPM-MI (118.7 M). The additional parameters mainly come from the symmetric branch encoders (AE+RE) and the Symmetric Multimodal Fusion Module (SFM). Specifically, RE introduces independent learnable parameters for text and image modalities to filter the modal noise, while SFM incorporates multi-layer Transformer and cross-modal attention mechanisms to achieve deep fusion. In terms of training efficiency, SMCL takes 2.3 h per epoch, which is 9.5% longer than TPM-MI (2.1 h), due to the additional computational cost of contrastive learning alignment (AE) and orthogonal constraint optimization (RE). For inference speed, SMCL processes 190.2 samples per second, a 11.8% decrease compared to TPM-MI (215.8 samples/s), which is attributed to the cross-modal interaction steps in SFM.
However, the computational cost increment is justified by significant performance gains. SMCL outperforms TPM-MI by 2.44% and 1.72% in F1-score on MNER-MI and MNER-MI-Plus datasets, respectively. Moreover, ablation experiments show that removing SFM reduces parameters by 8.9% and shortens training time by 13.0%, but leads to a 6.1% drop in F1-score (MNER-MI), confirming that SFM is indispensable for cross-modal feature synergy. In practical applications, SMCL’s inference speed (190 samples/s) meets the real-time requirements of short text-processing scenarios, and its cost can be further reduced via model compression or lightweight backbone replacement in future work.
Overall, SMCL achieves a favorable balance between computational cost and performance, providing a robust and practical solution for multi-image MNER tasks without excessive overhead.
4.8. Error Analysis
To verify the superiority of SMCL in handling complex multi-image MNER tasks, we randomly selected 1000 error samples from the test sets of MNER-MI and MNER-MI-Plus, and compared the error distribution with the state-of-the-art baseline TPM-MI. Errors are categorized into three types based on their manifestations and causes.
4.8.1. Definition of Error Types
Boundary Errors: Misidentification of the start or end positions of named entities.
Type Confusion: Incorrect classification of entity types.
Modality Conflict Errors: Prediction errors caused by inconsistent semantic information between text and image modalities.
4.8.2. Quantitative Error Distribution
Table 6 presents the quantitative distribution and reduction rates of the three error types for both models. SMCL achieves significant reductions in all error categories, with an overall error count reduction of 21.8% on MNER-MI and 24.0% on MNER-MI-Plus.
4.8.3. Qualitative Analysis of Typical Cases
Boundary Error Correction: For the text “Disney Channel”, TPM-MI only recognizes “Disney” due to insufficient visual cues. SMCL captures the semantic binding between “Disney” and “Channel” through the token-level fusion mechanism of the SFM module, achieving complete entity recognition.
Type Confusion Correction: For the entity “Richard” in the text, TPM-MI mislabels it as MISC due to interference from noise images. SMCL filters modal noise via the RE module and aligns cross-modal semantics through the AE module, correctly classifying “Richard” as PER.
Modality Conflict Correction: For “Joey” and “Niko” with conflicting text–image cues (text implies PER, images imply MISC), TPM-MI over-relies on text information. SMCL balances bimodal contributions through symmetric fusion, correctly predicting MISC.
4.8.4. Error-Reduction Mechanism
SMCL’s error reduction benefits from the synergistic effect of its core components. First, the SFM module enhances token-level semantic dependencies, effectively reducing boundary errors. Then, contrastive learning in the AE module strengthens the discriminability of entity type features, alleviating type confusion. Finally, orthogonal constraints in the RE module filter conflicting noise, mitigating modality conflict errors.
This verifies SMCL’s robustness and superiority in handling complex multi-image MNER scenarios with ambiguous boundaries, confusing types, or conflicting modal information.
5. Conclusions and Future Work
To effectively solve the problems of modality noise and modality differences in the MNER task in multi-image scenarios, this paper integrates symmetric multimodal fusion with contrastive learning and proposes a new MNER model. To address the problem of modality noise, each modality is projected into its own specific subspace to learn modal-specific representation, which is optimized through orthogonal constraints. To reduce the modality difference, each modality is mapped to a shared subspace, and by optimizing the distribution of feature representations, the feature representations of the same category are made more compact and those of different categories are made more separated. Specifically, the symmetric multimodal fusion module is designed to independently enhance and optimize each modality with a multi-layer self-attention mechanism, thereby achieving in-depth fusion through symmetric cross-modal feature interaction. The experimental results demonstrate that the performance of this model is significantly superior to that of existing approaches.
It should be noted that the model also has limitations. When processing multi-image input, it adopts an equal-weight fusion strategy and fails to distinguish the differences in the contribution of different images to text understanding. In practice, different images vary in their importance for the comprehension of posts, and some images may contain key entity visual cues, while others may only provide auxiliary context or irrelevant information. Equal-weight fusion would dilute the contribution of effective information. Future work will introduce an image–text relevance scoring mechanism to assign dynamic weights to different images, thereby achieving more accurate multi-image feature fusion and further enhancing the model’s adaptability in complex scenarios.