Visual Clue Guidance and Consistency Matching Framework for Multimodal Named Entity Recognition

: The goal of multimodal named entity recognition (MNER) is to detect entity spans in given image–text pairs and classify them into corresponding entity types. Despite the success of existing works that leverage cross-modal attention mechanisms to integrate textual and visual representations, we observe three key issues. Firstly, models are prone to misguidance when fusing unrelated text and images. Secondly, most existing visual features are not enhanced or filtered. Finally, due to the independent encoding strategies employed for text and images, a noticeable semantic gap exists between them. To address these challenges, we propose a framework called visual clue guidance and consistency matching (GMF). To tackle the first issue, we introduce a visual clue guidance (VCG) module designed to hierarchically extract visual information from multiple scales. This information is utilized as an injectable visual clue guidance sequence to steer text representations for error-insensitive prediction decisions. Furthermore, by incorporating a cross-scale attention (CSA) module, we successfully mitigate interference across scales, enhancing the image’s capability to capture details. To address the third issue of semantic disparity between text and images, we employ a consistency matching (CM) module based on the idea of multimodal contrastive learning, facilitating the collaborative learning of multimodal data. To validate the effectiveness of our proposed framework, we conducted comprehensive experimental studies, including extensive comparative experiments, ablation studies, and case studies, on two widely used benchmark datasets, demonstrating the efficacy of the framework.


Introduction
Multimodal named entity recognition (MNER) has become an important research direction in named entity recognition (NER), and it can improve text-based NER by using images as additional inputs [1].It assumes that image information can help to recognize unambiguous named entities when textual information is insufficient.Considering the given text example, "Rocky is ready for snow season", there are obvious challenges in determining named entity categories based on the textual content alone."Rocky" could refer to a person, an animal, or some other entity type.However, when we combine this text with the corresponding image information (as shown in Figure 1), we can easily determine its type as MISC.In this paper, we investigate MNER in social media posts.
Existing work achieved good performances compared to text-based NER methods [1][2][3][4][5][6][7][8][9][10][11][12][13][14].Refs.[2][3][4] are pioneering works in multimodal named entity recognition.For the first time, image information was added to assist entity recognition in text, which improved the accuracy of entity recognition.However, although word-related visual representations are generated, they are insensitive to the visual context and ignore the bias introduced by the visual context.To address these issues, ref. [5] proposes an end-to-end model for learning joint representations of text and images using a multidimensional self-attention technique that simultaneously captures text and improves the accuracy of the visual context.The authors of [4] present a novel model based on visual attention that provides deeper visual understanding in model decision-making.Ref. [6] utilizes object labels as embeddings to achieve a bridge between visual and verbal aspects and introduces an intensive co-attention mechanism for fine-grained interactions to recognize named entities more accurately using the visual context.A multimodal interaction module was designed for the first time using a Transformer to obtain information about the word representation and visual representation of an image [7], and the authors propose to utilize plain text entity span detection as an auxiliary module to mitigate the visual bias, considering both textual and visual information to improve the recognition accuracy.A unified multimodal graph fusion (UMGF) method is proposed to represent the input sentences and images as a unified multimodal graph [1], stacking multiple graph-based multimodal fusion layers, learning node representations by iteratively performing semantic interactions, and utilizing graph structures to make fine-grained semantic associations between semantic text and image units.However, there are limitations in all five approaches [1,[4][5][6][7], which do not fully utilize the knowledge behind image and text pairs.Therefore, ref. [8] proposes a novel pretrained multimodal model based on Relational Inference and Visual Attention (RIVA), which employs a teacher-student semi-supervised paradigm to utilize the large unlabeled multimodal inferred corpus and labeled datasets for the classification of textimage relations.A novel MNER neural model is proposed for acquiring image attributes and image knowledge [9], and a multiple attention mechanism is designed to integrate this information.Reference [10] proposes a MoRe framework for injecting knowledge-aware information into a multimodal NER task using multimodal retrieval, which has rarely been seen in previous research.However, the above three approaches [8][9][10] also have drawbacks in that they cannot avoid the complexity of using external tools and datasets.To address this problem, ref. [11] proposes a new multilevel semantic alignment approach that captures coarse-to fine-grained interactions between images and language and directly utilizes learned visual features to capture the relationship between images and text more comprehensively, avoiding the complexity of using external tools and datasets.In addition, ref. [12] had a unique idea [12][13][14] to utilize the external matching between different (text, image) pair relationships and designed an R-GCN model to model these relationships, using the external matching relationships between different (text, image) pairs within the dataset to mitigate image noise in the MNER task while modeling both cross-modal and intra-modal relationships.Ref. [13] proposed the SD-NER model, which models the minimum distance matrix between entities and is easily transferable to other tasks.Ref. [14] proposed a multimodal Chinese named entity recognition (USAF) model using acoustic features, which unifies textual and acoustic features through a unique positional embedding and fuses the features of both features using a multi-head attention mechanism.Although existing methods are a great improvement [1][2][3][4][5][6][7][8][9][10][11][12][13][14], they have three main shortcomings.Firstly, existing methods assume that each piece of text and its corresponding image are matched and that the image can help recognize named entities in the text, but not all texts are matched with their corresponding images.As shown in Figure 2, ref.[7] incorrectly predicts "Aquamarine" as ORG due to the influence of the house, indicating that the existing model has difficulties in filtering the noise introduced by the mismatched text-image pairs.According to [15], about 34.1% of the text content in Twitter-2015 has an imperfect match with images.Secondly, existing work on images usually relies only on the Residual Network (ResNet) [16] or Mask-RCNN [17] and other image encoders to extract visual features without enhancement or filtering and feed them directly into cross-modal interaction mechanisms along with text.Finally, most existing approaches fail to construct a consistent representation to bridge the semantic gap between the two modalities.As shown in Figure 3, ideally, the words "Danielle" and "Melissa" in the text should show a high degree of similarity with the region associated with the "rabbits" in the image and low similarity with other regions in the image.However, due to the inconsistent representation between the text and the image, the similarity between "Danielle" and "Melissa" in the text and the "rabbits" in the image may be lower than the similarity between "Danielle" and "Melissa" in the text and the "rabbits" in the image in the calculation of the similarity score.The similarity between "Danielle" and "Melissa" in the text and the "rabbits" in the image may be lower than the similarity with other regions in the image.To address these shortcomings, we propose a visual clue guidance and consistency matching framework (GMF).Firstly, we design a visual clue guidance module (VCG), which collects image features through the Resnet50 image encoder.On this basis, we propose a cross-scale attention (CSA) module, which combines low-level, high-resolution image information with high-level, strong semantic image information to enhance visual features.Further, feature fusion is performed by a structured mapping function F(θ), and, finally, the above vectors are processed using a gating mechanism for path decision-making.Secondly, we utilize Contrastive Language-Image Pretraining (CLIP) [18] based on the idea of multimodal contrast learning, and we design the consistency matching module (CM), defining contrast loss to bring matched image-text pairs closer together and push them away, enhancing the learning of the multimodal potential semantic space, and making the representation of two modalities more consistent.In addition, the attention mechanism of the pretrained Bert model is utilized to obtain text-aware image representations.Finally, the MNER task is performed by a Conditional Random Field (CRF) decoder.The main contributions of this paper can be summarized as follows.Firstly, we propose a visual clue guidance and consistency matching framework for the MNER task, which, through the visual clue guidance (VCG) module, reduces the effect of mismatched text-image pairs using a cross-scale attention (CSA) module.This enriches the information of visual modality.In addition, a consistency matching (CM) module is used to make the representation between images and text more consistent.Secondly, the three modules we proposed (VCG, CSA, and CM), which do not require additional data annotation, can be extended to other multimodal tasks.Finally, experiments on two publicly available datasets, Twitter-2015 and Twitter-2017, show that an excellent performance is achieved, outperforming powerful existing models and achieving F1 scores of 75.81% and 87.11%, respectively.We also conducted ablation studies, case studies, and further analysis to show that the VCG module, CSA module, and CM module play an important role in our framework.

Related Work
In this section, we review and summarize the works most relevant to our study.Starting from [2][3][4], multimodal named entity recognition has become an important research direction in named entity recognition (NER), which significantly extends the traditional text-based NER by using images as additional inputs [1].The key challenge is to combine text representation with image representation.Reference [2] proposes an LSTM-CNN architecture that combines text and image information through a generic modal attention module.The authors of [3] propose an adaptive co-attention network to dynamically control the combination of text representation and image representation.The authors of [4] propose an attention-based model to extract image features from the regions in the image most related to the text and use a gate to combine text features and image features.Meanwhile, refs.[2][3][4] propose methods that can integrate text and image information and improve the accuracy of entity recognition.However, they have the disadvantages of ignoring the bias brought about by the visual context and insensitivity to visual context information.In order to solve the above problems, ref. [5] proposes an end-to-end model to learn joint representations of text and images using multidimensional self-attention techniques.Ref. [6] utilizes object labels as embeddings to achieve a bridge between the visual and the verbal, introducing an intensive co-attention mechanism for fine-grained interactions.The authors of [7] pioneered the use of Transformers in multimodal tasks, and they proposed a multimodal interaction module that acquires word representations and visual representations of images and utilizes plain text entity span detection as an auxiliary module to mitigate visual bias.Ref. [1] proposes a unified multimodal graph fusion (UMGF) approach, which represents input sentences and images as a unified multimodal graph, stacks multiple graph-based multimodal fusion layers, and learns node representations by iteratively performing semantic interactions.However, refs.[1,[5][6][7] mitigated to some extent the bias caused by the visual context and the problem of insensitivity to visual contextual information.However, they suffer from two shortcomings.First, these methods assume that each text and its accompanying image are matched and that the image can be used to help named entity recognition.In addition, they cannot construct a consistent representation to bridge the semantic gap between the two modalities.Other unique ideas, e.g., the R-GCN model, were designed to model the external matching relationship between different (text, image) pairs [12], which is utilized to mitigate image noise in the MNER task by using the external matching relationship between different (text, image) pairs within the dataset and to model both cross-modal and endo-modal relationships.The authors of [10] used retrieval to inject relevant knowledge information from input text and images into a knowledge corpus into the MNER and Multimodal Relation Extraction (MRE) tasks, and they proposed the Mixed Expert (MoE) for MNER and MRE algorithms to combine the predictions of text and image models and make a final decision, which has rarely been seen in previous research.Disappointingly, refs.[10,12] also do not effectively solve the two problems mentioned above.
Therefore, we propose a visual clue guidance and consistency matching framework (GMF) that can effectively reduce the impact of mismatched text-image pairs, enrich the information of visual modalities, and make the representation between images and text more consistent.

Methodology
In this section, we will provide a detailed exposition of the GMF's application in multimodal named entity recognition (MNER).Prior to elucidating our proposed approach, we will commence with an overview of the MNER task.

Task Definition
The multimodal named entity recognition (MNER) task is designed to extract and categorize specific entities from a given sentence S along with its corresponding image I.The central challenge of this task is to accurately assign each entity to predefined categories.Drawing from existing research, MNER is conceptualized as a sequence-labeling task.Specifically, S = (s 1 , s 2 , . . ., s n ) describes the composition of the sentence, where each s n represents the nth word.Simultaneously, we use Y = (y 1 , y 2 , . . ., y n ) to denote the entity labels associated with each word s n in S.These labels adhere to the BIO2 annotation scheme [19].

Framework
Our visual clue guidance and consistency matching framework (GMF) is illustrated in Figure 4 and is composed of five key components: (1) input representations, (2) a visual clue guidance module, (3) a cross-scale attention module, (4) a consistency matching module, and (5) a CRF decoder.Subsequently, we will commence with an exploration of the input data transformation methods and then delve into the functionality and roles of each module.Finally, we will provide a detailed description of the training strategy for the MNER task.

Text Feature Extraction
To capture the deep semantic information in the text, we opted for BERT as the text encoder.Before presenting the sentence to BERT, we introduced specific tokens at the beginning and end positions of the sentence, namely [CLS] and [SEP].Thus, the extended sentence representation is denoted as S ′ = (s 0 , s 1 , s 2 , . . . ,s n , s n+1 ), where s 0 represents the [CLS] token and s n+1 represents the [SEP] token.After passing S ′ to BERT, we obtained the encoded output sequence C = (c 0 , c 1 , c 2 , . . . ,c n , c n+1 ).Subsequently, we utilized a fully connected layer with a Tanh activation function to process c 0 , obtaining the overall text representation C g .

Image Feature Extraction
Given an image, we followed the approach outlined by [1,20]

Visual Clue Guidance Module
The visual clue guidance module serves two primary purposes.Firstly, it captures multiple visually relevant objects associated with the text in a mutually aligned text-image pair, thereby enhancing the semantic knowledge for information extraction.Secondly, global image features often encompass abstract conceptual information, providing the model with a weak learning signal.In light of this fact, we aggregate various visual clues to strengthen the recognition of multimodal named entities.This strategy not only incorporates local region features as crucial clues but also introduces the global image as auxiliary information.Therefore, as illustrated in Section 3.3.2,we employed the aforementioned strategy to extract image features in this study.In computer vision research, the fusion strategy of features from different blocks, implemented through pretrained models [21][22][23], has been proven as an effective method to enhance model performances.To fully leverage these research findings, we particularly focus on exploring the potential of applying feature pyramids in multimodal scenarios.For this purpose, we propose embedding layered image features in each Transformer layer.Initially, encoding the image based on the above strategy generates a collection of pyramid feature maps at different scales {F 1 , F 2 , . . ., F b }, as shown in Figure 5.

Cross-Scale Attention Module
The detailed structure of the cross-scale attention (CSA) module is depicted in Figure 6.Given an input feature map F ∈ R C×H×W , CSA infers a 3D attention map M(F) ∈ R C×H×W .The refined feature map F ′ is computed as follows: where ⊗ denotes element-wise multiplication.We employed a residual learning scheme combined with attention mechanisms to facilitate gradient flow.To design an efficient yet powerful module, we initially computed channel attention M c (F) ∈ R C and spatial attention M s (F) ∈ R H×W separately in two branches.The attention map M(F) was then calculated as follows: where σ is the sigmoid function.The outputs from both branches were resized to R C×H×W before being added.Channel attention branch.We leverage relationships between channels since each channel contains specific feature responses.To aggregate feature maps in each channel, we apply global average pooling to the feature map F, generating a channel vector F c ∈ R C×1×1 .This vector softly encodes global information across all channels.To estimate cross-channel attention from the channel vector F c , we employ a multilayer perceptron (MLP) with a single hidden layer.To save on parameter costs, the hidden activation size is set to R C/r×1×1 , where r is the reduction ratio.Following the MLP, we add a batch normalization (BN) layer [24] to scale the output in proportion to the spatial branch.In summary, the channel attention computation is as follows:

Spatial attention branch.
A spatial attention map M s (F) ∈ R H×W is generated to emphasize or suppress features at different spatial positions.It is crucial to utilize contextual information to determine which spatial locations should be attended to.Effectively utilizing contextual information requires a larger receptive field.We employ dilated convolution [25] to efficiently expand the receptive field, as dilated convolution is known to construct spatial mappings more effectively than standard convolution.Our spatial branch adopts the "bottleneck structure" recommended by ResNet [16], reducing the parameter count and computational cost.Specifically, the feature map F ∈ R C×H×W is projected to the reduced dimension R C/r×H×W through a 1 × 1 convolution, integrating and compressing the feature map in the channel dimension.For simplicity, we use the same reduction ratio r as the channel branch.After reduction, two 3 × 3 dilated convolutions are applied to effectively leverage contextual information.Finally, the features are reduced to the spatial attention map R 1×H×W again through a 1 × 1 convolution.To adjust to the scale, a batch normalization layer is applied at the end of the spatial branch.In summary, the spatial attention computation is as follows: where f denotes a convolution operation, BN denotes a batch normalization operation, and the superscript denotes the size of the convolution filter.There are two 1 × 1 convolutions for channel reduction.The intermediate 3 × 3 dilated convolutions are applied to aggregate contextual information with a larger receptive field, where m represents the m-th block of the main model and b represents the number of blocks in the visual backbone model (in this case, ResNet has four blocks).Conν 1×1 denotes the 1 × 1 convolution operation used to adjust the number of channels, and the Pool operation ensures that all features are mapped to the same spatial dimension.

Dynamic Gating
Understanding that appropriate feature representations appear at corresponding scales of differently sized objects, the decision of mapping which module from ResNet50 to each layer of the Transformer is not straightforward.To address this challenge, we introduce a dense routing structure, enabling the hierarchical multiscale visual features to interact with each layer of the Transformer.The motivation behind dynamic gating is to predict a normalization vector, describing the importance of features from each visual block corresponding to a specific layer in the Transformer, for path selection.In the dynamic gating module, g i ∈ [0, 1] represents the path probability from the i-th block in ResNet50 to the l-th layer of the Transformer.The computation formula for the gate signal is given by g (l) = G (l) (V) ∈ R d , where G (l) (•) is the gating function for the l-th layer of the Transformer and d is the number of modules in ResNet50.Now, let us discuss the logit of the gating signal.
where σ(•) is the Leaky_ReLU activation function and P is the global average pooling layer.Initially, we perform average pooling on the features V m obtained from the i-th block (with a shape of (d i , h i , w)).Next, we accumulate these block features to obtain an average vector.Through the MLP layer W p , we adjust the dimensions of these features to d while considering a soft gating mechanism, generating continuous values as path probabilities.Furthermore, we obtain the probability vector g (l) for the l-th layer of the Transformer:

Hierarchical Feature Fusion
Based on the dynamic gating signal g (l) , we form the fused hierarchical visual features gated corresponding to the l-th layer of the Transformer: To more precisely represent the visual features Ṽ(l) gated that correspond to the final l-th layer of the Transformer, we perform the following concatenation operation: (10) This structure enhances the hierarchical representation of the text modality through the attention mechanism based on visual clue guidance.

Visual Clue Guidance
We describe the hierarchical multiscale visual features as visual clue guidance and insert these sequences of visual clue guidance into the text sequence in the self-attention layer of BERT.Specifically, for a given input sequence S = (s 1 , s 2 , . . ., s n ), the context representation C ∈ R n×d first undergoes linear projection to obtain the Q, K, and V vectors: For the integrated hierarchical visual features Ṽ(l) gated , we use a linear transformation W ϕ l ∈ R d×2×d (for the l-th layer) to project it into the embedding space matching the text representation.Furthermore, we define the operations for visual cues ϕ l k , ϕ l v ∈ R hw(m+1)×d as follows: where hw(m + 1) represents the length of the visual sequence and m denotes the number of visual objects detected by the object detection algorithm.Based on the visual clue attention mechanism, the calculation is as follows: We take the hierarchical multiscale visual features as visual cue prompts for each fusion layer and update the final hidden representation H = (h 0 , h 1 , . . ., h n+1 ) after text and image fusion through the multimodal attention mechanism layer by layer.

Consistency Matching Module
To address the inconsistency issue when integrating two modalities in the previous model, we designed a cross-modal consistency matching module to ensure the consistency of text and image representations.Inspired by contrastive learning [26][27][28], we constructed the consistency matching module.The module takes the text representation C g and the global representation of the image P g as inputs, and the overall process can be summarized in the following three steps.First, in the input pairs (C g , P g ) with a batch size of N, we generate positive and negative samples.Here, C m g represents the text representation for the m-th pair in the batch, and P n g is the image representation for the n-th pair.We define positive samples as the text and image representations from the same data pair {(C m g , P n g ) m=n }, while negative samples are selected from different data pairs {(C m g , P n g ) m̸ =n }.Although there might be a small number of mismatched positive samples, the literature [3] suggests that their impact can be negligible.Next, for each (C m g , P n g ) example, we process them using two independent multilayer perceptrons (MLPs) to obtain text representation C n ∈ R d and image representation P n ∈ R d .This MLP processing technique effectively assists the encoder in obtaining better representations, aligning with the results in the studies.Finally, we adjust the similarity of positive and negative samples by minimizing two contrastive loss functions.Specifically, the contrastive loss functions from image to text and text to image are defined as follows: For the i-th positive sample, the loss from the image to the text is where is the cosine similarity between P i n and \C i n and τ is the temperature coefficient.Similarly, for the i-th positive sample, the loss from the text to the image is Finally, integrating all the losses for positive samples, we define the total loss as In this equation, λ ∈ [0, 1] is set as a hyperparameter.Minimizing this loss function ensures the representations of both modalities are more consistent.

CRF Decoder
After incorporating the contextual information from the image, we employ a Conditional Random Field (CRF) as the decoder for the MNER task.The CRF decoder takes the final hidden representation H described in Section 3.5 and combines it with the original text S mentioned in Section 3.1 and the corresponding image I to predict the conditional probability of the sequence y.Specifically, this probability is defined as follows: where E h i ,y i and T y i ,y i+1 , respectively, denote the emission score for the i-th label y i and the transition score from y i to y i+1 .w y i MNER represents the weight parameter for y i , and the normalization factor Z(H) is used to sum and normalize the emission and transition scores for all possible sequences y.To train this module, we utilize the logarithmic likelihood loss function: where D mner = {S j , I j , y j } N j=1 denotes the training data batch.

Model Training
In summary, our framework encompasses both a supervised learning task (MNER) and an auxiliary self-supervised learning task (consistency matching).To achieve joint training and optimization, we formulate the comprehensive loss function as follows: where L CM_Loss represents the loss of the consistency matching module (see Section 3.5) and L mner corresponds to the loss of the MNER task (see Section 3.6).In this equation, α is a hyperparameter that needs to be fine-tuned.

Experiments 4.1. Dataset
In the evaluation phase of this study, we utilized two widely adopted datasets, namely Twitter-2015 and Twitter-2017, which were provided by [3,4], respectively.Each data sample consists of a {sentence, image} pair.It is noteworthy that due to the absence of the image modality in certain samples during the collection process, we took measures to ensure the data integrity and consistency.For these instances lacking image information, we uniformly substituted them with predefined blank images.Specific details regarding data distribution and the quantities of various entity classes can be found in Table 1.

Parameter Settings
In the experimental phase, we used an NVIDIA GTX 3090 GPU and PyTorch version 1.13.1 for all relevant tasks and experiments.To ensure the accuracy and efficiency of the framework, we set specific parameters and configuration strategies for each module.For text encoding, we chose bert-base-uncased as the text encoder for our model.In the image processing phase, we used ResNet50 as the image encoder.In addition, considering two different Twitter datasets, we used the following hyperparameter settings: for the Twitter-2015 dataset, the temperature coefficient is 0.21, and for the Twitter-2017 dataset, the temperature coefficient is 0.174.In addition, our uniform hyperparameter setting is 0.7.Regarding the training strategy and details, for the Twitter-2015 dataset, we set the training batch size to 16, the learning rate to 5 × 10 −5 , and the random seed to 1234.As for the Twitter-2017 dataset, we adjusted the training batch size to 32, and, again, the learning rate was kept at 5 × 10 −5 , and the random seed remained at 1234.In order to set up the most appropriate epoch, we refer to Figure 7.These parameters and configuration choices were made to ensure that our model achieves an optimal performance and accuracy on the given dataset.Finally, we set the epochs of Twitter-2015 and Twitter-2017 to 30 and 50, respectively.

Baselines
To comprehensively evaluate the effectiveness of our proposed visual clue guidance and consistency matching framework (GMF), we selected representative text-based and multimodal NER models for comparison.Text-based NER methods include BiLSTM-CRF [29], a classical NER model employing a bidirectional LSTM structure and an CRF layer; HBiLSTM-CRF [30], a variant of CNN-BiLSTM-CRF that replaces the CNN layer with LSTM for character-level word representations; BERT [19], serving as a multilayer bidirectional Transformer encoder followed by a softmax decoder; and BERT-CRF, similar to BERT but using a CRF decoder for label prediction; and T-NER [3,31], a NER system designed specifically for tweets, utilizing a range of effective features, including dictionaries, context, and positive word features.Multimodal NER methods include GVATT-HBiLSTM-CRF [4], which combines HBiLSTM-CRF with an attention mechanism to obtain representations merging text and visual information, and UMT-BERT-CRF [7], a leading multimodal NER model comprising a multimodal interaction module and an auxiliary module for pure text entity span detection.Additionally, we include other state-of-the-art multimodal models for comparison: MAF [32], which calculates similarity scores between text and images and uses the score to determine the proportion of visual information to retain; UMGF [1], which proposes a unified multimodal graph fusion method to capture fine-grained semantic correspondences between semantic units of different modalities; BFCL [33], which leverages a Transformer-based bottleneck fusion mechanism to reduce noise propagation in the visual modality; and DebiasCL [34], which utilizes hard sample mining and debiased contrastive loss to alleviate biases in both the quantity and entity types, enabling global learning for aligning text and image feature spaces.Lastly, GMF, the model proposed in this study, aims to enhance the MNER performance by integrating visual clue guidance and consistency matching.

Effectiveness
This thesis uses precision (P), recall (R), and the F1 score (F1) as metrics tested on two benchmark MNER datasets.For an entity, it is judged to be correctly predicted only if both its boundaries and categories are correctly predicted.Precision is the proportion of correct entities among the predicted entities, and recall is the proportion of correctly predicted entities among all entities in the sample.A high precision rate indicates that the model predicted a high percentage of correct entities but did not necessarily cover more entities.A high recall indicates that the model predicted most of the entities in the data; it covered more entities but did not necessarily predict a high proportion of correct entities.The F1 value is a weighted average of the precision and recall, taking into account both the precision of the prediction and the number of correct entities covered.Table 2 displays the performance of four text-based models and eight multimodal models on Twitter-2015, while Table 3 presents their performance on Twitter-2017.Below is a detailed analysis.Firstly, comparing all text-based NER methods, it is evident from the data that methods based on BERT outperform others, indicating the potential advantage of optimizing pretrained models through transfer learning rather than starting from scratch.Additionally, approaches combining BERT with the CRF surpass pure BERT strategies, suggesting the crucial role of the CRF in capturing label constraints and thereby achieving more accurate label predictions.
Next, we compare MNER methods with competitive text-based models, such as GVATT-HBiLSTM-CRF and HBiLSTM-CRF.The majority of multimodal strategies notably outperform their corresponding text-only models, further indicating the contribution of image information to named entity recognition in text.Furthermore, compared to the latest MNER method, DebiasCL, our GMF method, incorporating both the proposed visual guidance module and consistency matching module, enhances image information aligned with the text.Finally, on Twitter-2015 and Twitter-2017, our model shows overall F1 improvements of 0.53% and 0.27%, respectively.This suggests that the introduced modules can assist the model in better integrating text and image representations.

Ablation Study
In order to evaluate the contribution of the visual clue guidance module (VCG) crossscale attention module (CSA), and consistency matching module (CM) to the overall performance of the GMF model, we conducted detailed culling experiments.The data in Table 4 show the following: (1) Overall, the visual clue guidance module, cross-scale attention module, and consistency matching module all contribute significantly to the performance of the GMF model.(2) In the absence of the visual clue guidance module (labeled "w/o VCG"), the F1 score for the Twitter-2015 dataset decreased by 0.74%, while that for the Twitter-2017 dataset decreased by 0.66%.The critical role of the visual clue guidance module in integrating and enhancing visual features is emphasized.(3) In the absence of the cross-scale attention module (labeled "w/o CSA"), the F1 scores of the Twitter-2015 dataset decreased by 0.58%, while the Twitter-2017 dataset decreased by 0.5%, and both F1 scores are smaller than w/o VCG and w/o CM, suggesting that CSA plays a role in GMF to assist other modules.(4) In the case of removing the consistency matching module (labeled "w/o CM"), the F1 scores of the Twitter-2015 dataset decreased by 0.87%, while that of the Twitter-2017 dataset decreased by 0.78%.This indicates that the consistency matching module effectively aligns the text with the relevant image regions.
(5) With the removal of the visual clue guidance module, the cross-scale attention module, and the consistency matching module (labeled "w/o VCG + CSA + CM"), the F1 score for the Twitter-2015 dataset decreased by 1.18%, while that for the Twitter-2017 dataset decreased by 0.99%.It is shown that the visual clue guidance module, the cross-scale attention module, and the consistency matching module alleviate the problems caused by text-image mismatch, enrich the information of the visual modality, and ensure a high degree of consistency of modal representations between text and images.

Case Study
Figure 8 presents a case study showcasing the effectiveness of our proposed GMF and DebiasCL methods.In the first scenario, DebiasCL and GMF w/o CM misclassify the PER entity "Isis" as an ORG entity, whereas our methods GMF and GMF w/o VCG accurately identify it.Similarly, in the second scenario, DebiasCL and GMF w/o CM fail to recognize the MISC entity "Oscars", while our methods GMF and GMF w/o VCG do so correctly.These examples underscore the efficacy of the CM module in associating entities in the text with relevant image regions, thereby enhancing the accuracy of entity recognition.In the third scenario, DebiasCL and GMF w/o VCG incorrectly classify the MISC entity "Mufasa" as a PER entity.This misclassification may be attributed to the metaphorical emotions depicted in the image and the significant noise introduced by the characters, posing challenges for the MNER model.Conversely, our GMF approach, along with the GMF w/o CM modules, accurately predicts the MISC entity "Mufasa".This instance illustrates the effectiveness of the VCG module in filtering image noise and leveraging deep textual semantics for accurate entity prediction.Moving forward, we intend to explore the visual clue guidance module further and explore more advanced strategies for improved entity recognition accuracy.Additionally, we aim to apply our method to other multimodal tasks such as multimodal relationship extraction and multimodal entity linking.

Further Analysis
In order to better understand the importance of the three main contributions (VCG, CSA, and CM modules) in our proposed GMF approach, we performed additional analysis on both test sets.In Figure 9, we show the number of entities incorrectly/correctly predicted by BERT-CRF and the number of entities correctly/incorrectly predicted by each multimodal approach.First, we can see in Figure 9a,b that our GMF method correctly identifies more entities compared to the three multimodal baselines.In addition, we can see in Figure 9c,d that the GMF method has a lower error probability compared to the three multimodal baselines, and we believe that the reason is that GMF can significantly reduce the bias caused by the visual environment by incorporating our VCG module, and at the same time, with the aid of the auxiliary CSA module, it successfully reduces the inter-scale noise and strengthens the ability of detail capturing using the CM module that narrows the semantic gap between text and images.

Conclusions
In this paper, we propose a visual clue guidance and consistency matching framework (GMF), which improves the state-of-the-art performance of multimodal named entity recognition for social media posts.Specifically, the VCG module is used to combine lowlevel, high-resolution image information and high-level, strong semantic image information.On this basis, the CSA module is used to enrich the information of visual modalities.In addition, the CM module is used to minimize the contrast loss and align the entities in the text with the most relevant objects in the image.We conducted a number of experiments, ablation studies, case studies, and further analyses to show that the VCG module can help the model filter out most of the image information that is irrelevant to the text and reduce the effect of image mismatch on the text.The CSA module mitigates cross-scale interference and enhances the ability of images to capture details by augmenting visual features.The CM module can help the model establish a connection between the named entities in the text and the regions in the image where the corresponding objects are located and reduce the interactions with other regions in the image.
However, the method proposed in this paper still has some shortcomings.For example, there is a lack of stress testing of the model in different environments.In the future, we plan to increase the dataset of social media posts to better reflect the diversity and uncertainty of the data and to stress test the model at different data distributions, noise levels, and so on.In addition, we would like to apply the methods in this paper to other multimodal tasks, such as multimodal relationship extraction and multimodal entity linking.

Figure 1 .
Figure 1.An example of multimodal tweets.In this tweet, "Rocky" is the name of the dog.
to first employ a visual localization toolkit to locate the top salient local objects in the image.On this basis, we adjusted these local visual objects and the global image to a consistent size of 224 × 224 pixels, serving as the global image I and visual objects O = (o 0 , o 1 , o 2 , . . ., o n ).For image processing, we chose ResNet50 to obtain the regional and global representations of the image.The regional representation D=(d 1 , d 2 , . . ., d 48 , d 49 ) comes from the last convolutional layer of ResNet, with dimensions 2048 × 7 × 7, where 7 × 7 = 49 denotes the number of regions in the image and 2048 is the dimension of each region's representation, with each region's size being 32 × 32 pixels.We utilized a 7 × 7 average pooling layer to obtain the global representation V g ∈ R 2048 , representing the entire image.

Figure 6 .
Figure 6.Structure diagram of the cross-scale attention module.

b
which allows a model with stronger information perception for the image F ′ to be obtained.Next, we adopt a structured mapping function F(θ) for feature fusion, including Conν 1×1 , Pool, and Concat, represented as follows:

Figure 7 .
Figure 7. GMF test results on the development set and loss on the training set.

Figure 9 .
Figure 9. (a,b) both show the number of entities incorrectly predicted by BERT-CRF but corrected by each multimodal method (shown in the y-axis).(c,d) both show the number of entities correctly predicted by BERT-CRF but incorrectly predicted by each multimodal method (shown in the y-axis).

Table 1 .
Summary statistics of the two MNER datasets.

Table 4 .
Ablation studies of the GMF model.