1. Introduction
With the rapid development of social media platforms, Twitter, WeChat, and Weibo have gradually become important mediums for the public to express opinions and disseminate information. Users generate vast amounts of textual content on these platforms around various social events, which contains rich entity information. Effectively identifying and extracting named entities from such data helps uncover the core subjects of public concern and focal points of public opinion, thereby providing crucial data support for social opinion analysis and related decision-making. Named Entity Recognition (NER) [
1], a foundational task in Natural Language Processing (NLP), aims to identify and classify entities with specific semantic categories—such as person names, location names, and organization names—from unstructured text [
2]. With the rapid development of the mobile Internet and social media, multimodal data encompassing images, text, videos, and other types has gradually become the mainstream medium for information dissemination. In this context, NER based solely on a single text modality is unable to address the need for integrating cross-type information from multimodal data, making it challenging to fulfill entity mining tasks within such data [
3]. As illustrated in
Figure 1, multimodal tweets often contain ambiguous or incomplete textual information, making entity recognition highly challenging without visual context. In
Figure 1a, the text mentions “Konrad Hilton”, which can be correctly identified as a person (PER). However, the accompanying image of a hotel building with the “Hilton” logo provides additional contextual evidence that reinforces the association between the person and the organization. This visual cue helps the model better understand the semantic relationship between entities beyond the textual description alone. In
Figure 1b, the text “I love Alibaba” contains the entity “Alibaba”, which could be ambiguous in isolation. Without visual context, a model might erroneously classify “Alibaba” as a person (PER), mistaking it for a nickname or an individual’s name. However, with the support of the visual content showing the Alibaba logo and building, the entity can be accurately classified as an organization (ORG). Without visual information, the model may struggle to distinguish such entities due to limited textual context. These examples demonstrate that visual information plays a crucial role in resolving semantic ambiguity and enriching contextual understanding in multimodal named entity recognition, especially in social media scenarios where text is often short and noisy.
Multimodal Named Entity Recognition (MNER) aims to integrate rich scene, object, and semantic cues from images to address information loss caused by text sparsity. Simultaneously, it accurately disambiguates polysemous or ambiguous expressions in text, thereby identifying more comprehensive and precise entity types and attributes [
4].
Early studies mainly encoded the entire image into a global feature vector to enhance text representations with visual cues [
5]. For instance, Moon et al. [
6] utilized a joint modeling framework with an LSTM-CNN-based attention mechanism to integrate global information; Zhang et al. [
7] employed an attention mechanism to guide the model in extracting entity-relevant visual cues for aligning image-text features; and Yu et al. [
8] proposed a unified multimodal transformer incorporating entity span detection to improve MNER performance. Current research focuses on fine-grained image-text interaction mechanisms, extracting multi-scale visual features, and optimizing semantic alignment. For example, Bao et al. [
9] designed a multi-level alignment contrastive pre-training framework to enhance entity recognition and precise localization; Xu et al. [
10] proposed an adaptive mixing-based image enhancement strategy to refine image-text matching; while Wang et al. [
11] introduced a dual-enhancement hierarchical alignment framework that explicitly models cross-modal hierarchical associations through global and local dual-path contrastive learning.
However, these methods still suffer from several limitations. Most approaches rely on fixed fusion strategies and lack adaptive regulation mechanisms guided by the semantic discrimination requirements of multimodal data, which may introduce irrelevant visual cues and weaken text-dominated semantic representations. In addition, the exploration of multi-granularity visual semantics remains insufficient, limiting their ability to compensate for information loss in textual entity recognition. Furthermore, existing methods primarily perform semantic fusion and similarity alignment at the feature representation level, without explicitly modeling intra-modal relationships or enforcing consistency in cross-modal correspondences. This often leads to semantic inconsistencies in complex scenarios, ultimately degrading entity recognition performance.
To address the above challenges, this paper proposes a Visual–Semantic Guided Interaction Network (VSGN) that explicitly integrates semantic enhancement and structure-aware alignment. To alleviate noise interference and insufficient semantic representation, the model first introduces a visual–semantic fusion mechanism guided by generated visual descriptions. By using descriptions as auxiliary semantic cues, the model enriches textual representations with complementary visual semantics. Meanwhile, a channel-wise inhibitory routing strategy is incorporated to selectively suppress redundant or noisy visual signals, leading to more reliable and discriminative cross-modal feature integration. To address the limitation of coarse cross-modal alignment, the model further incorporates a visual–semantic guided graph structure learning mechanism. Instead of relying on implicit attention, it explicitly models relational dependencies between textual and visual elements, enabling fine-grained alignment at the structural level and capturing complex cross-modal interactions. In addition, a distribution-level alignment constraint is introduced to enforce global semantic consistency between modalities, which complements the local structural alignment and provides more stable supervision for cross-modal learning.
The main contributions of this paper are summarized as follows:
We propose a Visual–Semantic Guided Interaction Network (VSGN) for multimodal named entity recognition, which unifies semantic enhancement and structural alignment to effectively address cross-modal inconsistency, semantic ambiguity, and noise in social media data.
We design a unified visual–semantic interaction mechanism that integrates channel-wise inhibitory routing and visual–semantic enhanced graph structure learning. Redundant or noisy visual signals are suppressed at the channel level to improve feature discriminability, while cross-modal structural dependencies are modeled through graph-based learning. Guided by generated visual descriptions, this mechanism bridges the semantic gap between modalities and enhances fine-grained interaction.
Extensive experiments on benchmark datasets demonstrate the superiority of the proposed approach. Ablation studies and case analyses further confirm the effectiveness of each component in improving robustness and semantic consistency.
2. Related Work
In this section, we review previous works related to named entity recognition, including multimodal named entity recognition methods, and the application of graph neural networks in multimodal named entity recognition.
2.1. Multimodal Named Entity Recognition
Recent advancements in MNER have evolved from simple feature concatenation to increasingly sophisticated cross-modal interaction and alignment mechanisms [
12]. Early studies primarily focused on enhancing textual representations by incorporating global or region-level visual features. For instance, Yu et al. [
13] utilized hierarchical index generation to achieve fine-grained alignment between image regions and textual entities, laying the foundation for integrating visual cues into entity recognition.
Subsequent research shifted towards more effective cross-modal interaction strategies, aiming to bridge the semantic gap between modalities. Wang et al. [
14] transformed visual information into context tokens to enable seamless interaction within textual encoding, while Jiang et al. [
15] introduced dual-similarity guidance to suppress redundant features and enhance discriminative representations. To further address implicit alignment issues, Wei et al. [
16] designed an association-aware layer with contrastive learning to strengthen cross-modal consistency, and Mu et al. [
17] proposed a multi-granularity framework to alleviate visual noise and capture complementary information across different visual levels.
More recent studies have explored richer semantic modeling and more robust alignment mechanisms. Chen et al. [
18] incorporated high-level visual attributes and external knowledge to mitigate modality bias, leveraging knowledge graph retrieval to enhance semantic understanding beyond raw visual signals. Guo et al. [
19] proposed the MGICL framework, which performs cross-modal contrastive learning across multiple granularities and introduces a visual gating mechanism to dynamically filter irrelevant visual information, thereby reducing noise and narrowing feature space discrepancies. Zheng et al. [
20] developed the AGBAN model, which employs fine-grained visual object features and bilinear attention to explicitly capture entity-object correspondences, and further adopts adversarial learning to map multimodal features into a shared invariant space, reducing distribution gaps.
Despite these advances, most existing methods still rely on sequence-based or feature-level interaction paradigms, which lack an explicit mechanism to model structured relationships within and across modalities. In particular, they struggle to capture complex dependencies among textual tokens and visual regions, as well as the non-local and many-to-many correspondences between them. This limitation motivates the introduction of graph-based approaches, which provide a natural and flexible framework for modeling structured interactions and enable unified representation of intra-modal and cross-modal relationships.
2.2. Graph Neural Networks for MNER
To overcome the limitations of sequence-based modeling, recent studies have introduced graph structures into MNER to explicitly model structured dependencies across modalities. By representing textual tokens and visual regions as nodes, graph-based methods can capture non-local interactions and complex relational structures that are difficult to model using sequential architectures.
Graph Neural Networks (GNNs) have been widely adopted to aggregate node features based on topological connections, enabling the joint modeling of local and global contextual semantics [
21]. More importantly, graph structures provide a unified framework for encoding intra-modal relationships (e.g., dependencies between words or visual regions) and inter-modal interactions (e.g., alignments between entity spans and visual cues). For example, Zhang et al. [
22] constructed a unified multimodal graph with stacked fusion layers to jointly model intra- and inter-modal interactions, while Zhao et al. [
23] further enhanced graph representations by incorporating external matching signals across text-image pairs.
However, existing graph-based approaches still face two key limitations. First, most methods rely on relatively shallow aggregation mechanisms, which are insufficient to capture deep semantic associations between nodes, thereby limiting contextual understanding. Second, cross-modal structural alignment is often coarse-grained, failing to establish precise correspondences between fine-grained visual cues and textual entity spans.
Therefore, there remains a need for a more effective approach that can deeply model inter-node semantic relationships and achieve fine-grained cross-modal structural alignment. Motivated by these challenges, we propose a Visual–Semantic Guided Graph Interaction Network for Multimodal Named Entity Recognition, aiming to enhance semantic representation and establish precise semantic alignment between visual cues and textual entities through structured interaction learning.
4. Experiments
This section presents the overall experimental setup and evaluation of the proposed VSGN model. We introduce the datasets and implementation details, followed by comparisons with baseline methods. In addition, ablation studies and further analyses are conducted to validate the effectiveness and robustness of the model.
4.1. Dataset
This study conducts experiments on two widely used benchmark datasets for Multimodal Named Entity Recognition (MNER), namely Twitter-2015 [
27] and Twitter-2017 [
7]. Both datasets are collected from the social media platform Twitter and consist of tweets paired with corresponding images, forming multimodal samples that integrate textual and visual information. Each tweet is manually annotated with four predefined named entity categories: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). These annotations enable the evaluation of models in identifying entities from multimodal contexts where textual information alone may be incomplete or ambiguous. The detailed statistical information of the datasets is summarized in
Table 1.
4.2. Parameter Settings
For fair comparison, the same hyperparameter settings are applied to both the Twitter-2015 and Twitter-2017 datasets. All experiments are implemented using the PyTorch 2.4.0 deep learning framework on a server equipped with an NVIDIA GTX 4090 GPU. For textual representation, the pre-trained bert-base-uncased model is employed as the text encoder, while visual features are extracted using the CLIP-ViT-B/32 model as the image encoder. The hidden dimensions of both textual and visual representations are set to 768. During training, the proposed model is optimized using the AdamW optimizer for 40 epochs with a batch size of 8. The initial learning rate is set to 3 × 10−5, with a warm-up ratio of 0.01 applied at the beginning of training. To alleviate overfitting, a dropout rate of 0.5 is adopted. In addition, the maximum sentence length is set to 80 tokens, and the hyperparameter K is set to 5.
4.3. Compared Baselines
To comprehensively evaluate the performance of multimodal named entity recognition (MNER) models, we adopt standard metrics widely used in sequence labeling tasks: Precision (P), Recall (R), and F1-score (F1). These are computed at the entity level—a predicted entity is considered correct only if both its span and type match the ground truth.
To evaluate the effectiveness of the proposed model in addressing challenges such as cross-modal fusion, semantic consistency, and image noise suppression, several representative unimodal and multimodal models were selected for comparative experiments in this paper. The unimodal baselines include traditional sequence labeling models such as CNN + BiLSTM + CRF [
28], which enhances character-level representations through convolutional operations to improve the adaptability of social media text; BiLSTM + CRF [
29], which employs bidirectional recurrent neural networks to capture contextual dependencies and uses a CRF layer for structured prediction; and BERT-CRF [
30], which introduces a CRF decoding layer on top of the BERT encoder to further enhance sequence labeling performance. The multimodal baselines include several representative MNER approaches. UMGF [
22] proposes an object-guided multimodal graph fusion framework for named entity recognition. UMT [
8] introduces a multimodal interaction module along with an auxiliary entity span detection task to improve cross-modal representation learning. HVPNet [
31] presents a hierarchical visual prefix fusion network that enhances the robustness of multimodal entity and relation extraction. MNER-QG [
32] proposes an end-to-end framework for jointly learning multimodal named entity recognition and query localization. DebiasCL [
33] employs a debiased contrastive learning strategy to mitigate visual bias in multimodal entity recognition. MGCMT [
34] integrates multi-granularity visual cues with enhanced textual representations to improve recognition accuracy. MAF [
35] alleviates the image–text mismatch problem through a matching and alignment mechanism that improves cross-modal consistency. AMLR [
36] proposes an adaptive multi-scale linguistic enhancement mechanism to facilitate entity-level cross-modal interaction. Vec-MNER [
16] further enhances multimodal entity recognition by combining visual enhancement with cross-modal interaction.
To ensure the fairness of the experiment, the results of the baseline models mentioned above are taken directly from their original papers. We follow the standard experimental setup commonly adopted in MNER research, using exactly the same dataset split and evaluation scheme. This approach ensures that the performance gains of our VSGN model are evaluated under consistent conditions and compared against the known best-performing methods.
4.4. Effectiveness
Compared with unimodal methods (
Table 2 and
Table 3), the BERT-CRF framework demonstrates strong performance on sequence labeling tasks. Nevertheless, the proposed VSGN model still achieves superior results on both datasets. Specifically, on the Twitter-2015 dataset, VSGN obtains an overall F1-score of 76.72%, surpassing BERT-CRF by 5.63%. On the Twitter-2017 dataset, VSGN achieves an F1-score of 87.86%, outperforming BERT-CRF by 4.42%. These results indicate that incorporating visual information can effectively compensate for the limitations of purely textual representations and improve entity recognition performance.
Compared with existing multimodal methods, the proposed VSGN model consistently achieves competitive or superior performance across both datasets. Compared with the representative multimodal model MGCMT, VSGN improves the overall F1-score by 3.47% on Twitter-2015 and 2.17% on Twitter-2017. Furthermore, compared with the recent AMLR model, VSGN still achieves higher performance, with improvements of 1.41% and 0.93% on Twitter-2015 and Twitter-2017, respectively. These results demonstrate that the proposed model effectively enhances multimodal feature interaction and improves the robustness of multimodal entity recognition.
Further analysis at the entity level reveals the advantages of VSGN. The model achieves strong performance across most entity categories, particularly for PER and ORG entities. On the Twitter-2017 dataset, VSGN achieves F1-scores of 94.01% for PER and 85.20% for ORG, while the MISC category reaches 74.83%, which are among the best results compared with existing methods. These improvements can be attributed to the collaborative interaction between the graph-structure alignment stream and the channel-wise Inhibitory Routing fusion stream, which enables the model to better capture fine-grained correspondences between visual objects and textual entities while suppressing irrelevant visual noise.
Overall, the experimental results demonstrate that VSGN consistently outperforms a wide range of existing methods on both the Twitter-2015 and Twitter-2017 datasets. This confirms the effectiveness of the proposed model in improving multimodal feature fusion, enhancing cross-modal semantic consistency, and reducing the interference caused by visual noise in multimodal named entity recognition.
4.5. Ablation Study
To evaluate the contribution of each component in the proposed VSGN model, we conducted ablation experiments on the Twitter-2015 and Twitter-2017 datasets. Specifically, we considered four ablated variants: (1) w/o VSG, which removes the visual–semantic guided graph structure learning module; (2) w/o CIR, which replaces the CIR mechanism with simple fusion strategies such as direct concatenation or weighted summation; (3) w/o Description, which removes the generated visual descriptions; and (4) w/o VSG & CIR, which removes both the VSG and CIR modules. The performance is reported in terms of precision (P), recall (R), and F1 score.
As shown in
Table 4, the complete VSGN model achieves the best performance on both social media datasets, with F1 scores of 76.72% on Twitter-2015 and 87.86% on Twitter-2017, demonstrating the overall effectiveness of the proposed framework.
When removing the VSG module, the performance drops to 75.76% F1 on Twitter-2015 and 87.03% F1 on Twitter-2017, indicating its important role in modeling cross-modal structural relationships between text tokens and visual regions. By constructing a heterogeneous graph based on visual semantics and employing a GCN with adaptive edge weighting, the model captures local topological dependencies and facilitates structural semantic alignment across modalities.
To further demonstrate the effectiveness of the proposed module, we visualize the attention distribution of the VSG module, as shown in
Figure 4. The figure presents the attention map produced by the Visual–Semantic Guided graph structure learning (VSG) module. The attention responses exhibit a structured pattern, where darker colors indicate stronger relevance. Notably, the responses along the diagonal region are relatively darker, suggesting strong and consistent semantic associations between visual features and the corresponding textual tokens. These results indicate that the VSG module effectively captures semantic relationships across modalities and facilitates structured cross-modal interaction.
When replacing the CIR mechanism with simple fusion strategies such as direct concatenation or weighted summation, the F1 scores further decrease to 75.47% and 86.98%, respectively. This demonstrates that naive fusion methods are insufficient for handling noisy or redundant visual information. In contrast, the CIR mechanism introduces a channel-wise suppression routing strategy that selectively highlights informative visual channels while suppressing irrelevant ones, enabling more refined feature interaction and improving the quality of multimodal representations. Removing the generated visual descriptions leads to a more noticeable performance decline, with F1 scores dropping to 74.54% on Twitter-2015 and 86.64% on Twitter-2017. This suggests that visual descriptions provide complementary semantic cues that enhance textual understanding, especially in short and noisy social media contexts.
When both the VSG and CIR modules are removed, the model experiences the most significant degradation, achieving F1 scores of 73.71% and 84.59% on the two datasets, respectively. This result highlights the complementarity of the two modules, where VSG focuses on cross-modal structural alignment at the graph level, while CIR improves feature fusion at the channel level.
Overall, the ablation results demonstrate that each component of VSGN contributes positively to the final performance, and their joint optimization leads to substantial improvements in multimodal named entity recognition.
4.6. Case Study
4.6.1. Successful Cases
To further investigate the effectiveness and robustness of the proposed VSGN model in complex multimodal scenarios, we present three representative cases from the Twitter dataset, as illustrated in
Figure 5. These examples are carefully selected to reflect different types of challenges, including insufficient textual context, semantic ambiguity, and noisy social media expressions.
The first case involves the entity “Gigi Buffon” in a sports-related post. Due to the lack of explicit contextual cues in the text, baseline models exhibit significant misclassification errors. Specifically, UMGF and AMLR incorrectly classify “Gigi Buffon” as MISC, while HVPNet predicts it as O, revealing a failure to capture meaningful semantic signals. Similarly, the entity “Leicester City” is consistently misclassified as LOC by all baseline models, suggesting that these methods tend to rely heavily on surface-level lexical patterns rather than deeper semantic understanding. In contrast, our model correctly identifies “Gigi Buffon” as a person (PER) and “Leicester City” as an organization (ORG). This improvement can be attributed to the proposed visual–semantic fusion mechanism, which integrates generated visual descriptions with fine-grained visual features. Such a design provides richer contextual evidence, enabling the model to associate the textual mention with relevant visual semantics, thereby improving entity type discrimination.
The second case focuses on the entity “Phoenix” in the context of the movie Harry Potter and the Order of the Phoenix. This example highlights the challenge of semantic ambiguity, where the same entity can correspond to different types depending on context. In this case, UMGF and AMLR classify “Phoenix” as O, failing to recognize it as a meaningful entity, while HVPNet incorrectly predicts it as PER due to its bias toward dominant entity types in textual patterns. Although HVPNet and AMLR correctly identify “Harry Potter” as PER, they fail to properly model the semantic relationship between “Phoenix” and the film’s context. In contrast, our model successfully classifies “Phoenix” as MISC while maintaining correct recognition of “Harry Potter”. This demonstrates that the VSGN model can effectively leverage cross-modal contextual cues, particularly visual information from the movie scene, to disambiguate entity semantics. The visual–semantic guided interaction allows the model to align textual context with relevant visual concepts, thereby resolving ambiguity that cannot be addressed by text-only or weakly aligned multimodal methods.
The third case involves the entity “GPISDECHS”, which appears in a sports-related social media post and represents an abbreviated organization name. This scenario is particularly challenging due to noisy text, informal expressions, and the absence of explicit semantic clues. As shown in
Figure 5, UMGF and HVPNet fail to correctly identify the entity type, predicting it as MISC or O, respectively. Although AMLR correctly classifies “Trevino” as PER, it still misclassifies “GPISDECHS”, indicating its limited ability to handle rare or non-standard entity mentions. In contrast, our model accurately identifies “GPISDECHS” as an organization (ORG) and “Trevino” as a person (PER). This superior performance stems from the proposed visual–semantic guided graph learning mechanism, which explicitly models cross-modal structural relationships. By constructing interactions between textual tokens and visual regions, the model can propagate complementary information across modalities, enabling more robust entity representation even under noisy conditions.
Overall, these case studies demonstrate that the proposed VSGN model consistently outperforms existing approaches in challenging multimodal settings. By effectively integrating visual–semantic cues and modeling cross-modal interactions, the model not only mitigates issues arising from insufficient textual context and semantic ambiguity but also enhances robustness against noisy and informal data. This leads to more accurate and reliable entity recognition, highlighting the practical value of the proposed approach.
4.6.2. Error Analysis
To further analyze the limitations of the proposed model, we randomly sample representative error cases from the test set and categorize them into three types, as illustrated in
Figure 6.
The first category is bias brought by the annotation. As shown in
Figure 6a, the entity “Frank Erwin Center” is labeled as “ORG” in the dataset, while the model predicts it as “LOC”. However, such entities (e.g., stadiums or event venues) can reasonably be interpreted as either organizations or locations depending on the annotation standard. Therefore, this type of error is not solely caused by model deficiency, but also reflects the inherent ambiguity and inconsistency in the annotation scheme.
The second category is irregular social media structure. As illustrated in
Figure 6b, the text contains informal expressions such as “RT” and user mentions, along with account-like tokens such as “MensFitnessWire”. Although it is annotated as an organization, the model predicts it as “O”, suggesting that the model tends to treat such tokens as noisy or non-entity elements due to their lack of clear semantic structure. This indicates that non-standard linguistic patterns in social media pose challenges for accurate entity recognition.
The third category is lack of background knowledge. In
Figure 6c, the entity “One Piece” is labeled as “MISC”, but the model fails to recognize it and predicts “O”. This is mainly because the textual and visual context provides limited clues, and correctly identifying such entities often requires external knowledge (e.g., recognizing it as a well-known anime). Without sufficient background knowledge, the model struggles to capture the underlying semantics.
Overall, these errors arise from different sources, including annotation bias, structural noise in social media text, and insufficient background knowledge, highlighting the remaining challenges in multimodal named entity recognition.
4.7. t-SNE Visualization of Entity Feature Distributions
To demonstrate the effectiveness of the multimodal feature representations learned by the VSGN model and its classification capability, t-SNE was employed to perform dimensionality reduction and visualization analysis on the Twitter-2017 test set and the entity representations of the prediction results.as shown in
Figure 7.
The high-dimensional features output by the model are mapped to a two-dimensional space, where points in different colors represent different entity types. Comparing the ground truth label distribution in
Figure 7a with the predicted label distribution in
Figure 7b, it can be observed that they exhibit a high degree of consistency in both topological structure and distribution density. In the feature space, entities of the same type form tight clusters, while distinct boundaries are formed between different entity types. It is worth noting that the “O” label, representing non-entity categories, exhibits a relatively dispersed distribution pattern.
However, this is not a limitation of the model but rather a reasonable representation consistent with linguistic characteristics. Since the “O” label encompasses background words of various parts of speech, such as verbs, prepositions, and common nouns, its semantic space inherently possesses high diversity and contextual dependency. This dispersed distribution demonstrates that the model not only retains rich background contextual information but also successfully decouples and isolates named entities with clear semantic orientations from the complex background context.
The experimental results further validate that the VSGN model, while capturing fine-grained contextual semantics, maintains precise discriminative capability for core entity categories.
4.8. Sensitivity Analysis of the K Parameter
In the visual semantic-enhanced graph structure learning phase (see
Section 3.3.1), the model adopts a fixed-size neighbor sampling strategy, retaining the Top-
K neighbors for each node to construct its information aggregation range. Since the setting of the sampling number
K directly determines the scale and quality of candidate neighbors in the information aggregation process, the choice of this hyperparameter has a significant impact on model performance. To analyze the influence of
K on model performance, experiments were conducted by training the model under different
K values while keeping the number of training epochs consistent. The experimental results are shown in
Figure 8.
When K is small, the model can only capture limited neighborhood structural information, making it difficult to fully capture potential semantic associations between nodes, resulting in relatively low overall performance. As K gradually increases, the model can cover more potential semantic neighbors, allowing for more comprehensive integration of neighborhood information, and model performance improves, reaching its optimum at K = 5. However, when K continues to increase, an excessive number of neighbor nodes introduces some weakly correlated or even noisy information, and these irrelevant or redundant neighborhood features interfere with the core semantic aggregation process, thereby diminishing the model’s representation capability and recognition performance to some extent.
Although the sensitivity trend to K is consistent, the two datasets exhibit significant differences in performance baselines, which are primarily attributed to variations in data quality. This performance gap mainly stems from the following factors:
- (1)
The Twitter-2015 dataset has lower quality, with a large amount of missing image information, which is crucial for the multimodal named entity recognition task.
- (2)
The dataset contains many highly irrelevant samples, which act as noise and increase the difficulty of model learning. In contrast, the Twitter-2017 dataset is cleaner and more complete, enabling the model to learn the underlying patterns in the data more effectively.
4.9. Low-Resource Experiment
We further conduct experiments under resource-constrained settings. Specifically, we randomly sample 10% to 50% of the original training data to construct subsets with limited resources.
Figure 9 presents the performance comparison of VSGN and several baseline models on the Twitter-2015 and Twitter-2017 datasets. As shown in
Figure 9.
Overall, VSGN consistently outperforms all baselines across different data proportions. Notably, the performance gain is more pronounced in low-resource scenarios, demonstrating the effectiveness of the proposed model in utilizing limited training data.
4.10. Computational Cost Analysis
To evaluate the practical applicability of the proposed VSGN model, we conducted a comprehensive analysis of its computational efficiency, including parameter scale, training time, inference speed, and memory consumption.
VSGN comprises 142.7 million parameters, which is slightly higher than AMLR (136.4 million) and BERT-CRF (110.5 million). This increase is primarily attributed to the introduction of the visually–semantically guided graph structure learning module and the channel-level inhibition routing (CIR) mechanism, both of which require additional computations to model fine-grained cross-modal interactions. As shown in
Table 5.
In terms of training efficiency, VSGN requires 1.3 h per training epoch, which is slightly longer than AMLR (1.2 h). This computational overhead stems primarily from dynamic graph construction and routing operations, which introduce additional computations during forward and backward propagation. In terms of inference speed, VSGN achieves 36 samples per second, a decrease compared to AMLR (42 samples/second). This decrease is mainly attributed to graph-based interactions and dynamic routing processes, which increase inference complexity.
In terms of memory consumption, VSGN occupies 13.6 GB of GPU memory, which is higher than BERT-CRF (4.2 GB) but comparable to AMLR (12.4 GB). This overhead primarily stems from the learning of dynamic graph structures and the intermediate representations involved in cross-modal feature interactions. However, compared to some complex multimodal models, this memory consumption remains within an acceptable range, indicating that the proposed method achieves a reasonable trade-off between performance gains and resource overhead
Despite the increased computational cost, the model’s performance improves significantly. Ablation results further demonstrate that removing the VSG module (w/o VSG) reduces the parameter count to 132.3 million and increases inference speed to 47 samples/second, but leads to a significant performance drop (see
Section 4.5). This indicates that although the VSG module incurs higher computational costs, it plays a crucial role in enhancing cross-modal representation learning.
Overall, VSGN achieves a good balance between computational cost and performance. Although it introduces some overhead compared to baseline models, the significant performance gains validate the effectiveness and necessity of the proposed design, making it a practical solution for multimodal named entity recognition tasks.
5. Conclusions and Outlook
This paper proposes a Visual–Semantic Guided Interaction Network for multimodal named entity recognition, aiming to address the limitations of existing methods in handling noisy social media data, insufficient semantic modeling, and coarse cross-modal alignment. The proposed framework introduces an adaptive visual–semantic fusion mechanism to enable symmetric cross-modal interaction, allowing textual representations to be enhanced with complementary visual cues. In particular, a channel-wise inhibitory routing module is incorporated to selectively suppress redundant or noisy cross-modal signals, thereby improving the discriminability of fused features. By jointly modeling fine-grained semantic relationships and adaptive cross-modal interactions, the proposed method achieves more effective and robust multimodal representation learning. Furthermore, a visual–semantic guided graph structure learning module is developed to explicitly model structured relationships between textual tokens and visual regions, enabling fine-grained cross-modal alignment through bidirectional interaction and improving semantic consistency. Extensive experiments on benchmark datasets demonstrate the effectiveness and robustness of the proposed approach. Compared with existing methods, VSGN achieves superior performance in complex scenarios involving ambiguous expressions and noisy contexts, highlighting the importance of integrating structured modeling and cross-modal semantic guidance.
Despite its effectiveness, the proposed framework introduces additional computational overhead due to graph structure learning and cross-modal interaction. In particular, the construction of heterogeneous graphs and dynamic edge modeling increases model complexity, which may affect efficiency in large-scale or real-time applications. Moreover, as the current evaluation is mainly conducted on social media datasets, its applicability to broader domains remains to be further explored.
In future work, we will focus on developing more efficient graph learning strategies to improve scalability while enhancing generalization across diverse scenarios. Additionally, incorporating large-scale pre-trained multimodal models and extending the framework to more complex tasks, such as event extraction and multimodal reasoning.