VSGN: Visual–Semantic Guided Interaction Network for Multimodal Named Entity Recognition

Yao, Jianjun; Zhou, Zhikun; Li, Ruisheng; Zhang, Jiaming; Qi, Zhiwei

doi:10.3390/sym18050769

Open AccessArticle

VSGN: Visual–Semantic Guided Interaction Network for Multimodal Named Entity Recognition

by

Jianjun Yao

^1,*

,

Zhikun Zhou

¹

,

Ruisheng Li

¹,

Jiaming Zhang

¹ and

Zhiwei Qi

²

¹

School of Artificial Intelligence, Gansu University of Political Science and Law, No. 6 Anning West Road, Lanzhou 730070, China

²

School of Information Science and Engineering, Yunnan University, No. 2 Cuihu North Road, Kunming 650500, China

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(5), 769; https://doi.org/10.3390/sym18050769

Submission received: 23 March 2026 / Revised: 19 April 2026 / Accepted: 23 April 2026 / Published: 29 April 2026

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Multimodal Named Entity Recognition (MNER) aims to integrate textual and visual information to identify entities with specific semantic categories. However, existing methods often suffer from insufficient intra-modal semantic modeling, coarse cross-modal alignment, and vulnerability to noisy or ambiguous expressions in social media. To address these challenges, we propose a Visual–Semantic Guided Interaction Network (VSGN), which improves multimodal representation learning from both semantic and structural perspectives. Specifically, we first design an adaptive visual–semantic fusion module that incorporates visual descriptions as semantic guidance, enabling more informative cross-modal interactions. To further enhance feature quality, we introduce a deviation-aware channel-wise inhibitory routing (CIR) mechanism, which jointly models channel importance and distributional deviation to suppress noisy or redundant visual signals. In addition, we propose a visual–semantic guided graph structure learning module (VSG), which explicitly captures structural dependencies across modalities. By enforcing distribution-level alignment between textual and visual graph representations, the model achieves structure-aware cross-modal interaction and reduces modality inconsistency. Extensive experiments on the Twitter-2015 and Twitter-2017 datasets demonstrate the effectiveness of the proposed method, achieving F1 scores of 76.72% and 87.86%, respectively. The results show that jointly modeling semantic enhancement and structural alignment leads to more robust and discriminative multimodal representations.

Keywords:

multimodal named-entity recognition; graph neural network; cross-modal fusion; social media

1. Introduction

With the rapid development of social media platforms, Twitter, WeChat, and Weibo have gradually become important mediums for the public to express opinions and disseminate information. Users generate vast amounts of textual content on these platforms around various social events, which contains rich entity information. Effectively identifying and extracting named entities from such data helps uncover the core subjects of public concern and focal points of public opinion, thereby providing crucial data support for social opinion analysis and related decision-making. Named Entity Recognition (NER) [1], a foundational task in Natural Language Processing (NLP), aims to identify and classify entities with specific semantic categories—such as person names, location names, and organization names—from unstructured text [2]. With the rapid development of the mobile Internet and social media, multimodal data encompassing images, text, videos, and other types has gradually become the mainstream medium for information dissemination. In this context, NER based solely on a single text modality is unable to address the need for integrating cross-type information from multimodal data, making it challenging to fulfill entity mining tasks within such data [3]. As illustrated in Figure 1, multimodal tweets often contain ambiguous or incomplete textual information, making entity recognition highly challenging without visual context. In Figure 1a, the text mentions “Konrad Hilton”, which can be correctly identified as a person (PER). However, the accompanying image of a hotel building with the “Hilton” logo provides additional contextual evidence that reinforces the association between the person and the organization. This visual cue helps the model better understand the semantic relationship between entities beyond the textual description alone. In Figure 1b, the text “I love Alibaba” contains the entity “Alibaba”, which could be ambiguous in isolation. Without visual context, a model might erroneously classify “Alibaba” as a person (PER), mistaking it for a nickname or an individual’s name. However, with the support of the visual content showing the Alibaba logo and building, the entity can be accurately classified as an organization (ORG). Without visual information, the model may struggle to distinguish such entities due to limited textual context. These examples demonstrate that visual information plays a crucial role in resolving semantic ambiguity and enriching contextual understanding in multimodal named entity recognition, especially in social media scenarios where text is often short and noisy.

Multimodal Named Entity Recognition (MNER) aims to integrate rich scene, object, and semantic cues from images to address information loss caused by text sparsity. Simultaneously, it accurately disambiguates polysemous or ambiguous expressions in text, thereby identifying more comprehensive and precise entity types and attributes [4].

Early studies mainly encoded the entire image into a global feature vector to enhance text representations with visual cues [5]. For instance, Moon et al. [6] utilized a joint modeling framework with an LSTM-CNN-based attention mechanism to integrate global information; Zhang et al. [7] employed an attention mechanism to guide the model in extracting entity-relevant visual cues for aligning image-text features; and Yu et al. [8] proposed a unified multimodal transformer incorporating entity span detection to improve MNER performance. Current research focuses on fine-grained image-text interaction mechanisms, extracting multi-scale visual features, and optimizing semantic alignment. For example, Bao et al. [9] designed a multi-level alignment contrastive pre-training framework to enhance entity recognition and precise localization; Xu et al. [10] proposed an adaptive mixing-based image enhancement strategy to refine image-text matching; while Wang et al. [11] introduced a dual-enhancement hierarchical alignment framework that explicitly models cross-modal hierarchical associations through global and local dual-path contrastive learning.

However, these methods still suffer from several limitations. Most approaches rely on fixed fusion strategies and lack adaptive regulation mechanisms guided by the semantic discrimination requirements of multimodal data, which may introduce irrelevant visual cues and weaken text-dominated semantic representations. In addition, the exploration of multi-granularity visual semantics remains insufficient, limiting their ability to compensate for information loss in textual entity recognition. Furthermore, existing methods primarily perform semantic fusion and similarity alignment at the feature representation level, without explicitly modeling intra-modal relationships or enforcing consistency in cross-modal correspondences. This often leads to semantic inconsistencies in complex scenarios, ultimately degrading entity recognition performance.

To address the above challenges, this paper proposes a Visual–Semantic Guided Interaction Network (VSGN) that explicitly integrates semantic enhancement and structure-aware alignment. To alleviate noise interference and insufficient semantic representation, the model first introduces a visual–semantic fusion mechanism guided by generated visual descriptions. By using descriptions as auxiliary semantic cues, the model enriches textual representations with complementary visual semantics. Meanwhile, a channel-wise inhibitory routing strategy is incorporated to selectively suppress redundant or noisy visual signals, leading to more reliable and discriminative cross-modal feature integration. To address the limitation of coarse cross-modal alignment, the model further incorporates a visual–semantic guided graph structure learning mechanism. Instead of relying on implicit attention, it explicitly models relational dependencies between textual and visual elements, enabling fine-grained alignment at the structural level and capturing complex cross-modal interactions. In addition, a distribution-level alignment constraint is introduced to enforce global semantic consistency between modalities, which complements the local structural alignment and provides more stable supervision for cross-modal learning.

The main contributions of this paper are summarized as follows:

We propose a Visual–Semantic Guided Interaction Network (VSGN) for multimodal named entity recognition, which unifies semantic enhancement and structural alignment to effectively address cross-modal inconsistency, semantic ambiguity, and noise in social media data.
We design a unified visual–semantic interaction mechanism that integrates channel-wise inhibitory routing and visual–semantic enhanced graph structure learning. Redundant or noisy visual signals are suppressed at the channel level to improve feature discriminability, while cross-modal structural dependencies are modeled through graph-based learning. Guided by generated visual descriptions, this mechanism bridges the semantic gap between modalities and enhances fine-grained interaction.
Extensive experiments on benchmark datasets demonstrate the superiority of the proposed approach. Ablation studies and case analyses further confirm the effectiveness of each component in improving robustness and semantic consistency.

2. Related Work

In this section, we review previous works related to named entity recognition, including multimodal named entity recognition methods, and the application of graph neural networks in multimodal named entity recognition.

2.1. Multimodal Named Entity Recognition

Recent advancements in MNER have evolved from simple feature concatenation to increasingly sophisticated cross-modal interaction and alignment mechanisms [12]. Early studies primarily focused on enhancing textual representations by incorporating global or region-level visual features. For instance, Yu et al. [13] utilized hierarchical index generation to achieve fine-grained alignment between image regions and textual entities, laying the foundation for integrating visual cues into entity recognition.

Subsequent research shifted towards more effective cross-modal interaction strategies, aiming to bridge the semantic gap between modalities. Wang et al. [14] transformed visual information into context tokens to enable seamless interaction within textual encoding, while Jiang et al. [15] introduced dual-similarity guidance to suppress redundant features and enhance discriminative representations. To further address implicit alignment issues, Wei et al. [16] designed an association-aware layer with contrastive learning to strengthen cross-modal consistency, and Mu et al. [17] proposed a multi-granularity framework to alleviate visual noise and capture complementary information across different visual levels.

More recent studies have explored richer semantic modeling and more robust alignment mechanisms. Chen et al. [18] incorporated high-level visual attributes and external knowledge to mitigate modality bias, leveraging knowledge graph retrieval to enhance semantic understanding beyond raw visual signals. Guo et al. [19] proposed the MGICL framework, which performs cross-modal contrastive learning across multiple granularities and introduces a visual gating mechanism to dynamically filter irrelevant visual information, thereby reducing noise and narrowing feature space discrepancies. Zheng et al. [20] developed the AGBAN model, which employs fine-grained visual object features and bilinear attention to explicitly capture entity-object correspondences, and further adopts adversarial learning to map multimodal features into a shared invariant space, reducing distribution gaps.

Despite these advances, most existing methods still rely on sequence-based or feature-level interaction paradigms, which lack an explicit mechanism to model structured relationships within and across modalities. In particular, they struggle to capture complex dependencies among textual tokens and visual regions, as well as the non-local and many-to-many correspondences between them. This limitation motivates the introduction of graph-based approaches, which provide a natural and flexible framework for modeling structured interactions and enable unified representation of intra-modal and cross-modal relationships.

2.2. Graph Neural Networks for MNER

To overcome the limitations of sequence-based modeling, recent studies have introduced graph structures into MNER to explicitly model structured dependencies across modalities. By representing textual tokens and visual regions as nodes, graph-based methods can capture non-local interactions and complex relational structures that are difficult to model using sequential architectures.

Graph Neural Networks (GNNs) have been widely adopted to aggregate node features based on topological connections, enabling the joint modeling of local and global contextual semantics [21]. More importantly, graph structures provide a unified framework for encoding intra-modal relationships (e.g., dependencies between words or visual regions) and inter-modal interactions (e.g., alignments between entity spans and visual cues). For example, Zhang et al. [22] constructed a unified multimodal graph with stacked fusion layers to jointly model intra- and inter-modal interactions, while Zhao et al. [23] further enhanced graph representations by incorporating external matching signals across text-image pairs.

However, existing graph-based approaches still face two key limitations. First, most methods rely on relatively shallow aggregation mechanisms, which are insufficient to capture deep semantic associations between nodes, thereby limiting contextual understanding. Second, cross-modal structural alignment is often coarse-grained, failing to establish precise correspondences between fine-grained visual cues and textual entity spans.

Therefore, there remains a need for a more effective approach that can deeply model inter-node semantic relationships and achieve fine-grained cross-modal structural alignment. Motivated by these challenges, we propose a Visual–Semantic Guided Graph Interaction Network for Multimodal Named Entity Recognition, aiming to enhance semantic representation and establish precise semantic alignment between visual cues and textual entities through structured interaction learning.

3. Methodology

In this section, we first formally define the multimodal named entity detection problem and then introduce the technical details of the proposed VSGN approach. The overall framework of VSGN is illustrated in Figure 2.

3.1. Task Definition

Given a text

T

and its associated image V as input, the task of MNER is to identify all named entities from the text and classify each entity into predefined semantic categories. In this paper, the MNER task is modeled as a sequence labeling problem. Let

X = (s_{1}, s_{2}, s_{3}, \dots, s_{n})

represent a text consisting of

n

words, with the corresponding label sequence denoted as

L = (y_{1}, y_{2}, y_{3}, \dots, y_{n})

. Each label in the sequence is drawn from a pre-defined set of labels

L

, constructed using the BIO2 annotation scheme [24].

3.2. Multimodal Feature Encoding Layer

To mitigate the insufficiency of semantic expression in single visual features, this paper introduces multi-granular visual semantic information for joint modeling, including global visual features, regional object features, and generated visual descriptions. These are designed to supplement the semantic information required for textual entity recognition from different levels.

3.2.1. Text Feature Extraction

Given the input token sequence T = (t₁, t₂, t₃, …t_n), we employ a pre-trained BERT model as the backbone encoder to extract contextualized textual representations. Specifically, each token is first mapped into a continuous embedding space through token embedding, positional embedding, and segment embedding, and then fed into a stack of Transformer layers to capture contextual dependencies within the sequence. The hidden states output from the final Transformer layer are taken as the semantic representations of the input text, which can be formulated as

H_{t} = B E R T (h_{1}, h_{2}, \dots, h_{L}) \in ℝ^{d}

(1)

where L denotes the sequence length and d represents the hidden dimension. Each vector presents the semantic representation of the word in its context.

3.2.2. Visual Feature Extraction

To better complement textual semantics with visual information, we extract visual representations from three levels: global visual features, regional object features, and semantic descriptions. Such a multi-level design enables the model to capture both holistic scene information and fine-grained object details, while also introducing high-level semantic cues to bridge the gap between visual perception and textual understanding.

First, the ViT-B/32 image encoder from the CLIP [25] model is employed to extract global visual features. Specifically, the input image is first resized to a uniform size of 224 × 224 pixels and then divided into a sequence of fixed-size patches of 32 × 32 pixels. Consequently, each image is partitioned into a 7 × 7 grid, resulting in 49 patches, each treated as a visual token. These tokens are then processed by a stack of Transformer layers to model long-range dependencies across different regions, thereby capturing the overall semantic structure of the image. The resulting visual representations can be formulated as:

H_{i} = V i T^{(i)} (V), i = 1, 2, \dots, 12

(2)

here,

H_{i} \in R^{m \times d}

represents the set of local visual features of the image, where

m

is the number of visual units and

d

is the feature dimension. These global visual features provide a coarse-grained understanding of the image content and serve as the foundation for subsequent multimodal interaction. In particular, they are further integrated with regional object features and semantic descriptions to construct richer visual representations, which will be leveraged in the following modules for adaptive fusion and graph-based cross-modal alignment.

Second, to explicitly model entity-related objects in the image, the object detection model Fast R-CNN is employed to extract several high-confidence regions from the image. Compared with global visual features, these region proposals focus on salient objects that are more likely to correspond to textual entities, thereby providing fine-grained visual evidence for entity recognition. The detected regions can be represented as

R = Fast R-CNN (V)

(3)

V_{R} = V i T (R)

(4)

Finally, to compensate for the limitations of visual features in high-level semantic expression, the BLIP (Bootstrapped Language–Image Pre-training) model [26] is introduced to generate natural language descriptions of the image. Unlike raw visual features, these descriptions provide explicit semantic interpretations of the image content, effectively bridging the modality gap between vision and language. The generated descriptions are then encoded using a pre-trained BERT model to obtain their contextualized semantic representations:

C = B E R T (B L I P (V))

(5)

here,

C = {C_{0}, C_{1}, \dots, C_{19}}

, and the maximum text length is 20. These semantic description features serve as high-level complementary signals, enriching textual representations and providing additional guidance for both feature fusion and graph structure learning. By integrating global visual features, region-level representations, and semantic descriptions, the model constructs a multi-granularity visual feature space, which lays a solid foundation for subsequent adaptive fusion and fine-grained cross-modal alignment.

3.3. Multimodal Graph Structure Learning

3.3.1. Visual–Semantic Enhanced Graph Structure Learning

To address the limitations of fixed or heuristic graph construction in existing MNER methods, particularly under noisy and heterogeneous multimodal conditions, we propose a visual–semantic guided graph structure learning mechanism. The core idea is to construct adaptive graph structures that are jointly informed by semantic similarity and learnable relational patterns. Specifically, the graph construction process consists of three steps: (1) semantic similarity estimation, (2) learnable edge weight generation via a node association learner, and (3) sparsification for noise reduction. This design enables the model to dynamically capture meaningful intra- and cross-modal relationships while suppressing noisy connections.

The model constructs graph representations for both textual and visual modalities to explicitly capture fine-grained semantic and spatial relationships. For the text modality, we construct a token-level graph, where each node corresponds to a token from the input sequence. The sequence is formed by concatenating the original text and the generated image description, allowing the model to incorporate complementary visual semantics. The node features are obtained from the contextualized representations produced by a pre-trained language model. For the image modality, we construct a patch-level graph, where each node corresponds to a visual patch extracted by the Vision Transformer (ViT). Specifically, the patch embeddings from the last hidden layer of the ViT encoder are used as node features, representing local visual regions, as described in Section 3.2.2 (Equation (2)). For each text graph and image graph, the node feature matrix is denoted as H ∈ R^n×d, where n represents the number of nodes and d is the feature dimension.

To avoid structural bias caused by using a fixed adjacency matrix, this study designs a Node Association Learner f_θ to estimate the edge weight score between any pair of nodes. Its workflow is shown in Figure 3.

First, we compute the semantic consistency association score for cross-modal node features as follows:

S_{i j} = κ (H_{i}, H_{j}) = \frac{{(W H_{i})}^{⊤} (W H_{j})}{{‖ W H_{i} ‖}_{2} {‖ W H_{j} ‖}_{2}}

(6)

where

κ (\cdot, \cdot)

denotes a cosine kernel that captures semantic affinity in the latent space.

Then, instead of explicitly parameterizing the edge weights with scalar variables, we model the edge generation process as a learnable function mapping f_θ. This function takes the latent features of a node pair as input and outputs the correlation between them. Specifically, we employ a Multilayer Perceptron (MLP) to implement f_θ, aiming to capture complex nonlinear dependencies. The calculation is as follows:

\begin{matrix} f_{θ} & = σ ({MLP}_{θ} ([H_{i} \oplus H_{j}])) \\ = σ (W^{(2) T} \cdot ReLu (W^{(1)} H + b^{(1)}) + b^{(2)}) \end{matrix}

(7)

here, ⊕ denotes feature concatenation; MLP_θ (⋅) is a feed-forward network parameterized by θ = {W^(l),b^(l)} that automatically learns feature interactions; and

σ

is the Sigmoid function.

While the learnable edge generator provides flexibility, relying solely on it may lead to unstable training or overfitting, especially under limited data. Therefore, we incorporate the similarity score as a prior to regularize the edge learning process. This hybrid design balances prior knowledge and data-driven learning, improving both stability and generalization. To further enhance local contextual modeling and suppress long-range noise, a masking mechanism is introduced in the text graph to constrain node connectivity. The final edge weight consists of two components: the trainable edge weight learned via f_θ, and an auxiliary term based on the prior similarity S_i_,j. Accordingly, the edge weights of the graph are formulated as follows:

E_{i, j} = f_{θ} ([H_{i} : H_{j}]) + a S_{i, j}

(8)

To mitigate the noise inherent in social network data, we impose sparsity constraints on the fully connected graph. As shown in Equation (9), the final adjacency matrix A is constructed by retaining only the top-k most significant neighbors for each node, effectively suppressing long-range noise and ensuring a sparse, robust graph structure.

A_{i, j} = T (E_{i, j}; k) = \{\begin{matrix} E_{i, j}, & if j \in TopK (E_{i}) \\ 0, & otherwise \end{matrix}

(9)

The final adjacency matrix A is obtained as an approximate solution to the following optimization problem:

\min_{A} \sum_{i, j} {‖ A_{i, j} - E_{i, j} ‖}^{2} + λ Ω (A)

(10)

here, E represents the initial fully connected similarity matrix computed from the raw data, serving as the reference for the data fidelity term; A is the final sparse adjacency matrix we aim to solve for, which must satisfy specific structural constraints while approximating E; Ω(A) is a sparse regularization term; The hyperparameter λ is the balancing factor, which dynamically adjusts the weights.

Due to the inherent heterogeneity between image and text modalities, their graph structures may exhibit inconsistent relational patterns, even when describing the same entities, which hinders effective cross-modal reasoning. To address this issue, we introduce a cross-modal semantic alignment mechanism that explicitly enforces structural consistency between modalities. Specifically, the feature matrices of the text graph H_t and the image graph H_i are projected into a shared semantic space to construct a cross-modal alignment matrix M, enabling comparable representation across modalities:

{\hat{H}}_{t} = H_{t} W_{t}, {\hat{H}}_{i} = H_{i} W_{i}

(11)

where,

{\hat{H}}_{t}

and

{\hat{H}}_{i}

denote the aligned features in the shared space. The cross-modal alignment matrix M is then constructed by calculating the semantic consistency scores between these projected features:

M = Softmax (\frac{{\hat{H}}_{t} {\hat{H}}_{i}^{T}}{\sqrt{d}})

(12)

Based on the aligned features, the corresponding adjacency structures are normalized into probability distributions P and Q. We then minimize the bidirectional KL divergence between P and Q as a structural consistency constraint, thereby reducing cross-modal discrepancies and promoting coherent relational modeling.

L_{K L} = \sum P_{i} \log (P_{i} | Q_{i}) + \sum Q_{i} l o g (Q_{i} | P_{i}))

(13)

3.3.2. Cross-Modal Interaction of Graph Propagation and Attention Synergy

To enable each node to perceive the structural information of its neighborhood, GCNs are applied to both the text and image graphs. Based on the filtered adjacency relation set ε, the corresponding sparse adjacency matrix A is constructed. After L-layer propagation of the GCN, the representation of each node aggregates information from its L-hop neighbors. The implementation is as follows:

H^{(l + 1)} = GCN (H^{(l)}, A) = σ (D^{- \frac{1}{2}} A D^{- \frac{1}{2}} H^{(l)} W^{(l)})

(14)

where H^(l) is the input feature matrix of the

l

-th layer, and W^(l) is the trainable weight matrix of the l-th layer.

After obtaining node representations that incorporate structural context, it is necessary to further model the cross-modal semantic relationships between text nodes and image region nodes. To this end, a cross-modal attention (CMA) mechanism is adopted to achieve node-level interaction between the two modality graphs. Using the nodes H_t of the text graph as queries and the nodes H_i of the image graph as keys and values, the image region information most semantically relevant to each text node is aggregated via cross-modal attention. The implementation is as follows:

H_{g} = C M A (H_{t}, H_{i}, H_{i}) = Softmax (\frac{(H_{t} W^{Q}) {(H_{i} W^{K})}^{T}}{\sqrt{d_{k}}}) (H_{i} W^{V})

(15)

where W_Q, W_K, W_V are learnable projection matrices for queries, keys, and values, respectively. Specifically, we treat the text node representations H_t as queries (Q) and the image node representations H_i as keys (K) and values (V). Calculate the semantic relevance score between text nodes and image regions using scaled dot-product attention. By applying the softmax function, we obtain the attention weights, which are then used to aggregate the image features (V) into the text-aligned representation. This mechanism allows each text node to selectively focus on the most relevant visual regions.

3.4. Channel-Wise Inhibitory Routing Fusion Module

To achieve effective collaborative modeling between visual information and textual semantics, we propose an adaptive cross-modal fusion module. Through a dual-branch cross-modal interaction mechanism, it collaboratively integrates global visual features, local object features, and image description semantics, thereby enhancing the ability of visual information to complement textual entity semantics. In the dual-branch cross-modal interaction, the contributions of different modality relationships to entity semantic modeling are inherently imbalanced. In weak alignment scenarios, the semantics derived from image descriptions often provides more direct and reliable cues than noisy visual features. To address this issue, we propose a deviation-aware channel-wise inhibitory routing (CIR) mechanism, which explicitly models the distributional consistency of feature responses and suppresses unreliable channels. Unlike conventional attention mechanisms that rely solely on importance estimation, the proposed CIR introduces a deviation-aware constraint to penalize channels that deviate from their canonical activation patterns, thereby improving robustness against noisy or misaligned multimodal features.

First, the model performs gated fusion on the global visual features H_i and the regional object features V_R to integrate the overall semantics of the image with explicit object information, thereby obtaining an enhanced visual representation:

g a t e = σ (W_{θ} [H_{i}; V_{R}])

(16)

H_{i}' = g a t e \cdot H_{i} + (1 - g a t e) \cdot V_{R}

(17)

where gate is a learnable gating weight, and

H_{i}^{'}

denotes the fused visual representation. On this basis, the semantic features C of the image description are introduced, and a dual-branch cross-modal interaction structure is constructed. It models the semantic correlations between text and vision, and between vision and image description semantics, respectively:

H_{I T} = C M A (H_{t}, H_{i}', H_{i}'), H_{I C} = C M A (C, H_{i}', H_{i}')

(18)

where,

H_{I T}

denotes the text-vision cross-modal interaction representation, and

H_{I C}

denotes the description-vision semantic interaction representation.

Specifically, given an input feature map H ∈ R^T^×d, the global feature vector z obtained through global average pooling (GAP) of sequence lengths T, we first compute a channel-wise relevance score vector p ∈ R^d via a lightweight mapping network:

z_{c} = G A P (H) = \frac{1}{T} \sum_{i = 1}^{T} H_{i}

(19)

p = M L P_{θ} (z) = σ (W_{2} δ (W_{1} z + b_{1}) + b_{2})

(20)

where

W_{1} \in ℝ^{\frac{C}{r} \times C}

and

W_{2} \in ℝ^{C \times \frac{C}{r}}

are learnable weight matrices with a reduction ratio r, b₁, b₂ are bias terms, e(⋅) denotes the ReLU activation function, and σ(⋅) is the Sigmoid function that normalizes the weights into the range (0,1).

Subsequently, a learnable parameter

v_{c}

is introduced as a reference for the canonical pattern. During training, v_c automatically adapts to the mainstream activation distribution of that channel in well-aligned samples. The deviation D_c is used to quantify the extent to which the current feature deviates from this mainstream pattern, thereby suppressing anomalous responses caused by noise or mismatches. The response deviation degree is defined as:

D_{c} = \frac{W {| z_{c} - v_{c} |}^{2}}{{‖ W | z - v_{c} |^{2} ‖}_{2}}

(21)

where W denotes a learnable parameter,

∣ \cdot ∣

represents the element-wise absolute value operation, and

∥ \cdot ∥_{2}

denotes the L2 norm of a vector. Based on D_c, an inhibitory soft routing mechanism is designed:

B^{(c)} = \frac{\exp (\frac{p^{(c)} - β \cdot D_{c}}{γ})}{\sum_{k = 1}^{d} \exp (\frac{p^{(k)} - β \cdot D_{k}}{γ})}

(22)

where σ denotes the Sigmoid function, γ is the temperature parameter, and β controls the maximum suppression strength. To avoid over-suppressing informative features in the early training stage, β is initialized with a relatively small value and gradually increased as training progresses, enabling a smooth transition from feature exploration to noise suppression.

The proposed routing mechanism can be interpreted as a joint evaluation process that integrates channel importance and distributional consistency. Specifically, the relevance score p^(c) captures the global importance of each channel, while the deviation term Dc measures its inconsistency with respect to the learned canonical pattern. By combining these two factors, the model effectively performs a reliability-aware reweighting, where channels with high importance but large deviation are suppressed, and only those that are both informative and consistent are emphasized. This mechanism serves as an implicit validation process, enabling robust feature selection under noisy or misaligned multimodal conditions. Finally, the original features are modulated channel-wise:

{\tilde{H}}_{I T} = B^{(c)} \cdot H_{I T}, {\tilde{H}}_{I C} = B^{(c)} \cdot H_{I C}

(23)

H_{out} = L a y e r N o r m (H_{t} + C M A ({\tilde{H}}_{I T}, {\tilde{H}}_{I C}, {\tilde{H}}_{I C}))

(24)

This design not only preserves the discriminative power of features through learnable scoring but also promotes feature diversity by explicitly penalizing overly dominant channels. Ablation studies demonstrate that, compared to global weighting or standard attention mechanisms, CIR significantly improves convergence stability and final performance in multimodal tasks.

3.5. Modal Fusion and Decoding

To integrate the dual-stream features, the model performs feature fusion via a dynamic gating mechanism:

H_{f u s e d} = g \cdot H_{o u t} + (1 - g) \cdot H_{g}, g = σ (W_{g} [H_{out}; H_{g}])

(25)

where

W_{g} \in R^{2 d \times 1}

is a learnable parameter matrix, σ denotes the Sigmoid function, and the resulting gating weights

g \in [0, 1]^{n \times 1}

.

This study introduces Conditional Random Fields (CRF) as the decoding layer to model the global dependencies of label sequences. Given the fusion feature H, the conditional probability of the label sequence y is defined as

P (y ∣ H) = \frac{1}{Z (H)} \exp (\sum_{i = 1}^{n} ϕ (y_{i - 1}, y_{i}, H, i))

(26)

Z (H) = \sum_{y^{'} \in Y} \exp (\sum_{i = 1}^{n} ϕ ({y^{'}}_{i - 1}, {y^{'}}_{i}, H, i))

(27)

where ϕ (∙) denotes the potential function at position i (including both emission and transition scores), Z(H) is the normalization partition function, and y represents the space of all valid label sequences.

The optimization objective of the model is to simultaneously ensure the accuracy of label prediction and the consistency of multimodal features. The total loss function

L

consists of the negative log-likelihood loss

L_{N L L}

from the CRF decoding layer and the bidirectional KL divergence loss

L_{K L}

(see Section 3.3.1) for cross-modal feature alignment:

L = \underset{L_{NLL}}{\underset{︸}{- \log P (y | H)}} + λ \cdot \underset{L_{KL}}{\underset{︸}{(\sum_{i} P_{i} \log \frac{P_{i}}{Q_{i}} + \sum_{i} Q_{i} \log \frac{Q_{i}}{P_{i}})}}

(28)

During inference, the model computes the globally optimal label sequence by maximizing the conditional probability:

y^{*} = {argmax}_{y \in Y} P_{θ} (y | H)

(29)

where

y^{*}

is the predicted optimal label sequence, y denotes the set of all valid label sequences, and

P_{θ} (y | H)

is the conditional probability distribution determined by the model parameters

θ

and the fused features H.

4. Experiments

This section presents the overall experimental setup and evaluation of the proposed VSGN model. We introduce the datasets and implementation details, followed by comparisons with baseline methods. In addition, ablation studies and further analyses are conducted to validate the effectiveness and robustness of the model.

4.1. Dataset

This study conducts experiments on two widely used benchmark datasets for Multimodal Named Entity Recognition (MNER), namely Twitter-2015 [27] and Twitter-2017 [7]. Both datasets are collected from the social media platform Twitter and consist of tweets paired with corresponding images, forming multimodal samples that integrate textual and visual information. Each tweet is manually annotated with four predefined named entity categories: Person (PER), Location (LOC), Organization (ORG), and Miscellaneous (MISC). These annotations enable the evaluation of models in identifying entities from multimodal contexts where textual information alone may be incomplete or ambiguous. The detailed statistical information of the datasets is summarized in Table 1.

4.2. Parameter Settings

For fair comparison, the same hyperparameter settings are applied to both the Twitter-2015 and Twitter-2017 datasets. All experiments are implemented using the PyTorch 2.4.0 deep learning framework on a server equipped with an NVIDIA GTX 4090 GPU. For textual representation, the pre-trained bert-base-uncased model is employed as the text encoder, while visual features are extracted using the CLIP-ViT-B/32 model as the image encoder. The hidden dimensions of both textual and visual representations are set to 768. During training, the proposed model is optimized using the AdamW optimizer for 40 epochs with a batch size of 8. The initial learning rate is set to 3 × 10⁻⁵, with a warm-up ratio of 0.01 applied at the beginning of training. To alleviate overfitting, a dropout rate of 0.5 is adopted. In addition, the maximum sentence length is set to 80 tokens, and the hyperparameter K is set to 5.

4.3. Compared Baselines

To comprehensively evaluate the performance of multimodal named entity recognition (MNER) models, we adopt standard metrics widely used in sequence labeling tasks: Precision (P), Recall (R), and F1-score (F1). These are computed at the entity level—a predicted entity is considered correct only if both its span and type match the ground truth.

To evaluate the effectiveness of the proposed model in addressing challenges such as cross-modal fusion, semantic consistency, and image noise suppression, several representative unimodal and multimodal models were selected for comparative experiments in this paper. The unimodal baselines include traditional sequence labeling models such as CNN + BiLSTM + CRF [28], which enhances character-level representations through convolutional operations to improve the adaptability of social media text; BiLSTM + CRF [29], which employs bidirectional recurrent neural networks to capture contextual dependencies and uses a CRF layer for structured prediction; and BERT-CRF [30], which introduces a CRF decoding layer on top of the BERT encoder to further enhance sequence labeling performance. The multimodal baselines include several representative MNER approaches. UMGF [22] proposes an object-guided multimodal graph fusion framework for named entity recognition. UMT [8] introduces a multimodal interaction module along with an auxiliary entity span detection task to improve cross-modal representation learning. HVPNet [31] presents a hierarchical visual prefix fusion network that enhances the robustness of multimodal entity and relation extraction. MNER-QG [32] proposes an end-to-end framework for jointly learning multimodal named entity recognition and query localization. DebiasCL [33] employs a debiased contrastive learning strategy to mitigate visual bias in multimodal entity recognition. MGCMT [34] integrates multi-granularity visual cues with enhanced textual representations to improve recognition accuracy. MAF [35] alleviates the image–text mismatch problem through a matching and alignment mechanism that improves cross-modal consistency. AMLR [36] proposes an adaptive multi-scale linguistic enhancement mechanism to facilitate entity-level cross-modal interaction. Vec-MNER [16] further enhances multimodal entity recognition by combining visual enhancement with cross-modal interaction.

To ensure the fairness of the experiment, the results of the baseline models mentioned above are taken directly from their original papers. We follow the standard experimental setup commonly adopted in MNER research, using exactly the same dataset split and evaluation scheme. This approach ensures that the performance gains of our VSGN model are evaluated under consistent conditions and compared against the known best-performing methods.

4.4. Effectiveness

Compared with unimodal methods (Table 2 and Table 3), the BERT-CRF framework demonstrates strong performance on sequence labeling tasks. Nevertheless, the proposed VSGN model still achieves superior results on both datasets. Specifically, on the Twitter-2015 dataset, VSGN obtains an overall F1-score of 76.72%, surpassing BERT-CRF by 5.63%. On the Twitter-2017 dataset, VSGN achieves an F1-score of 87.86%, outperforming BERT-CRF by 4.42%. These results indicate that incorporating visual information can effectively compensate for the limitations of purely textual representations and improve entity recognition performance.

Compared with existing multimodal methods, the proposed VSGN model consistently achieves competitive or superior performance across both datasets. Compared with the representative multimodal model MGCMT, VSGN improves the overall F1-score by 3.47% on Twitter-2015 and 2.17% on Twitter-2017. Furthermore, compared with the recent AMLR model, VSGN still achieves higher performance, with improvements of 1.41% and 0.93% on Twitter-2015 and Twitter-2017, respectively. These results demonstrate that the proposed model effectively enhances multimodal feature interaction and improves the robustness of multimodal entity recognition.

Further analysis at the entity level reveals the advantages of VSGN. The model achieves strong performance across most entity categories, particularly for PER and ORG entities. On the Twitter-2017 dataset, VSGN achieves F1-scores of 94.01% for PER and 85.20% for ORG, while the MISC category reaches 74.83%, which are among the best results compared with existing methods. These improvements can be attributed to the collaborative interaction between the graph-structure alignment stream and the channel-wise Inhibitory Routing fusion stream, which enables the model to better capture fine-grained correspondences between visual objects and textual entities while suppressing irrelevant visual noise.

Overall, the experimental results demonstrate that VSGN consistently outperforms a wide range of existing methods on both the Twitter-2015 and Twitter-2017 datasets. This confirms the effectiveness of the proposed model in improving multimodal feature fusion, enhancing cross-modal semantic consistency, and reducing the interference caused by visual noise in multimodal named entity recognition.

4.5. Ablation Study

To evaluate the contribution of each component in the proposed VSGN model, we conducted ablation experiments on the Twitter-2015 and Twitter-2017 datasets. Specifically, we considered four ablated variants: (1) w/o VSG, which removes the visual–semantic guided graph structure learning module; (2) w/o CIR, which replaces the CIR mechanism with simple fusion strategies such as direct concatenation or weighted summation; (3) w/o Description, which removes the generated visual descriptions; and (4) w/o VSG & CIR, which removes both the VSG and CIR modules. The performance is reported in terms of precision (P), recall (R), and F1 score.

As shown in Table 4, the complete VSGN model achieves the best performance on both social media datasets, with F1 scores of 76.72% on Twitter-2015 and 87.86% on Twitter-2017, demonstrating the overall effectiveness of the proposed framework.

When removing the VSG module, the performance drops to 75.76% F1 on Twitter-2015 and 87.03% F1 on Twitter-2017, indicating its important role in modeling cross-modal structural relationships between text tokens and visual regions. By constructing a heterogeneous graph based on visual semantics and employing a GCN with adaptive edge weighting, the model captures local topological dependencies and facilitates structural semantic alignment across modalities.

To further demonstrate the effectiveness of the proposed module, we visualize the attention distribution of the VSG module, as shown in Figure 4. The figure presents the attention map produced by the Visual–Semantic Guided graph structure learning (VSG) module. The attention responses exhibit a structured pattern, where darker colors indicate stronger relevance. Notably, the responses along the diagonal region are relatively darker, suggesting strong and consistent semantic associations between visual features and the corresponding textual tokens. These results indicate that the VSG module effectively captures semantic relationships across modalities and facilitates structured cross-modal interaction.

When replacing the CIR mechanism with simple fusion strategies such as direct concatenation or weighted summation, the F1 scores further decrease to 75.47% and 86.98%, respectively. This demonstrates that naive fusion methods are insufficient for handling noisy or redundant visual information. In contrast, the CIR mechanism introduces a channel-wise suppression routing strategy that selectively highlights informative visual channels while suppressing irrelevant ones, enabling more refined feature interaction and improving the quality of multimodal representations. Removing the generated visual descriptions leads to a more noticeable performance decline, with F1 scores dropping to 74.54% on Twitter-2015 and 86.64% on Twitter-2017. This suggests that visual descriptions provide complementary semantic cues that enhance textual understanding, especially in short and noisy social media contexts.

When both the VSG and CIR modules are removed, the model experiences the most significant degradation, achieving F1 scores of 73.71% and 84.59% on the two datasets, respectively. This result highlights the complementarity of the two modules, where VSG focuses on cross-modal structural alignment at the graph level, while CIR improves feature fusion at the channel level.

Overall, the ablation results demonstrate that each component of VSGN contributes positively to the final performance, and their joint optimization leads to substantial improvements in multimodal named entity recognition.

4.6. Case Study

4.6.1. Successful Cases

To further investigate the effectiveness and robustness of the proposed VSGN model in complex multimodal scenarios, we present three representative cases from the Twitter dataset, as illustrated in Figure 5. These examples are carefully selected to reflect different types of challenges, including insufficient textual context, semantic ambiguity, and noisy social media expressions.

The first case involves the entity “Gigi Buffon” in a sports-related post. Due to the lack of explicit contextual cues in the text, baseline models exhibit significant misclassification errors. Specifically, UMGF and AMLR incorrectly classify “Gigi Buffon” as MISC, while HVPNet predicts it as O, revealing a failure to capture meaningful semantic signals. Similarly, the entity “Leicester City” is consistently misclassified as LOC by all baseline models, suggesting that these methods tend to rely heavily on surface-level lexical patterns rather than deeper semantic understanding. In contrast, our model correctly identifies “Gigi Buffon” as a person (PER) and “Leicester City” as an organization (ORG). This improvement can be attributed to the proposed visual–semantic fusion mechanism, which integrates generated visual descriptions with fine-grained visual features. Such a design provides richer contextual evidence, enabling the model to associate the textual mention with relevant visual semantics, thereby improving entity type discrimination.

The second case focuses on the entity “Phoenix” in the context of the movie Harry Potter and the Order of the Phoenix. This example highlights the challenge of semantic ambiguity, where the same entity can correspond to different types depending on context. In this case, UMGF and AMLR classify “Phoenix” as O, failing to recognize it as a meaningful entity, while HVPNet incorrectly predicts it as PER due to its bias toward dominant entity types in textual patterns. Although HVPNet and AMLR correctly identify “Harry Potter” as PER, they fail to properly model the semantic relationship between “Phoenix” and the film’s context. In contrast, our model successfully classifies “Phoenix” as MISC while maintaining correct recognition of “Harry Potter”. This demonstrates that the VSGN model can effectively leverage cross-modal contextual cues, particularly visual information from the movie scene, to disambiguate entity semantics. The visual–semantic guided interaction allows the model to align textual context with relevant visual concepts, thereby resolving ambiguity that cannot be addressed by text-only or weakly aligned multimodal methods.

The third case involves the entity “GPISDECHS”, which appears in a sports-related social media post and represents an abbreviated organization name. This scenario is particularly challenging due to noisy text, informal expressions, and the absence of explicit semantic clues. As shown in Figure 5, UMGF and HVPNet fail to correctly identify the entity type, predicting it as MISC or O, respectively. Although AMLR correctly classifies “Trevino” as PER, it still misclassifies “GPISDECHS”, indicating its limited ability to handle rare or non-standard entity mentions. In contrast, our model accurately identifies “GPISDECHS” as an organization (ORG) and “Trevino” as a person (PER). This superior performance stems from the proposed visual–semantic guided graph learning mechanism, which explicitly models cross-modal structural relationships. By constructing interactions between textual tokens and visual regions, the model can propagate complementary information across modalities, enabling more robust entity representation even under noisy conditions.

Overall, these case studies demonstrate that the proposed VSGN model consistently outperforms existing approaches in challenging multimodal settings. By effectively integrating visual–semantic cues and modeling cross-modal interactions, the model not only mitigates issues arising from insufficient textual context and semantic ambiguity but also enhances robustness against noisy and informal data. This leads to more accurate and reliable entity recognition, highlighting the practical value of the proposed approach.

4.6.2. Error Analysis

To further analyze the limitations of the proposed model, we randomly sample representative error cases from the test set and categorize them into three types, as illustrated in Figure 6.

The first category is bias brought by the annotation. As shown in Figure 6a, the entity “Frank Erwin Center” is labeled as “ORG” in the dataset, while the model predicts it as “LOC”. However, such entities (e.g., stadiums or event venues) can reasonably be interpreted as either organizations or locations depending on the annotation standard. Therefore, this type of error is not solely caused by model deficiency, but also reflects the inherent ambiguity and inconsistency in the annotation scheme.

The second category is irregular social media structure. As illustrated in Figure 6b, the text contains informal expressions such as “RT” and user mentions, along with account-like tokens such as “MensFitnessWire”. Although it is annotated as an organization, the model predicts it as “O”, suggesting that the model tends to treat such tokens as noisy or non-entity elements due to their lack of clear semantic structure. This indicates that non-standard linguistic patterns in social media pose challenges for accurate entity recognition.

The third category is lack of background knowledge. In Figure 6c, the entity “One Piece” is labeled as “MISC”, but the model fails to recognize it and predicts “O”. This is mainly because the textual and visual context provides limited clues, and correctly identifying such entities often requires external knowledge (e.g., recognizing it as a well-known anime). Without sufficient background knowledge, the model struggles to capture the underlying semantics.

Overall, these errors arise from different sources, including annotation bias, structural noise in social media text, and insufficient background knowledge, highlighting the remaining challenges in multimodal named entity recognition.

4.7. t-SNE Visualization of Entity Feature Distributions

To demonstrate the effectiveness of the multimodal feature representations learned by the VSGN model and its classification capability, t-SNE was employed to perform dimensionality reduction and visualization analysis on the Twitter-2017 test set and the entity representations of the prediction results.as shown in Figure 7.

The high-dimensional features output by the model are mapped to a two-dimensional space, where points in different colors represent different entity types. Comparing the ground truth label distribution in Figure 7a with the predicted label distribution in Figure 7b, it can be observed that they exhibit a high degree of consistency in both topological structure and distribution density. In the feature space, entities of the same type form tight clusters, while distinct boundaries are formed between different entity types. It is worth noting that the “O” label, representing non-entity categories, exhibits a relatively dispersed distribution pattern.

However, this is not a limitation of the model but rather a reasonable representation consistent with linguistic characteristics. Since the “O” label encompasses background words of various parts of speech, such as verbs, prepositions, and common nouns, its semantic space inherently possesses high diversity and contextual dependency. This dispersed distribution demonstrates that the model not only retains rich background contextual information but also successfully decouples and isolates named entities with clear semantic orientations from the complex background context.

The experimental results further validate that the VSGN model, while capturing fine-grained contextual semantics, maintains precise discriminative capability for core entity categories.

4.8. Sensitivity Analysis of the K Parameter

In the visual semantic-enhanced graph structure learning phase (see Section 3.3.1), the model adopts a fixed-size neighbor sampling strategy, retaining the Top-K neighbors for each node to construct its information aggregation range. Since the setting of the sampling number K directly determines the scale and quality of candidate neighbors in the information aggregation process, the choice of this hyperparameter has a significant impact on model performance. To analyze the influence of K on model performance, experiments were conducted by training the model under different K values while keeping the number of training epochs consistent. The experimental results are shown in Figure 8.

When K is small, the model can only capture limited neighborhood structural information, making it difficult to fully capture potential semantic associations between nodes, resulting in relatively low overall performance. As K gradually increases, the model can cover more potential semantic neighbors, allowing for more comprehensive integration of neighborhood information, and model performance improves, reaching its optimum at K = 5. However, when K continues to increase, an excessive number of neighbor nodes introduces some weakly correlated or even noisy information, and these irrelevant or redundant neighborhood features interfere with the core semantic aggregation process, thereby diminishing the model’s representation capability and recognition performance to some extent.

Although the sensitivity trend to K is consistent, the two datasets exhibit significant differences in performance baselines, which are primarily attributed to variations in data quality. This performance gap mainly stems from the following factors:

(1): The Twitter-2015 dataset has lower quality, with a large amount of missing image information, which is crucial for the multimodal named entity recognition task.
(2): The dataset contains many highly irrelevant samples, which act as noise and increase the difficulty of model learning. In contrast, the Twitter-2017 dataset is cleaner and more complete, enabling the model to learn the underlying patterns in the data more effectively.

4.9. Low-Resource Experiment

We further conduct experiments under resource-constrained settings. Specifically, we randomly sample 10% to 50% of the original training data to construct subsets with limited resources. Figure 9 presents the performance comparison of VSGN and several baseline models on the Twitter-2015 and Twitter-2017 datasets. As shown in Figure 9.

Overall, VSGN consistently outperforms all baselines across different data proportions. Notably, the performance gain is more pronounced in low-resource scenarios, demonstrating the effectiveness of the proposed model in utilizing limited training data.

4.10. Computational Cost Analysis

To evaluate the practical applicability of the proposed VSGN model, we conducted a comprehensive analysis of its computational efficiency, including parameter scale, training time, inference speed, and memory consumption.

VSGN comprises 142.7 million parameters, which is slightly higher than AMLR (136.4 million) and BERT-CRF (110.5 million). This increase is primarily attributed to the introduction of the visually–semantically guided graph structure learning module and the channel-level inhibition routing (CIR) mechanism, both of which require additional computations to model fine-grained cross-modal interactions. As shown in Table 5.

In terms of training efficiency, VSGN requires 1.3 h per training epoch, which is slightly longer than AMLR (1.2 h). This computational overhead stems primarily from dynamic graph construction and routing operations, which introduce additional computations during forward and backward propagation. In terms of inference speed, VSGN achieves 36 samples per second, a decrease compared to AMLR (42 samples/second). This decrease is mainly attributed to graph-based interactions and dynamic routing processes, which increase inference complexity.

In terms of memory consumption, VSGN occupies 13.6 GB of GPU memory, which is higher than BERT-CRF (4.2 GB) but comparable to AMLR (12.4 GB). This overhead primarily stems from the learning of dynamic graph structures and the intermediate representations involved in cross-modal feature interactions. However, compared to some complex multimodal models, this memory consumption remains within an acceptable range, indicating that the proposed method achieves a reasonable trade-off between performance gains and resource overhead

Despite the increased computational cost, the model’s performance improves significantly. Ablation results further demonstrate that removing the VSG module (w/o VSG) reduces the parameter count to 132.3 million and increases inference speed to 47 samples/second, but leads to a significant performance drop (see Section 4.5). This indicates that although the VSG module incurs higher computational costs, it plays a crucial role in enhancing cross-modal representation learning.

Overall, VSGN achieves a good balance between computational cost and performance. Although it introduces some overhead compared to baseline models, the significant performance gains validate the effectiveness and necessity of the proposed design, making it a practical solution for multimodal named entity recognition tasks.

5. Conclusions and Outlook

This paper proposes a Visual–Semantic Guided Interaction Network for multimodal named entity recognition, aiming to address the limitations of existing methods in handling noisy social media data, insufficient semantic modeling, and coarse cross-modal alignment. The proposed framework introduces an adaptive visual–semantic fusion mechanism to enable symmetric cross-modal interaction, allowing textual representations to be enhanced with complementary visual cues. In particular, a channel-wise inhibitory routing module is incorporated to selectively suppress redundant or noisy cross-modal signals, thereby improving the discriminability of fused features. By jointly modeling fine-grained semantic relationships and adaptive cross-modal interactions, the proposed method achieves more effective and robust multimodal representation learning. Furthermore, a visual–semantic guided graph structure learning module is developed to explicitly model structured relationships between textual tokens and visual regions, enabling fine-grained cross-modal alignment through bidirectional interaction and improving semantic consistency. Extensive experiments on benchmark datasets demonstrate the effectiveness and robustness of the proposed approach. Compared with existing methods, VSGN achieves superior performance in complex scenarios involving ambiguous expressions and noisy contexts, highlighting the importance of integrating structured modeling and cross-modal semantic guidance.

Despite its effectiveness, the proposed framework introduces additional computational overhead due to graph structure learning and cross-modal interaction. In particular, the construction of heterogeneous graphs and dynamic edge modeling increases model complexity, which may affect efficiency in large-scale or real-time applications. Moreover, as the current evaluation is mainly conducted on social media datasets, its applicability to broader domains remains to be further explored.

In future work, we will focus on developing more efficient graph learning strategies to improve scalability while enhancing generalization across diverse scenarios. Additionally, incorporating large-scale pre-trained multimodal models and extending the framework to more complex tasks, such as event extraction and multimodal reasoning.

Author Contributions

J.Y.: conceptualization, supervision, project administration, writing—review and editing. Z.Z.: methodology, software, investigation, formal analysis, validation, writing—original draft, writing—review and editing. J.Z.: investigation, formal analysis, visualization, writing—original draft. R.L.: resources, data curation, formal analysis, writing—original draft. Z.Q.: conceptualization, supervision, funding acquisition, project administration. All authors have read and agreed to the published version of the manuscript.

Funding

This study was financially supported by the Lanzhou Youth Science and Technology Talent Innovation Project (no. 2025-QN-118) and the National Natural Science Foundation of China (no. 62567008).

Data Availability Statement

The empirical findings of this work rely on the Twitter-2015 and Twitter-2017 benchmark datasets. These collections of anonymized, multi-modal Twitter posts are publicly hosted at https://github.com/CopotronicRifat/TwitterDataMABSA (last accessed 7 January 2026). We confirm that all data were obtained and processed in accordance with Twitter’s Developer Agreement and relevant academic research guidelines. Comprehensive documentation regarding these datasets is available in our cited references [7,27].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cui, L.; Wu, Y.; Liu, J.; Yang, S.; Zhang, Y. Template-Based Named Entity Recognition Using BART. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, 1–6 August 2021; pp. 1835–1845. [Google Scholar]
Jia, S.; Ding, L.; Chen, X.; E, S.; Xiang, Y. Incorporating Uncertain Segmentation Information into Chinese NER for Social Media Text. In Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media, Online, 13 December 2020; pp. 51–60. [Google Scholar]
Wang, X.; Tian, J.; Gui, M.; Li, Z.; Ye, J.; Yan, M.; Xiao, Y. PromptMNER: Prompt-Based Entity-Related Visual Clue Extraction and Integration for Multimodal Named Entity Recognition. In Proceedings of the 27th International Conference on Database Systems for Advanced Applications (DASFAA 2022), Virtual Event, 11–14 April 2022; pp. 297–305. [Google Scholar] [CrossRef]
Arshad, O.; Gallo, I.; Nawaz, S.; Calefati, A. Aiding Intra-Text Representations with Visual Context for Multimodal Named Entity Recognition. In Proceedings of the IEEE International Conference on Document Analysis and Recognition (ICDAR 2019), Barcelona, Spain, 20–25 September 2019; pp. 337–342. [Google Scholar] [CrossRef]
Lu, J.; Zhang, D.; Zhang, J.; Zhang, P. Flat Multi-modal Interaction Transformer for Named Entity Recognition. In Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022), Gyeongju, Republic of Korea, 12–17 October 2022; pp. 2055–2064. [Google Scholar]
Moon, S.; Neves, L.; Carvalho, V. Multimodal Named Entity Recognition for Short Social Media Posts. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018; pp. 852–860. [Google Scholar]
Zhang, Q.; Fu, J.; Liu, X.; Huang, X. Adaptive Co-attention Network for Named Entity Recognition in Tweets. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI 2018), New Orleans, LA, USA, 2–7 February 2018; Volume 32, pp. 1–8. [Google Scholar] [CrossRef]
Yu, J.; Jiang, J.; Yang, L.; Xia, R. Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online, 5–10 July 2020; pp. 3342–3352. [Google Scholar] [CrossRef]
Bao, X.; Tian, M.; Wang, L.; Zha, Z.; Qin, B. Contrastive Pre-training with Multi-level Alignment for Grounded Multimodal Named Entity Recognition. In Proceedings of the 2024 International Conference on Multimedia Retrieval (ICMR ‘24), Utrecht, The Netherlands, 1–4 July 2024; pp. 795–803. [Google Scholar] [CrossRef]
Xu, B.; Jiang, H.; Wei, J.; Jing, H.; Du, M.; Song, H.; Wang, H.; Xiao, Y. Enhancing Multi-modal Named Entity Recognition through Adaptive Mixup Image Augmentation. In Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025), Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 1802–1812. [Google Scholar]
Wang, J.; Zhou, Y.; He, Q.; Zhang, W. A Dual-Enhanced Hierarchical Alignment Framework for Multimodal Named Entity Recognition. Appl. Sci. 2025, 15, 6034. [Google Scholar] [CrossRef]
Wang, P.; Chen, X.; Shang, Z.; Ke, W. Multimodal Named Entity Recognition with Bottleneck Fusion and Contrastive Learning. IEICE Trans. Inf. Syst. 2023, 106, 545–555. [Google Scholar] [CrossRef]
Yu, J.; Li, Z.; Wang, J.; Xia, R. Grounded Multimodal Named Entity Recognition on Social Media. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), Toronto, ON, Canada, 9–14 July 2023; pp. 9141–9154. [Google Scholar] [CrossRef]
Wang, X.; Gui, M.; Jiang, Y.; Jia, Z.; Bach, N.; Wang, T.; Huang, Z.; Tu, K. ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2022), Seattle, WA, USA, 10–15 July 2022; pp. 3176–3189. [Google Scholar] [CrossRef]
Jiang, C.; Wang, Y.; Xiong, B. Dual Similarity Enhanced Hybrid Orthogonal Fusion for Multimodal Named Entity Recognition. Pattern Recognit. 2026, 169, 111940. [Google Scholar] [CrossRef]
Wei, P.; Ouyang, H.; Hu, Q.; Zeng, B.; Feng, G.; Wen, Q. VEC-MNER: Hybrid Transformer with Visual-Enhanced Cross-Modal Multi-level Interaction for Multimodal NER. In Proceedings of the 2024 International Conference on Multimedia Retrieval (ICMR ‘24), Utrecht, The Netherlands, 1–4 July 2024; pp. 469–477. [Google Scholar] [CrossRef]
Mu, Y.; Guo, Z.; Li, X.; Shao, L.; Liu, S.; Li, F.; Mei, G. MCIRP: A multi-granularity cross-modal interaction model based on relational propagation for Multimodal Named Entity Recognition with multiple images. Inf. Process. Manag. 2026, 63, 104384. [Google Scholar] [CrossRef]
Chen, D.; Li, Z.; Gu, B.; Chen, Z. Multimodal Named Entity Recognition with Image Attributes and Image Knowledge. In Proceedings of the 26th International Conference on Database Systems for Advanced Applications (DASFAA 2021), Taipei, Taiwan, 11–14 April 2021; pp. 163–178. [Google Scholar] [CrossRef]
Guo, A.; Zhao, X.; Tan, Z.; Xiao, W. MGICL: Multi-Grained Interaction Contrastive Learning for Multimodal Named Entity Recognition. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM ‘23), Birmingham, UK, 21–25 October 2023; pp. 649–658. [Google Scholar] [CrossRef]
Zheng, C.; Wu, Z.; Wang, T.; Cai, Y.; Li, Q. Object-Aware Multimodal Named Entity Recognition in Social Media Posts With Adversarial Learning. IEEE Trans. Multimed. 2021, 23, 2520–2532. [Google Scholar] [CrossRef]
Yuan, Y.; Xue, H. Multimodal Information Integration and Retrieval Framework Based on Graph Neural Networks. In Proceedings of the 2025 4th International Conference on Big Data, Information and Computer Network (BDICN ‘25), Bangkok, Thailand, 24–26 January 2025; pp. 135–139. [Google Scholar] [CrossRef]
Zhang, D.; Wei, S.; Li, S.; Wu, H.; Zhu, Q.; Zhou, G. Multi-Modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI 2021), Virtual Event, 2–9 February 2021; Volume 35, pp. 14347–14355. [Google Scholar] [CrossRef]
Zhao, F.; Li, C.; Wu, Z.; Xing, S.; Dai, X. Learning from Different Text-Image Pairs: A Relation-Enhanced Graph Convolutional Network for Multimodal NER. In Proceedings of the 30th ACM International Conference on Multimedia (MM ‘22), Lisbon, Portugal, 10–14 October 2022; pp. 3983–3992. [Google Scholar] [CrossRef]
Sang, E.F.; Veenstra, J. Representing Text Chunks. In Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL 1999), Bergen, Norway, 8–12 June 1999; pp. 173–179. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML 2021), Virtual Event, 18–24 July 2021; Volume 139, pp. 3942–3951. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning (ICML 2022), Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
Lu, D.; Neves, L.; Carvalho, V.; Zhang, N.; Ji, H. Visual Attention Model for Name Tagging in Multimodal Social Media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), Melbourne, Australia, 15–20 July 2018; pp. 1990–1999. [Google Scholar] [CrossRef]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar] [CrossRef]
Ma, X.; Hovy, E. End-to-End Sequence Labeling via Bi-Directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), Berlin, Germany, 7–12 August 2016; pp. 1064–1074. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2019), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Chen, X.; Zhang, N.; Li, L.; Yao, Y.; Deng, S.; Tan, C.; Huang, F.; Si, L.; Chen, H. Good Visual Guidance Make a Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, USA, 10–15 July 2022; pp. 1607–1618. [Google Scholar] [CrossRef]
Jia, M.; Shen, L.; Shen, X.; Liao, L.; Chen, M.; He, X.; Chen, Z.; Li, J. MNER-QG: An End-to-End MRC Framework for Multimodal Named Entity Recognition with Query Grounding. In Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI 2023), Washington, DC, USA, 7–14 February 2023; pp. 8032–8040. [Google Scholar] [CrossRef]
Zhang, X.; Yuan, J.; Li, L.; Liu, J. Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (WSDM ‘23), Singapore, 27 February–3 March 2023; pp. 958–966. [Google Scholar] [CrossRef]
Liu, P.; Wang, G.; Li, H.; Liu, J.; Ren, Y.; Zhu, H.; Sun, L. Multi-Granularity Cross-Modal Representation Learning for Named Entity Recognition on Social Media. Inf. Process. Manag. 2024, 61, 103546. [Google Scholar] [CrossRef]
Xu, B.; Huang, S.; Sha, C.; Wang, H. MAF: A General Matching and Alignment Framework for Multimodal NER. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining (WSDM ‘22), Tempe, AZ, USA, 21–25 February 2022; pp. 1215–1223. [Google Scholar] [CrossRef]
Li, E.; Li, T.; Luo, H.; Chu, J.; Duan, L.; Lv, F. Adaptive Multi-Scale Language Reinforcement for Multimodal Named Entity Recognition. IEEE Trans. Multimed. 2025, 27, 5312–5323. [Google Scholar] [CrossRef]

Figure 1. Visual disambiguation in multimodal NER: (a) The text mentions “Konrad Hilton”, and the accompanying image of the hotel building provides visual evidence to reinforce the entity recognition; (b) The text “I love Alibaba” is ambiguous, but the visual content showing the Alibaba logo and building helps classify the entity accurately as an organization.

Figure 2. The overall architecture of the VSGN model.

Figure 3. Implementation details of the node association learner.

Figure 4. Cross-modal attention heat map of VSG. Each cell represents the attention weight between a text token and a visual feature (image or regional). Darker blue indicates stronger correlation, while lighter green/yellow indicates weaker correlation.

Figure 5. The second row shows several representative samples together with their manually labeled entities in the test set of our two Twitter datasets, and the bottom four rows show predicted entities of different methods on these test samples.

Figure 6. Typical error cases of VSGN. (a) Annotation bias (17.9%), where the model predicts “LOC” while the ground truth is “ORG”; (b) Irregular social media structure (33.6%), which leads to prediction failures on complex text; (c) Lack of background knowledge (22.5%), where the model fails to recognize the reference to “One Piece”.

Figure 7. t-SNE visualization of entity feature distributions on the Twitter-2017 test set. (a) Ground Truth Label Distribution: The distribution of true entity labels in the test set, where different colors represent different entity types. (b) Predicted Label Distribution: The distribution of entity labels predicted by VSGN, showing high consistency with the ground truth in topological structure and distribution density.

Figure 8. Sensitivity analysis of the hyperparameter K on the Twitter-2015 and Twitter-2017 datasets. (a) Performance on the Twitter-2015 development set. (b) Performance on the Twitter-2017 development set. The curves illustrate the variation in Precision, Recall, and F1-score with respect to different values of K.

Figure 9. F1 scores of the benchmark models and VSGN in a low-resource setting on the MNER task. (a) Performance comparison on the Twitter-2015 dataset, showing that model consistently outperforms all baselines across different data proportions. (b) Performance comparison on the Twitter-2017 dataset, demonstrating the effectiveness of our method in utilizing limited training data, with more pronounced performance gains observed in low-resource scenarios.

Table 1. Statistics for the two Twitter datasets.

Entity Type	Twitter-2015			Twitter-2017
Entity Type	Train	Dev	Test	Train	Dev	Test
Person	2217	552	1816	2943	626	621
Location	2091	522	839	731	173	178
Organization	928	247	839	1674	375	395
Miscellaneous	940	225	726	701	150	157
Total	6176	1546	5078	6049	1324	1351
Num of Tweets	4000	1000	3257	3373	723	723

Table 2. Experimental results on the Twitter-2015 dataset.

Modality	Methods	Twitter-2015
		Single Type (F1)			Overall
		PER	LOC	ORG	MISC	P	R	F1
Text	BiLSTM + CRF	76.77	72.56	41.33	26.80	85.24	81.58	63.03
	CNN + BiLSTM + CRF	80.86	75.39	47.77	32.61	84.26	83.17	62.45
	BERT-CRF	84.74	80.51	60.27	37.29	84.67	81.18	63.35
Text + Image	UMT	85.24	81.58	63.03	39.45	71.67	75.23	73.41
	UMGF	84.26	83.17	62.45	42.42	74.49	75.21	74.85
	MAF	84.67	81.18	63.35	41.82	71.86	75.10	73.42
	HVPNet	-	-	-	-	73.87	76.82	75.32
	DebiasCL	85.97	81.84	64.02	43.38	74.45	76.13	75.28
	MNER-QG	85.31	81.65	63.41	41.32	77.43	72.15	74.70
	MGCMT	85.84	82.03	63.08	40.81	73.25	75.03	74.13
	VEC-MNER	86.11	81.03	62.86	40.60	74.56	75.23	74.89
	AMLR	85.90	82.19	69.65	40.20	75.45	75.20	75.31
	VSGN	87.49	83.64	63.41	43.32	75.93	77.62	76.72

Table 3. Experimental results on the Twitter-2017 dataset.

Modality	Methods	Twitter-2017
		Single Type(F1)			Overall
		PER	LOC	ORG	MISC	P	R	F1
Text	BiLSTM + CRF	85.12	72.68	72.50	52.56	79.42	73.43	76.31
	CNN + BiLSTM + CRF	87.99	77.44	74.02	60.82	80.00	78.76	79.37
	BERT-CRF	90.25	83.05	81.13	62.21	83.32	83.57	83.44
Text + Image	UMT	91.56	84.73	82.24	70.10	85.28	85.34	85.31
	UMGF	91.92	85.22	83.13	69.83	86.54	84.50	85.51
	MAF	91.51	85.80	85.10	68.79	86.13	86.38	86.25
	HVPNet	—	—	—	—	85.84	87.93	86.87
	DebiasCL	93.46	84.15	84.42	67.88	87.59	86.11	86.84
	MNER-QG	90.90	86.19	84.52	71.67	88.26	85.65	86.94
	MGCMT	90.82	86.26	84.21	65.88	85.61	85.79	85.69
	VEC-MNER	93.88	81.27	85.49	73.40	87.42	87.61	87.51
	AMLR	93.22	86.13	84.46	68.42	86.96	86.90	86.93
	VSGN	94.01	84.56	85.20	74.83	87.20	88.52	87.86

Table 4. Performance comparison of VSGN and its ablated variants.

Methods	Twitter-2015			Twitter-2017
Methods	P	R	F1	P	R	F1
VSGN	75.93	77.62	76.72	87.20	88.52	87.86
w/o VSG	74.81	76.73	75.76	86.59	87.47	87.03
w/o CIR	74.02	76.97	75.47	86.65	87.32	86.98
w/o Caption	73.77	75.32	74.54	86.32	86.97	86.64
w/o VSG + CIR	73.32	74.1	73.71	83.77	85.43	84.59

Table 5. Computational cost and efficiency comparison.

Model	Param (M)	Inference Speed (Samples/s)	Train Times per Epoch (h)	Memory Consumption (G)
BERT-CRF	110.5	157	0.2	4.2
UMGF	174.3	21	1.9	15.3
AMLR	136.4	42	1.2	12.4
VSGN	142.7	36	1.3	13.6
w/o VSG	132.3	47	1.0	11.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yao, J.; Zhou, Z.; Li, R.; Zhang, J.; Qi, Z. VSGN: Visual–Semantic Guided Interaction Network for Multimodal Named Entity Recognition. Symmetry 2026, 18, 769. https://doi.org/10.3390/sym18050769

AMA Style

Yao J, Zhou Z, Li R, Zhang J, Qi Z. VSGN: Visual–Semantic Guided Interaction Network for Multimodal Named Entity Recognition. Symmetry. 2026; 18(5):769. https://doi.org/10.3390/sym18050769

Chicago/Turabian Style

Yao, Jianjun, Zhikun Zhou, Ruisheng Li, Jiaming Zhang, and Zhiwei Qi. 2026. "VSGN: Visual–Semantic Guided Interaction Network for Multimodal Named Entity Recognition" Symmetry 18, no. 5: 769. https://doi.org/10.3390/sym18050769

APA Style

Yao, J., Zhou, Z., Li, R., Zhang, J., & Qi, Z. (2026). VSGN: Visual–Semantic Guided Interaction Network for Multimodal Named Entity Recognition. Symmetry, 18(5), 769. https://doi.org/10.3390/sym18050769

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

VSGN: Visual–Semantic Guided Interaction Network for Multimodal Named Entity Recognition

Abstract

1. Introduction

2. Related Work

2.1. Multimodal Named Entity Recognition

2.2. Graph Neural Networks for MNER

3. Methodology

3.1. Task Definition

3.2. Multimodal Feature Encoding Layer

3.2.1. Text Feature Extraction

3.2.2. Visual Feature Extraction

3.3. Multimodal Graph Structure Learning

3.3.1. Visual–Semantic Enhanced Graph Structure Learning

3.3.2. Cross-Modal Interaction of Graph Propagation and Attention Synergy

3.4. Channel-Wise Inhibitory Routing Fusion Module

3.5. Modal Fusion and Decoding

4. Experiments

4.1. Dataset

4.2. Parameter Settings

4.3. Compared Baselines

4.4. Effectiveness

4.5. Ablation Study

4.6. Case Study

4.6.1. Successful Cases

4.6.2. Error Analysis

4.7. t-SNE Visualization of Entity Feature Distributions

4.8. Sensitivity Analysis of the K Parameter

4.9. Low-Resource Experiment

4.10. Computational Cost Analysis

5. Conclusions and Outlook

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI