Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

A Dual-Enhanced Hierarchical Alignment Framework for Multimodal Named Entity Recognition

Appl. Sci. 2025, 15(11), 6034; https://doi.org/10.3390/app15116034

by Jian Wang

, Yanan Zhou, Qi He^*

and Wenbo Zhang

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Appl. Sci. 2025, 15(11), 6034; https://doi.org/10.3390/app15116034

Submission received: 18 March 2025 / Revised: 28 April 2025 / Accepted: 23 May 2025 / Published: 27 May 2025

(This article belongs to the Special Issue Intelligence Image Processing and Patterns Recognition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper presents a novel framework for Multimodal Named Entity Recognition (MNER) in social media posts combining text and images. It addresses challenges of fine-grained cross-modal alignment and visual noise interference through a global-local contrastive learning strategy by their DEHA framework.

Key Contributions:

Semantic-Augmented Global Contrast (SAGC): Improve global text-image alignment via semantic similarity and nearest-neighbor expansion to resolve entity ambiguity.
Multi-scale Spatial Local Contrast (MS-SLC): Performs token-level alignment using a multi-scale visual feature pyramid and gated attention to suppress noisy image regions.
Cross-Modal Fusion + Visual-Constrained CRF: Combines features adaptively and improves entity prediction using a CRF layer guided by visual alignment cues.

Strength points:

The paper is well written.
Achieves state-of-the-art results on Twitter-2015 and Twitter-2017 datasets.
Robust against noisy visual content.
Well-supported by ablation studies demonstrating each module's contribution.

Weaknesses:

The framework should be compared to "LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition", which also uses the same datasets.
Providing code would further strengthen the paper’s reproducibility and impact.

Answers to Additional Questions:

What is the main question addressed by the research?

The paper addresses the challenge of improving fine-grained alignment between textual and visual modalities in the task of Multimodal Named Entity Recognition (MNER), especially in noisy social media contexts. The main goal is to reduce cross-modal semantic deviation and image noise interference that hinder accurate entity recognition.

Do you consider the topic original or relevant to the field? Does it address a specific gap in the field?

Yes. Existing methods typically rely on either global-level alignment or static visual features, whereas this work proposes a dual-enhancement mechanism integrating semantic-augmented global contrast and multi-scale spatial local contrast.

What does it add to the subject area compared with other published material?

DEHA combines global and local alignment in a unified framework, improving robustness to noisy or semantically ambiguous inputs.
It introduces a semantic-augmented global contrastive learning (SAGC) strategy for handling polysemous words and ambiguous semantic contexts.
It applies multi-scale visual pyramids and a gating mechanism (MS-SLC) to dynamically align token-level text with appropriate visual regions.
The CRF layer is vision-constrained, integrating visual alignment as a decoding prior—an advancement over standard CRF in MNER.

Empirically, DEHA outperforms state-of-the-art baselines (e.g., ICAK, HamLearning, MLNet), achieving F1 scores of 77.42% and 88.79% on Twitter-2015 and Twitter-2017 datasets respectively.

What specific improvements should the authors consider regarding the methodology?

The current multi-scale grid levels (64×64, 32×32, 16×16) are predefined. Did they try to optimize the scales or check an adaptive scale selection strategy?
The visual feature extraction pipeline may increase inference time. Did you check and compare computational complexity with other SOTA models?

Are the conclusions consistent with the evidence and arguments presented and do they address the main question posed?

Yes. The quantitative improvements and ablation studies support the claim that DEHA improves fine-grained cross-modal alignment and robustness to noise.

Are the references appropriate?

Yes, but additional recent work in vision-language foundation models such as BLIP and Flamingo could have been discussed, even if not directly used.

Any additional comments on the tables and figures.

The figures help to explain the framework.
The tables are well-structured and support the claims effectively.

Author Response

Please check the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors present the Dual-Enhanced Hierarchical Alignment (DEHA) framework for addressing challenges in Multimodal Named Entity Recognition (MNER) in social media contexts. Experiments on Twitter-2015 and Twitter-2017 datasets demonstrate superior performance compared to existing methods.

Advantages

The paper addresses a significant problem, particularly for social media: multimodal NER.
The proposed dual-enhancement approach (SAGC and MS-SLC) is innovative in handling global semantic and local spatial alignment.
The three-level visual pyramid encoding captures visual information at multiple scales effectively.
The vision-constrained CRF layer offers an elegant solution to integrating visual information into the final prediction stage.
The authors provide comprehensive empirical validation, including ablation studies and case studies that demonstrate the contribution of each component.
The framework achieves significant performance improvements over existing methods on different entity types (PER, LOC, ORG, MISC).
The paper is well-written and organized.

Limitations

The authors could improve Figure 2 by showing clearer input/output specifications for each main block.
The paper uses numerous acronyms. I suggest introducing a reference list of acronyms to improve accessibility.
The authors don't justify the different neural network architectures chosen for the three hierarchical image encoding layers. A rationale for these specific design choices would strengthen the methodology.
The framework combines numerous components, and the authors do not discuss or test its computational complexity and practical deployment in real-world applications.
The authors could better describe the paragraph "as indicated by the yellow shaded region in the contrastive matrix depicted in the second column from the left above" (line 237).
Table 2 lacks explicit information about the evaluation metrics and source values; the table lacks some values for MLNet without explanation.
The evaluation is limited to Twitter datasets with a limited number of entity types. Therefore, the authors should demonstrate the framework's capabilities on different entity taxonomies.
Technical terminology is sometimes introduced without sufficient explanation or context, which may challenge readers who are less familiar with the field. A background section could be helpful.
Figure captions should be more descriptive so that readers can understand the visual elements without reading all the main text.

Author Response

Please check the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

This paper introduces a new Dual-Enhanced Hierarchical Alignment (DEHA) framework for multimodal named entity recognition (MNER). The authors focus on addressing key challenges such as cross-modal semantic bias, fine-grained alignment, and image noise. The model is tested on two widely used datasets, Twitter-2015 and Twitter-2017, and achieves strong performance compared to both unimodal and multimodal baselines. The paper is well-organised, clearly written, and includes detailed experiments and ablation studies.

The method section explains the main components (SAGC, MS-SLC, visual CRF) well. However, it would be helpful to briefly explain the intuition or motivation behind each part before giving technical details. This would make the model easier to understand for readers who are not experts in MNER.
The results are complete and well-presented. Still, the paper could include a short discussion on why the model performs better on some entity types (e.g., LOC vs. ORG). This would help readers better understand the strengths of DEHA.
The multi-scale contrastive learning module is a major contribution. While the results show its impact, explaining how it helps align local and global features would make its role clearer.
The related work section is strong, but a more direct comparison with similar models (such as MNER-QA, DMNER, and RLMNER) would improve clarity. A table or bullet list showing key differences would help position DEHA more clearly in the existing literature.
Although the experiments focus on Twitter data, the model design is quite general. A short comment on how DEHA could be used in other domains like medical or e-commerce text would increase its broader impact.
The paper is mostly well-written but could benefit from a careful proofreading to fix small grammar issues. For example: Line 349: “recognition datasets, namely the Twitter-2015...” consider removing “namely” or rephrasing for smoother reading.

Author Response

Please check the attachment.

Author Response File: Author Response.pdf

Article Menu

A Dual-Enhanced Hierarchical Alignment Framework for Multimodal Named Entity Recognition

Advantages

Limitations

Further Information

Guidelines

MDPI Initiatives

Follow MDPI