A Dual-Enhanced Hierarchical Alignment Framework for Multimodal Named Entity Recognition
Abstract
:1. Introduction
2. Related Work
2.1. Multimodal Named Entity Recognition
2.2. Contrastive Learning
3. Methodology
3.1. Multimodal Feature Encoding Layer
3.1.1. Text Feature Encoding
3.1.2. Three-Level Visual Pyramid Feature Encoding
3.2. Dual-Enhanced Hierarchical Alignment (DEHA)
3.2.1. Semantic-Augmented Global Contrast (SAGC)
3.2.2. Multi-Scale Spatial Local Contrast (MS-SLC)
3.3. Cross-Modal Feature Fusion and Vision-Constrained CRF Prediction Layer
3.3.1. Cross-Modal Feature Fusion
3.3.2. Vision-Constrained CRF Prediction Layer
4. Experiments
4.1. Datasets and Evaluation Metrics
4.2. Compared Baselines
- Compared to ITA [11] and HVPNeT [24], which rely on static alignment strategies and struggle to capture fine-grained semantic differences, DEHA introduces a dynamic multi-scale alignment module that integrates a three-level visual feature pyramid with a gated attention mechanism, enabling adaptive association of text entities with multi-granularity visual regions, including global scenes, regional semantics, and local details. In fine-grained entity recognition tasks, such as identifying brand-related ORG entities, the F1 scores on the two datasets improved by 2.75% and 2.76%, respectively, over HVPNeT. The recognition of ORG entities benefits from the cross-modal abstract association capability of the global semantic enhancement module (SAGC). Organization names often involve complex semantic compositions (such as abbreviations, aliases, and implicit associations with visual symbols). SAGC effectively addresses entity coreference ambiguity by modeling the global semantic consistency between textual descriptions and visual concepts through contrastive learning.
- Although MPMRC-MNER [25] and ICAK [15] excel in few-shot scenarios, markedly enhancing long-tail entity recognition (e.g., niche brands), their dependence on manual expertise and knowledge bases limits their generalizability. In contrast, DEHA dynamically updates cross-modal alignment knowledge through text similarity expansion, improving the F1 score for long-tail entity recognition, such as “Springfield” (LOC), by 1.11% over MPMRC-MNER, effectively mitigating semantic bias. Geographic entities are often strongly coupled with specific regions in images (such as landmarks and geographic boundaries), and MS-SLC precisely captures the fine-grained co-occurrence patterns between textual descriptions and visual spatial cues through local feature pyramid alignment.
- Compared to models like DebiasCL [13], MLNet [14], and HamLearning [16], which rely on predefined noise assumptions and fail to address dynamic noise interference, DEHA increases the F1 score on Twitter-2017 by 1.43% to 1.95%. Particularly in cases with complex image backgrounds, such as interference from the Eiffel Tower in Case 1, the MS-SLC module suppresses irrelevant regional feature responses through multi-scale spatial contrast, boosting the recall rate for the LOC category by 3.79%.
- Compared to the multi-level fusion architecture MLNet [14] and the MRC-based MPMRC-MNER [25], DEHA’s visual constraint CRF layer jointly optimizes label transition probabilities and visual consistency, achieving a precision of 78.09% on Twitter-2015, an improvement of 1.14%, demonstrating that adaptive feature aggregation effectively reduces misjudgments caused by image interference.
4.3. Ablation Study
- w/o SAGC removes the Semantic-Enhanced Global Contrastive Learning (SAGC) module, replacing it with traditional cross-modal contrastive learning (He et al., 2020 [8]), which relies solely on random negative samples for global alignment and disables the text semantic similarity-driven prototype expansion strategy;
- w/o MS-SLC removes the Multi-Scale Spatial Local Contrastive (MS-SLC) learning module, substituting it with single-scale (64×64 grid) local feature matching, eliminating the spatial pyramid structure and cross-scale contrast constraints;
- w/o Cross-Modal Fusion removes the adaptive cross-modal feature fusion layer, replacing it with simple concatenation followed by a fully connected layer;
- w/o Visual Constraint CRF removes the visual constraint CRF prediction layer, replacing it with a standard CRF that depends solely on text label transition probabilities without incorporating visual alignment strength as a decoding constraint.
- Removing the SAGC module results in a 2.5% decrease in Precision and a 2.2% reduction in Recall on the Twitter-2017 dataset, indicating that traditional contrastive learning approaches, such as MoCo [8], struggle to address the semantic misalignment between images and text in social media contexts. This module enhances cross-modal semantic consistency by expanding prototypes through textual semantic similarity. Furthermore, the removal of this module leads to a 5.3% drop in Recall on the Twitter-2015 dataset, because the elimination of SAGC’s mechanism for selecting the hardest negative samples forces the model to rely solely on random negative samples, resulting in blurred discriminative boundaries and an inability to filter out irrelevant entities in images.
- When MS-SLC is removed, the F1 scores on Twitter-2015 and Twitter-2017 decline by 3.7% and 2.9%, respectively. MS-SLC captures multi-granularity visual contexts through a three-level spatial grid (64 × 64/32 × 32/16 × 16), whereas single-scale local contrast depends solely on text regions, resulting in a 4.5% drop in Recall due to the absence of multi-scale feature support.
- Removing the cross-modal fusion layer and resorting to simple concatenation also reduces F1 scores. In scenarios with conflicting text and image information, adaptive cross-modal fusion suppresses interfering visual features via a gating mechanism, whereas concatenation allows erroneous visual signals to contaminate text representations, increasing the misjudgment rate. The standard CRF, by ignoring visual evidence consistency, reduces Precision by 1.3% and Recall by 1.1% for image-dominant entities on Twitter-2017.
- The standard CRF overlooks the consistency of visual evidence, leading to a 1.3% decrease in Precision and a 1.1% decrease in Recall for “image-dominant entities” on the Twitter-2017 dataset. The visually constrained CRF suppresses entity probabilities based on low alignment intensity, whereas the standard CRF excessively relies on textual context, erroneously retaining such entities.
- Simultaneously removing SAGC and MS-SLC results in F1 score decreases of 4.7% and 4.4% on Twitter-2015 and Twitter-2017, respectively, significantly exceeding the sum of individual module ablation declines, underscoring a nonlinear synergistic effect between global semantic calibration and local spatial contrast.
Model | Twitter-2015 | Twitter-2017 | ||||
---|---|---|---|---|---|---|
P (%) | R (%) | F1 (%) | P (%) | R (%) | F1 (%) | |
DEHA | 85 | 83.9 | 84.9 | 86.8 | 85.7 | 86.5 |
w/o SAGC | 82.5 | 81.7 | 82.1 | 84.3 | 83.5 | 83.9 |
w/o MS-SLC | 83.8 | 82.2 | 83 | 85.2 | 84.2 | 84.7 |
w/o Cross-Modal Fusion | 81 | 80.1 | 80.5 | 83.5 | 82.9 | 83.2 |
w/o Visual Constraint CRF | 83.9 | 83 | 83.4 | 85.5 | 84.6 | 85 |
w/o SAGC + MS-SLC | 79.8 | 78.6 | 79.2 | 82.3 | 81.9 | 82.1 |
4.4. Case Study
5. Conclusions and Outlook
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
BERT | Bidirectional Encoder Representations from Transformers |
CLIP | Contrastive Language–Image Pretraining |
CRF | Conditional Random Field |
DEHA | Dual-Enhanced Hierarchical Alignment |
Faster R-CNN | Faster Region-based Convolutional Neural Network |
GCN | Graph Convolutional Network |
HamLearning | Hierarchical Adaptive Modality Learning |
HVPNeT | Hierarchical Visual Prefix Network |
ICAK | Instruction Construction and Knowledge Alignment |
ITA | Image–Text Alignments |
MAF | Matching and Alignment Framework |
MLNet | Multi-Level Network |
MNER | Multimodal Named Entity Recognition |
MoCo | Momentum Contrast |
MPMRC-MNER | Machine Reading Comprehension for Multimodal Named Entity Recognition |
MS-SLC | Multi-Scale Spatial Local Contrast |
NLP | Natural Language Processing |
ResNet | Residual Network |
SAGC | Semantic-Augmented Global Contrast |
SimCSE | Simple Contrastive Learning of Sentence Embeddings |
UMGF | Unified Multimodal Graph Fusion |
UMT | Unified Multimodal Transformer |
References
- Moon, S.; Neves, L.; Carvalho, V. Multimodal named entity disambiguation for noisy social media posts. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Melbourne, Australia, 2018; pp. 2000–2008. [Google Scholar]
- Yadav, V.; Bethard, S. A Survey on Recent Advances in Named Entity Recognition from Deep Learning models. In Proceedings of the 27th International Conference on Computational Linguistics; Association for Computational Linguistics: Santa Fe, NM, USA, 2018; pp. 2145–2158. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MI, USA, 2019; pp. 4171–4186. [Google Scholar]
- Zhang, Q.; Fu, J.; Liu, X.; Huang, X. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32, pp. 5674–5681. [Google Scholar]
- Yu, J.; Jiang, J.; Yang, L.; Xia, R. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020. [Google Scholar]
- Chen, S.; Aguilar, G.; Neves, L.; Solorio, T. Can images help recognize entities? A study of the role of images for Multimodal NER. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021); Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 87–96. [Google Scholar]
- Zhang, D.; Wei, S.; Li, S.; Wu, H.; Zhu, Q.; Zhou, G. Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. Proc. AAAI Conf. Artif. Intell. 2021, 35, 14347–14355. [Google Scholar] [CrossRef]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognitio, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
- Gao, T.; Yao, X.; Chen, D. SimCSE: Simple contrastive learning of sentence embeddings. arXiv 2021, arXiv:2104.08821. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: Cambridge, MA, USA, 2021; pp. 8748–8763. [Google Scholar]
- Wang, X.; Gui, M.; Jiang, Y.; Jia, Z.; Bach, N.; Wang, T.; Huang, Z.; Huang, F.; Tu, K. ITA: Image-Text Alignments for Multimodal NER; Association for Computational Linguistics: Melbourne, Australia, 2022; pp. 3176–3189. [Google Scholar]
- Xu, B.; Huang, S.; Sha, C.; Wang, H. MAF: A general matching and alignment framework for multimodal named entity recognition. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Virtual, 21–25 February 2022; pp. 1215–1223. [Google Scholar]
- Zhang, X.; Yuan, J.; Li, L.; Liu, J. Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, Singapore, 27 February–3 March 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 958–966. [Google Scholar]
- Zhai, H.; Lv, X.; Hou, Z.; Tong, X.; Bu, F. MLNet: Multi-level MNER Architecture. Front. Neurorobotics 2023, 17, 1181143. [Google Scholar] [CrossRef] [PubMed]
- Zeng, Q.; Yuan, M.; Wan, J.; Wang, K.; Shi, N.; Che, Q.; Liu, B. ICKA: Instruction Construction and Knowledge Alignment for MNER. Expert Syst. 2024, 255, 124867. [Google Scholar] [CrossRef]
- Liu, P.; Li, H.; Ren, Y.; Liu, J.; Si, S.; Zhu, H.; Sun, L. Hierarchical Aligned Multimodal Learning for NER on Tweet Posts. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 18680–18688. [Google Scholar]
- Wang, Y.; Liu, X.; Huang, F.; Xiong, Z.; Zhang, W. A multi-modal contrastive diffusion model for therapeutic peptide generation. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 3–11. [Google Scholar]
- Wang, Z.; Xiong, Z.; Huang, F.; Liu, X.; Zhang, W. ZeroDDI: A Zero-Shot Drug-Drug Interaction Event Prediction Method with Semantic Enhanced Learning and Dual-Modal Uniform Alignment. arXiv 2024, arXiv:2407.00891. [Google Scholar]
- Ji, Z.; Chen, K.; Wang, H. Step-wise hierarchical alignment network for image-text matching. arXiv 2021, arXiv:2106.06509. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
- Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
- Lu, D.; Neves, L.; Carvalho, V.; Zhang, N.; Ji, H. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Melbourne, Australia, 2018; pp. 1990–1999. [Google Scholar]
- Chen, X.; Zhang, N.; Li, L.; Yao, Y.; Deng, S.; Tan, C.; Huang, F.; Si, L.; Chen, H. Good Visual Guidance Make A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction. In Findings of the Association for Computational Linguistics: NAACL 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 1607–1618. [Google Scholar]
- Bao, X.; Tian, M.; Zha, Z.; Qin, B. MPMRC-MNER: Unified MRC Framework for MNER. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 47–56. [Google Scholar]
Entity Type | Twitter-2015 | Twitter-2017 | ||||
---|---|---|---|---|---|---|
Train | Dev | Test | Train | Dev | Test | |
Person | 2217 | 552 | 1816 | 2943 | 626 | 621 |
Location | 2091 | 522 | 1697 | 731 | 173 | 178 |
Organization | 928 | 247 | 839 | 1674 | 375 | 395 |
Miscellaneous | 940 | 225 | 726 | 701 | 150 | 157 |
Total | 6176 | 1546 | 5078 | 6049 | 1324 | 1351 |
No. of Tweets | 4000 | 1000 | 3257 | 3373 | 723 | 723 |
Modality | Methods | Twitter-2015 | ||||||
---|---|---|---|---|---|---|---|---|
Single Type (F1) | Overall | |||||||
PER | LOC | ORG | MISC | Pre. | Rec. | F1 | ||
Text | BiLSTM-CRF [2] | 76.77 | 72.56 | 41.33 | 26.8 | 68.14 | 61.09 | 64.42 |
BERT-CRF [3] | 84.74 | 80.51 | 60.27 | 37.29 | 69.22 | 74.59 | 71.81 | |
Text+Image | UMT [5] | 85.24 | 81.58 | 63.03 | 39.45 | 71.67 | 75.23 | 73.41 |
ITA [11] | 85.6 | 82.6 | 64.4 | 44.8 | - | - | 75.6 | |
HVPNeT [24] | 85.88 | 82.96 | 62.72 | 41.56 | 73.87 | 76.82 | 75.32 | |
DebiasCL [13] | 85.97 | 81.84 | 64.02 | 43.38 | 74.45 | 76.13 | 75.28 | |
MLNet [14] | - | - | - | - | 75.73 | 75.73 | 75.73 | |
MPMRC-MNER [25] | 85.88 | 83.06 | 66.6 | 42.99 | 77.15 | 75.39 | 76.26 | |
ICAK [15] | 87.01 | 83.85 | 65.87 | 48.28 | 72.36 | 78.75 | 75.42 | |
HamLearning [16] | 85.28 | 82.84 | 64.46 | 42.52 | 77.25 | 75.75 | 76.49 | |
DEHA (Ours) | 87.22 | 84.21 | 65.47 | 48.65 | 78.29 | 80.03 | 77.42 | |
Modality | Methods | Twitter-2017 | ||||||
Single Type (F1) | Overall | |||||||
PER | LOC | ORG | MISC | Pre. | Rec. | F1 | ||
Text | BiLSTM-CRF [2] | 85.12 | 72.68 | 72.5 | 52.56 | 79.42 | 73.43 | 76.31 |
BERT-CRF [3] | 90.25 | 83.05 | 81.13 | 62.21 | 83.32 | 83.57 | 83.44 | |
Text+Image | UMT [5] | 91.56 | 84.73 | 82.24 | 70.1 | 85.28 | 85.34 | 85.31 |
ITA [11] | 91.4 | 84.8 | 84 | 68.6 | - | - | 85.72 | |
HVPNeT [24] | 92.05 | 83.35 | 85.31 | 68.5 | 85.84 | 87.93 | 86.87 | |
DebiasCL [13] | 93.46 | 84.15 | 84.42 | 67.88 | 87.59 | 86.11 | 86.84 | |
MLNet [14] | - | - | - | - | 87.36 | 87.36 | 87.36 | |
MPMRC-MNER [25] | 92.8 | 86.8 | 83.4 | 72.79 | 87.1 | 87.16 | 87.13 | |
ICAK [15] | 93.99 | 87.24 | 86.24 | 75.76 | 85.13 | 89.19 | 87.12 | |
HamLearning [16] | 91.43 | 86.26 | 86.66 | 69.17 | 86.99 | 87.28 | 87.13 | |
DEHA (Ours) | 94.06 | 87.94 | 88.07 | 76.1 | 88.32 | 89.6 | 88.79 |
(a). Hiking in [Yellowstone National Park LOC]. | (b). Visiting the [Louvre Museum ORG] in [Paris LOC]. | (c). [Apple ORG]’s latest innovation! | |
HamLearning [16] | Yellowstone (LOC) ✓ | Louvre Museum (LOC) × Paris (LOC) ✓ | Apple (FOOD) × |
DebiasCL [13] | Yellowstone (ORG) × | Louvre Museum (MISC) × Paris (LOC) ✓ | Apple (ORG) ✓ |
ICAK [15] | Yellowstone National (LOC) × | Louvre (ORG) × Paris (LOC) ✓ | Apple (FOOD) × |
DEHA (Ours) | Yellowstone National Park (LOC) ✓ | Louvre Museum (ORG) ✓ Paris (LOC) ✓ | Apple (ORG) ✓ |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, J.; Zhou, Y.; He, Q.; Zhang, W. A Dual-Enhanced Hierarchical Alignment Framework for Multimodal Named Entity Recognition. Appl. Sci. 2025, 15, 6034. https://doi.org/10.3390/app15116034
Wang J, Zhou Y, He Q, Zhang W. A Dual-Enhanced Hierarchical Alignment Framework for Multimodal Named Entity Recognition. Applied Sciences. 2025; 15(11):6034. https://doi.org/10.3390/app15116034
Chicago/Turabian StyleWang, Jian, Yanan Zhou, Qi He, and Wenbo Zhang. 2025. "A Dual-Enhanced Hierarchical Alignment Framework for Multimodal Named Entity Recognition" Applied Sciences 15, no. 11: 6034. https://doi.org/10.3390/app15116034
APA StyleWang, J., Zhou, Y., He, Q., & Zhang, W. (2025). A Dual-Enhanced Hierarchical Alignment Framework for Multimodal Named Entity Recognition. Applied Sciences, 15(11), 6034. https://doi.org/10.3390/app15116034