Image Captioning Using Enhanced Cross-Modal Attention with Multi-Scale Aggregation for Social Hotspot and Public Opinion Monitoring
Abstract
1. Introduction
- (1)
- enhancing cross-modal interaction between visual features and query tokens to facilitate more effective information exchange;
- (2)
- preserving and aggregating visual semantics at multiple levels of abstraction to mitigate information loss during feature compression;
- (3)
- improving the refinement of semantic representations during caption generation to enhance semantic coherence and structural consistency.
1.1. Motivation for Social-Oriented Image Captioning
1.2. Research Objectives and Innovations
- This paper proposes an enhanced cross-modal attention mechanism (ECMA). ECMA reconstructs the fusion of visual and Query Tokens through bidirectional interaction, multi-scale aggregation, and cross-head guided attention. Compared to the native Q-Former, ECMA is able to capture deeper semantic relationships;
- ECMA’s design emphasizes seamless compatibility with the existing BLIP-2 system. Its lightweight structure enhances cross-modal coupling capabilities without disrupting the pre-training alignment space. Since there is no need to adjust the parameters of ViT or large language models, ECMA can be directly inserted into the interactive phase of Q-Former, thereby maintaining the stability and controllability of training. This modular extension approach also provides a portable design paradigm for other visual language tasks such as VQA and cross-modal retrieval;
- Improved multiple indicators. Experimental results show that CIDEr, BLEU, METEOR, and ROUGE-L have all been improved. In particular, the improvement of CIDEr indicates that ECMA has indeed improved the semantic structure modeling capability.
2. Materials and Methods
2.1. Related Work
2.1.1. Visual Encoder
2.1.2. Cross-Modal Pre-Trained Models
2.1.3. Limitations of Cross-Modal Attention Mechanisms
- ViT-based visual encoders provide strong visual representations but are typically frozen in large-scale vision–language models, limiting the potential for performance gains through encoder fine-tuning.
- Cross-modal pre-trained models such as BLIP-2 achieve high efficiency via lightweight alignment modules, yet suffer from limited cross-modal interaction and unidirectional attention.
- Current cross-modal attention mechanisms often overlook multi-scale semantic modeling, which is critical for accurately describing complex and socially rich scenes.
2.2. Overview of the BLIP-2 Architecture
2.3. Limitations Analysis of BLIP-2 Cross-Modal Attention
2.3.1. One-Way Attention
2.3.2. Excessive Semantic Compression Leads to Information Flow Bottlenecks
2.4. ECMA: Enhancing Cross-Modal Attention
2.4.1. Bidirectional Cross-Modal Attention
- Visual-to-Query V2Q: The Query Token extracts information from visual features, the same as the original BLIP-2, as shown in Equation (2):where denotes the learnable query token embeddings, and and are the key and value matrices projected from the visual feature tokens.
- Visual Q2V Query: The visual features are adjusted according to the semantic requirements of the Query Tokens, as shown in Equation (3):where represents the visual feature tokens extracted by the vision encoder. and are the key and value matrices obtained by linear projections of the query tokens .
- Enhanced visual features: The original visual features are added to the query-guided update to obtain the enhanced features, as shown in Equation (4):where is a learnable scaling factor, initialized to 0.1, which enables the model to smoothly transition from the original unidirectional attention to bidirectional attention.
2.4.2. Multi-Scale Visual Aggregation
2.4.3. Semantic Residual Gating Fusion
2.5. Dataset
2.6. Experimental Setup
3. Results and Discussion
3.1. Experimental Results
- BLEU@1 and BLEU@4—Measure n-gram precision between the generated captions and reference captions. BLEU@1 evaluates unigram overlap, while BLEU@4 considers 4-gram sequences, capturing fluency and local coherence.
- METEOR—Considers unigram matches with stemming and synonymy, balancing precision and recall. It emphasizes semantic similarity and alignment with reference captions.
- ROUGE_L—Computes the longest common subsequence between generated and reference captions, reflecting sentence-level structure consistency.
- CIDEr—Evaluates consensus between generated captions and multiple references by weighting n-grams according to their rarity, focusing on informativeness and content coverage.
- SPICE—Analyzes scene-graph-based semantic propositional content of captions, assessing how well visual relationships and objects are described.
| Model | BLUE@1 | BLUE@4 | METEOR | ROUGE_L | CIDEr | SPICE |
|---|---|---|---|---|---|---|
| BLIP | 79.1 | 39.9 | 31.0 | 60.0 | 133.5 | 23.8 |
| ORT [31] | 80.5 | 38.6 | 28.7 | 58.4 | 128.3 | 22.6 |
| AoANet [32] | 80.2 | 38.9 | 29.2 | 58.8 | 129.8 | 22.4 |
| DLCT [33] | 81.4 | 39.8 | 29.5 | 59.1 | 133.8 | 23.0 |
| FlanT5 | 82.2 | 40.7 | 30.8 | 60.8 | 138.0 | 24.7 |
| ECMA | 83.5 | 44.0 | 31.9 | 62.2 | 147.0 | 25.2 |
3.2. Discussion
3.3. Social Hotspot and Public Opinion Monitoring: An Application-Oriented Evaluation
- (1)
- object omission, where salient objects in complex scenes are missing from the generated captions;
- (2)
- attribute mismatch, including incorrect color, quantity, or object states;
- (3)
- imprecise relational description, where spatial or interaction relationships are inaccurately expressed.
4. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Albadarneh, I.A.; Hammo, B.H.; Al-Kadi, O.S. Attention-based transformer models for image captioning across languages: An in-depth survey and evaluation. arXiv 2025, arXiv:2506.05399. [Google Scholar] [CrossRef]
- Fei, J.; Wang, T.; Zhang, J.; He, Z.; Wang, C.; Zheng, F. Transferable Decoding with Visual Entities for Zero-Shot Image Captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 2788–2798. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations (ICLR), New York, NY, USA, 23–26 June 2021. [Google Scholar]
- Fang, Y.; Wang, W.; Xie, B.; Sun, Q.; Wu, L.; Wang, X.; Huang, T.; Wang, X.; Cao, Y. EVA: Exploring the Limits of Masked Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
- Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X.V.; et al. OPT: Open Pre-trained Transformers. Meta AI Technical Report. arXiv 2022, arXiv:2205.01068. [Google Scholar] [CrossRef]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. LLaMA: Open and Efficient Foundation Language Models. arXiv 2023, arXiv:2302.13971. [Google Scholar] [CrossRef]
- Chung, H.W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fedus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; et al. Scaling Instruction-Finetuned Language Models. J. Mach. Learn. Res. 2024, 25, 1–53. [Google Scholar]
- Li, D.; Li, J.; Le, H.; Wang, G.; Savarese, S.; Hoi, S.C.H. LAVIS: A one-stop library for language-vision intelligence. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Toronto, ON, Canada, 10–12 July 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 31–41. [Google Scholar]
- Eom, S.; Shim, J.; Koo, G.; Na, H.; Hasegawa-Johnson, M.A. Query-based Cross-Modal Projector Bolstering Mamba Multimodal LLM. In Findings of the Association for Computational Linguistics; EMNLP: Miami, FL, USA, 2024; pp. 14158–14167. [Google Scholar]
- Guo, H.; Xie, Z.; Cao, S.; Wang, B.; Liu, W.; Le, A.; Li, L.; Li, Z. SNS-Bench-VL: Benchmarking multimodal large language models in social networking services. arXiv 2025, arXiv:2505.23065. [Google Scholar]
- Ma, Y.; Ji, J.; Sun, X.; Zhou, Y.; Ji, R. Towards Local Visual Modeling for Image Captioning. Pattern Recognit. 2023, 138, 109420. [Google Scholar] [CrossRef]
- Liu, Y.; Li, X.; Zhang, L.; Wang, Z.; Zheng, Z.; Zhou, Y.; Xie, C. OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning. arXiv 2025, arXiv:2509.01644. [Google Scholar] [CrossRef]
- Kim, S.; Xiao, R.; Georgescu, M.-I.; Alaniz, S.; Akata, Z. COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 14690–14700. [Google Scholar]
- Tschannen, M.; Gritsenko, A.; Wang, X.; Naeem, M.F.; Alabdulmohsin, I.; Parthasarathy, N.; Evans, T.; Beyer, L.; Xia, Y.; Mustafa, B.; et al. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. arXiv 2025, arXiv:2502.14786. [Google Scholar]
- Zheng, C. The Linear Attention Resurrection in Vision Transformer. arXiv 2025, arXiv:2501.16182. [Google Scholar] [CrossRef]
- Zhou, J.; Jiang, J.; Zhu, Z. Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation. arXiv 2025, arXiv:2510.23894. [Google Scholar]
- Zhan, G.; Liu, Y.; Han, K.; Xie, W.; Zisserman, A. EIP: Enhanced Visual-Language Foundation Models for Image Retrieval. arXiv 2025, arXiv:2502.15682. [Google Scholar]
- Tschannen, M.; Kumar, M.; Steiner, A.; Zhai, X.; Houlsby, N.; Beyer, L. Image Captioners Are Scalable Vision Learners Too. arXiv 2023, arXiv:2306.07915. [Google Scholar] [CrossRef]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the 39th International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Guo, Q.; Yao, K.; Chu, W. Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–24 October 2022. [Google Scholar]
- Hu, W.; Dou, Z.; Li, L.H.; Kamath, A.; Peng, N.; Chang, K.-W. Matryoshka Query Transformer for Large Vision-Language Models. arXiv 2024, arXiv:2405.19315. [Google Scholar] [CrossRef]
- Hashmi, K.A.; Badrinarayanan, V.; Udpa, S.; Kundu, A.; Khandelwal, Y. FeatEnHancer: Enhancing Hierarchical Features for Object Detection and Beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar]
- Ma, X.; Zhou, C.; Kong, X.; He, J.; Gui, L.; Neubig, G.; May, J.; Zettlemoyer, L. MEGA: Moving Average Equipped Gated Attention. In Proceedings of the 11th International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Wu, M.; Zhang, X.; Sun, X.; Zhou, Y.; Chen, C.; Gu, J.; Sun, X.; Ji, R. DIFNet: Boosting Visual Information Flow for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 18020–18029. [Google Scholar]
- Wortsman, M.; Dettmers, T.; Zettlemoyer, L.; Morcos, A.; Farhadi, A.; Schmidt, L. Stable and Low-Precision Training for Large-Scale Vision-Language Models. arXiv 2023, arXiv:2304.13013. [Google Scholar]
- Kalra, D.S.; Barkeshli, M. Why Warmup the Learning Rate? Underlying Mechanisms and Improvements. arXiv 2024, arXiv:2406.09405. [Google Scholar] [CrossRef]
- Beck, M.; Pöppel, K.; Spanring, M.; Auer, A.; Prudnikova, O.; Kopp, M.; Klambauer, G.; Brandstetter, J.; Hochreiter, S. xLSTM: Extended Long Short-Term Memory. arXiv 2024, arXiv:2405.04517. [Google Scholar]
- Koryakovskiy, I.; Yakovleva, A.; Buchnev, V.; Isaev, T.; Odinokikh, G. One-Shot Model for Mixed-Precision Quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7939–7949. [Google Scholar]
- Herdade, S.; Kappeler, A.; Boakye, K.; Soares, J. Image captioning: Transforming objects into words. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; ACM: New York, NY, USA, 2019; pp. 11137–11147. [Google Scholar]
- Huang, L.; Wang, W.; Chen, J.; Wei, X.-Y. Attention on attention for image captioning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Long Beach, CA, USA, 2019; pp. 4634–4643. [Google Scholar]
- Luo, Y.; Ji, J.; Sun, X.; Cao, L.; Wu, Y.; Huang, F.; Lin, C.-W.; Ji, R. Dual-level collaborative Transformer for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, AAAI, Virtual, 2–9 February 2021; pp. 2286–2293. [Google Scholar]
- Schumann, C.; Ricco, S.; Prabhu, U.; Ferrari, V.; Pantofaru, C. A Step Toward More Inclusive People Annotations for Fairness AIES. arXiv 2021, arXiv:2105.02317. [Google Scholar]
- Zang, X.; Liu, L.; Wang, M.; Song, Y.; Zhang, H.; Chen, J. PhotoChat: A Human-Human Dialogue Dataset with Photo Sharing Behavior for Joint Image-Text Modeling. arXiv 2021, arXiv:2108.01453. [Google Scholar]
- Honda, U.; Watanabe, T.; Matsumoto, Y. Switching to Discriminative Image Captioning by Relieving a Bottleneck of Reinforcement Learning. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Zanella, M.; Ayed, B. Low-Rank Few-Shot Adaptation of Vision-Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 16–22 June 2024; pp. 1593–1603. [Google Scholar]
- Chen, X.; Liu, J.; Wang, Y.; Wang, P.; Brand, M.; Wang, G. SuperLoRA: Parameter-Efficient Unified Adaptation for Large Vision Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 16–22 June 2024; pp. 8050–8055. [Google Scholar]
















| Bidirectional Cross-Modal Attention | Multi-Scale Visual Aggregation | Semantic Residual Gating Fusion | BLUE@4 | METEOR | CIDEr | ROUGE_L | SPICE |
|---|---|---|---|---|---|---|---|
| × | × | × | 42.8 | 32.1 | 145.7 | 61.8 | 25.5 |
| √ | × | × | 42.8 | 32.0 | 145.5 | 61.9 | 25.5 |
| √ | √ | × | 43.8 | 31.6 | 140.0 | 61.8 | 25.1 |
| √ | √ | √ | 44.0 | 31.9 | 147.0 | 62.2 | 25.5 |
| Bidirectional Cross-Modal Attention | Multi-Scale Visual Aggregation | Semantic Residual Gating Fusion | BLUE@4 | METEOR | CIDEr | ROUGE_L | SPICE |
|---|---|---|---|---|---|---|---|
| × | × | × | 42.5 | 31.7 | 144.6 | 61.5 | 25.4 |
| √ | × | × | 43.2 | 31.4 | 145.3 | 61.6 | 25.0 |
| √ | √ | × | 43.9 | 31.8 | 146.7 | 62.0 | 25.2 |
| √ | √ | √ | 43.9 | 31.8 | 146.8 | 62.0 | 25.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Jiang, S.; Chen, Y.; Chaomu, R.; Liu, Z. Image Captioning Using Enhanced Cross-Modal Attention with Multi-Scale Aggregation for Social Hotspot and Public Opinion Monitoring. Inventions 2026, 11, 13. https://doi.org/10.3390/inventions11010013
Jiang S, Chen Y, Chaomu R, Liu Z. Image Captioning Using Enhanced Cross-Modal Attention with Multi-Scale Aggregation for Social Hotspot and Public Opinion Monitoring. Inventions. 2026; 11(1):13. https://doi.org/10.3390/inventions11010013
Chicago/Turabian StyleJiang, Shan, Yingzhao Chen, Rilige Chaomu, and Zheng Liu. 2026. "Image Captioning Using Enhanced Cross-Modal Attention with Multi-Scale Aggregation for Social Hotspot and Public Opinion Monitoring" Inventions 11, no. 1: 13. https://doi.org/10.3390/inventions11010013
APA StyleJiang, S., Chen, Y., Chaomu, R., & Liu, Z. (2026). Image Captioning Using Enhanced Cross-Modal Attention with Multi-Scale Aggregation for Social Hotspot and Public Opinion Monitoring. Inventions, 11(1), 13. https://doi.org/10.3390/inventions11010013
