A Study on Generating Maritime Image Captions Based on Transformer Dual Information Flow
Abstract
1. Introduction
- 1.
- Addressing key challenges in maritime image captioning: we propose a dual-information-flow architecture based on the Transformer model, which integrates segmentation features and grid features. We introduce a cross-attention mechanism to optimize semantic scene modeling, thereby enhancing the accuracy and professionalism of the generated semantic descriptions.
- 2.
- We propose the construction of two maritime image datasets: the Maritime Semantic Segmentation Dataset (MSS) and the Maritime Image Description Dataset (MIC). The establishment of these two datasets provides diverse real-world data for the fields of maritime image description and maritime scene semantic segmentation, thereby promoting the development of these fields.
- 3.
- We designed a multimodal feature fusion module to achieve the implicit fusion of multimodal features. The parameters of the shared MHA (Multi-Head Attention Mechanism) layer and PWFF (Position-wise Feedforward Network) layer are shared, while maintaining the modality specificity of the batch normalization layer, thereby minimizing parameter increase and enhancing the fusion effect between features.
2. Related Work
3. Research Methodology
3.1. Transformer Architecture
3.1.1. Multi-Head Attention Mechanism
3.1.2. Position-Wise Feed-Forward Networks
3.2. Grid Feature Extraction Network
3.3. Segmentation Feature Extraction Network
3.4. Encoder Module
3.4.1. Encoding Network
3.4.2. Fusion Module
3.5. Decoder Module
3.5.1. Cross-Attention Mechanism
3.5.2. Decoding Networks
4. Experimental Results and Analyses
4.1. Dataset
4.2. Experimental Environment and Evaluation Indicators
4.3. Comparison and Analysis of Experimental Results
4.3.1. Quantitative Analysis
4.3.2. Qualitative Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- China Classification Society. Rules for Intelligent Ships. China Classification Society. 2025. Available online: https://www.ccs.org.cn/ccswzen/specialDetail?id=202503310327204112 (accessed on 19 June 2025).
- Chen, X.; Wei, C.; Xin, Z.; Zhao, J.; Xian, J. Ship detection under low-visibility weather interference via an ensemble generative adversarial network. J. Mar. Sci. Eng. 2023, 11, 2065. [Google Scholar] [CrossRef]
- Chen, X.; Wu, S.; Shi, C.; Huang, Y.; Yang, Y.; Ke, R.; Zhao, J. Sensing data supported traffic flow prediction via denoising schemes and ANN: A comparison. IEEE Sens. J. 2020, 20, 14317–14328. [Google Scholar] [CrossRef]
- Liu, X.; Qiu, L.; Fang, Y.; Wang, K.; Li, Y.; Rodríguez, J. Event-Driven Based Reinforcement Learning Predictive Controller Design for Three-Phase NPC Converters Using Online Approximators. IEEE Trans. Power Electron. 2024, 40, 4914–4926. [Google Scholar] [CrossRef]
- Farhadi, A.; Hejrati, M.; Sadeghi, M.A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; Forsyth, D. Every picture tells a story: Generating sentences from images. In Proceedings of the Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, 5–11 September 2010; Proceedings, Part IV 11; Springer: Berlin/Heidelberg, Germany, 2010; pp. 15–29. [Google Scholar]
- Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A.C.; Berg, T.L. Babytalk: Understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2891–2903. [Google Scholar] [CrossRef]
- Kuznetsova, P.; Ordonez, V.; Berg, A.; Berg, T.; Choi, Y. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju, Korea, 8–14 July 2012; pp. 359–368. [Google Scholar]
- Ordonez, V.; Kulkarni, G.; Berg, T. Im2text: Describing images using 1 million captioned photographs. Adv. Neural Inf. Process. Syst. 2011, 24. [Google Scholar]
- Socher, R.; Karpathy, A.; Le, Q.V.; Manning, C.D.; Ng, A.Y. Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguist. 2014, 2, 207–218. [Google Scholar] [CrossRef]
- Devlin, J.; Cheng, H.; Fang, H.; Gupta, S.; Deng, L.; He, X.; Zweig, G.; Mitchell, M. Language models for image captioning: The quirks and what works. arXiv 2015, arXiv:1505.01809. [Google Scholar]
- Regression, N. An Introduction to Kernel and Nearest-Neighbor. Am. Stat. 1992, 46, 175–185. [Google Scholar]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 652–663. [Google Scholar] [CrossRef]
- Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.S. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5659–5667. [Google Scholar]
- Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 375–383. [Google Scholar]
- Liu, S.; Zhu, Z.; Ye, N.; Guadarrama, S.; Murphy, K. Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 873–881. [Google Scholar]
- Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7008–7024. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
- Cornia, M.; Stefanini, M.; Baraldi, L.; Cucchiara, R. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10578–10587. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
- Pan, Y.; Yao, T.; Li, Y.; Mei, T. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10971–10980. [Google Scholar]
- Ji, J.; Luo, Y.; Sun, X.; Chen, F.; Luo, G.; Wu, Y.; Gao, Y.; Ji, R. Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; Volume 35, pp. 1655–1663. [Google Scholar] [CrossRef]
- Li, Y.; Pan, Y.; Yao, T.; Mei, T. Comprehending and ordering semantics for image captioning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 17990–17999. [Google Scholar]
- Zhang, J.; Xie, Y.; Ding, W.; Wang, Z. Cross on cross attention: Deep fusion transformer for image captioning. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 4257–4268. [Google Scholar] [CrossRef]
- Shao, Z.; Han, J.; Marnerides, D.; Debattista, K. Region-object relation-aware dense captioning via transformer. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 4184–4195. [Google Scholar] [CrossRef]
- Nguyen, T.; Gadre, S.Y.; Ilharco, G.; Oh, S.; Schmidt, L. Improving multimodal datasets with image captioning. Adv. Neural Inf. Process. Syst. 2023, 36, 22047–22069. [Google Scholar]
- Shao, Z.; Han, J.; Debattista, K.; Pang, Y. Textual context-aware dense captioning with diverse words. IEEE Trans. Multimed. 2023, 25, 8753–8766. [Google Scholar] [CrossRef]
- Shao, Z.; Han, J.; Debattista, K.; Pang, Y. DCMSTRD: End-to-end dense captioning via multi-scale transformer decoding. IEEE Trans. Multimed. 2024, 26, 7581–7593. [Google Scholar] [CrossRef]
- Subedi, B.; Bal, B.K. CNN-transformer based encoder-decoder model for Nepali image captioning. In Proceedings of the 19th International Conference on Natural Language Processing (ICON), New Delhi, India, 15–18 December 2022; pp. 86–91. [Google Scholar]
- Solomon, R.; Abebe, M. Amharic language image captions generation using hybridized attention-based deep neural networks. Appl. Comput. Intell. Soft Comput. 2023, 2023, 9397325. [Google Scholar] [CrossRef]
- Chethas, K.; Ankita, V.; Apoorva, B.; Sushma, H.; Jayashree, R. Image Caption Generation in Kannada using Deep Learning Frameworks. In Proceedings of the 2023 International Conference on Advances in Electronics, Communication, Computing and Intelligent Information Systems (ICAECIS), Bangalore, India, 19–21 April 2023; pp. 486–491. [Google Scholar]
- Jiang, D.; Song, G.; Wu, X.; Zhang, R.; Shen, D.; Zong, Z.; Liu, Y.; Li, H. Comat: Aligning text-to-image diffusion model with image-to-text concept matching. Adv. Neural Inf. Process. Syst. 2024, 37, 76177–76209. [Google Scholar]
- Daneshfar, F.; Bartani, A.; Lotfi, P. Image captioning by diffusion models: A survey. Eng. Appl. Artif. Intell. 2024, 138, 109288. [Google Scholar] [CrossRef]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
- Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
- Wang, Y.; Sun, F.; Lu, M.; Yao, A. Learning deep multimodal feature representation with asymmetric multi-layer fusion. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 3902–3910. [Google Scholar]
- Nguyen, V.Q.; Suganuma, M.; Okatani, T. Grit: Faster and better image captioning transformer using dual visual features. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 167–184. [Google Scholar]
- Vedantam, R.; Lawrence Zitnick, C.; Parikh, D. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29 June 2005; pp. 65–72. [Google Scholar]
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
Arithmetic | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE | CIDEr |
---|---|---|---|---|---|---|---|
Transformer | 0.888 | 0.839 | 0.794 | 0.748 | 0.548 | 0.879 | 3.134 |
TransformerS | 0.886 | 0.842 | 0.805 | 0.766 | 0.558 | 0.887 | 3.262 |
TransformerS1 | 0.884 | 0.835 | 0.794 | 0.757 | 0.591 | 0.896 | 3.323 |
TransformerS3 | 0.893 | 0.852 | 0.813 | 0.777 | 0.593 | 0.896 | 3.286 |
Ours | 0.895 | 0.854 | 0.821 | 0.788 | 0.575 | 0.897 | 3.343 |
Arithmetic | Seg Feature | Number of Fusion Layers | FLOPs | Params | CIDEr |
---|---|---|---|---|---|
Transformer | No | 0 | 0.84G | 22.51M | 3.134 |
TransformerS | Yes | 0 | 1.50G | 36.71M | 3.262 |
TransformerS1 | Yes | 1 | 1.50G | 33.56M | 3.323 |
TransformerS3 | Yes | 3 | 1.50G | 27.25M | 3.286 |
Ours | Yes | 2 | 1.50G | 30.41M | 3.343 |
Number of Groups | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE | CIDEr |
---|---|---|---|---|---|---|---|
1 | 0.916 | 0.877 | 0.845 | 0.809 | 0.597 | 0.908 | 3.410 |
2 | 0.887 | 0.842 | 0.800 | 0.762 | 0.588 | 0.901 | 3.287 |
3 | 0.917 | 0.873 | 0.838 | 0.800 | 0.578 | 0.899 | 3.374 |
4 | 0.900 | 0.858 | 0.820 | 0.777 | 0.584 | 0.901 | 3.325 |
5 | 0.916 | 0.873 | 0.830 | 0.786 | 0.598 | 0.919 | 3.407 |
Mean | 0.907 | 0.864 | 0.827 | 0.787 | 0.589 | 0.905 | 3.361 |
SD | 0.0133 | 0.0146 | 0.0175 | 0.0186 | 0.0085 | 0.0082 | 0.0535 |
Arithmetic | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | METEOR | ROUGE | CIDEr |
---|---|---|---|---|---|---|---|
GoogleNIC | 0.725 | 0.638 | 0.568 | 0.503 | 0.456 | 0.764 | 2.078 |
M2 | 0.877 | 0.838 | 0.803 | 0.765 | 0.542 | 0.882 | 3.312 |
DFT | 0.882 | 0.843 | 0.808 | 0.770 | 0.546 | 0.882 | 3.314 |
Ours | 0.895 | 0.854 | 0.821 | 0.788 | 0.575 | 0.897 | 3.343 |
Image | Captions |
---|---|
TDSI: The sea ahead is very wide, with a city in the distance Transformer: The sea ahead is very wide, with islands in the distance. GT1: The sea ahead is very wide, with a city and a dock in the distance GT2: The sea ahead is particularly wide, with a city and a dock in the distance. GT3: The sea ahead is particularly wide, with a city and a port in the distance | |
TDSI: The sea ahead has buoys, with the city in the distance Transformer: The port ahead has several ships docked along the shore GT1: The sea ahead has two buoys, with the city in the distance GT2: The sea ahead has two buoys floating, with the city’s coastline in the distance GT3: There are two buoys on the sea ahead, and in the distance is the city’s coastline | |
TDSI: There are multiple ships and buoys on the sea ahead Transformer: The city’s port is ahead. GT1: There are multiple ships and buoys on the sea ahead GT2: There are ships and buoys on the sea ahead GT3: There are multiple ships and buoys ahead |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, Z.; Shen, H.; Wang, M.; Wang, Y. A Study on Generating Maritime Image Captions Based on Transformer Dual Information Flow. J. Mar. Sci. Eng. 2025, 13, 1204. https://doi.org/10.3390/jmse13071204
Zhao Z, Shen H, Wang M, Wang Y. A Study on Generating Maritime Image Captions Based on Transformer Dual Information Flow. Journal of Marine Science and Engineering. 2025; 13(7):1204. https://doi.org/10.3390/jmse13071204
Chicago/Turabian StyleZhao, Zhenqiang, Helong Shen, Meng Wang, and Yufei Wang. 2025. "A Study on Generating Maritime Image Captions Based on Transformer Dual Information Flow" Journal of Marine Science and Engineering 13, no. 7: 1204. https://doi.org/10.3390/jmse13071204
APA StyleZhao, Z., Shen, H., Wang, M., & Wang, Y. (2025). A Study on Generating Maritime Image Captions Based on Transformer Dual Information Flow. Journal of Marine Science and Engineering, 13(7), 1204. https://doi.org/10.3390/jmse13071204