Image Captioning Using Topic Faster R-CNN-LSTM Networks
Abstract
1. Introduction
2. Related Works
2.1. Image Caption Generation
2.2. Topic Model
3. Methods
3.1. Object Detection
3.2. Topic Detection
3.3. Language Model
4. Experiments
4.1. Dataset
4.2. Experimental Setup
4.3. Evaluation Metrics
4.4. Comparison with Other Models
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A.C.; Berg, T.L. Baby talk: Understanding and generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1601–1608. [Google Scholar] [CrossRef] [PubMed]
- Çaylı, Ö.; Makav, B.; Kılıç, V.; Onan, A. Mobile application based automatic caption generation for visually impaired. In Intelligent and Fuzzy Techniques: Smart and Innovative Solutions, Proceedings of the INFUS 2020 Conference, Istanbul, Turkey, 21–23 July 2020; Springer: Cham, Switzerland, 2020; pp. 1532–1539. [Google Scholar]
- Sangolgi, V.A.; Patil, M.B.; Vidap, S.S.; Doijode, S.S.; Mulmane, S.Y.; Vadaje, A.S. Enhancing Cross-Linguistic Image Caption Generation with Indian Multilingual Voice Interfaces using Deep Learning Techniques. Procedia Comput. Sci. 2023, 233, 547–557. [Google Scholar] [CrossRef]
- Huang, G.; Hu, H. c-RNN: A Fine-Grained Language Model for Image Captioning. Neural Process Lett. 2019, 49, 683–691. [Google Scholar] [CrossRef]
- Bineeshia, J. Image Caption Generation Using CNN-LSTM Based Approach. In Proceedings of the ICCAP 2021, Chennai, India, 7–8 December 2021; p. 352. [Google Scholar]
- Khamparia, A.; Pandey, B.; Tiwari, S.; Gupta, D.; Khanna, A.; Rodrigues, J.J. An integrated hybrid CNN–RNN model for visual description and generation of captions. Circuits Syst. Signal Process. 2020, 39, 776–788. [Google Scholar] [CrossRef]
- Verma, A.; Yadav, A.K.; Kumar, M. Automatic image caption generation using deep learning. Multimed. Tools Appl. 2024, 83, 5309–5325. [Google Scholar] [CrossRef]
- Bartosiewicz, M.; Iwanowski, M. The Optimal Choice of the Encoder–Decoder Model Components for Image Captioning. Information 2024, 15, 504. [Google Scholar] [CrossRef]
- Wang, S.; Zhu, Y. A Novel Image Caption Model Based on Transformer Structure. In Proceedings of the ICICSE, Chengdu, China, 19–21 March 2021; pp. 144–148. [Google Scholar] [CrossRef]
- Zeng, X.; Wen, L.; Liu, B.; Qi, X. Deep learning for ultrasound image caption generation based on object detection. Neurocomputing 2020, 392, 132–141. [Google Scholar] [CrossRef]
- Palash, M.A.H.; Nasim, M.D.; Saha, S.; Afrin, F.; Mallik, R.; Samiappan, S. Bangla Image Caption Generation through CNN-Transformer Based Encoder-Decoder Network. In Proceedings of the International Conference on Fourth Industrial Revolution and Beyond 2021, Dhaka, Bangladesh, 10–11 December 2021; Springer: Singapore, 2021; Volume 437, pp. 631–644. [Google Scholar]
- Alam, M.D.S.; Rahman, M.D.S.; Hosen, M.D.I.; Mubin, K.A.; Hossen, S.; Mridha, M.F. Bahdanau Attention Based Bengali Image Caption Generation. In Proceedings of the 2022 International Conference on Decision Aid Sciences and Applications (DASA), Chiangrai, Thailand, 23–25 March 2022; pp. 1073–1077. [Google Scholar] [CrossRef]
- Mishra, S.K.; Dhir, R.; Saha, S.; Bhattacharyya, P. A Hindi image caption generation framework using deep learning. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2021, 20, 1–19. [Google Scholar] [CrossRef]
- Wadhwa, V.; Gupta, B.; Gupta, S. AI Based Automated Image Caption Tool Implementation for Visually Impaired. In Proceedings of the 2021 International Conference on Industrial Electronics Research and Applications (ICIERA), New Delhi, India, 22–24 December 2021; pp. 1–6. [Google Scholar] [CrossRef]
- Kulkarni, C.; Monika, P.; Preeti, B.; Shruthi, S. A novel framework for automatic caption and audio generation. Mater. Today Proc. 2022, 65, 3248–3252. [Google Scholar] [CrossRef]
- Yang, D.; Chen, H.; Hou, X.; Ge, T.; Jiang, Y.; Jin, Q. Visual captioning at will: Describing images and videos guided by a few stylized sentences. In Proceedings of the 31st ACM International Conference on Multimedia, New York, NY, USA, 29 October–3 November 2023; pp. 5705–5715. [Google Scholar] [CrossRef]
- Du, S.; Zhu, H.; Zhang, Y.; Wang, D.; Shi, J.; Xing, N.; Lin, G.; Zhou, H. Controllable Image Captioning with Feature Refinement and Multilayer Fusion. Appl. Sci. 2023, 13, 5020. [Google Scholar] [CrossRef]
- Li, Y.; Zhang, X.; Zhang, T.; Wang, G.; Wang, X.; Li, S. A patch-level region-aware module with a multi-label framework for remote sensing image captioning. Remote Sens. 2024, 16, 3987. [Google Scholar] [CrossRef]
- Peng, R.; He, H.; Wei, Y.; Wen, Y.; Hu, D. Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 3963–3973. [Google Scholar]
- Hofmann, T. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, San Francisco, CA, USA, 30 July–1 August 1999; pp. 289–296. [Google Scholar]
- Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
- Lee, D.; Seung, H.S. Algorithms for non-negative matrix factorization. Adv. Neural Inf. Process. Syst. 2000, 13, 556–562. [Google Scholar]
- English, J.A.; Kossarian, M.M.; McManis, C.E.; Smith, D.A. Phenomenological Semantic Distance from Latent Dirichlet Allocations (LDA) Classification. U.S. Patent 10,242,002, 1 February 2018. [Google Scholar]
- Yu, D.; Xiang, B. Discovering topics and trends in the field of Artificial Intelligence: Using LDA topic modeling. Expert Syst. Appl. 2023, 225, 120114. [Google Scholar] [CrossRef]
- Inoue, M.; Fukahori, H.; Matsubara, M.; Yoshinaga, N.; Tohira, H. Latent Dirichlet allocation topic modeling of free-text responses exploring the negative impact of the early COVID-19 pandemic on research in nursing. Jpn. J. Nurs. Sci. 2023, 20, e12520. [Google Scholar] [CrossRef]
- Abdalgader, K.; Matroud, A.A.; Al-Doboni, G. Temporal Dynamics in Short Text Classification: Enhancing Semantic Understanding Through Time-Aware Model. Information 2025, 16, 214. [Google Scholar] [CrossRef]
- Wu, L.; Xu, M.; Qian, S.; Cui, J. Image to Modern Chinese Poetry Creation via a Constrained Topic-aware Model. ACM Trans. Multimedia Comput. Commun. Appl. 2020, 16, 1–21. [Google Scholar] [CrossRef]
- Wang, B.; Zheng, X.; Qu, B.; Lu, X. Retrieval topic recurrent memory network for remote sensing image captioning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 13, 256–270. [Google Scholar] [CrossRef]
- Li, C.Y.; Chun, S.A.; Geller, J. Perspective-based Microblog Summarization. Information 2025, 16, 285. [Google Scholar] [CrossRef]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Proc. NIPS 2015, 28, 91–99. [Google Scholar] [CrossRef]
- Liu, X.; Xu, Q.; Wang, N. A survey on deep neural network-based image captioning. Vis. Comput. 2019, 35, 445–470. [Google Scholar] [CrossRef]
- Zhu, Y.; Lu, S.; Zheng, L.; Guo, J.; Zhang, W.; Wang, J.; Yu, Y. Texygen: A benchmarking platform for text generation models. In Proceedings of the 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 1097–1100. [Google Scholar] [CrossRef]
- Zhao, W.; Wu, X.; Luo, J. Cross-domain image captioning via cross-modal retrieval and model adaptation. IEEE Trans. Image Process 2020, 30, 1180–1192. [Google Scholar] [CrossRef]
(a) MSCOCO dataset | (b) Flickr30k dataset |
a picture of an airplane parked at the loading docks | Several men in hard hats are operating a giant pulley system. |
a jet airplane on the tarmac at the airport | Workers look down from up above on a piece of equipment. |
a plane on the runway at the airport | Two men working on a machine wearing hard hats. |
a jumbo jet american airlines plane sitting in a waiting area. | Four men on top of a tall structure. |
a large jet liner sitting on top of an airport runway. | Three men on a large rig. |
Model | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
---|---|---|---|---|
c-RNN [4] | 0.612 | 0.449 | 0.323 | 0.237 |
Khamparia et al. [6] | 0.852 | - | - | - |
VGGNet + LSTM [16] | 0.510 | 0.322 | 0.207 | 0.136 |
The proposed approach | 0.638 | 0.507 | 0.405 | 0.304 |
Model | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 |
---|---|---|---|---|
c-RNN [4] | 0.604 | 0.431 | 0.304 | 0.215 |
Du et al. [17] | - | - | - | 0.146 |
Zhao et al. [33] | 0.690 | 0.493 | 0.347 | 0.241 |
The proposed approach | 0.643 | 0.435 | 0.277 | 0.180 |
Model | METEOR | Rouge-L | CIDEr |
---|---|---|---|
c-RNN [4] | 0.247 | - | - |
Khamparia et al. [6] | - | 0.863 | 0.493 |
VGGNet + LSTM [16] | 0.170 | - | 0.654 |
The proposed approach | 0.223 | 0.484 | 0.827 |
Model | METEOR | Rouge-L | CIDEr |
---|---|---|---|
c-RNN [4] | 0.190 | - | - |
Du et al. [17] | 0.190 | 0.521 | 1.169 |
Zhao et al. [33] | 0.203 | 0.465 | 0.528 |
The proposed approach | 0.374 | 0.311 | 0.676 |
ResNet + Oe + ATTLSTM + SCS (ours) | c-RNN [4] |
---|---|
穿白襯衫的男人拿著熱狗堡及漢堡 (A man in a white shirt holding a hot dog and a hamburger) | 一個男人拿著食物 (A man holding food) |
穿白襯衫的男人雙手各拿著食物 (A man in a white shirt tightly holding food) | 一個男人拿著三明治 (A man holding a sandwich) |
穿黑衣的男人拿著冰淇淋和飲料 (A man in yellow clothes holding ice cream and a drink) | 一個男人拿著啤酒 (A man holding a beer) |
穿黃衣的男子站在冰箱前面 (A man in yellow clothes standing in front of a fridge) | 一個穿著白衣服的男子拿著食物 (A man in white clothes holding food) |
冰箱上有一個紫色的背包 (A purple backpack is on top of the fridge) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yeh, J.-F.; Lin, K.-M.; Chen, C.-C. Image Captioning Using Topic Faster R-CNN-LSTM Networks. Information 2025, 16, 726. https://doi.org/10.3390/info16090726
Yeh J-F, Lin K-M, Chen C-C. Image Captioning Using Topic Faster R-CNN-LSTM Networks. Information. 2025; 16(9):726. https://doi.org/10.3390/info16090726
Chicago/Turabian StyleYeh, Jui-Feng, Kuei-Mei Lin, and Chun-Chieh Chen. 2025. "Image Captioning Using Topic Faster R-CNN-LSTM Networks" Information 16, no. 9: 726. https://doi.org/10.3390/info16090726
APA StyleYeh, J.-F., Lin, K.-M., & Chen, C.-C. (2025). Image Captioning Using Topic Faster R-CNN-LSTM Networks. Information, 16(9), 726. https://doi.org/10.3390/info16090726