The Optimal Choice of the Encoder–Decoder Model Components for Image Captioning †
Abstract
:1. Introduction
2. Deep-Learning Captioning Models
3. Preliminaries
3.1. Encoder–Decoder Image Captioning Model
3.2. Image Feature Extractors
3.3. Recurrent Neural Network
3.4. Word Embedding Models
3.5. Adaptation, Merging, and Word Prediction
3.6. Evaluation Metrics
4. Training and Evaluation
4.1. Dataset and Testing Procedure
4.2. Training
4.3. Evaluation
5. Results
5.1. Feature Extraction and Word Embedding
5.2. Merging and Word Prediction
RNN Size |
Adaptation Component Size |
Word Prediction Component Size | No. of Model Parameters (mln) | Time of Sentence Generation (ms) | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | CIDEr | SPICE | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 512 | 512 | 1024 | 33.35 | 6024 | 66.31 | 48.47 | 34.29 | 24.22 | 79.05 | 15.57 |
2 | 256 | 256 | - | 27.04 | 2814 | 66.96 | 48.74 | 34.32 | 24.28 | 79.07 | 15.54 |
3 | 128 | 128 | 256 | 24.68 | 2647 | 67.27 | 49.69 | 35.36 | 25.04 | 80.44 | 15.71 |
4 | 256 | 256 | 512 | 27.31 | 2812 | 67.51 | 49.75 | 35.56 | 25.36 | 82.49 | 16.08 |
5 | 256 | - | - | 25.56 | 2765 | 65.36 | 47.02 | 32.72 | 22.77 | 75.93 | 14.90 |
6 | 256 | - | 512 | 26.66 | 2513 | 66.22 | 48.39 | 34.32 | 24.27 | 78.71 | 15.09 |
7 | 256 | 256 | 256 | 25.32 | 2247 | 67.56 | 49.72 | 35.48 | 25.24 | 81.85 | 15.63 |
8 | 256 | 256 | 128 | 24.31 | 1905 | 67.18 | 49.47 | 35.19 | 24.90 | 80.73 | 15.40 |
9 | 512 | 512 | - | 32.29 | 2393 | 65.24 | 47.33 | 33.30 | 23.54 | 75.86 | 14.88 |
RNN Size |
Adaptation Component Size |
Word Prediction Component Size | No. of Model Parameters (mln) | Time of Sentence Generation (ms) | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | CIDEr | SPICE | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 512 | 512 | 512 | 28.82 | 5565 | 67.53 | 49.67 | 35.39 | 25.19 | 82.48 | 15.75 |
2 | 256 | 256 | - | 25.18 | 4195 | 66.87 | 48.61 | 34.17 | 24.00 | 78.55 | 15.39 |
3 | 128 | 128 | 128 | 23.7 | 5100 | 65.86 | 47.94 | 33.54 | 23.40 | 74.92 | 14.77 |
4 | 256 | 256 | 256 | 25.2 | 6336 | 66.59 | 48.63 | 34.34 | 24.33 | 78.13 | 15.16 |
5.3. Recurrent Neural Network Model
5.4. Experiments on External Images
5.5. Comparison with Transformer-Based Approaches
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Ramachandram, D.; Taylor, G.W. Deep Multimodal Learning: A Survey on Recent Advances and Trends. IEEE Signal Process. Mag. 2017, 34, 96–108. [Google Scholar] [CrossRef]
- Zhang, X.; He, S.; Song, X.; Lau, R.W.; Jiao, J.; Ye, Q. Image captioning via semantic element embedding. Neurocomputing 2020, 395, 212–221. [Google Scholar] [CrossRef]
- Janusz, A.; Kałuża, D.; Matraszek, M.; Grad, Ł.; Świechowski, M.; Ślęzak, D. Learning multimodal entity representations and their ensembles, with applications in a data-driven advisory framework for video game players. Inf. Sci. 2022, 617, 193–210. [Google Scholar] [CrossRef]
- Zhang, W.; Sugeno, M. A fuzzy approach to scene understanding. In Proceedings of the [Proceedings 1993] Second IEEE International Conference on Fuzzy Systems, San Francisco, CA, USA, 28 March–1 April 1993; Volume 1, pp. 564–569. [Google Scholar] [CrossRef]
- Iwanowski, M.; Bartosiewicz, M. Describing images using fuzzy mutual position matrix and saliency-based ordering of predicates. In Proceedings of the 2021 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Luxembourg, 11–14 July 2021; pp. 1–8. [Google Scholar] [CrossRef]
- Kuznetsova, P.; Ordonez, V.; Berg, A.; Berg, T.; Choi, Y. Collective Generation of Natural Image Descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Republic of Korea, 10 July 2012; pp. 359–368. [Google Scholar]
- Li, S.; Kulkarni, G.; Berg, T.L.; Berg, A.C.; Choi, Y. Composing Simple Image Descriptions using Web-scale N-grams. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Portland, OR, USA, 23–24 June 2011; pp. 220–228. [Google Scholar]
- Mitchell, M.; Han, X.; Dodge, J.; Mensch, A.; Goyal, A.; Berg, A.; Yamaguchi, K.; Berg, T.; Stratos, K.; Daumé, H. Midge: Generating Image Descriptions from Computer Vision Detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL ’12), Avignon, France, 23–27 April 2012; pp. 747–756. [Google Scholar]
- Farhadi, A.; Hejrati, M.; Sadeghi, M.A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; Forsyth, D. Every Picture Tells a Story: Generating Sentences from Images. In Proceedings of the Computer Vision—ECCV 2010; Daniilidis, K., Maragos, P., Paragios, N., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; pp. 15–29. [Google Scholar]
- Barnard, K.; Duygulu, P.; Forsyth, D.; Blei, D.; Kandola, J.; Hofmann, T.; Poggio, T.; Shawe-Taylor, J. Matching Words and Pictures. J. Mach. Learn. Res. 2003, 3, 1107–1135. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: New York, NY, USA, 2012; Volume 25. [Google Scholar]
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
- Ramisa, A.; Yan, F.; Moreno-Noguer, F.; Mikolajczyk, K. BreakingNews: Article Annotation by Image and Text Processing. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1072–1085. [Google Scholar]
- Biten, A.F.; Gómez, L.; Rusiñol, M.; Karatzas, D. Good News, Everyone! Context Driven Entity-Aware Captioning for News Images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 12458–12467. [Google Scholar]
- Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the ACL, Melbourne, Australia, 15–20 July 2018. [Google Scholar]
- Changpinyo, S.; Sharma, P.; Ding, N.; Soricut, R. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training to Recognize Long-Tail Visual Concepts. In Proceedings of the CVPR, Los Alamitos, CA, USA, 10 June 2021. [Google Scholar]
- Kiros, R.; Salakhutdinov, R.; Zemel, R. Multimodal Neural Language Models. In Proceedings of the 31st International Conference on Machine Learning, Bejing, China, 22–24 June 2014; pp. 595–603. [Google Scholar]
- Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 7–12 June 2015; pp. 3128–3137. [Google Scholar] [CrossRef]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Los Alamitos, CA, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
- Donahue, J.; Hendricks, L.A.; Rohrbach, M.; Venugopalan, S.; Guadarrama, S.; Saenko, K.; Darrell, T. Long-Term Recurrent Convolutional Networks for Visual Recognition and Description. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 677–691. [Google Scholar] [CrossRef]
- Johnson, J.; Karpathy, A.; Fei-Fei, L. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 18–20 June 2016; pp. 4565–4574. [Google Scholar]
- Xiao, X.; Wang, L.; Ding, K.; Xiang, S.; Pan, C. Dense semantic embedding network for image captioning. Pattern Recognit. 2019, 90, 285–296. [Google Scholar] [CrossRef]
- Toshevska, M.; Stojanovska, F.; Zdravevski, E.; Lameski, P.; Gievska, S. Exploration into Deep Learning Text Generation Architectures for Dense Image Captioning. In Proceedings of the 2020 15th Conference on Computer Science and Information Systems (FedCSIS), Sofia, Bulgaria, 6–9 September 2020; pp. 129–136. [Google Scholar] [CrossRef]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; Volume 37, pp. 2048–2057. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6077–6086. [Google Scholar]
- Guo, L.; Liu, J.; Tang, J.; Li, J.; Luo, W.; Lu, H. Aligning Linguistic Words and Visual Semantic Units for Image Captioning. In Proceedings of the 27th ACM International Conference on Multimedia (MM ’19), New York, NY, USA, 21–25 October 2019; pp. 765–773. [Google Scholar] [CrossRef]
- Gu, J.; Wang, G.; Cai, J.; Chen, T. An Empirical Study of Language CNN for Image Captioning. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1231–1240. [Google Scholar]
- Liu, S.; Bai, L.; Hu, Y.; Wang, H. Image Captioning Based on Deep Neural Networks. MATEC Web Conf. 2018, 232, 01052. [Google Scholar] [CrossRef]
- Xu, K.; Wang, H.; Tang, P. Image captioning with deep LSTM based on sequential residual. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME), Los Alamitos, CA, USA, 10–14 July 2017; pp. 361–366. [Google Scholar] [CrossRef]
- Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Yuille, A.L. Explain Images with Multimodal Recurrent Neural Networks. arXiv 2014, arXiv:1410.1090. [Google Scholar]
- Dong, H.; Zhang, J.; McIlwraith, D.; Guo, Y. I2T2I: Learning Text to Image Synthesis with Textual Data Augmentation. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE Press: Piscataway, NJ, USA, 2017; pp. 2015–2019. [Google Scholar] [CrossRef]
- Xian, Y.; Tian, Y. Self-Guiding Multimodal LSTM-When We Do Not Have a Perfect Training Dataset for Image Captioning. IEEE Trans. Image Process. 2019, 28, 5241–5252. [Google Scholar] [CrossRef]
- Rennie, S.J.; Marcheret, E.; Mroueh, Y.; Ross, J.; Goel, V. Self-Critical Sequence Training for Image Captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 21–26 July 2017; pp. 1179–1195. [Google Scholar]
- Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 21–26 July 2017; pp. 3242–3250. [Google Scholar]
- Delbrouck, J.; Dupont, S. Bringing back simplicity and lightliness into neural image captioning. arXiv 2018, arXiv:1810.06245. [Google Scholar]
- Tanti, M.; Gatt, A.; Camilleri, K. What is the Role of Recurrent Neural Networks (RNNs) in an Image Caption Generator? In Proceedings of the 10th International Conference on Natural Language Generation, Santiago de Compostela, Spain, 4–7 September 2017; pp. 51–60. [Google Scholar] [CrossRef]
- Zhou, L.; Xu, C.; Koch, P.A.; Corso, J.J. Image Caption Generation with Text-Conditional Semantic Attention. arXiv 2016, arXiv:1606.04621. [Google Scholar]
- Chen, X.; Zitnick, C.L. Mind’s eye: A recurrent visual representation for image caption generation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 7–12 June 2015; pp. 2422–2431. [Google Scholar] [CrossRef]
- Hessel, J.; Savva, N.; Wilber, M. Image Representations and New Domains in Neural Image Captioning. arXiv 2015, arXiv:1508.02091. [Google Scholar] [CrossRef]
- Song, M.; Yoo, C.D. Multimodal representation: Kneser-ney smoothing/skip-gram based neural language model. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 2281–2285. [Google Scholar] [CrossRef]
- Hendricks, L.; Venugopalan, S.; Rohrbach, M.; Mooney, R.; Saenko, K.; Darrell, T. Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 27–30 June 2016; pp. 1–10. [Google Scholar] [CrossRef]
- You, Q.; Jin, H.; Wang, Z.; Fang, C.; Luo, J. Image Captioning with Semantic Attention. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 27–30 June 2016; pp. 4651–4659. [Google Scholar] [CrossRef]
- Mao, J.; Xu, W.; Yang, Y.; Wang, J.; Yuille, A.L. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN). arXiv 2014, arXiv:1412.6632. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, W.; Chen, S.; Guo, L.; Zhu, X.; Liu, J. CPTR: Full Transformer Network for Image Captioning. arXiv 2021, arXiv:2101.10804. [Google Scholar]
- Pan, Y.; Yao, T.; Li, Y.; Mei, T. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Los Alamitos, CA, USA, 14–19 June 2020; pp. 10971–10980. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning; Available online: http://proceedings.mlr.press/v139/radford21a/radford21a.pdf (accessed on 12 July 2024).
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.; Parekh, Z.; Pham, H.; Le, Q.V.; Sung, Y.; Li, Z.; Duerig, T. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. arXiv 2021, arXiv:2102.05918. [Google Scholar]
- Zhou, L.; Palangi, H.; Zhang, L.; Hu, H.; Corso, J.; Gao, J. Unified vision-language pre-training for image captioning and vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13041–13049. [Google Scholar]
- Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 121–137. [Google Scholar]
- Zhang, P.; Li, X.; Hu, X.; Yang, J.; Zhang, L.; Wang, L.; Choi, Y.; Gao, J. VinVL: Making Visual Representations Matter in Vision-Language Models. arXiv 2021, arXiv:2101.00529. [Google Scholar]
- Ding, Z.; Sun, Y.; Xu, S.; Pan, Y.; Peng, Y.; Mao, Z. Recent Advances and Perspectives in Deep Learning Techniques for 3D Point Cloud Data Processing. Robotics 2023, 12, 100. [Google Scholar] [CrossRef]
- Zhang, H.; Wang, C.; Yu, L.; Tian, S.; Ning, X.; Rodrigues, J. PointGT: A Method for Point-Cloud Classification and Segmentation Based on Local Geometric Transformation. IEEE Trans. Multimed. 2024, 26, 8052–8062. [Google Scholar] [CrossRef]
- Wang, C.; Ning, X.; Sun, L.; Zhang, L.; Li, W.; Bai, X. Learning Discriminative Features by Covering Local Geometric Space for Point Cloud Analysis. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5703215. [Google Scholar] [CrossRef]
- Wang, C.; Ning, X.; Li, W.; Bai, X.; Gao, X. 3D Person Re-Identification Based on Global Semantic Guidance and Local Feature Aggregation. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 4698–4712. [Google Scholar] [CrossRef]
- Xue, L.; Yu, N.; Zhang, S.; Panagopoulou, A.; Li, J.; Martín-Martín, R.; Wu, J.; Xiong, C.; Xu, R.; Niebles, J.C.; et al. ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2024; pp. 27091–27101. [Google Scholar]
- Chen, G.; Wang, M.; Yang, Y.; Yu, K.; Yuan, L.; Yue, Y. PointGPT: Auto-regressively Generative Pre-training from Point Clouds. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Wang, S.S.; Dong, R.Y. Learning Complex Spatial Relation Model from Spatial Data. J. Comput. 2019, 30, 123–136. [Google Scholar]
- Yang, Z.; Zhang, Y.; ur Rehman, S.; Huang, Y. Image Captioning with Object Detection and Localization. arXiv 2017, arXiv:1706.02430. [Google Scholar]
- Herdade, S.; Kappeler, A.; Boakye, K.; Soares, J. Image Captioning: Transforming Objects into Words. arXiv 2019, arXiv:1906.05963. [Google Scholar]
- Sugano, Y.; Bulling, A. Seeing with Humans: Gaze-Assisted Neural Image Captioning. arXiv 2016, arXiv:1608.05203. [Google Scholar]
- Lebret, R.; Pinheiro, P.O.; Collobert, R. Phrase-Based Image Captioning. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (CML’15), Lille, France, 6–11 July 2015; JMLR.org: New York, NY, USA, 2015; Volume 37, pp. 2085–2094. [Google Scholar]
- Li, Y. Image Caption using VGG model and LSTM. Appl. Comput. Eng. 2024, 48, 68–77. [Google Scholar] [CrossRef]
- Bartosiewicz, M.; Iwanowski, M.; Wiszniewska, M.; Frączak, K.; Leśnowolski, P. On Combining Image Features and Word Embeddings for Image Captioning. In Proceedings of the 2023 18th Conference on Computer Science and Intelligence Systems (FedCSIS), Warsaw, Poland, 17–20 September 2023; pp. 355–365. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. arXiv 2016, arXiv:1610.02357. [Google Scholar]
- Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Los Alamitos, CA, USA, 21–26 July 2017. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. LSTM Long Short-term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder—Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar] [CrossRef]
- Mikolov, T.; Chen, K.; Corrado, G.S.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. GloVe: Global Vectors for Word Representation. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL ’02), Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
- Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar] [CrossRef]
- Cui, Y.; Yang, G.; Veit, A.; Huang, X.; Belongie, S. Learning to Evaluate Image Captioning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 18–22 June 2018; pp. 5804–5812. [Google Scholar] [CrossRef]
- Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. Spice: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 8–14 September 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 382–398. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
- Chen, X.; Fang, H.; Lin, T.; Vedantam, R.; Gupta, S.; Dollár, P.; Zitnick, C.L. Microsoft COCO Captions: Data Collection and Evaluation Server. CoRR 2015, abs/1504.00325. Available online: http://arxiv.org/abs/1504.00325 (accessed on 14 August 2024).
- Xu, N.; Liu, A.; Liu, J.; Nie, W.; Su, Y. Scene graph captioner: Image captioning based on structural visual representation. J. Vis. Commun. Image Represent. 2019, 58, 477–485. [Google Scholar]
- Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diega, CA, USA, 7–9 May 2015. [Google Scholar]
- Rohrbach, A.; Hendricks, L.A.; Burns, K.; Darrell, T.; Saenko, K. Object Hallucination in Image Captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4035–4045. [Google Scholar] [CrossRef]
- OpenAI. DALL·E 3 System Card. 2023. Available online: https://openai.com/index/dall-e-3-system-card/ (accessed on 12 July 2024).
- OpenAI. Introducing GPT-4o and More Tools to ChatGPT Free Users. 2024. Available online: https://openai.com/index/gpt-4o-and-more-tools-to-chatgpt-free/ (accessed on 12 July 2024).
- Stefanini, M.; Cornia, M.; Baraldi, L.; Cascianelli, S.; Fiameni, G.; Cucchiara, R. From Show to Tell: A Survey on Image Captioning. arXiv 2021, arXiv:2107.06912. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2017; Volume 30. [Google Scholar]
- Radford, A.; Narasimhan, K. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://openai.com/index/language-unsupervised/ (accessed on 12 July 2024).
- Wiegreffe, S.; Pinter, Y. Attention is not not Explanation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 4 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; pp. 11–20. [Google Scholar] [CrossRef]
- Tan, H.; Bansal, M. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 4 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; pp. 5100–5111. [Google Scholar] [CrossRef]
Image Features | No. of Model Parameters (mln) | Embeddings | Time of Sentence Generation (ms) | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | CIDEr | SPICE |
---|---|---|---|---|---|---|---|---|---|
Vgg19 | 144.47 | Glove | 2046 | 64.1 | 45.83 | 31.86 | 22.34 | 69.62 | 13.93 |
145.32 | FastText | 2090 | 65.42 | 46.89 | 32.72 | 22.93 | 71.79 | 14.46 | |
Vgg16 | 143.26 | Glove | 2166 | 64.25 | 45.62 | 31.63 | 22.09 | 67.35 | 13.64 |
144.11 | FastText | 2086 | 64.47 | 45.73 | 31.54 | 21.86 | 67.76 | 13.81 | |
ResNet50 | 27.97 | Glove | 2468 | 65.33 | 47.26 | 33.26 | 23.44 | 73.12 | 14.43 |
28.81 | FastText | 2016 | 65.97 | 47.82 | 33.79 | 24.02 | 74.47 | 14.71 | |
ResNet152V2 | 62.71 | Glove | 2834 | 64.91 | 46.78 | 32.57 | 22.86 | 70.77 | 14.08 |
63.55 | FastText | 2418 | 65.28 | 46.78 | 32.47 | 22.61 | 70.07 | 14.16 | |
MobileNetV2 | 6.45 | Glove | 2096 | 65.39 | 47.14 | 33.04 | 23.24 | 73.03 | 14.55 |
7.29 | FastText | 2144 | 65.13 | 47.22 | 33.17 | 23.32 | 73.79 | 14.62 | |
MobileNet | 8.36 | Glove | 3860 | 64.35 | 46.14 | 32.12 | 22.42 | 69.28 | 13.76 |
9.2 | FastText | 1952 | 65.02 | 46.93 | 32.85 | 23.02 | 71.24 | 14.31 | |
Xception | 25.24 | Glove | 2414 | 66.59 | 48.63 | 34.34 | 24.33 | 78.13 | 15.16 |
26.08 | FastText | 2052 | 67.01 | 48.8 | 34.45 | 24.3 | 77.64 | 15.18 | |
InceptionV3 | 26.18 | Glove | 1960 | 66.12 | 47.72 | 33.35 | 23.38 | 74.16 | 14.72 |
27.02 | FastText | 1922 | 66.15 | 47.87 | 33.57 | 23.63 | 75.04 | 14.83 | |
DenseNet201 | 22.67 | Glove | 1828 | 66.35 | 48.41 | 34.26 | 24.18 | 76.54 | 14.96 |
23.51 | FastText | 1748 | 66.59 | 48.73 | 34.57 | 24.55 | 76.74 | 14.83 | |
DenseNet121 | 11.16 | Glove | 2468 | 65.03 | 47.02 | 32.96 | 23.26 | 71.94 | 14.13 |
12.00 | FastText | 2360 | 65.39 | 47.09 | 32.89 | 23.09 | 72.36 | 14.25 | |
Sugano [61] | - | 71.4 | 50.5 | 35.2 | 24.5 | 63.8 | - | ||
Lebret [62] | - | 73 | 50 | 34 | 23 | - | - | ||
Karpathy [18] | - | 62.5 | 45 | 32.1 | 23 | 66 | - | ||
Xu [84] | - | 67.9 | 49.3 | 34.7 | 24.3 | 75.4 | - |
Image Features | Image | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | CIDEr | Predicted Caption | Ground-Truth Captions |
---|---|---|---|---|---|---|---|---|
DenseNet121 | Figure 2a | 79.54 | 73.06 | 64.87 | 53.42 | 159.42 | A man riding skis down a snow covered slope. | A man on skis is posing on a ski slope. A person on a ski mountain posing for the camera. A man n a red coat stands on the snow on skis. A man riding skis on top of a snow covered slope. A lady is in her ski gear in the snow. |
ResNet152V2 | Figure 2e | 60.00 | 44.72 | 0.00 | 0.00 | 83.88 | A dog jumping in the air to catch a frisbee. | A very cute brown dog with a disc in its mouth. A dog running in the grass with a frisbee in his mouth. A dog carrying a Frisbee in its mouth running on a grass lawn. A dog in a grassy field carrying a frisbee. A brown dog walking across a green field with a frisbee in its mouth. |
VGG19 | Figure 2c | 100.00 | 84.52 | 61.98 | 75.00 | 184.89 | A bathroom with a toilet and a sink. | A bathroom with a sink. toilet and vanity. Tiled bathroom with a couple towels hanging up An old bathroom with a black marble sink. A bathroom with a black sink counter next to a white toilet. The corner of a bathroom with light mint green walls above the tile. |
MobileNet | Figure 2b | 75.00 | 46.29 | 0.00 | 0.00 | 120.58 | A kitchen with a stove and a microwave. | A microwave is sitting idly in the kitchen. A shiny silver metal microwave near wooden cabinets. There are wooden cabinets that have a microwave attached at the bottom of it A microwave sitting next to and underneath kitchen cupboards. A kitchen scene with focus on a silver microwave. |
Picture | Figure 2d | |
Image features | DenseNet201 | Xception |
BLEU-1 | 38.46 | 75.00 |
BLEU-2 | 17.90 | 65.47 |
BLEU-3 | 0.00 | 52.28 |
BLEU-4 | 0.00 | 41.11 |
METEOR | 21.90 | 21.81 |
ROUGE-L | 35.62 | 69.85 |
CIDEr | 11.12 | 157.97 |
Predicted caption | A woman in a red dress is holding a white and red toothbrush. | A bride and groom cutting their wedding cake. |
Ground truth captions | A man and woman standing in front of a cake. A newly wed couple celebrating with a toast. A bride and groom celebrate over a cake. A bride and groom are celebrating with wedding cake. A man and a woman standing next to each other. |
RNN Size |
Adaptation Component Size |
Word Prediction Component Size | No. of Model Parameters (mln) | Time of Sentence Generation (ms) | BLEU-1 | BLEU-2 | BLEU-3 | BLEU-4 | CIDEr | SPICE | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 256 | 256 | 512 | 27.19 | 2211 | 67.32 | 49.62 | 35.51 | 25.45 | 81.41 | 15.60 |
2 | 256 | 256 | 256 | 25.19 | 2234 | 67.98 | 50.22 | 35.83 | 25.42 | 82.19 | 15.76 |
3 | 256 | 256 | 128 | 24.19 | 1823 | 67.20 | 49.10 | 34.67 | 24.41 | 80.39 | 15.33 |
CIDEr | Predicted Caption | |
---|---|---|
Image features: Xception; merge method: concatenate; word prediction component: 512; adaptation component: 0, RNN: LSTM | 2.2237 | A giraffe standing in a field next to a tree. |
Image features: VGG16; merge method: concatenate; word prediction component: 256; RNN: LSTM | 1.7200 | A giraffe standing next to a tree in a park. |
Image features: Xception; merge method: concatenate; word prediction component: 256; RNN: GRU | 1.6128 | A giraffe standing in a dirt field next to a building. |
Image features: MobileNetV2; merge method: concatenate; word prediction component: 256; RNN: LSTM | 1.5162 | A giraffe standing in a fenced in area. |
Image features: ResNet50; merge method: concatenate; word prediction component: 256; RNN: LSTM | 1.4194 | A giraffe standing next to a zebra in a zoo. |
Image features: Xception; merge method: concatenate; word prediction component: 512; RNN: GRU | 1.2934 | A giraffe standing next to a zebra in a field. |
Image features: InceptionV3; merge method: concatenate; word prediction component: 256; RNN: LSTM | 1.2851 | A giraffe standing next to a wooden fence. |
Image features: Xception; merge method: concatenate; word prediction component: 512; RNN: LSTM | 0.9393 | A couple of giraffe standing next to each other. |
Ground truth captions | A giraffe standing outside of a building next to a tree. A giraffe standing in a small piece of shade. A giraffe finds some sparse shade in his habitat. Giraffe standing in a holding pen near a tree stump. A giraffe in a zoo enclosure next to a barn. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bartosiewicz, M.; Iwanowski, M. The Optimal Choice of the Encoder–Decoder Model Components for Image Captioning. Information 2024, 15, 504. https://doi.org/10.3390/info15080504
Bartosiewicz M, Iwanowski M. The Optimal Choice of the Encoder–Decoder Model Components for Image Captioning. Information. 2024; 15(8):504. https://doi.org/10.3390/info15080504
Chicago/Turabian StyleBartosiewicz, Mateusz, and Marcin Iwanowski. 2024. "The Optimal Choice of the Encoder–Decoder Model Components for Image Captioning" Information 15, no. 8: 504. https://doi.org/10.3390/info15080504
APA StyleBartosiewicz, M., & Iwanowski, M. (2024). The Optimal Choice of the Encoder–Decoder Model Components for Image Captioning. Information, 15(8), 504. https://doi.org/10.3390/info15080504