A Context Semantic Auxiliary Network for Image Captioning
Abstract
:1. Introduction
- We propose an Attention-Aware (AA) mechanism, which can effectively filter out erroneous or irrelevant information.
- A Context Semantic Auxiliary Network (CSAN) is constituted via AA, which can effectively capture the context semantic information from the complete descriptive sentence, and results in great improvement.
2. Related Works
2.1. Image Captioning
2.2. Attention Mechanism
3. Methods
3.1. Visual Feature Extraction
3.2. Base Network
3.3. Context Semantic Aware Network
3.3.1. Attention-Aware Module
3.3.2. Context Semantic Aware Network
3.4. Training
4. Experiments
4.1. Dataset
4.2. Experiment Settings
4.3. Quantitative Analysis
4.4. Ablation Analysis
- CSAA and CVAA were removed from CSAN, leaving only two layers of LSTM, which was referred to as the base network.
- CSAA was added to the base network (base + CSAA) to verify the necessity of learning contextual semantic information by CSAA in improving the image captioning model.
- Based on the second step, CVAA was added to the base network (base + CSAA + CVAA) to verify the impact of visual feature information under the condition of context semantic information.
4.5. Qualitative Analysis
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
- Donahue, J.; Hendricks, L.A.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Darrell, T.; Saenko, K. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2625–2634. [Google Scholar]
- Jia, X.; Gavves, E.; Fernando, B.; Tuytelaars, T. Guiding the long-short term mem-ory model for image caption generation. In Proceedings of the 2015 IEEE International Conference on Computer Vision, ICCV, Santiago, Chile, 7–13 December 2015; pp. 2407–2415. [Google Scholar]
- Wang, C.; Yang, H.; Bartz, C.; Meinel, C. Image Captioning with Deep Bidirectional LSTMs. In Proceedings of the 2016 ACM Conference on Multimedia Conference, MM, Amsterdam, The Netherlands, 15–19 October 2016; pp. 988–997. [Google Scholar]
- Mao, J.; Huang, J.; Toshev, A.; Camburu, O.; Yuille, A.L.; Murphy, K. Generation and Comprehension of Unambiguous Object Descriptions. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; IEEE Computer Society: Washington, DC, USA, 2016; pp. 11–20. [Google Scholar]
- Dai, B.; Ye, D.; Lin, D. Rethinking the Form of Latent States in Image Captioning. In Proceedings of the Computer Vision—ECCV, Munich, Germany, 8–14 September 2018; Lecture Notes in Computer Science. ECCV: Coburg, Australia, 2018; Volume 11209, pp. 294–310. [Google Scholar]
- You, Q.; Jin, H.; Wang, Z.; Fang, C.; Luo, J. Image Captioning with Semantic Attention. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 4651–4659. [Google Scholar]
- Wang, Y.; Lin, Z.; Shen, X.; Cohen, S.; Cottrell, G.W. Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 7378–7387. [Google Scholar]
- Yao, T.; Pan, Y.; Li, Y.; Qiu, Z.; Mei, T. Boosting image captioning with attributes. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4894–4902. [Google Scholar]
- Li, N.; Chen, Z. Image Cationing with Visual-Semantic LSTM. In Proceedings of the 2018 Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI, Stockholm, Sweden, 13–19 July 2018; pp. 793–799. [Google Scholar]
- Huang, F.; Li, Z.; Chen, S.; Zhang, C.; Ma, H. Image Captioning with Internal and External Knowledge. In Proceedings of the CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, CIKM, Virtual Event, Ireland, 19–23 October 2020; pp. 535–544. [Google Scholar]
- Yao, T.; Pan, Y.; Li, Y.; Mei, T. Exploring Visual Relationship for Image Captioning. In Proceedings of the Computer Vision—ECCV, Munich, Germany, 8–14 September 2018; Volume 11218, pp. 711–727. [Google Scholar]
- Chen, F.; Ji, R.; Sun, X.; Wu, Y.; Su, J. GroupCap: Group-Based Image Captioning With Structured Relevance and Diversity Constraints. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1345–1353. [Google Scholar]
- Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the ICLR, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Luo, Y.; Ji, J.; Sun, X.; Cao, L.; Wu, Y.; Huang, F.; Lin, C.; Ji, R. Dual-level Collaborative Transformer for Image Captioning. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI, Virtual Event, 2–9 February 2021; pp. 2286–2293. [Google Scholar]
- Jiang, W.; Li, X.; Hu, H.; Lu, Q.; Liu, B. Multi-Gate Attention Network for Image Captioning. IEEE Access 2021, 9, 69700–69709. [Google Scholar] [CrossRef]
- Xian, T.; Li, Z.; Zhang, C.; Ma, H. Dual Global Enhanced Transformer for image captioning. Neural Netw. 2022, 148, 129–141. [Google Scholar] [CrossRef] [PubMed]
- Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 6298–6306. [Google Scholar]
- Fu, K.; Jin, J.; Cui, R.; Sha, F.; Zhang, C. Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2321–2334. [Google Scholar] [CrossRef] [PubMed]
- Liu, C.; Mao, J.; Sha, F.; Yuille, A.L. Attention Correctness in Neural Image Captioning. In Proceedings of the Thirty-First AAAI, San Francisco, CA, USA, 4–9 February 2017; Singh, S.P., Markovitch, S., Eds.; pp. 4176–4182. [Google Scholar] [CrossRef]
- Pedersoli, M.; Lucas, T.; Schmid, C.; Verbeek, J. Areas of Attention for Image Captioning. In Proceedings of the IEEE International Conference on Computer Vision, ICCV, Venice, Italy, 22–29 October 2017; pp. 1251–1259. [Google Scholar]
- Guo, L.; Liu, J.; Lu, S.; Lu, H. Show, Tell, and Polish: Ruminant Decoding for Image Captioning. IEEE Trans. Multim. 2020, 22, 2149–2162. [Google Scholar] [CrossRef]
- Song, Z.; Zhou, X.; Mao, Z.; Tan, J. Image Captioning with Context-Aware Auxiliary Guidance. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; pp. 2584–2592. [Google Scholar]
- Liang, C.; Wang, W.; Zhou, T.; Miao, J.; Luo, Y.; Yang, Y. Local-Global Context Aware Transformer for Language-Guided Video Segmentation. arXiv 2022, arXiv:2203.09773. [Google Scholar]
- Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. In Proceedings of the 2017 34th International Conference on Machine Learning, ICML, Sydney, Australia, 6–11 August 2017; Proceedings of Machine Learning Research. Precup, D., Teh, Y.W., Eds.; Volume 70, pp. 933–941. [Google Scholar]
- Anderson, P.; He, X.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6077–6086. [Google Scholar]
- Ke, L.; Pei, W.; Li, R.; Shen, X.; Tai, Y. Reflective Decoding Network for Image Captioning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8887–8896. [Google Scholar]
- Huang, L.; Wang, W.; Chen, J.; Wei, X. Attention on Attention for Image Captioning. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4633–4642. [Google Scholar]
- Farhadi, A.; Hejrati, S.M.M.; Sadeghi, M.A.; Young, P.; Rashtchian, C.; Hockenmaier, J.; Forsyth, D.A. Every Picture Tells a Story: Generating Sentences from Images. In Proceedings of the Computer Vision—ECCV; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6314, pp. 15–29. [Google Scholar]
- Kulkarni, G.; Premraj, V.; Ordonez, V.; Dhar, S.; Li, S.; Choi, Y.; Berg, A.C.; Berg, T.L. BabyTalk: Understanding and Generating Simple Image Descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2891–2903. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Mitchell, M.; Dodge, J.; Goyal, A.; Yamaguchi, K.; Stratos, K.; Han, X.; Mensch, A.C.; Berg, A.C.; Berg, T.L.; Daume, H., III. Midge: Generating Image Descriptions From Computer Vision Detections. In Proceedings of the EACL, Avignon, France, 23–27 April 2012; The Association for Computer Linguistics: Cedarville, OH, USA, 2012; pp. 747–756. [Google Scholar]
- Cho, K.; van Merrienboer, B.; Gülçehre, Ç.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. In Proceedings of the 2014 the Conference on Empirical Methods in Natural Language Processing, EMNLP, Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
- Liang, C.; Wang, W.; Zhou, T.; Yang, Y. Visual Abductive Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, New Orleans, LA, USA, 19–20 June 2022; pp. 15544–15554. [Google Scholar] [CrossRef]
- Lu, J.; Xiong, C.; Parikh, D.; Socher, R. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the 2017 The IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 375–383. [Google Scholar]
- Yu, D.; Fu, J.; Mei, T.; Rui, Y. Multi-level Attention Networks for Visual Question Answering. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 4187–4195. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, NIPS, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the CVPR, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Lin, T.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV, Zurich, Switzerland, 6–12 September 2014; Lecture Notes in Computer Science. Volume 8693, pp. 740–755. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 2002 40th Annual Meeting of the Association for Computational Linguistics, ACL, Philadephia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the 2005 ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, June 2005; pp. 65–72. [Google Scholar]
- Lin, C.Y. Rouge: A package for automatic evaluation of summaries. In Proceedings of the Text Summarization Branches Out, Barcelona, Spain, 25–26 July 2004; pp. 74–81. [Google Scholar]
- Vedantam, R.; Zitnick, C.L.; Parikh, D. CIDEr: Consensus-based image description evaluation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Boston, MA, USA, 7–12 June 2015; pp. 4566–4575. [Google Scholar]
- Anderson, P.; Fernando, B.; Johnson, M.; Gould, S. SPICE: Semantic Propositional Image Caption Evaluation. In Proceedings of the Computer Vision—ECCV, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Volume 9909, pp. 382–398. [Google Scholar]
- Li, Z.; Li, Y.; Lu, H. Improve Image Captioning by Self-attention. In Proceedings of the Neural Information Processing—26th International Conference, ICONIP, Sydney, NSW, Australia, 12–15 December 2019; Volume 1143, pp. 91–98. [Google Scholar]
- Zhong, X.; Nie, G.; Huang, W.; Liu, W.; Ma, B.; Lin, C. Attention-guided image captioning with adaptive global and local feature fusion. J. Vis. Commun. Image Represent. 2021, 78, 103138. [Google Scholar] [CrossRef]
- Zha, Z.; Liu, D.; Zhang, H.; Zhang, Y.; Wu, F. Context-Aware Visual Policy Network for Fine-Grained Image Captioning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 710–722. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Fei, Z. Attention-Aligned Transformer for Image Captioning. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI, Virtual, 22 February–1 March 2022; pp. 607–615. [Google Scholar]
- Wei, H.; Li, Z.; Zhang, C.; Ma, H. The synergy of double attention: Combine sentence-level and word-level attention for image captioning. Comput. Vis. Image Underst. 2020, 201, 103068. [Google Scholar] [CrossRef]
- Wang, J.; Wang, W.; Wang, L.; Wang, Z.; Feng, D.D.; Tan, T. Learning visual relationship and context-aware attention for image captioning. Pattern Recognit. 2020, 98, 107075. [Google Scholar] [CrossRef]
- Yang, L.; Hu, H.; Xing, S.; Lu, X. Constrained LSTM and Residual Attention for Image Captioning. ACM Trans. Multim. Comput. Commun. Appl. 2020, 16, 75:1–75:18. [Google Scholar] [CrossRef]
- Zhang, Y.; Shi, X.; Mi, S.; Yang, X. Image captioning with transformer and knowledge graph. Pattern Recognit. Lett. 2021, 143, 43–49. [Google Scholar] [CrossRef]
- Wang, C.; Gu, X. Local-global visual interaction attention for image captioning. Digit. Signal Process. 2022, 130, 103707. [Google Scholar] [CrossRef]
- Yang, Z.; Yuan, Y.; Wu, Y.; Cohen, W.W.; Salakhutdinov, R.R. Review networks for caption generation. Adv. Neural Inf. Process. Syst. 2016, 29, 2361–2369. [Google Scholar]
- Jiang, W.; Ma, L.; Jiang, Y.; Liu, W.; Zhang, T. Recurrent Fusion Network for Image Captioning. In Proceedings of the Computer Vision—ECCV, Munich, Germany, 8–14 September 2018; Volume 11206, pp. 510–526. [Google Scholar]
- Chen, F.; Xie, S.; Li, X.; Tang, J.; Pang, K.; Li, S.; Wang, T. Show, Rethink, And Tell: Image Caption Generation With Hierarchical Topic Cues. In Proceedings of the IEEE International Conference on Multimedia and Expo, ICME, Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
- Wang, C.; Shen, Y.; Ji, L. Geometry Attention Transformer with position-aware LSTMs for image captioning. Expert Syst. Appl. 2022, 201, 117174. [Google Scholar] [CrossRef]
Metric | B-1 | B-2 | B-3 | B-4 | METEOR | ROUGE-L | CIDEr | SPICE |
---|---|---|---|---|---|---|---|---|
BUTD [27] | 79.0 | 60.4 | 47.8 | 36.3 | 27.0 | 56.4 | 113.5 | 20.3 |
AoA-Net [29] | 80.2 | - | - | 39.1 | 29.2 | 58.8 | 129.8 | 22.3 |
GCN-LSTM [12] | 77.4 | - | - | 37.1 | 28.1 | 57.2 | 117.1 | 21.4 |
RDN [28] | 77.5 | 61.8 | 47.9 | 36.8 | 27.2 | 56.8 | 115.3 | 20.5 |
SAC [46] | 77.2 | - | - | 36.8 | 28.0 | 57.1 | 116.3 | 21.2 |
Attin + RD [23] | - | - | - | 36.8 | 28.1 | 57.5 | 116.5 | 21.2 |
BUTD + CAAG [24] | - | - | - | 38.4 | 28.6 | 58.6 | 128.8 | 22.1 |
Multi-gate [17] | 78.4 | 62.8 | 48.9 | 37.5 | 28.2 | 57.8 | 117.5 | 21.6 |
ASIA [47] | 78.5 | 62.2 | 48.5 | 37.8 | 27.7 | - | 116.7 | - |
CA-VNP [48] | - | - | - | 38.6 | 28.3 | 58.5 | 125.0 | 22.1 |
AAT [49] | 78.6 | - | - | 38.2 | 29.2 | 58.3 | 126.3 | 21.6 |
BUTD + CSAN | 79.1 | 63.0 | 49.5 | 38.7 | 28.8 | 59.1 | 130.0 | 22.2 |
RDN + CSAN | 79.5 | 63.2 | 49.8 | 38.8 | 29.0 | 59.7 | 129.5 | 22.3 |
AoA-Net + CSAN | 80.3 | 64.5 | 52.5 | 39.6 | 29.4 | 60.0 | 133.5 | 22.6 |
Metric | B-1 | B-2 | B-3 | B-4 | METEOR | ROUGE-L | CIDEr | SPICE |
---|---|---|---|---|---|---|---|---|
Soft-Attn [14] | 66.7 | 43.4 | 28.8 | 19.1 | 18.5 | - | - | - |
Adaptive [35] | 67.7 | 49.4 | 35.4 | 25.1 | 20.4 | - | 53.1 | - |
BUTD [27] | 76.4 | - | - | 27.3 | 21.7 | 56.6 | - | - |
DAIC [50] | 64.5 | 46.4 | 33.5 | 24.3 | 20.4 | 46.7 | 61.6 | - |
ARL [51] | 69.8 | 51.7 | 37.8 | 27.7 | 21.5 | 48.5 | 57.4 | - |
cLSTM-RA [52] | 70.5 | 52.5 | 37.6 | 27.1 | 21.9 | 49.4 | 57.7 | - |
Trans-KG [53] | 78.4 | - | - | 26.8 | 21.7 | - | 56.6 | - |
LGVIA [54] | 75.4 | 57.6 | 39.0 | 28.2 | 25.4 | 53.7 | 58.0 | - |
Adaptive + CSAN | 71.6 | 54.3 | 39.1 | 26.3 | 22.3 | 51.5 | 60.2 | 17.6 |
BUTD + CSAN | 77.3 | 59.2 | 44.3 | 29.6 | 23.0 | 59.8 | 68.7 | 20.7 |
c5 | ||||||||
Model | B-1 | B-2 | B-3 | B-4 | METEOR | ROUGE-L | CIDEr | SPICE |
Adaptive [35] | 74.8 | 58.4 | 44.4 | 33.6 | 26.4 | 55.0 | 104.2 | 19.7 |
ReviewNet [55] | 72.0 | 55.0 | 41.4 | 31.3 | 25.6 | 53.3 | 96.5 | 18.5 |
BUTD [27] | 80.2 | 64.1 | 49.1 | 36.9 | 27.6 | 57.1 | 117.9 | 21.5 |
RF-Net [56] | 80.4 | 64.9 | 50.1 | 38.0 | 28.2 | 58.2 | 122.9 | - |
RDN [28] | 80.2 | - | - | 37.3 | 28.1 | 57.4 | 121.2 | - |
AoA-Net [29] | 81.0 | 65.8 | 51.5 | 39.6 | 29.3 | 58.9 | 126.9 | 21.7 |
HTC [57] | 80.2 | 64.8 | 51.0 | 38.5 | 28.6 | 58.4 | 124.2 | - |
GAT [58] | 81.1 | 66.1 | 51.8 | 39.8 | 29.1 | 59.1 | 127.8 | - |
CA-VPN [48] | 81.6 | 64.3 | 50.8 | 37.9 | 27.4 | 57.6 | 120.9 | - |
Ours | 80.7 | 66.0 | 52.6 | 41.0 | 30.9 | 61.2 | 130.4 | 22.7 |
c40 | ||||||||
Model | B-1 | B-2 | B-3 | B-4 | METEOR | ROUGE-L | CIDEr | SPICE |
Adaptive [35] | 92.0 | 84.5 | 74.4 | 63.7 | 35.9 | 70.5 | 105.9 | 67.3 |
ReviewNet [55] | 90.0 | 81.2 | 70.5 | 59.7 | 34.7 | 68.6 | 96.9 | 64.9 |
BUTD [27] | 95.2 | 88.8 | 79.4 | 68.5 | 36.7 | 72.4 | 120.5 | 71.5 |
RF-Net [56] | 95.0 | 89.3 | 80.1 | 69.2 | 37.2 | 73.1 | 125.1 | - |
RDN [28] | 95.3 | - | - | 69.5 | 37.8 | 73.3 | 125.2 | - |
AoA-Net [29] | 95.2 | 89.6 | 81.3 | 70.9 | 38.6 | 74.9 | 129.6 | 72.6 |
HTC [57] | 95.1 | 89.0 | 81.2 | 70.4 | 38.4 | 73.6 | 128.9 | - |
GAT [58] | 95.1 | 89.7 | 81.5 | 71.4 | 38.4 | 74.7 | 130.8 | - |
CA-VPN [48] | 95.6 | 87.8 | 80.3 | 69.5 | 37.3 | 73.7 | 124.5 | - |
Ours | 95.1 | 89.3 | 82.7 | 73.0 | 40.7 | 76.5 | 134.5 | 73.4 |
Settings/Metrics | B-1 | B-4 | METEOR | ROUGE-L | CIDEr | SPICE |
---|---|---|---|---|---|---|
base | 73.2 | 35.6 | 26.0 | 52.8 | 118.8 | 20.8 |
base + CSAA | 78.4 | 38.4 | 28.4 | 58.7 | 128.4 | 22.1 |
base + CSAA + CVAA | 79.1 | 38.7 | 28.8 | 59.1 | 130.0 | 22.2 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Li, J.; Shao, X. A Context Semantic Auxiliary Network for Image Captioning. Information 2023, 14, 419. https://doi.org/10.3390/info14070419
Li J, Shao X. A Context Semantic Auxiliary Network for Image Captioning. Information. 2023; 14(7):419. https://doi.org/10.3390/info14070419
Chicago/Turabian StyleLi, Jianying, and Xiangjun Shao. 2023. "A Context Semantic Auxiliary Network for Image Captioning" Information 14, no. 7: 419. https://doi.org/10.3390/info14070419
APA StyleLi, J., & Shao, X. (2023). A Context Semantic Auxiliary Network for Image Captioning. Information, 14(7), 419. https://doi.org/10.3390/info14070419