Next Article in Journal
A Hydraulic Semi-Active Suspension Based on Road Statistical Properties and Its Road Identification
Previous Article in Journal
Algorithm for Virtual Aggregates’ Reconstitution Based on Image Processing and Discrete-Element Modeling
Article Menu
Issue 5 (May) cover image

Export Article

Open AccessArticle

Captioning Transformer with Stacked Attention Modules

1,2,†,‡, 1,2,*,‡, 3,‡, 1,2,‡ and 1,2,‡
Information Security Center, State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China
National Engineering Laboratory for Disaster Backup and Recovery, Beijing University of Posts and Telecommunications, Beijing 100876, China
National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Author to whom correspondence should be addressed.
Current address: School of Cyberspace Security, Beijing University of Posts and Telecommunications, P.O. Box 145, Haidian District, Beijing 100876, China.
These authors contributed equally to this work.
Appl. Sci. 2018, 8(5), 739;
Received: 30 March 2018 / Revised: 1 May 2018 / Accepted: 3 May 2018 / Published: 7 May 2018
PDF [2166 KB, uploaded 13 May 2018]


Image captioning is a challenging task. Meanwhile, it is important for the machine to understand the meaning of an image better. In recent years, the image captioning usually use the long-short-term-memory (LSTM) as the decoder to generate the sentence, and these models show excellent performance. Although the LSTM can memorize dependencies, the LSTM structure has complicated and inherently sequential across time problems. To address these issues, recent works have shown benefits of the Transformer for machine translation. Inspired by their success, we develop a Captioning Transformer (CT) model with stacked attention modules. We attempt to introduce the Transformer to the image captioning task. The CT model contains only attention modules without the dependencies of the time. It not only can memorize dependencies between the sequence but also can be trained in parallel. Moreover, we propose the multi-level supervision to make the Transformer achieve better performance. Extensive experiments are carried out on the challenging MSCOCO dataset and the proposed Captioning Transformer achieves competitive performance compared with some state-of-the-art methods. View Full-Text
Keywords: image caption; image understanding; deep learning; computer vision image caption; image understanding; deep learning; computer vision

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Share & Cite This Article

MDPI and ACS Style

Zhu, X.; Li, L.; Liu, J.; Peng, H.; Niu, X. Captioning Transformer with Stacked Attention Modules. Appl. Sci. 2018, 8, 739.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Appl. Sci. EISSN 2076-3417 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top