Separate Syntax and Semantics: Part-of-Speech-Guided Transformer for Image Captioning

: Transformer-based image captioning models have recently achieved remarkable performance by using new fully attentive paradigms. However, existing models generally follow the conventional language model of predicting the next word conditioned on the visual features and partially generated words. They treat the predictions of visual and nonvisual words equally and usually tend to produce generic captions. To address these issues, we propose a novel part-of-speech-guided transformer (PoS-Transformer) framework for image captioning. Speciﬁcally, a self-attention part-of-speech prediction network is ﬁrst presented to model the part-of-speech tag sequences for the corresponding image captions. Then, different attention mechanisms are constructed for the decoder to guide the caption generation by using the part-of-speech information. Beneﬁting from the part-of-speech guiding mechanisms, the proposed framework not only adaptively adjusts the weights between visual features and language signals for the word prediction, but also facilitates the generation of more ﬁne-grained and grounded captions. Finally, a multitask learning is introduced to train the whole PoS-Transformer network in an end-to-end manner. Our model was trained and tested on the MSCOCO and Flickr30k datasets with the experimental evaluation standard CIDEr scores of 1.299 and 0.612, respectively. The qualitative experimental results indicated that the captions generated by our method conformed to the grammatical rules better.

With the success of deep learning, image captioning models have recently achieved great progress. A typical deep neural network for an image captioning model generally follows an encoder-decoder paradigm, where a deep convolutional neural network (CNN) is introduced as the encoder to learn visual representations from the input image, while a recurrent neural network (RNN) serves as the decoder to recursively predict each word. Recently, the transformer-based image captioning models have shown superior performance to the conventional CNN-RNN models by using fully attentive paradigms. Despite great advances made in the model architectures, existing models still have two limitations: (i) they treat the predictions of visual and nonvisual words equally at each time step, leading to ambiguous inference; (ii) they have the tendency to generate minimal sentences, which is common in datasets. Consequently, how to organize phrases and words to accurately express the semantics of an image remains a challenging task.
The neuroscience research on language processing has demonstrated that the brain contains partially separate systems for processing syntax and semantics [9,10], which provides us a new prospective to overcome the limitations of existing image captioning models. Naturally, the traditional encoder-decoder framework can be improved by imposing an analogous separation. Considering that in English the part-of-speech (PoS) tag sequences contain rich grammatical rules available to infer the corresponding words (We use the Stanford constituency parser to obtain the PoS tags of captions. URL: https://www.nltk.org/book/ch05.html (accessed on 15 November 2022), in this paper, we intend to improve the grounding performance of image captioning by using the PoS information. Figure 1a illustrates an example of an image caption with its corresponding PoS tags. From Figure 1a, we can observe that the different parts of speech of the words play specific grammatical roles in the caption. For example, the determiners (DET) and adjectives (ADJ) are generally used to modify the nouns (NN). The adpositions (ADP), such as in and on, play the role of connecting two noun phrases so as to establish their semantic relationship. All the PoS tags play an important role in generating the caption since they correspond to words one by one. Consequently, it is essential to master the PoS of each word for generating grammatically correct sentences. Besides, some PoS tags, such as ADJ and NOUN are closely related to the visual features of the image while some PoS tags, such as the second ADP (corresponding to the word on) in the PoS tag sequence, are irrelevant to any visual features. As a result, there is a need to find more ways to highlight the PoS information contained in sentences so that they can provide additional guidance for one captioner to distinguish between visual and nonvisual words. Then, the obtained PoS information is used to guide the visual and linguistic attention for the word prediction. "<BOS>" and "<EOS>" denote the beginning and end of all the sentences, respectively. "<EOP>" is short for "<End of PoS>", which is the end of all the PoS sequences.
Aiming to obtain the syntactic information contained in the sequence of PoS tags, we first introduce a PoS predictor to predict the PoS tag of the next word, which can be integrated with the image captioning model seamlessly. As shown in Figure 1b, the PoS tag of the next word is predicted based on the previous words while the PoS information provided by the PoS predictor is utilized to guide the generation of the next word. For instance, after the words a and red as well as their PoS tags are generated, the PoS predictor uses the word embeddings of a and red as inputs to predict the PoS tag NOUN. Meanwhile, the PoS information of DET, ADJ, and NOUN are utilized by the image caption model to predict the next word firetruck. Unlike the existing transformer-based captioners that treat all word predictions equally, the sequence of partially generated tags can help evaluate the effect of visual features and language signals on the word prediction. As illustrated in Figure 1a, when the word on is to be generated, the visual features are actually not very helpful at the current time step. However, the conventional transformer-based image captioners take no effective measures but simply concatenate attended visual features and language signals in each decoder layer, i.e., the irrelevant visual features are also used to predict the next word. As a result, the captioners are easily distracted by irrelevant visual concepts, leading to the generation of incorrect words. In contrast, after the partial PoS tag ADP of the next word on is available, the captioners can exploit the information of partially generated PoS tags to balance the effect of visual features and language context, e.g., the language cues would be paid more attention to at the current time step, which facilitates the generation of the correct word on.
In order to make a transformer-based image captioning model effectively align the generated words with the visual or nonvisual features of an image and further generate the grammatically correct captions with the help of the PoS information, we propose a PoS-Transformer framework based on a new learning paradigm. Specifically, the process of generating captions is divided into two stages: PoS prediction and caption generation. The PoS tag of the next word is predicted in the first stage, which is much easier than predicting the next word directly, since the number of PoS tags is far less than that of words. In the second stage, two different PoS-guided attention modules are proposed on top of the PoS guiding information, visual features, and linguistic context, which enables the decoder to adaptively attend to visual features and language signals. As a result, the PoS predictor, the PoS-guided attention modules, and the encoder-decoder captioning network closely collaborate to enhance the performance of image captioning. The main contributions of our work can be summarized as follows: • We propose two kinds of PoS-guided attention mechanisms based on the PoS information, adaptively adjusting the effect of visual features and language signals on the word prediction, to encourage the generation of more grounded captions. • We incorporate the PoS prediction model and the PoS-guided attention modules into the transformer-based captioning architecture to build a unified end-to-end image captioning framework, boosting the performance of image captioning by separating syntax and semantics for the prediction of each word. • We optimize the proposed PoS-Transformer network by a multitask learning method on the Flickr30k and MSCOCO benchmark datasets, respectively. Extensive experiments demonstrate the effectiveness of our method.
The remainder of this paper is organized as follows. Section 2 introduces the related work, especially the prevailing deep-learning-based methods. Our proposed framework and its multitask learning for image captioning are detailed in Section 3. The experimental results are reported in Section 4. Finally, Section 5 concludes the paper.

Related Work
Image captioning. The mainstream image captioning methods generally follow the encoder-decoder paradigm, where image features extracted by a CNN are fed into an RNN to generate the corresponding sentence. For example, Xu et al. [11] first utilized soft and hard attention mechanisms to attend to the different CNN grid features of an image when generating each word. Lu et al. [12] presented an adaptive attention mechanism to determine where to attend to visual features for the word prediction. After that, Anderson et al. [13] further introduced an attention mechanism over the region-based features extracted by an object detector. Despite progress made on the basis of visual attention mechanism over object features, these approaches suffer from catastrophic forgetting in long-term memory, leading to limited performance improvement. To overcome the limitations of RNN-based image captioning models, plenty of transformer-based models [14][15][16][17][18][19][20][21][22], following fully attentive paradigms, have recently been presented and have improved the performance remarkably. For example, Herdade et al. [15] developed an object relation transformer (ORT) captioning model, which explicitly incorporated spatial relationships between region features through geometric attention. Li et al. [23] introduced entangled attention into a transformer-based sequence modeling framework that performs attention over visual features and semantic attributes simultaneously. Recently, a large amount of methods have been explored to improve image understanding with the help of a scene graph, as it contains rich semantic information. For example, Yang et al. [24] proposed a method that first used the sentence's scene graph to learn a dictionary, and then incorporated it with the image's scene graph for the description generation. Yao et al. [25] presented a model that integrated both the semantic and spatial object relationships as image representation. Since the scene graph constructed a series of semantic relationship information, the model achieved comparable results. Zhao et al. [26] proposed a multilevel cross-modal alignment (MCA) module to align the image scene graph with the sentence's scene graph at a different level. Although the existing captioning approaches have achieved impressive results, they still follow the conventional way of modeling language and suffer from the limitations mentioned above.
PoS-based image captioning. Recently, some works have also introduced the PoS information into image captioning models [27][28][29]. However, these methods are all based on long short-term memory (LSTM) networks, while our model exploits the transformerbased captioning architecture and fully attentive paradigm, which is essentially different from them. The model proposed by Zhang et al. [27] is the most related to ours; they integrated the PoS information with two popular image captioning models. However, their models suffered from dependencies between distant positions since the hidden states of LSTM were used to predict the PoS sequences. He et al. [28] utilized PoS tags as switches to guide the generation of the visual words. However, they required an external PoS tagger in both the training and test stages, which was limited in practice. In our PoS-Transformer, a PoS prediction network, as a part of the framework, is seamlessly integrated with other parts of PoS-Transformer. Consequently, the captions can be generated word by word at the inference time without any extra PoS taggers. Deshpande et al. [29] used the part-ofspeech information to generate diverse captions. They first predicted a PoS sequence for an image and then employed the PoS sequence as the guiding information to generate image captions. However, they quantized the space of POS tag sequences by using a classification model, which harmed the generation of fine-grained captions. Unlike existing PoS-based image captioning models, our proposed PoS-Transformer framework is able to process both word sequences and PoS sequences in parallel during training. On one hand, by means of cross-attention, PoS-Transformer establishes the relationship between the visual features and PoS information as well as the relationship between the partially generated words and PoS information. On the other hand, PoS-Transformer also captures the self-attention within the PoS information, which is helpful to adaptively adjust the weights between visual features and language signals for the word prediction.

Approach
The proposed PoS-Transformer model aims to guide the process of caption generation with the part-of-speech information on top of the Transformer architecture. Notably, our method follows a novel learning paradigm, which maintains the PoS and word information in separate streams for image captioning. Specifically, PoS-Transformer is composed of four parts: (1) a visual subencoder that exploits the deep visual representation on the basis of a self-attention mechanism; (2) a language subencoder that represents language signals; (3) a self-attention PoS predictor (SAPP) which is used to predict the category of PoS and obtain the PoS information for generating the next word in the captioning process; (4) a PoS-guided multimodal decoder which provides two alternative attention mechanisms, i.e., single attention (SAT) and dual attention (DAT), to integrate and decode visual features, language signals, and PoS information. Figure 2 illustrates the overall architecture of the proposed PoS-Transformer model.

Dual-Way Encoder
Different from the local operator essence of convolution [3,30], the full transformer captioning networks, effectively accessing information globally via self-attention mechanism, have recently been proposed and achieved promising performance. However, the existing transformer-based captioning architectures are still based on the conventional language model, which generates the captions word by word regardless of the grammatical structures, leading to the limitations mentioned above. Consequently, it is essential to construct a novel image captioning architecture, which not only separates syntactic structure and word semantics, but has the ability to guide the usage of visual and language information. To reach this goal, inspired by the ETA model [23], we first propose a dual-way encoder that contains a visual subencoder and a language subencoder to obtain the visual features and language signals attended to, respectively.
(1) Visual subencoder: In Figure 3, the region-based visual features of an image extracted by a pretrained Faster-RCNN model are utilized as the input of visual subencoder. Given a set of region-based visual features V = {v 1 , v 2 , . . . , v N } extracted from an input image, where N is the number of visual regions in an image, the visual features V are first projected to a d-dimensional space via a fully connected layer to adapt to the visual subencoder's dimensionality. Then, the projected features V 0 = {v 0 1 , v 0 2 , . . . , v 0 N } ∈ R N×d are input into the visual subencoder with L attention blocks. To be specific, the output of the lth (0 ≤ l < L) layer is input into a multihead module (MH) [31] in the (l + 1)th layer, which is then followed by an AddNorm operation: and a positionwise feed-forward network (FFN) [31] is adopted to further transform the outputs, which is also encapsulated within the AddNorm operation: Eventually, we can obtain V L , i.e., the output of our visual subencoder, which represents the considered visual features, on basis of the self-attention mechanism.
(2) Language subencoder: Given a caption Y = {Y 1 , Y 2 , . . . , Y M }, where Y i denotes the ith word in the sentence and M is the number of words. To adapt the language subencoder's dimensionality, all tokens are first embedded to d-dimensional vectors through an embedding matrix and then fed into the positional encoding module for the relative and absolute position information. Finally, we obtain the initial input features which are input to the language subencoder with L attention blocks. Different from the visual subencoder, the output of the lth (0 ≤ l < L) layer is passed into the masked multihead (MMH) module [31] to ensure that the prediction for the tth word w t depends only on the previous words w 1:t−1 , and the output of the (l + 1)th layer is denoted as follows: Recursively, the output of the Lth layer, denoted as W L , can be obtained and used as the language signals to be fed into the following decoder.

Self-Attention PoS Predictor
In the self-attention PoS predictor, we also use a randomly initialized word-embedding matrix and positional encoding to project the input tokens The PoS prediction model takes the projected features P 0 as the initial input to N, the first attention block. Similar to the language subencoder, the output of the (n + 1)th layer can be represented as: Finally, the output of the Nth decoder stack is used as the PoS information to predict the probability distribution of the next word's PoS as follows: where P N t−1 denotes the hidden state corresponding to the (t − 1)th PoS, the embedded matrix W PoS ∈ R d×C , the bias vector b PoS ∈ R C , Y t−1 denotes the previously generated words, and C is the class number of PoS. Meanwhile, as shown in Figure 3, the PoS information P N t−1 is then passed to the PoS-guided multimodal decoder to guide the caption generation.

PoS-Guided Multimodal Decoder
(1) PoS-guided single attention: Different from the traditional transformer decoder, we introduce a single cross-attention over the fused features of visual features and language signals by virtue of the PoS information.
As shown in Figure 3, for the (l + 1)th layer, the input F l is fed into an MMH module, followed by the AddNorm operation: Note that F 0 = W L . Subsequently, the outputF l+1 is fed into one multihead crossattention module to perform the attention task over visual features V L as follows: Since the PoS information is beneficial for both visual words and nonvisual words, it is used to attend to the fused features of visual features and language signals during training. Meanwhile, it is also added to the considered fused features, to provide the decoder with the PoS information. To be specific, we utilize the PoS information P N as the query vectors to perform the cross-attentions overF l+1 as follows: Finally, the output of the multimodal decoder can be obtained as follows: (2) PoS-guided dual attention: Although the single attention mechanism utilizes the POS information to facilitate the generation of grounded captions, it cannot adaptively adjust the weights between visual features and language signals at each decoding time step. Inspired by the ETA model [23], we first introduce the dual attention mechanism into the multimodal decoder, which employs the PoS information to attend to the visual features and language signals, respectively. In addition, a gated controller module is inserted after the dual attention module, which enables the decoder to dynamically adjust the weights between the visual features and language signals.
As depicted in Figure 4, the dual attention module is inserted between the MMH and FFN modules, which allows the decoder block to apply attention over the output visual features V L and language signals W L of the dual-way encoder simultaneously. Similar to the single attention, we have: where F 0 = P N . Then, the outputF l+1 is passed into two multihead cross-attention modules to perform the attention task over language signals W L and visual features V L : Next, as shown in Figure 4, the gated controller module is introduced into the decoder to dynamically specify the weights of S l+1 and V l+1 on the word prediction. Concretely, the context gate C l+1 of the gated controller is determined by the visual features V l+1 , the language signals S l+1 , and the current self-attention outputF l+1 : where C l+1 ∈ R M×1 , W C ∈ R 3d×1 , [·] and σ(·) denote the vector concatenation and sigmoid function, respectively. The gate value C l+1 and its complement part (1 − C l+1 ) control the flow of visual features V l+1 and language signals S l+1 , respectively, we have: where represents the Hadamard product and E l+1 ∈ R M×d denotes the output of the gated controller module. Finally, the output F L of the PoS-guided SAT or DAT module is input into the word classifier to predict the next possible word as follows: where F N t−1 is the hidden state corresponding to the (t − 1)th word, the embedded matrix W word ∈ R d×D , the bias vector b word ∈ R D , and D is the size of the vocabulary.

Training Details
As shown in Figures 3 and 4, the SAT-based and DAT-based multimodal decoder have the same input visual features, language signals, and PoS information as well as the same output vectors. The two outputs of our models are utilized to predict the next word and its PoS tag, which, respectively, correspond to two different objective functions. Thus, in practice, the network weights of these two models can be trained concurrently by a supervised multitask learning.
For an input image, assume its region-based visual feature vector as V, the corresponding ground-truth caption Y * = {y * 0 , · · · , y * T } and the ground-truth PoS tags S * = {s * 0 , · · · , s * T }. For the self-attention PoS predictor, the cross-entropy (XE) loss for the PoS prediction is: where ϕ represents the parameters of the SAPP network. The parameters θ of our image captioning model (including dual-way encoder and PoS-guided multimodal decoder) is optimized via minimizing the following cross-entropy loss L word between the generated captions and the ground truths: Combining the word prediction loss L word with the PoS prediction loss L PoS , the total loss function for our proposed PoS-Transformer framework can be defined as: where λ is a trade-off factor between the PoS loss and the word loss. Thus, all the parameters of the PoS-Transformer network can be optimized by minimizing the total loss function. As can be seen from Figure 2, when minimizing the XE loss L word , the parameter ϕ of the SAPP network will also be optimized, which indicates that the word prediction can be considered as the leading task of the whole model. At the same time, when the XE loss L PoS is minimized, only the PoS predictor in the whole framework will be updated. Thus, the training of SAPP plays a role of auxiliary task for the main task. By means of the ground-truth PoS tags, the PoS prediction model can be well optimized, which provides the main task with the auxiliary optimization direction of the parameter ϕ. Consequently, with the guidance of SAPP, the image captioning part of our whole framework can be encouraged to generate more grounded and fine-grained captions.
At inference time, PoS-Transformer needs not employ any PoS tagger to tag each word in the generated sentences since it actually utilizes the current hidden state of the SAPP network as the PoS information to guide the caption generation.

Datasets
MSCOCO [32]: This popular benchmark dataset contains 123k images and each of them is equipped with five manually annotated sentences. We adopted the offline Karpathy splits [33], which assigns 113k images for training, 5k images for validation, and 5k images for testing. Following the same settings in prior studies, we converted all sentences to lowercase, deleted the punctuation characters, tokenized each caption, and constructed a vocabulary including 9487 words by selecting the words which appeared more than five times.

Evaluation Metrics
To evaluate the performance of different captioning methods, we used the full set of the standard evaluation metrics, including BLEU [36], METEOR [37], ROUGE-L [38], CIDEr [39], and SPICE [40]. All these metrics were calculated directly by using the MSCOCO caption evaluation tool (https://github.com/tylin/coco-caption (accessed on 15 November 2022)). BLEU is an n-gram precision-based metric, METEOR performs unigram matching, and SPICE computes an F1-score over caption scene-graph tuples, i.e., the balance between the precision and the recall. Notably, CIDEr is specially designed to evaluate the image captioning model. It obtains the similarity between the captions to be evaluated and the reference captions by calculating the TF-IDF weights of each n-tuple to evaluate the effectiveness of the image captioning. The number of times an n-gram w k occurs in a reference sentence s ij is denoted by h k (s ij ) or h k (c i ) for the candidate sentence c i . The TF-IDF weighting g k (s ij ) for each n-gram w k can be formulated as: where ω is the vocabulary of all n-grams and I is the set of all images in the dataset. The CIDEr score for n-grams of length n is computed by using the average cosine similarity between the candidate sentence and the reference sentences, which accounts for both precision and recall: Empirically, the uniform weights w n = 1/N work the best and N = 4. The higher the CIDEr score, the better the resulting discourse quality. (2) Implementation details: For the self-attention PoS predictor and language subencoder, we utilized randomly initialized word embeddings W 0 , whose dimensionality was equal to d, and then summed the input vectors and their sinusoidal positional encodings [8]. For the visual subencoder, we used the pretrained Up-Down model [13] to extract the 2048dimensional bottom-up features of the detected objects and linearly projected them to the 512-dimensional input visual vectors. Following the same settings as in [31], the latent dimensionality in each head was set to d h = d/h = 64, where the latent dimensionality d was 512. The number of attention blocks L in the visual subencoder, language subencoder, and PoS-guided multimodal decoder ranged in {1, 2, 4, 6} and that of the POS prediction model N was set to 3. During the training stage, we used the Adam optimizer [41] with 20,000 warm-up steps and a batch size of 10. Our models were first trained for 30 epochs with the cross-entropy loss and then further optimized with the CIDEr reward [42] for additional 30 epochs with a fixed learning rate of 5 × 10 −6 . In the inference stage, the beam search strategy was adopted [8] with a beam size of three.

Ablation Studies
To validate the impacts of different modules and settings in our models on the captioning performance, we conducted extensive ablations including different numbers of encoding and decoding layers L, different values of the hyperparameter λ, and different PoS-guided attention mechanisms.
(1) Effect of encoding and decoding layers: To investigate the impact of the number of encoding and decoding layers, we applied the single-attention-based PoS-Transformer (SAT-PoS-Transformer) model with different numbers of stacked blocks L ∈ {1, 2, 4, 6} on Flickr30k, as well as the dual-attention-based PoS-Transformer (DAT-PoS-Transformer) model on MSCOCO and Flickr30k, respectively. For simplicity, the numbers of stacked blocks in the encoder and decoder were set to the same value. Table 1 shows the performance of SAT-PoS-Transformer and DAT-PoS-Transformer with different L's on Flickr30k. We can observe that these two models achieved the best performance when using four encoding and four decoding layers. This was due to the fact that deeper layers enabled the encoder of the captioner to represent more complicated relationships between objects and the decoder to provide more discriminative latent vectors for the prediction of words. However, if the number of layers becomes large, the risk of overfitting also increases. Table 2 reports the performance of DAT-PoS-Transformer with different L's on the MSCOCO dataset. Similarly, we can see that the generated image captions by our proposed models reached the highest scores on all metrics when L = 4. Thus, all subsequent experiments used four layers.
(2) Effect of the hyperparameter λ: To analyze the impact of the hyperparameter λ on the captioning performance, we applied our PoS-Transformer models with different values of λ on MSCOCO and Flickr30k, respectively. The experimental results of SAT-PoS-Transformer and DAT-PoS-Transformer on Flickr30k are illustrated in Table 3. It can be seen that DAT-PoS-Transformer with λ = 0.75 had the highest scores on most metrics and a pretty high BLEU-4 and ROUGE-L scores (only slightly lower than the highest 0.287 and 0.492, respectively). For SAT-PoS-Transformer, it reached relatively optimal performance when λ = 1.00. As can be seen from Table 4, when the coefficient λ of the PoS loss function increased to 0.50, DAT-PoS-Transformer obtained the highest scores in terms of all metrics on MSCOCO.
(3) Effect of single attention and dual attention: As shown in Tables 1 and 3, the image captions generated by SAT-PoS-Transformer with L = 4 and λ = 1.00 reached the highest scores on most metrics. It can be also observed from Table 3 that DAT-PoS-Transformer significantly outperformed SAT-PoS-Transformer in terms of all metrics. Based on the dual attention mechanism, the best CIDEr score increased from 0.601 to 0.612 on the Flickr30k dataset, which validated the superiority of dual attention over single attention.

Quantitative Analysis
According to the ablation studies, we compared our best DAT-PoS-Transformer model with the competitive methods on Flickr30k and MSCOCO datasets.
(1) Results on the MSCOCO Karpathy test splits: In Table 5, we compared DAT-PoS-Transformer with LSTM [43], SCST [42], ADP-ATT [12], LSTM-A [44], Up-Down [13], RFNet [45], GCN-LSTM [25], SGAE [24], AVSG [26], and ORT [15] on the offline COCO Karpathy test split. In addition, we also compared DAT-PoS-Transformer with part-ofspeech-based image captioning methods such as PoS-Guiding [28], Inject+PoS [27], PoS-SCAN [46], and CNM [47]. LSTM introduced a deep model with two attention mechanisms to distill information in images down to the most salient objects. LSTM-A improved LSTM by emphasizing semantic attributes at the decoding stage. ADP-ATT introduced a visual sentinel and sentinel gate to adaptively determine whether to attend to the visual regions for the word prediction. Up-Down and RFNet improved the attention mechanism by having it learn to identify selective spatial regions, which further boosted the performance of the captioning generation. ORT developed an object relation transformer captioning model which explicitly incorporated spatial relationships between region features through geometric attention. GCN-LSTM, SGAE, and AVSG used a scene graph which contained rich semantic information to improve the image understanding. As can be seen from Table 5, compared with the existing PoS-based methods, our method had better performance on most metrics when optimized with the self-critical loss [42]. Remarkably, the CIDEr score and BLEU-4 score of our model could reach 129.9% and 39.3%, which were 2% and 4% better than the best comparison model CNM [47], respectively. In addition, other than [28] which exploited PoS tags as switches to decide whether or not to utilize visual features at each time step, our method did not need any PoS tagger in the test stage. Compared to [27], which also introduced a PoS prediction model to image captioning, our PoS-Transformer model not only overcame the limitation of dependencies between distant positions in language modeling, but also incorporated the novel PoS-guided attention module to more flexibly adapt to the variation of PoS for each word. Furthermore, compared with the strong baseline (Transformer), which followed the traditional language model, the proposed PoS-Transformer model achieved better performance on all metrics, which demonstrated the effectiveness of our model with the PoS guidance and dual attention mechanism.
(2) Results on the Flickr30k dataset: We also compared DAT-PoS-Transformer to other methods trained by cross-entropy loss on the Flickr30k dataset. As can be seen in Table 6, our method surpassed all other approaches in terms of BLEU-1∼BLEU-4 and CIDEr. The METEOR and ROUGE-L scores of our method were worse than those of Inject+PoS [27]. Remarkably, it improved on the performance of the Inject+PoS model on CIDEr by 0.143 points (from 0.469 to 0.612). Thus, our method achieved better performance in comparison with the existing PoS-based models. Notably, our model had superior performance over the strong baseline (the original Transformer model) on all metrics, which further validated that it was effective at generating the captions with PoS guidance.  Intuitively, the descriptions generated by PoS-Transformer were more precise and distinguishable compared to the Transformer baseline. The reason was that by introducing the PoS information guidance, our model was encouraged to align the visual words with the grounding visual features, while the generated captions conformed to the grammatical rules better. More specifically, our model could generate more fine-grained and grounded captions than the original Transformer model. Taking the fifth image as an example, the Transformer baseline only generated a simple sentence a baseball player holding a bat. Instead, our model generated the caption a baseball game in progress with the batter up at the plate, which was more fine-grained and had the same semantic meaning as the ground truth. In addition, in the last image, our model generated the feasible sentence a large bird with a long beak walking on a beach, while the Transformer baseline inferred the simple but wrong sentence a bird that flying in the air. Notably, the PoS tags generated by our model included two more ADJ (large and long) and one NOUN (beak), which made the description more vivid and detailed. Additionally, it can be seen from Figure 5 that in most cases, the self-attention PoS predictor was able to precisely predict the PoS tags. It is worth noting that the corresponding word could also be inferred correctly even if its PoS tag was incorrect, which implied that the PoS predictor actually played a role of auxiliary task, and by means of the beam search strategy [8], the proposed model had the capability to correct errors on the PoS tags to some extent.
GT: A man with glasses and his eyes closed dressed in a black shirt and a necktie.  We further visualized the image regions attended to and the variations of gate values in the gate controller during the caption generation in Figure 6. For each word, we mainly analyzed its gate value of the gate controller in the last decoding block since it was directly used to infer the next word. From Figure 6, we can observe that the proposed model was able to correctly attend to the corresponding image regions when predicting the visual words, e.g., baseball, game, and batter, while preventing itself from attending to any image region if a nonvisual word was being generated, such as a, process, the, etc. To be specific, our model assigned a pretty large gate value (over 0.9) for visual words. Note that some nonvisual words following NOUN, such as in and up, may also be assigned gate values larger than 0.5, which was reasonable since these words actually represented the relationships between objects, i.e., they were closely related to the visual words. The visualization experiment could further demonstrate that our PoS-Transformer model effectively took advantage of the PoS information to adaptively adjust the effect of visual features and language signals on the word prediction.

Conclusions
In this paper, we presented PoS-Transformer, a novel transformer-based framework for image captioning, to separate the grammatical structures and word semantics of captions and incorporate the PoS guiding information into the modeling. PoS-Transformer seamlessly integrated the PoS prediction module with the transformer-based captioner for a more grounded and fine-grained image captioning. By virtue of two proposed attention mechanisms, the PoS-Transformer decoder effectively exploited the PoS information to guide the caption generation, which not only adaptively adjusted the weights between visual and language signals for more grounded captioning, but leveraged the PoS information to generate more fine-grained sentences. Extensive experiments as well as ablation studies demonstrated that our method could significantly boost the performance of image captioning on top of the transformer-based architecture and substantially outperform other PoS-based image captioning models on the Flickr30k and MSCOCO datasets.
The current PoS-Transformer model focuses on introducing syntactic structures into the conventional language model in image captioning, which can play a better role in robot interaction, preschool education, and other application fields. Additional visual and semantic encoding approaches, such as exploiting the image attributes and the relative geometry relations between the objects, are not integrated with PoS-Transformer. However, it has been validated that these approaches can provide much richer visual and semantic information to facilitate a high-quality caption generation. In our future work, we will further enrich the representations of visual and semantic concepts to boost the performance of PoS-Transformer.