Attention-Guided Image Captioning through Word Information

Image captioning generates written descriptions of an image. In recent image captioning research, attention regions seldom cover all objects, and generated captions may lack the details of objects and may remain far from reality. In this paper, we propose a word guided attention (WGA) method for image captioning. First, WGA extracts word information using the embedded word and memory cell by applying transformation and multiplication. Then, WGA applies word information to the attention results and obtains the attended feature vectors via elementwise multiplication. Finally, we apply WGA with the words from different time steps to obtain previous word guided attention (PW) and current word attention (CW) in the decoder. Experiments on the MSCOCO dataset show that our proposed WGA can achieve competitive performance against state-of-the-art methods, with PW results of a 39.1 Bilingual Evaluation Understudy score (BLEU-4) and a 127.6 Consensus-Based Image Description Evaluation score (CIDEr-D); and CW results of a 39.1 BLEU-4 score and a 127.2 CIDER-D score on a Karpathy test split.


Introduction
Image captioning is synthetic research that spans computer vision and natural language processing to generate natural descriptions of images. In recent years, image captioning has made great progress with the rapid development of classification [1], object detection [2], and machine translation. However, there are many problems, such as object recognition and interactions, and corresponding relations between objects and words, making it a challenging task [3][4][5][6][7].
Inspired by attention mechanisms [8] and sequence-sequence models [9] exploited in machine translation tasks, an encoder-decoder framework [10][11][12][13][14] has been widely used for image captioning. In such a framework, images are encoded to feature vectors by a pretrained image classification model, object detection model, or semantic segmentation model, and then decoded to words via an RNN. Within the RNN network [15], the decoder process is implemented as a sequence to generate words one by one. Until the attention mechanism was proposed, there was little optimization for this framework. The attention mechanism [16], which comes from machine translation, can guide the generation of words by weighting a feature to connect a portion of an image with a word at each time step.
Currently, an attention mechanism is widely applied in image captioning [17] systems. Because the attention directly determines the caption of an image, the inference direction determined by attention modules must be correct. However, attention is concentrative and superficial in general. It is prone to results in the decoder knowing little or mistaking the objects of the image, such as the "dog" in Figure 1b. In detail, the decoder may be misled to list nouns into generated sentences simply and ignore the relationships among the objects, for example the relation between "person" and "motorcycle" in Figure 1a and what the "dog" is doing in Figure 1b. Moreover, an attended region only represents one word, which means that the decoder may overlook the details of an object, for instance the word "little" depicting the "girl" in Figure 1c. To address this issue, we propose word guided attention (WGA), which is created from word information, to bring novel specific guidance to the decoder. First, we design new information processing for words with several transformations and activation functions, which is similar to GLU [18]. The information processing method includes memory cell weighting, embedded words, and basic attention. Based on this process, we construct a WGA module in the decoder. Subsequently, we utilize the WGA and propose respective methods for the different step words. Fused with the previous step word, the previous word guided attention (PW) can be achieved. In addition, the current step word constitutes the current word guided attention (CW).
In this paper, we apply self-attention [19] as the basic attention unit for both the encoder and decoder phases. In the encoder, self-attention is used to build relations among the objects by weighting feature vectors extracted from an image. In the decoder, selfattention points out the major objects in an image and plays a guided role in PW/CW. Furthermore, we propose PW to expand the scope of objects describing and intensify the relationships via word-level attention. On the other hand, CW is concentrated on the current saliency region to obtain more detailed content and deeper relations.
We evaluate our method on the MSCOCO dataset and perform quantitative and qualitative analysis. The results show that our WGA is effective. The proposed PW/CW model is superior to other published image caption models. The main contributions of our paper include the following:

•
We propose a novel word guided attention module for image captioning to determine the relationships among the attention features of an encoded image.

•
We use the WGA with the previous step word and the current step word. With the previous word, the WGA concentrates on covering more objects in the scene and describing the relevance among them. With the current step word, the WGA is devoted to obtaining more details and deeper relation information from the current attention region.

Image Captioning
Recent image captioning approaches are based on the framework of an encoder-decoder, which benefits from the development of deep learning and machine translation [8]. For example, an end-to-end CNN-LSTM framework is proposed to encode an image into CNN feature vectors and decode them into a sentence [20]. In [21], high-level semantic information is proposed as a CNN-LSTM framework. In [22], a two-layer LSTM is applied to give attention to a performing stage. Moreover, some complicated information, such as attributes and relationships, is integrated to improve the generated captions covering an image more completely [23][24][25].

Attention Mechanism
The attention mechanism [26], which originates from simulating the human perception system, has been widely employed and has made a great progress for seq-to-seq tasks. In image captioning, attention is an essential part of the model. In [16], a weighted candidate vector is proposed to teach the decoder focusing on the right fields in an image using normalization and SoftMax function. Since then, many studies on attention mechanisms have emerged for image captioning, such as adaptive attention [27], Sca-cnn [28].
In machine translation, we can obtain much inspiration for image captioning. For instance, novel attention achieved from the words is proposed in [29]. In [19], self-attention is proposed and obtains state-of-the-art results.

Methods
We first introduce the WGA module. Then we present how the WGA works for different image captioning phases.

WGA
A basic attention unit f Att (.) provides the weighted feature vectorsV for queries, keys, and values (denoted as Q, K, and V, respectively) by some operations, as shown in Figure 2a. First, we operated on Q, K, and V to linearize them independently. Then, the similarity weight between Q and K was measured by the dot-product, exerting scale correction and the SoftMax function. Finally, we performed matrix multiplication between the similarity weight and V. Thus, a basic attention unitV = f Att (Q, K, V) can be formulated as where q i ∈ Q, k j ∈ K, and v j ∈ V, W Q , W K , W V ∈ R D×D , b * ∈ R D are the linear transformation, bias of queries, keys, and values respectively, and D is the dimension of a vector. sim i,j denotes a function to calculate the similarity score between q i and k j via the dot-product , andv i ∈V is the attended feature vector. A basic attention unit outputs the preliminarily attended feature vectors, which can direct a language model to generate more nouns and be effective building their relationships. However, it is too compulsive to obtain correlations among irrelevant objects and may ignore some inconspicuous objects.
Therefore, we proposed the WGA module f WGA (.), as shown in Figure 2b, to extract guiding information from a generated word. The WGA module generates a word guiding weighting (WGW) β i , which is extremely conditioned on the attention unit resultV and adopts elementwise multiplication for β i andV to output word guided attended feature vectors V using a residual connection. V = f WGA (Q, K, V, X, M) can be determined by where v i ∈ V , f WGW (.) denotes the process of WGW, [., .] is a function for concatenating two vectors, x ∈ X defines the embedded word, m T ∈ M is a memory cell, and T is the set of time steps. W λ , W φ ∈ R D×D , b * ∈ R D are two linear transformation matrixes and W e is the word embedding matrix. Then, the WGW f WGW (.) is central to the WGA to obtain the guiding weight information from a word. As shown in Figure 3, WGW first employs a memory cell M and the individual word X to strengthen the influence of generated words for the subsequent sentence content via an activation function and elementwise multiplication. Then, WGW utilizes linear transformation and merges word context information g x T withv i Finally, the SoftMax function is used to obtain the weighting β i . β i = f WGW (v i , W e x T , m T ) and can be represented as where Wβ i , Wv, W g ∈ R D×D , and b * ∈ R D are matrixes of linear transformation and corresponding biases. The tanh denotes an activation function.

Image Captioning Model
We took advantage of the WGA for image captioning model based on the two-LSTM encoder-decoder framework.
For an image I, CNN feature vectors A = {a 1 , a 2 , . . . , a n } are extracted, where a n ∈ R D , n ∈ N and N is the number of feature vectors in the image. In encoder phase Enc, we not only obtained the feature vectors, but also fed A to the basic attention unit (Figure 2a) f Att (Q, K, V), where Q, K and V are inputs of A. Enc can be formulated aŝ whereÂ E = {â 1 ,â 2 , . . . ,â n } is the encoded result of an image.
In the decoder, we generated a sequence of captions y with the encoding resultsÂ E . The two-LSTM framework is composed of a language LSTM LSTM lang and an attention LSTM LSTM Att , as shown in Figure 4. We input the mean pooling visual vector a = 1 n ∑Â E and the t-th time step embedded word W e x t to the attention LSTM LSTM Att , which can be defined as where t ∈ T, and h Att T , m Att T ∈ R D are the hidden state and memory cell of the LSTM Att respectively.
To make the WGA produce a marked effect in decoder, we inserted it between the LSTM lang and LSTM Att to guide the language model. As shown in Figure 4, A t can be obtained from f WGA (.) fed withÂ E , W e x T , m T and h Att t , which is formulated as where Q, K, and V are replaced with h Att t and doubleÂ E . In addition, W e x T , m T are discussed after the language LSTM model.
The input to the LSTM lang is the concatenation of the WGA weighted feature vectors A t and the current hidden state of the LSTM Att . Therefore, the LSTM lang can be presented as where t ∈ T and h lang T , m lang T ∈ R D are the hidden state and memory cell of the LSTM lang , respectively. Therefore, we can achieve a probability distribution y t of the caption prediction at time step t: where W h ∈ R D×D , y 1:T refers to the generated captions, and SoftMax is the activation function.
As stated earlier, we fed different W e x T , m T into the WGA to realize different generation improvements, as shown in Figure 5. Previous word guided attention. To better describe the entire scene, we made use of previous word information W e x t−1 and memory cell m lang t−1 (Figure 5a): We believe that the word information from the previous step can protect the logical correctness for generation of the current word. Furthermore, the information is a summary of the previous attention region. It can guide the model to select the correct attention region for the current step, so that some neglected attention regions can be effectively utilized, and the WGA can cover more objects and relations.
Current word guided attention. For a more detailed description of the current attention region, we applied the current memory cell m Att t of the LSTM Att and simulated generating the current word via h Att t and gated linear units (Figure 5b): whereÂ D t means the result of attendedÂ E by basic attention unit f Att (.), W Att , W gate ∈ R D×D , b * ∈ R D , and δ is a sigmoid function. We think that the current word information can help the model focus on the current attention region that is significant by weighting the feature vectors. Thus, the salience region can provide more details that are not only about objects but also deeper relations or status.

Training and Objectives
Training with cross-entropy loss. We first trained the image captioning model using cross-entropy loss L XE : where y * 1:T refers to the ground truth of captions. Optimize using the CIDEr-D score. Then, we followed the approach of self-critical sequence training (SCST) [30] to optimize the model: where the reward r(.) is calculated by the metrics of Consensus-Based Image Description Evaluation score (CIDEr-D) [31]. The gradient is defined as where y s is the result from the sampled probability, andŷ is the result of the greedy algorithm.

Dataset
The proposed method was implemented on the MSCOCO [32] dataset. The MSCOCO dataset contains 123,287 images, including 82,783 training images and 40,775 validation images, with 5 captions for each. The Karpathy split [33] was adopted to obtain 113,287 images for training, 5000 images for validation, and 5000 images for testing. We collected words occurring more than 4 times in all sentences from the MSCOCO dataset and obtained a dictionary of 10,369 words. The metric we used to evaluate our method was Bilingual Evaluation Understudy score (BLEU-N) [34], which can be calculated as where r and c denote ground truth captions and generated captions, respectively, and p n is the n-gram precision. In addition, we also adopted CIDEr-D [31], Metric for Evaluation of Translation with Explicit Ordering (METEOR) [35], Semantic Propositional Image Caption Evaluation (SPICE) [36], and Recall-Oriented Understudy for Gisting Evaluation (ROUGE-L) [37] to evaluate our method. These values calculate the similarity between the generated caption and the ground truth caption, and higher values represent better results.

Implementation Details
The Faster R-CNN model [2] trained on ImageNet [38] and Visual Genome [39] was exploited to extract bottom-up feature vectors from images. These vectors have 2048 dimensions and were transformed to 1024-D vectors to match the hidden size of the LSTM in the decoder. In the training phase with cross-entropy loss, we adopted an initial learning rate of 4e-4 decaying 0.8 every 2 epochs, and the ADAM optimizer was employed over a total of 30 epochs. For training with CIDEr-D score optimization, we set the initial learning rate to 4e-5 and decayed it by 50% when the performance of the validation split was never better in another 20 epochs. In addition, we set the image batch size to 10 during training, and the beam size was 2 while testing.

Quantitative Analysis
To validate the performance of our method, we gathered some results based on the Karpathy split test from other methods. These methods are based on certain wellknown frameworks or improved attention, including LSTM [20], which encodes an image into CNN features and decodes them to a series of words using LSTM; SCST [30], which proposes a sequence train with evaluation metrics using reinforcement learning; Adaptive-Attention [27], which proposes an adaptive attention model with a visual sentinel; RFNet [40], which proposes a novel recurrent fusion network to exploit multiple information from encoders; UpDown [22], which applies two LSTMs to weight bottom-up image feature; research [41,42] that contributes new attention using semantic-enhanced image features and spatiotemporal memory, respectively; research [43] that provides special decoding phase improved by a ruminant mechanism; research [44] that leverages object attributes to structure linguistically-aware attention for the lack of high-level understanding; VRAtt-Soft [45], which proposes novel visual relationship attention via contextualized embedding for individual regions; and research [46] that extends the caption model by incorporating extra explicit knowledge from a memory bank. The results shown as percentage are presented in Tables 1 and 2. We report the performance of the methods with cross-entropy loss in Table 1, and it can be seen that our PW is superior to other methods in all metrics. CW can also achieve the same performance compared with others and is slightly better than PW. In addition, we present the comparison among methods trained with cross-entropy loss and optimized via CIDEr-D score optimization in Table 2. The results demonstrate that CW achieves the optimal performance among all methods, and PW is second best. Furthermore, we collect four models with different initial training parameters to perform ensemble evaluation and the comparison is in Table 3. Our method obtains satisfactory results in contrast to others.

Qualitative Analysis
We reported some examples of images and corresponding captions gathered from our PW, CW, the baseline, and ground truth. Note that we reimplemented the UpDown [22] model sharing the parameters of our models as a baseline. From Table 4, where we mark the improvements in blue, we found that the baseline rigidly describes the prominent objects without exact relations between objects and a detailed depiction of objects. Moreover, our models were superior in two ways: a) Our models focus on the whole image and obtain nearly all components in an image. For the first example, the baseline just recognizes "women" and a shallow relation but ignores others such as wine and glasses. In contrast, the captions from PW contain more objects and count more correctly, including "wine" and "glasses", and CW also contains the background of "room". The other three examples can confirm this conclusion. b) Our models can obtain the connections between objects in greater scope and depth with PW and CW. As seen in the first example, PW achieves objects of "two women" and "wine glasses" and then prefers to build the relationship between "women" and connect "women" with "wine glasses" by "holding". Meanwhile, CW can guide the model in another direction. To use the same example, CW determines the relationship among "women", "wine", and "glasses" and then describes it with "drinking". In other examples, we can find the same conclusion. PW and CW have these superiorities because they can guide the model to distribute attention for different purposes. As we can find in Table 4, PW and CW are experts in building relations between objects due to the selfattention and basic WGA. PW can determine how the model covers the objects in an image, and CW can more deeply assess the details of the current attention region, which we will show in Section 4.5. Table 4. Samples of image captions generated by our PW, CW, and baseline as well as ground truths.

Image Captions
Baseline: A couple of women standing next to each other. Our PW: Two women standing next to each other holding wine glasses.
Our CW: Two women drinking wine in a room. GT1: Two young women are sharing a bottle of wine. GT2: Two female friends posing with a bottle of wine. GT3: Two women posing for a photo with drinks in hand.
Baseline: A group of people walking down a street. Our PW: A group of people standing in the street with an umbrella.
Our CW: A group of people standing under an umbrella. GT1: Several people standing on a sidewalk under an umbrella. GT2: Some people standing on a dark street with an umbrella. GT3: Some people standing on a dark street with an umbrella.
Baseline: A close up of a horse in a field. Our PW: A white horse standing in the grass in a field.
Our CW: A white horse grazing in a field of grass. GT1: A horse eating grass in a green field. GT2: A while horse bending down eating grass. GT3: A tall black and white horse standing on a lush green field.
Baseline: A group of people on skis in the snow. Our PW: A group of people riding skis down a snow covered slope.
Our CW: Two men are skiing down a snow covered slope. GT1: Two cross country skiers heading onto the trail. GT2: Two guys cross country ski in a race. GT3: Skiers on their skis ride on the slope while others watch.

Ablative Studies
To quantify the influence of our WGA models, we compared PW and CW against other methods with the same training phase. First, the UpDown method was defined as the baseline model, which adopted two LSTMs and attention to generate captions. Second, we employed the metrics of B@1, B@4, ROUGE-L, and CIDEr-D to evaluate the models trained after CIDEr-D score optimization. Finally, we refer to self-attention as self-att, the encoder phase as Enc, and the decoder phase as Dec in Table 5. Effect of self-attention. To evaluate the influence of self-attention, we successively extended the baseline with self-attention on the decoder and encoder. In the decoder, self-attention was located between the two-layer LSTM and became the backbone to build residual construction. From Table 5, we observe that replacing the original attention with self-attention brings benefits and improves the B@4 and CIDEr-D scores of the baseline by 0.9 and 2.0, respectively. In the encoder phase, we utilized self-attention to highlight the principal parts of the image. In Table 5, we can easily conclude that the weighted feature representations give the model effective impact. The results show that further self-attention improved the B@4 and CIDEr-D scores by 0.8 and 1.1, respectively.
Effect of word guided attention. We made further efforts to conduct experiments to test performance of the PW and CW modules. These two were designed following self-attention and constitute the WGA model during the decoding phase. In Table 5, we obtain a B@4 increase of 0.7 and a CIDEr-D increase of 1.8 for PW. On the other hand, CW ameliorated B@4 and CIDEr-D scores of self-att(Enc+Dec) by 0.7 and 1.4, respectively. Unfortunately, the B@4 and CIDEr-D scores were 38.8 and 126.2, respectively, when combining PW with CW by contacting themselves. We think that PW and CW guide the model in distinct directions and mistake the inference results. Even so, the basic word guided module still works, which is the reason why it is always better than self-att(Enc+Dec).
To qualitatively present the influence of WGA, we visualized the sentences generated by the ablated models. Referring to Table 4, we present the ablation results in Table 6. As we can see, the sentence of an image was increasingly abundant from the baseline to PW/CW. The captions of PW/CW also confirmed their model characteristics. For example, PW and CW obtained different styles of improvement compared with the caption of self-att(Enc+Dec) in the last example. PW added a "skis" object and built the connection between "people" and "skis", described as "riding skis down". CW replaced "A group of people" with "Two men" for more detailed information and deep captioning. Table 6. Visualization of the generated captions of the ablated models, where the colored words are the improvements from the previous caption.

Image Captions
Baseline: A couple of women standing next to each other. +self-att(Dec): A couple of women standing next to each other. +self-att(Enc+Dec): Two women are holding wine glasses in a room. Our PW: Two women standing next to each other holding wine glasses.
Our CW: Two women drinking wine in a room.
Baseline: A group of people walking down a street +self-att(Dec): A group of people standing in the street. +self-att(Enc+Dec): A group of people standing with an umbrella. Our PW: A group of people standing in the street with an umbrella.
Our CW: A group of people standing under an umbrella.
Baseline: A close up of a horse in a field. +self-att(Dec): A horse standing in a field. +self-att(Enc+Dec): A horse in the grass in a field. Our PW: A white horse standing in the grass in a field.
Our CW: A white horse grazing in a field of grass.
Baseline: A group of people on skis in the snow. +self-att(Dec): A man riding skis in the snow. +self-att(Enc+Dec): A group of people skiing down a snow covered slope. Our PW: A group of people riding skis down a snow covered slope.
Our CW: Two men are skiing down a snow covered slope.

Conclusions
In this paper, we propose a novel attention guided by word information (WGA) for image captioning, which is aimed at extracting more valuable information from the images. The proposed attention contains a novel word guiding weighting (WGW), which is built upon the extended word information, and a residual structure. Therefore, the WGA can provide the various semantic information to address the lack of objects and image details for the captioning model. After that, we propose different applications of WGA for the decoder and obtain the previous word guided attention (PW) and current word attention (CW) with different timing. We demonstrate that PW can expand the insight of the model to cover more objects in the image, and CW can focus on the current region to extract further information. More remarkably, we achieve competitive performance against other methods, and experimental results conclude that our proposed method has stability and universality.
In the future, we will explore how to fuse word information in the encoder to guide the captioning model. For the PW, the key is to find out the breakpoint in which the word information can be embedded. On the other hand, how to simulate word information in the current time step to construct CW is an issue. In addition, the gap between the image feature and word information needs to be solved.  Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://cocodataset.org/ (accessed on 25 September 2021).