Middle-Level Attribute-Based Language Retouching for Image Caption Generation

: Image caption generation is attractive research which focuses on generating natural language sentences to describe the visual content of a given image. It is an interdisciplinary subject combining computer vision (CV) and natural language processing (NLP). The existing image captioning methods are mainly focused on generating the ﬁnal image caption directly, which may lose signiﬁcant identiﬁcation information of objects contained in the raw image. Therefore, we propose a new middle-level attribute-based language retouching (MLALR) method to solve this problem. Our proposed MLALR method uses the middle-level attributes predicted from the object regions to retouch the intermediate image description, which is generated by our language generation model. The advantage of our MLALR method is that it can correct descriptive errors in the intermediate image description and make the ﬁnal image caption more accurate. Moreover, evaluation using benchmark datasets—MSCOCO, Flickr8K, and Flickr30K—validated the impressive performance of our MLALR method with evaluation metrics—BLEU, METEOR, ROUGE-L, CIDEr, and SPICE.


Introduction
Image caption generation is attractive research in the field of artificial intelligence and has emerged in recent years. It is an interdisciplinary subject combining computer vision (CV) and natural language processing (NLP). It focuses on generating a readable sentence to depict the visual content of a given image. The representative applications of image caption generation are assisting visually impaired people to perceive the visual content of the surrounding scenes, providing visual intelligence to chatting robots, image retrieval, and semantic visual research. Therefore, image caption generation is a significant part of scene understanding.
Benefiting from the rapid development of deep learning, many novel methods have been proposed for image caption generation in recent years. Researchers have used convolution neural networks (CNNs) [1] to extract the visual information of a given image, and to generate the corresponding language description by a decoder, such as a recurrent neural network (RNN) [2]. Briefly, most methods of image caption generation are based on an encoder-decoder framework. CNN-based models are the most common encoders, such as VGG16 [3] and ResNet-101 [4]. Meanwhile, RNN-based models are commonly used encoders, such as Gated Recurrent Unit (GRU) [5], Bidirectional RNNs (Bi-RNNs) [6], Long Short-Term Memory cell (LSTM) [7], etc.
However, most methods for image caption generation based on the encoder-decoder framework have an obvious limitation. The image feature extracted by CNNs is a fixed-length feature vector,

•
We propose a new MLALR method for image captioning.

•
We extracted different information from the raw image, including the global image feature, local image features, and the middle-level attributes, all of which can complement each other.

•
We used the predicted middle-level attributes to retouch the intermediate image caption and generate the final well-described image caption.
Remainder. The remainder of this paper is organized as follows. We review the related work of image caption generation in Section 2. In Section 3, we introduce our proposed method in detail. Then, the results and discussion are given in Section 4. Finally, we summarize our work with future research in Section 5.

Related Work
In this section, we review the related work of image caption generation. The commonly used framework for image captioning is the encoder-decoder framework, which has been extensively studied recently [14][15][16][17][18][19][20][21][22]. The related work can be broadly divided into two categories: attention-based methods and attribute-based methods.

Attention-Based Methods
An attention mechanism is a powerful algorithm which is inspired by human's visual attention. It is widely used in CV and NLP. Therefore, various attention-based methods have been proposed for image caption generation.
The "hard" and "soft" attention methods, proposed by [23], are used to better understand the visual content of the raw image. In [23], the "hard" and "soft" attention methods were respectively applied to two different image captioning frameworks. "Hard" attention is not dependent on all hidden states of the language model, and the gradient in "hard" attention needs to be estimated by Monte Carlo-based sampling. On the other hand, "soft" attention is used to calculate the weights of all input feature vectors and to generate an encoded feature vector. Therefore, "soft" attention is a parameterized method, and it can be embedded in the language model for training directly. Most existing methods use "soft" attention as the basic attention mechanism.
To solve the problem of the visual attention mechanism, i.e., that it is active all the time, Ref. [24] proposed an adaptive encoder-decoder framework to automatically decide whether to use the attention mechanism or not. In [24], some words could be generated by previous words, such as "sign" after "behind a red stop". In this case, the attention mechanism is unnecessary. On the other hand, the attention mechanism is active when generating other words which have little relationship with the previously generated words, such as "of", "the", etc. The shortcoming of the method in [24] is that the attention mechanism might be suppressed all the time when generating image captions.
Furthermore, to address problems related to object missing and object misjudgment, Ref. [25] proposed a global-lcoal attention (GLA) method to solve these issues during the image captioning process. The advantage of the method in [25] is that the image features are split into two parts, one of which is used as the local image features and is integrated by "soft" attention. The drawback of GLA is that the model is not fine-tuned in different datasets. In [26], the authors proposed a novel attention-based model for automatic image captioning named "Areas of attention". It can be trained without bounding-box supervision. The contribution of the method in [26] is that the corresponding image areas can be marked when generating words at each time step.
The authors in [27] proposed a sequence-to-sequence RNN for image caption generation. Different from previous methods, [27] made the input image a sequence of detected objects to generate the corresponding captions by using an attention mechanism. The advantage of the method in [27] is that a sequential attention layer is introduced when generating each word. However, the middle-level attributes of the object were not taken into consideration in [27].
Furthermore, Ref. [28] proposed a bottom-up and top-down method for image caption and visual question answering (VQA), which also uses an attention mechanism to integrate the objects and other salient image regions. The advantage of the method in [28] is that "soft" and "hard" attention are combined to achieve image captioning.
As described above, the attention mechanism is usually used to integrate different pieces of image information during the word generation process. The power of the attention mechanism relies on extracting sufficient image information.

Attribute-Based Methods
There are various attribute-based methods that have been proposed for image caption generation. In [29], the high-level semantic information of the raw image was extracted for image captioning and VQA, and this achieved impressive performance improvement. The advantage of the method from [29] is that non-parametric attribute prediction and parametric attribute prediction are used as two different methods to extract the visual attribute labels.
Moreover, in [30], an LSTM-A method was proposed for image captioning. The authors in [30] explored the impact of high-level visual attribute labels by using five different frameworks with different insertion locations of attributes, including "Leveraging only Attributes", "Inserting Image First", "Feeding Attribute First", "Inputting Image each time step", and "Inputting Attributes each time step". In [30], the role of high-level visual attributes, such as "outdoor", "riding", and "market", can be observed. Meanwhile, the middle-level attributes of images are integrated by low-level image features in the method proposed in [31], which enhanced the precision of the attribute labels. The low-level image features are tiny information of objects, including edges, cornets, pixels, and so on.
Furthermore, in terms of some specific application scenarios, such as supermarkets, researchers took user-contributed tags as the image information to recognize the specific objects contained in the raw image. The user-contributed tags are attributes which can reflect the user's attention, such as "camera" or "surfboard" held by users, etc. For example, Ref. [32] combined the visual attention and user attention simultaneously for social image captioning. In addition, due to the good performance and robustness of some existing image generation methods, they can also be extended and applied to other fields, such as VQA and video captioning [29,33,34]. The main difference between image captioning and VQA is whether a machine can respond well to the input question information.
From all of the above methods, we can observe that the process of image caption generation can be broadly divided into several parts, including visual feature extraction from the raw images, information fusion of visual features and attribute labels, and descriptive sentence generation.
The problem with the existing image captioning methods is that they use an encoder-decoder framework to generate the final image caption directly, which ignores the role of the middle-level attributes of objects. Therefore, our research focuses on using the middle-level attributes of objects to retouch the intermediate image caption generated by our language model and to generate a final well-described image caption.

Materials and Methods
In this section, we introduce the contributions of this paper, including global and local image feature extraction, the prediction of the middle-level attributes of objects, the language generation model, and language retouching of the intermediate image caption. The framework of our proposed method can be seen in Figure 1. The main purpose of our work is to correct descriptive errors in the intermediate image caption and to make the final image caption more accurate. The main idea of our research is to use the middle-level attributes to retouch the intermediate image caption generated by our language generation model, which can solve the problem discussed before.

Image Feature Extraction
The image features we used are the global image feature and local image features. The global image feature is used to make the language generation model have a general understanding of the raw image. Meanwhile, the local image features are used as fine-grained information for image captioning.
In our method, the global image feature is extracted by ResNet-101 [4], which was pre-trained on the ImageNet classification dataset [35]. Since the number of the last fully connected layer of the ResNet-101 model is 2048, the extracted global image feature of each image is a 2048-dimension vector, denoted as G (see Equation (1)): The local image features we use are the feature vectors of object regions. Hence, we used faster-RCNN [8] as the local image feature extractor, as it can generate object features and object regions separately. The object features are used as local image features, and the object regions are used to predict the middle-level attribute labels, as described in next subsection (see Table 1). Table 1. Possible objects and object regions contained in raw images.

Categories Optional Values
Objects child, man, woman, desk, polar bear, rock, cat... Object regions the corresponding bounding-box: (P x , P y , P w , P h ) The column named "optional values" in Table 1 denotes the possible objects and the corresponding object regions in the raw images. The bounding-box (P x , P y , P w , P h ) can be used to denote an object region, where P x and P y represent the coordinates of the center point of the bounding-box, and P w and P h denote the width and height of the bounding-box, respectively.
Since the basic deep neural network of faster-RCNN is ResNet-152, the generated object feature vectors are all 2048-dimension, denoted as set O (see Equation (2)).

Middle-Level Attributes Prediction
In this part, we mainly focus on predicting the middle-level attribute labels from object regions, which can be used to retouch the intermediate image description generated by our language generation model. The process can be broadly divided into three steps, including extracting the object regions, predefining the middle-level attributes, and training and applying the middle-level attribute predictors.
From the above subsection, we know that the object features and object regions of the raw image can be generated simultaneously by the faster-RCNN model. Hence, we used the object regions as the raw data to predict the valuable middle-level attribute labels.
Since the categories of the objects contained in the raw image can be recognized by faster-RCNN, we only need to predict the middle-level attribute labels of these objects. We roughly used some valuable middle-level attributes of human and non-human objects predefined in previous works [36,37].
As described in [36], the images in the PubFig dataset consist of positive and negative examples, all of which have been labelled with "0" or "1" for each attribute to indicate whether an attribute exists or not. The attributes are in [36], including expression, lighting, scene, etc. Thus, the PubFig dataset can also be used for training binary classifiers to recognize the presence or absence of describable aspects of visual appearance, such as gender, age, etc. Therefore, the attribute labels used to retouch a human's appearance in our research are a subset of those used in [36]. The details of the human attributes are shown in Table 2, including gender, age, and hair color. The ImageNet dataset [35] contains a subset [37] which has 9600 images collected from 384 synsets, and each image is paired with 25 object attributes. The 25 semantic appearance attributes in [35] can be divided into four categories, including "color", "pattern", "shape", and "texture". Hence, the middle-level attributes of non-human objects used in this research are the semantic appearance attributes, which are similar to those in [37] and are illustrated in Table 3. Finally, we combined multiple attribute predictors into an ensemble attribute predictor to predict the valuable middle-level attributes from the object regions. We used the VGG16 [3] model as the basic classifier, which was trained six times separately for different aims, as shown in Table 4.
We used the PubFig dataset to train our attribute predictors VGG16(GENDER), VGG16(AGE), and VGG16(HAIR COLOR) to predict human attributes. The attribute predictors VGG16(SHAPE), VGG16(COLOR), and VGG16(TEXTURE) were pre-trained on the subset of the ImageNet dataset to recognize the middle-level attributes of non-human objects. After that, these pre-trained attribute predictors were combined into an ensemble attribute predictor to predict the middle-level attributes from the object regions generated by faster-RCNN. It is worth noting that only when the probability of a predicted attribute is greater than a predefined threshold is the predicted attribute applied to retouch the intermediate image description. Here, the threshold was set to 70%.

Language Generation Model
The language generation model we used in this work is a stacked two-layer RNN with LSTM cells. The advantage of the LSTM cell is that it can maintain long-term dependence of information to a certain extent. Additionally, the two-layer RNNs are used to reserve the information of the global image feature and local image features, respectively.
The detailed calculations of the LSTM cell are denoted as Equations (3)- (5). The input, forget, output, and memory gate are respectively denoted as i t , where l ∈ {1, 2} indicates the first layer or the second layer of our language generation model.
where x denotes element-wise multiplication.
The main difference between the first layer and the second layer of our language generation model is the input information. As shown in Figure 2, at time step t, the input information of the first layer, x (l) t , consists of the global image feature G and the previous generated word s t−1 . Hence, the calculation of x (1) t can be defined as Equation (8).
where s t−1 belongs to the set S, which is a sentence in a form of a word sequence generated by our language generation model. It is referred to as Equation (9).
where L is the length of the generated descriptive sentence. What is more, the input information of the second layer contains the fused feature vector v t and the hidden state h t . In our model, the fused feature vector v t is calculated from the local image features O and the hidden state h (2) t−1 by using the attention mechanism. It is worth noting that the middle-level attributes are not included in the input information of our attention mechanism. In this way, the role of the middle-level attributes will not be weakened by the attention mechanism.
The cosine function is used to measure the similarity between object feature vector o j and the hidden state h t−1 , as shown in Equation (10).
Then, the weight value α(o j ) of each object feature vector o j can be calculated, as shown in Equation (11).
After that, the fused feature vector v t at time step t can be calculated according to the weight value α(o j ) and the object feature vector o j , which is referred to as Equation (12).
Hence, the input information of the second layer, x t , is calculated as Equation (13).
Furthermore, the generated probability of each word s t is calculated based on the previously generated words {s 1 , s 2 , ..., s t−1 } and the global and local image features; the detailed operation is shown as Equation (14).
Therefore, the generated probability of the entire sentence can be calculated by the product of the probability of each word, as shown in Equation (15).
The loss function of our language model is the negative log-likelihood loss, which is defined as Equation (16).
Finally, we used Self-Critical Sequence Training (SCST) [38] to achieve CIDEr optimization. The negative expected reward is defined as Equation (17), where r indicates the score function.
The gradient of the negative expected reward can be calculated as Equation (18): where r(Ŝ) is the reward obtained by the current model. The image description generated by our language generation model is used as an intermediate image caption, which will be retouched by the predicted middle-level attribute labels.

Language Retouching of Intermediate Image Captioning
As we mentioned before, the generated intermediate image caption loses the information of the descriptive middle-level attributes. Therefore, the intermediate image caption is retouched by using the predicted middle-level attributes. It consists of two steps: (1) traversing the intermediate image caption to search for the same fragment as the key index, and (2) replacing the searched fragments with the corresponding short phrases.
Before retouching the intermediate image caption, the middle-level attribute labels generated by our ensemble attribute predictor are combined with the object label to form a short phrase (examples in Table 5). The object labels and the corresponding object regions are generated by the faster-RCNN model. Here, the short phrase is generated without grammar rules. Since the structure of the short phase is relatively simple, we only need to arrange the middle-level attributes in order, just like a static template. For human objects, the order of middle-level attributes is "age", "gender", "hair color". For non-objects, the order of words is "shape", "color", "texture", and object label. In the search step, each object label is used as a key index to search for the same fragment from the intermediate image caption. After that, the searched fragment is replaced by a short phrase according to the key index. For example, in the intermediate image caption, "a polar bear is standing on a rock with its mouth open", the word "rock" is replaced by "gray rough rock" after one instance of the searching and replacing steps. After all searching and replacing steps are completed, the language retouching of the intermediate image caption is complete.

Datasets
The datasets we used in our work are the MS COCO dataset [39], Flickr8K dataset [40], Flicker30K dataset [41], Pubfig dataset [36], and a subset [37] of the ImageNet dataset. The first three datasets are well-known datasets and can be used for object detection, image segmentation, and captioning. The PubFig dataset and the subset of the ImageNet dataset were used for training our middle-level attribute predictors (see Table 4) to predict the valuable attributes of human and non-human objects, respectively.
The official MS COCO dataset consists of 82,783 training images, 40,504 validation images, and 40,775 test images. However, in our work, we used the 'Karpathy' split [24] for reporting results, as in previous works. Therefore, the MS COCO dataset is split into 113,287 training images, 5000 validation images, and 5000 test images. Additionally, the Flickr8K dataset contains 6000 training images, 1000 validation images, 1000 testing images. The Flickr8K dataset is an extension of the Flickr8K dataset, which consists of 31,783 images. We used 28,000 images for training, 1000 images for validation, and 1000 images for testing. The above three datasets all contain five reference captions for each image.
The PubFig dataset consists of ∼10,000 images, and we used ∼7000 images for training, 1500 images for validation, and 1500 images for testing. The subset of ImageNet contains 9600 images, which are split into 6900 training images, 1350 validation images, and 1350 testing images.
BLEU can be used to evaluate the co-occurrences of n-grams between the reference sentences and the generated captions. METEOR is based on the harmonic mean of uni-gram precision and recall. Its evaluation is at the corpus level. ROUGE-L is based on the Longest Common Subsequence (LCS) of reference sentences and generated captions and is used to capture sentence-level structure. Different from the above evaluation metrics, CIDEr and SPICE are human consensus metrics. CIDEr can measure the similarity of the generated captions against a set of human-written sentences, and SPICE is used to measure how effectively the image captions recover attributes, objects, and the relationships between them.

Experiment Setting and Result
The experiment described in our paper was separated into two parts, including the training part and test part (see Figure 3). In the training part, we aimed to obtain the model parameters by training our language generation model and middle-level attribute predictors. In the test part, we separately used the language generation model and the ensemble attribute predictor to generate the intermediate image caption and the middle-level attribute labels. Then, the attribute labels were used to retouch the intermediate image caption to generate a final well-described sentence.
In our language model, the Adam algorithm is used to optimize the cost function (see Equation (16)). The value of alpha and beta used for the Adam algorithm are equal to 0.9 and 0.999, respectively. The embedding size, which is used to translate the generated word to the feature vector, is set to 1000. The number of hidden cells (LSTM) is equal to 1000, and the attention hidden size is set to 512. Furthermore, the sentence generation strategy we used in this research is the beam search method, which can ensure the quality of the generated intermediate sentence.
Finally, based on the old top-n sentences at the previous moment of each time step, the new top-n best sentences with higher probabilities can be selected. In our research, the value of n was set to 3. The GPU (graphic processing unit) we used is Nividia TITAN X (PASCAL).

Results of Middle-Level Attribute Prediction
In the phase for predicting the middle-level attributes, the pre-trained VGG16 predictors are combined into an ensemble attribute predictor to predict the attribute labels of human and non-human objects. When the probability of the predicted attribute of each VGG16 predictor is not more than the predefined threshold, the generated attribute label is not used to retouch the intermediate image caption. Figure 4 illustrates the results of the predicted middle-level attributes of humans. We show four representative results, including the girl with brown hair (i.e., Figure 4a), young female with blonde hair (i.e., Figure 4b), middle-aged man with black hair (i.e., Figure 4c), and senior man with brown hair (i.e., Figure 4d). From the results of the predicted mid-level attributes, we can observe that our ensemble attribute predictor can accurately predict the identification information of humans, which covers people of different ages, genders, and hair colors.
Furthermore, we also display the results of the predicted middle-level attributes of non-human objects (see Figure 5). The displayed images cover four different categories, including animal (i.e., Figure 5a), artifact (i.e., Figure 5b), natural object (i.e., Figure 5c), and plant (i.e., Figure 5d). As observed, our ensemble attribute predictor can accurately predict the color information and texture information of non-human objects. For instance, the color 'green' and the texture 'vegetation' of cabbage are predicted precisely (see Figure 5d). Nevertheless, the shape information of animals may be lost (see Figure 5a). The reason is that the predicted shape labels of animals, including long, round, rectangular, and square, all have low probabilities. Therefore, the predicted shape labels of animals would not be applied to retouch the intermediate image caption.

Results of Retouched Image Captioning
We compared our proposed method with recent state-of-the-art methods for image captioning in the literature, such as soft attention [23], hard attention [23], Log-bilinear [21], ATT [18], F-G attention [42], GLA [25], and Topdown [28]. 'OUR' indicates our proposed method. Since the evaluation results of these methods on the MS COCO, Flickr8K, and Flickr30K datasets have been published publicly, we can compare with them directly. The evaluation results were generated by the coco-caption code.
Referring to Tables 6-8, the detailed comparison results of our method with previous works are shown, and we observe that our proposed method achieves impressive performance when using the predicted middle-level attribute labels to retouch the intermediate image caption generated by the language model. Our main assumption is that the global image feature provides an overview of the raw image, which is significant for the machine to understand the visual content in a rough manner. However, in order to accurately identify the details of an image, the local image features and the middle-level attributes need to be taken into account.  Moreover, we observe that the performance of our method for image caption generation achieves obvious improvement when using middle-level attribute labels to retouch the intermediate image description. Our conjecture is that the middle-level attributes can provide fine-grained identification information of objects, which avoids the loss of visual information. The main difference between middle-level attributes and object features is that the middle-level attribute information is more detailed and more accurate, which can highlight deep identification information of objects compared with object features.
A shortcoming of the attention mechanism is that it will weaken partial input information and generate one fused feature vector. If we try to inject the middle-level attributes and object features into the attention mechanism simultaneously, the role of the middle-level attributes will be weakened, since the middle-level attributes contain fine-grained identification information of objects.
Furthermore, we display some results of image captioning on the MS COCO dataset, Flickr8K dataset, and Flickr30K dataset, referring to Figures 6 and 7. Figure 6 illustrates the sampled MS COCO images and their corresponding descriptions of whether or not to use attributes. In Figure 6a-c, we can observe that the final image captions generated by our method accurately depict the details of the objects when using the middle-level attributes to retouch the intermediate image caption, including the age and gender of humans, and color and texture of non-human objects. Meanwhile, from Figure 7, which illustrates the sampled Flickr images and their descriptions, we can observe results similar to Figure 6.
However, in Figure 6d, the generated final image caption has a descriptive error when using the middle-level attribute to retouch the intermediate image caption. The word "boat" is retouched by "yellow" and "white", simultaneously. The reason that the middle-level attribute negatively affected the generated caption is that there are two of the same object-"boat"-with different colors, respectively. Therefore, we need to be cautious about such issues in future work. The optional approach is to generate multiple image captions which can depict different aspects of a given image. Similarly, the youth male is mistakenly described as a middle-aged male in Figure 7d.
Finally, the results above show that our proposed method can solve the problem of missing middle-level attributes, to some degree, and can make the final image caption more accurate.
Before: a woman in a dress holding a tennis racket playing tennis.
Before: a polar bear is standing on a rock with its mouth open.
Before: a boy riding a snowboard down a snow-covered slope.
Before: a yellow boat sitting on top of a sandy beach. After: a young female with brown hair in a white dress holding a tennis racket playing tennis.
After: a white furry polar bear is standing on a gray rough rock with its mouth open.

After:
a young boy riding a green smooth snowboard down a snow-covered slope.
After: a yellow long white smooth boat sitting on top of a sandy beach. Before: a child and a woman playing with a ball in a field.
Before: a woman is riding a bicycle with a man on the back of it.
Before: a man riding on a dirt bike in the middle of a field. After: a youth boy with a red coat riding a red skateboard up the side of a ramp.
After: a child and a young female with brown hair playing with a white ball in a field.
After: a youth female with blonde hair is riding a white metallic bicycle with a middle-aged male on the back of it.
After: a middle-aged male riding on a dirt bike in the middle of a field.

Conclusions
In this research, we tried to solve a problem in existing methods, i.e., that the encoder-decoder framework is used to generate the final image caption directly, which may ignore significant identification information of the middle-level attributes of the raw image, resulting in a generated image description of the object that is not accurate enough.
We propose an MLALR method for image caption generation, which can solve the aforementioned problem and make the final generated image caption more accurate. Our proposed MLALR method first uses the global and the local image features to generate the intermediate image caption. Then, it uses the middle-level descriptive attributes, which are predicted from the object regions of the raw image, to retouch the intermediate image caption according to the object index.
We validated our proposed method with several well-known evaluation metrics-BLEU, METEOR, ROUGE-L, CIDEr, and SPICE. The evaluation results and the final generated image descriptions show that our proposed method can correct descriptive errors in the intermediate image description to some degree and make the final generated image caption more accurate. However, the middle-level attributes used in this research are describable aspects of visual appearance. We did not consider that whether people wear ornaments or not, such as glasses, necklaces, watches, etc. Therefore, we will try to extract the information of ornaments from the original image to make the final image captioning more detailed in future work.