Panoptic Segmentation-Based Attention for Image Captioning

: Image captioning is the task of generating textual descriptions of images. In order to obtain a better image representation, attention mechanisms have been widely adopted in image captioning. However, in existing models with detection-based attention, the rectangular attention regions are not ﬁne-grained, as they contain irrelevant regions (e


Introduction
Image captioning, the task of automatically generating natural language descriptions of images, has received increasing attention in computer vision and natural language processing.This task has several important practical applications.For example, it can help people with visual impairments.Therefore, it requires accurate recognition of the objects and a thorough understanding of the images.With the advances in deep neural networks, image captioning models now tend to use the "encoder-decoder" framework.In this framework, a convolutional neural network (CNN) is used to encode images into vectors, and a recurrent neural network (RNN) or one of its variants LSTM [1], is used to generate captions step by step.
The main problem in image captioning is the coarse representation of images.In the vanilla encoder-decoder framework, the encoder simply compresses an entire image into a global representation.This representation is coarse and has two drawbacks.First, it is fixed and thus does not correspond to the dynamic decoding process of caption generation.Second, it does not contain the spatial structures of the image.In order to obtain a fine-grained image representation, visual attention mechanisms [2][3][4][5][6] have been widely adopted.These mechanisms aim to focus on a specific region while generating the corresponding word.
However, the image features in existing attention-based methods are not fine-grained.Xu et al. [2] proposed a top-down attention mechanism that represents the image with the parameters from the convolutional layer of the CNN, allowing the model to preserve the spatial information and dynamically attend to different regions when generating words.
However, in the top-down attention mechanism, the attention regions correspond to a uniform grid of receptive fields.Since the sizes and shapes of these receptive fields are equal, they are independent of the content of the image.With the advances in object detection, detection-based attention mechanisms [3][4][5] were proposed to enable the model to attend to the detected salient image regions.Compared with the top-down attention mechanism, the detection-based attention mechanism has a better performance as it can generate variable numbers and sizes of rectangular attention regions.However, it is still not fine-grained enough because there may be other objects or background in the rectangular attention regions.
In this paper, we introduce a novel attention mechanism called panoptic segmentation-based attention, as illustrated in Figure 1.This mechanism comes from the goal of finding more fine-grained attention regions.Naturally, image segmentation, a more fine-grained form of detection, was taken into consideration.We considered it first.Instance segmentation [7,8] is a more challenging task than object detection, as it requires not only detecting all objects in an image but also segmenting the instances of each object class.The results from instance segmentation only contain the main part of the object and do not include the background or overlapped regions.Therefore, we can obtain more fine-grained attention regions with the aid of instance segmentation.However, instance segmentation only segments instances of "things" classes (countable objects with specific shapes; e.g., cars and persons), neglecting "stuff" classes [9] (amorphous background regions; e.g., sky and grass).Losing information of the stuff regions may weaken the model's ability to understand the scene.Recently, Kirillov et al. [10] proposed panoptic segmentation that performs segmentation for all classes (things and stuff) in the image.Inspired by this work, we built our attention mechanism upon this idea.Based on the segmentation regions generated from panoptic segmentation, our method extracts image features based on the shape of the segmentation regions and generates captions based on the attention-weighted features.As shown in Figure 2, compared with the detection-based attention mechanisms that contain irrelevant regions around the objects in their attention regions, the attention regions of our approach contain one instance in each region with irrelevant regions masked out, which is more fine-grained and can avoid the negative impact of the background or overlapped regions.Moreover, while the detection-based attention mechanisms only detect things classes and extract scene information in a top-down way [3,6], our approach processes things and stuff classes independently via a dual-attention module.Incorporating the features of stuff regions can provide richer context information and yield better performance.
The main contributions of this paper are: • Introducing a novel panoptic segmentation-based attention mechanism together with a dual-attention module that can focus on more fine-grained regions at a mask-level when generating captions.To our best knowledge, we are the first to incorporate panoptic segmentation into image captioning.

•
We explored and evaluated the impact of combining segmentation features and stuff regions on image captioning.Our study reveals the significance of the fine-grained attention region features and the scene information provided by stuff regions.

•
Our proposed method is evaluated on the MSCOCO [11] dataset.Results show that our approach outperforms the baseline detection-based attention approach, improving the CIDEr score from 112.3 to 119.4.Our approach achieves competitive performance against state-of-the-art.With panoptic segmentation, our method can not only generate fine-grained attention regions, which avoids the negative impact caused by irrelevant regions, but also performs attention to stuff class regions.

Related Work
Image Captioning.Image captioning models with the encoder-decoder framework have been widely studied.In recent works [2][3][4][12][13][14][15][16], attention mechanisms have been introduced to the encoder-decoder framework to obtain a better image representation.Xu et al. [2] first proposed an attention mechanism in image captioning, where weighted CNN features from a convolutional layer are fed to the encoder-decoder framework.Lu et al. [15] developed an adaptive attention mechanism to decide whether to attend to the image or caption at each time step.Leveraging object detection, subsequent works perform attention in different ways.You et al. [13] trained a set of visual concept detectors and performed attention over the detected concepts.Jin et al. [3] generated salient regions of an image by selective search [17] and feed these regions to an attention-based decoder.Similarly to Jin et al. [3], Pedersoli et al. [4] generated object detection proposals and applied a spatial transformer network [18] to them to obtain more accurate regions.Recent detection-based attention methods [6,19,20] use Faster R-CNN [21] to generate detection regions, which significantly increases the quality of the generated captions.However, their attention regions are not fine-grained as they contain other objects or background within their rectangular area.
As object detection models only detect things classes, scene information needs to be provided to the captioning model.With the aid of Latent Dirichlet allocation [22], Fu et al. [23] generated topic vectors from the corpus of captions to represent the scene-specific contexts.Anderson et al. [6] trained an object detector from the Visual Genome [24] dataset which provides annotations of richer categories (e.g., tree, water).The features of the detected regions in an image are then averaged to be regarded as scene information.Differently from the above methods, our method not only has more fine-grained attention regions but also incorporates stuff regions to obtain richer context information.
Two works that incorporate segmentation into their attention mechanisms are similar to ours.Liu et al. [25] proposed the mask pooling module in video captioning to pool the features according to the shape of the masks.This module is similar to our feature extraction process but it only considers the things classes, and therefore lacks scene information from stuff classes.Zhang et al. [26] proposed the FCN-LSTM network, which incorporates the segmentation information generated from a semantic segmentation model FCN [27] into attention.However, their attention regions are the same as [2], which are not fine-grained, and the segmentation information is merely used to guide the attention.Moreover, their models are incapable of distinguishing instances of the same class in an image, as they use semantic segmentation.As opposed to their methods, our method incorporates panoptic segmentation to distinguish not only things and stuff classes but also their instances with fine-grained attention regions.Our work also places emphasis on demonstrating the advantage of fine-grained segmentation regions over detection regions.
There are other works that focused on other issues in image captioning.Rennie et al. [28] used reinforcement learning to directly optimize the evaluation metric.Dai et al. [29] studied the impact of an RNN with 2D hidden states.Some recent works [30,31] explored the use of graph convolutional networks (GCN) to encode images to improve a visual relationship.Other works attempt to increase the diversity of the captions [32,33].
Instance and Semantic Segmentation.Instance segmentation [7,8,34,35] and semantic segmentation [27,[36][37][38] are two similar tasks but usually employ different approaches.The aim of instance segmentation is to detect and segment each object instance in an image.Most of the instance segmentation approaches [7,8,34] modify the object detection networks to output a ranked list of segments instead of bounding boxes.Hence, instance segmentation can distinguish individual object instances but only for things classes.Semantic segmentation aims to assign a class label to each pixel in an image.It is capable of segmenting stuff and things classes but does not distinguish the individual instances.
Due to the inherent difference mentioned above, although both semantic and instance segmentation techniques aim to segment an image, they had not been unified hitherto.Recently, Kirillov et al. [10] proposed "panoptic segmentation", which unifies the above tasks and requires jointly segmenting things and stuff at the instance level.Our method was developed upon panoptic segmentation and performs attention over the segmentation regions.

Captioning Models
In this section, we first describe the generic encoder-decoder image captioning framework (Section 3.1).Then, we describe the up-down attention model in Section 3.2.Our panoptic segmentation-based attention mechanism is based on the up-down attention model with features from segmentation regions.We also proposed a baseline detection-based attention model with features from detection regions as a comparison.Then, we introduce the dual-attention module we proposed for panoptic segmentation features in Section 3.3.

Encoder-Decoder Framework
First, we briefly introduce the encoder-decoder framework [39].This framework takes an image I as input and generates a sequence of words w = {w 0 , ..., w t }.
In this framework, captions are generated by LSTM.At a high level, the hidden state of the LSTM is modeled as: where x t is the input vector and h t−1 is the previous hidden state.For notational convenience, we do not show the propagation of the memory cell.
At each time step t, the probability distribution of the output word is given by: Here we omit the bias term.W h ∈ R Σ×d where Σ is the size of the vocabulary and d is the dimension of the hidden state.θ denotes the parameters of the model.
Given the target ground truth sentence w * = {w * 0 , ..., w * t }, the encoder-decoder framework is trained to maximize the probability of w * .By applying the chain rule to model the joint probability over w * 0 , ..., w * t , the objective is to minimize the sum of the negative log likelihood: where T is the total length of the caption.
In the vanilla encoder-decoder framework, the image is only input once, at t = 0, to inform the LSTM about the image contents.The input x t is the previous generated word, given by: where is the dimension of the image features and E is the word embedding matrix.The beginning of the sentence w 0 and end of the sentence w t are marked with a BOS token and an EOS token, respectively.

Up-Down Attention Model
We adopt the framework of the up-down attention model [6] with our segmentation-region/detection-region features.This model is composed of an attention LSTM which generates the attention weights and a language LSTM which generates words.Their hidden states are denoted by h 1 t and h 2 t , respectively.Given k image regions, the features of these regions v are given by: The input to the attention LSTM is the concatenation of the mean-pooled image feature, the previous hidden state of the language LSTM, and the previous generated word: where the mean-pooled image feature Ī provides the attention LSTM with a global content of the image.Note that, differently from [6] where Ī = 1 k ∑ k i v i is the average feature of the detected regions, in this paper, Ī is the average feature of the uniform grid of the image regions.In the baseline detection-based attention model, the image features v are the features of detection regions.In order to obtain fine-grained representation of the images, the image features v in our method are the features of segmentation regions.The definitions of detection/segmentation regions are illustrated in Sections 4.1 and 4.2, and the image features v are described in Section 5.
The input to the language LSTM is the concatenation of the attention weighted image feature and the previous hidden state of the attention LSTM: where the attention weighted image feature I t is the weighted sum of v i : where α i t is the normalized attention weights, W a ∈ R A , W av ∈ R A×D , and W ah ∈ R A×d ; A is the dimension of the attention layer.For the sake of simplicity, we denote {a 1 t , ..., a k t } by a t and denote {α 1 t , ..., α k t } by α t .Then, the hidden state of the language LSTM is used to generate the distribution of the next word following (2) and h t is replaced with h 2 t following [6].The other parts of the model remain the same with the definition in Section 3.1.
Reinforcement training [28] is also introduced to directly optimize the CIDEr [40] metric.For a sampled sentence w, a reward function r(w) denoting the CIDEr score of w is used to measure the quality the sentence.With this reward, the probability of the sampled captions with a higher CIDEr score is increased by reinforcement training following [28].Therefore, we can directly optimize the CIDEr score by reinforcement learning.

Dual-Attention Module for Panoptic Segmentation Features
In panoptic segmentation, the segmentation regions include things and stuff classes.As they convey different kinds of information, they have to be processed separately.
In this section, we describe the Dual-Attention Module we propose for panoptic segmentation features.Our framework and the Dual-Attention Module is shown in Figure 3.In this case, the image features v in (5) include the features of things classes v t and the features of stuff classes v s ; i.e., v = [v t , v s ], where v t and v s are given by: where k t is the number of things regions and k s is the number of stuff regions.The attention process is further given by: The final image feature I t is the concatenation of the attention weighted features I s t and I t t .Other parts of the model remain the same.

Attention Regions
In this section, we describe how to obtain the detection and segmentation regions for later feature extraction.

Detection Regions
We first describe how to obtain the detection regions for the detection-based attention model.Given an image, the output of an object detection model is a set of bounding boxes (i.e., the rectangular boxes that contain the instances): where b = {b 1 , ..., b L }; b i = (x min , y min , x max , y max ) contains the coordinates of the bounding box; L is the number of the instances in the image.Here we do not show the output category prediction.
As current object detection models only detect things classes, b does not contain regions that belong to stuff classes.

Segmentation Regions
Next, we describe how to obtain the segmentation regions for the segmentation-based attention model.Given an image, a segmentation model generates a set of binary masks for each instance in the image: where m = {m 1 , ..., m L }, m i ∈ {0, 1} H×W is the mask indicating which pixels belong to the instance.H and W are the height and width of the image, respectively.Here we also omit the output category prediction.Note that in panoptic segmentation m contains things and stuff classes while in instance segmentation m only contains things classes.

Feature Extraction
In this section, we introduce how we extract the features v of the segmentation/detection regions that are used in the image captioning model.
For segmentation-region features, we extract the image features by convolutional feature masking [41], denoted as the CFM approach.As shown in Figure 4, given an image I, we first obtain the masks m of the image as in (21).Meanwhile, the feature maps CNN(I) of the image are extracted from the convolutional layer of a pre-trained CNN.The masks are resized to match the size of the feature maps.The final segmentation-region features are given by: where we perform an element-wise product ⊙ to every channel of the output feature maps.For detection-region features, we scale the coordinates to match the sizes of the feature maps.The final detection-region features are given by: where the crop() operation crops the output feature maps based on the resized coordinates.Similarly, this operation is performed to every channel of the output feature maps.Note that unlike [41], in order to obtain richer information, the pixel values of the resized masks are obtained by averaging without thresholding.

Dataset
Extensive experiments were performed to evaluate our proposed method.All the results were based on the MSCOCO dataset.For validation of offline testing, the "Karpathy" split [42] that has been widely used in prior work was adopted.The training, validation, and test sets respectively, contained 113,287, 5000, and 5000 images, along with five captions per image.We truncated captions longer than 16 words and removed the words that appeared less than five times, resulting in 9587 words.
The COCO-Stuff [9] dataset contains 80 things classes, 91 stuff classes, and one unlabeled class.The stuff classes are organized in a hierarchical way and belong to 15 parent categories.We omitted the unlabeled class.Since the stuff regions are often scattered, and we did not need a fine classification of the sub-classes stuff in image captioning; we used a compact representation for stuff.The regions of the sub-classes that belong to the same parent category were merged by adding the masks of these sub-classes together, resulting in 15 parent stuff categories.

Implementation Details
In our experiments, in the up-down attention model, the dimension of the hidden states in the language LSTM and the attention LSTM and word embedding were set to 1000.The hidden state A of the attention layer was 512.We used the Adam [43] optimizer with initial learning rate of 5 × 10 −4 .The weight-decay and momentum were 1 × 10 −4 and 0.9, respectively.We set the batch size to 100 and trained the models for up to 50 epochs.In order to further boost performance, we trained the models with reinforcement learning for another 50 epochs.
We used a pre-trained ResNet-101 [44] as our CNN model to extract image features.The image feature v i was the mean output of the last convolutional layer of ResNet-101, and thus had a dimension of 2048.
As there is no current available model to jointly perform both elements of the panoptic segmentation, we used Mask R-CNN [7] to perform instance segmentation for things classes and DeepLab [36] to perform semantic segmentation for stuff classes.Their outputs were merged to represent the result of panoptic segmentation.The minimum detection confidence of Mask R-CNN was 0.6 and the non-maximum suppression threshold was 0.5.The code of our method can be accessed at https://github.com/jamiechoi1995/PanoSegAtt.
We denote the model that uses panoptic segmentation features by PanopticSegAtt, and denote the baseline model that uses detection features by DetectionAtt.To evaluate the impact of stuff regions, we also propose InstanceSegAtt, a model that only uses instance segmentation features (i.e., without stuff regions) as another baseline.

Evaluation
In this subsection, we first compare our results with state-of-the-art models on MSCOCO dataset.To demonstrate the effect of the segmentation-region features, we conducted qualitative and quantitative analyses of the difference between InstanceSegAtt and DetectionAtt.Moreover, to demonstrate the effect of the features of stuff regions, we also conducted qualitative and quantitative analyses of the difference between PanopticSegAtt and InstanceSegAtt.We report results using the COCO captioning evaluation tool [11], which reports the following metrics: BLEU [45], METEOR [46], ROUGE-L [47], and CIDEr [40].Table 1 shows the overall results on the MSCOCO dataset.
Compared with the method of Zhang et al. [26] which is most similar to our method, our PanopticSegAtt model surpasses their method in all metrics by a large margin.We consider that it is because of the more fine-grained attention regions and the combination of panoptic segmentation in our method.We also compared our PanopticSegAtt with PanopticSegAtt (w/o Dual-Attend).The full model improved the CIDEr from 118.2 to 119.4, which shows that with the dual-attention module, our model can generate more accurate captions.
We then compared our method with the typical attention methods [2][3][4][13][14][15].For example, SCA-CNN [14] uses spatial and channel-wise attention in the CNN.Lu et al. [15] adaptively attends to the image and caption during decoding.Our PanopticSegAtt model significantly outperforms these methods in all metrics, which demonstrates the power of our panoptic segmentation-based attention mechanism.We also compared our method with state-of-the-art methods [48][49][50][51][52]. Our PanopticSegAtt model outperforms these methods in most of the metrics, especially on the CIDEr metric, which is considered to be the metric most aligned with human judgments.Note that Stack-Cap [53] has higher scores, as the model of this method has three LSTMs to perform coarse-to-fine decoding, which is more complex than our method.Since Anderson [6] used the extra Visual Genome [24] dataset to train the object detector, their attention regions are much richer than ours.Thus, we did not compare with these two methods directly.
Figure 5 shows the statistical results of the CIDEr scores of the captions in the Karpathy test split for DetectionAtt, InstanceSegAtt, and PanopticSegAtt.Captions within the interval with a score from 0 to 1 are not accurate enough to describe the images.Among them, the number of DetectionAtt is the highest, InstanceSegAtt is second, and PanopticSegAtt is the smallest.In the interval with a score from 1 to 3, captions can accurately describe the images.Among them, the number of PanopticSegAtt is the highest, InstanceSegAtt is second, and DetectionAtt is the smallest.In the interval where the score is greater than 3, the number of captions of the three methods is almost the same.This is because the images in this interval are simple so that all the models work well for them.The above results indicate the advantages of using the segmentation features and the stuff regions in our method.We also evaluated our model on the online COCO test server in Table 2. Our PanopticSegAtt model achieves comparable scores compared to the state-of-the-art models.We used Mask R-CNN [7] to perform instance segmentation and object detection, which resulted in equal numbers of attention regions in both tasks.Thus, the difference between segmentation regions and detection regions is that the segmentation regions are more fine-grained and better-matching to the shape of the instances.We compared InstanceSegAtt with DetectionAtt to demonstrate the impact of segmentation-region features.
As shown in Table 1, comparing against DetectionAtt verifies the effectiveness of using segmentation-region features.Our InstanceSegAtt model improves the CIDEr score from 112.3 to 117.3 compared with the DetectionAtt model.The performance gap between InstanceSegAtt and DetectionAtt demonstrates that using features from more fine-grained regions is beneficial.
The training curves in terms of CIDEr metric are shown in Figure 6.During 50 epochs' training, the InstanceSegAtt model consistently surpasses the DetectionAtt model.This result indicates that, since segmentation regions do not include irrelevant regions that have negative impact on captioning models during training, using segmentation-region features leads to better convergence and performance.Dense Annotation Split.In order to better evaluate the effect of using features of fine-grained segmentation regions, we present a new split from the COCO dataset called dense annotation split.This is based on the intuition that the segmentation-based attention model ought to distinguish instances even if they are overlapped, while the detection-based attention model may be confused by the features of the rectangular regions which include irrelevant regions.The dense annotation split consists of images in which some of the instances are highly overlapped.We generated this split by selecting images for which the IoU (intersection-over-union) between any of their two instances was over 0.5, resulting in 12,967, 570, and 608 images in the training, validation, and testing sets of Karpathy split [42], respectively.We then evaluated the performance of InstanceSegAtt and DetectionAtt on this split.
As shown in Figure 7, compared with the DetectionAtt model that uses detection-region features, InstanceSegAtt is better at handling images with overlapped objects, as the segmentation-region features do not overlap with each other.For example, in the first column of Figure 7, the detection region of the person and snowboard are highly overlapped.InstanceSegAtt correctly generates the word "snowboard", while DetectionAtt cannot.Similarly, in the second column of Figure 7, the dense detection regions make DetectionAtt hard to generate "a woman" like InstanceSegAtt.In the third column of Figure 7, the detection regions of the two giraffes contain each other, which could confuse the DetectionAtt model.In the last column of Figure 7, the rectangle detected region of the boy in the second row contains the region of the man; therefore, DetectionAtt cannot correctly recognize their relationship.Table 3 shows the evaluation results on the dense annotation test split.The performance gap between InstanceSegAtt and DetectionAtt is larger than their gap in Table 1, which demonstrates that the segmentation-based attention model has the advantage in handling densely annotated images.We consider that the features of overlapped regions make it hard for DetectionAtt to distinguish the individual instances in overlapped regions.Thus, DetectionAtt has lower scores compared with InstanceSegAtt.It is also observable that the performance of PanoSegAtt is better than that of InstanceSegAtt.This suggests that the contextual information from stuff region features is of benefit for the model to recognize the partially occluded objects.Such a conclusion is widely acknowledged in object detection [56][57][58].The above results demonstrate that, with fine-grained attention regions, the model can not only avoid the negative impact from irrelevant regions but also benefit from the context information and is more capable of distinguishing instances in images with overlapped objects.

With Stuff versus without Stuff
Stuff regions play an important role in image captioning, as they provide the context information (scene, location, etc.) to the model.As shown in Table 1, when comparing PanopticSegAtt with InstanceSegAtt, PanopticSegAtt further improves the CIDEr score by 2.1, which shows that using features of stuff regions also enhances performance.To qualitatively demonstrate the superiority of using features of stuff regions, we show the example captions generated by InstanceSegAtt and PanopticSegAtt in Figure 8.Compared with the InstanceSegAtt model that does not have the features from stuff regions, PanopticSegAtt can generate captions with richer scene information.For example, in the first column of Figure 8, the segmentation regions generated by panoptic segmentation contain the brick wall region in the image and provide the feature of brick wall to the captioning model.Thus, the caption generated by PanopticSegAtt contains the background information "brick wall".Similarly, in the second column of Figure 8, the purple area provides the context information of the image, so the PanopticSegAtt can generate the phrase "with plants".In the third column of Figure 8, while InstanceSegAtt does not generate the scene of the photo, PanopticSegAtt correctly generates the scene phrase "in a field".In the fourth column of Figure 8, the purple area provides the location of the train to the PanopticSegAtt model which generates a more accurate scene word "mountain" while InstanceSegAtt generates the scene word "field".
The above results demonstrate that, with the aid of stuff regions, our panoptic segmentation-based attention method can generate captions with richer context information.

Conclusions
In this paper, we present a novel panoptic segmentation-based attention mechanism for image captioning, which provides more fine-grained regions for attention with the aid of panoptic segmentation.Our method achieves competitive performance against state-of-the-art methods.Qualitative and quantitative evaluation results show that our approach has better scene and instance recognition of an image compared with the detection-based attention method, which demonstrates the superiority of using features of fine-grained segmentation regions in image captioning.Our research provides a novel perspective for academics and practices on how to improve the performance of image captioning.The above results indicate that extracting fine-grained image features is a prospective research topic for future work.
We plan to further utilize the available model in the panoptic segmentation task to generate segmentation regions for things and stuff in a more elegant way.Investigating different ways to incorporate stuff-region features into the captioning model is also a future research direction.

Figure 1 .
Figure 1.An overview of our proposed method.Given an image, we first generate segmentation regions of the image and use a convolutional neural network to extract the segmentation-region features.For readability, we have applied a color map to each segment.The segmentation-region features are then fed to the LSTM to generate the captions.

Figure 2 .
Figure 2. Comparison of the feature extraction procedures between a detection-based attention model and our panoptic segmentation-based attention model: (a) The feature extraction procedure of the detection-based attention model.The attention regions are the rectangular regions annotated with colored edges.(b) The feature extraction procedure of our panoptic segmentation-based attention model.The attention regions are the fine-grained regions annotated in the colored map.With panoptic segmentation, our method can not only generate fine-grained attention regions, which avoids the negative impact caused by irrelevant regions, but also performs attention to stuff class regions.

Figure 3 .
Figure 3. Overview of our framework and the Dual-Attention Module used for handling the panoptic segmentation features.The features of the things and stuff classes are fed into the pathway to perform attention individually.The attended features are then concatenated and fed into the LSTM to generate captions.

Figure 4 .
Figure 4.An overview of the convolutional feature masking (CFM) approach for feature extraction.The image is fed to the segmentation model to obtain masks and fed to the CNN to obtain feature maps, respectively.The masks are resized to have the same size as the feature maps.The feature maps are then multiplied by the resized mask to obtain the segmentation-region features.

Figure 6 .
Figure 6.Comparison of CIDEr scores on the validation set for InstanceSegAtt and DetectionAtt during 50 epochs' training.

Figure 7 .
Figure 7. Examples of captions generated and the attention regions of DetectionAtt and InstanceSegAtt.(a): Original image; (b) detection regions generated by the object detection model; (c) captions generated by DetectionAtt; (d) instance segmentation regions generated by the instance segmentation model; (e) captions generated by InstanceSegAtt.Images are selected from the dense annotation split.Bold text indicates where InstanceSegAtt has included more detail in the captions compared to DetectionAtt.Results of PanopticSegAtt are not shown, as the differences between the captions of PanopticSegAtt and InstanceSegAtt are not obvious in these images.

Figure 8 .
Figure 8. Examples of captions and the attention regions of InstanceSegAtt and PanopticSegAtt.(a) Original image; (b) instance segmentation regions generated by the instance segmentation model; (c) captions generated by InstanceSegAtt; (d) segmentation regions generated by the panoptic segmentation model; (e) captions generated by PanopticSegAtt.Images were selected from the Karpathy test split.Bold text indicates where PanopticSegAtt has included more detail in the captions compared to InstanceSegAtt.

Table 1 .
[42]results obtained on the MSCOCO Karpathy test split[42].† indicates ensemble models.Higher is better in all columns.Scores were multiplied by a factor of 100.

Table 2 .
The results obtained on the online MSCOCO test server.† indicates ensemble models.Higher is better in all columns.Scores are multiplied by a factor of 100.

Table 3 .
The results obtained on the MSCOCO dense annotation test split.Higher is better in all columns.Scores are multiplied by a factor of 100.