Next Article in Journal
Reconstruction and Measurement of Irregular Karst Caves Using BLST along the Shield Metro Line
Previous Article in Journal
A Novel Optimization Layout Method for Clamps in a Pipeline System
 
 
Article
Peer-Review Record

Panoptic Segmentation-Based Attention for Image Captioning

Appl. Sci. 2020, 10(1), 391; https://doi.org/10.3390/app10010391
by Wenjie Cai 1, Zheng Xiong 1, Xianfang Sun 2, Paul L. Rosin 2, Longcun Jin 1,* and Xinyi Peng 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2020, 10(1), 391; https://doi.org/10.3390/app10010391
Submission received: 12 December 2019 / Revised: 30 December 2019 / Accepted: 1 January 2020 / Published: 4 January 2020
(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

The topic is very timely, needed, and hot. It covers a clear gap, and there is a lot of research going on currently in this area. Therefore, this paper deems an excellent future reference for researchers who work in the area. The overall quality is good and meeting the minimum requirements of the journal. The content is relevant and useful for the researchers’ community. However, it has many minor issues that must be fixed properly ranging from literature and lack of suitable directions, etc as follows:

The abstract is long and doesn't communicate the problem well. And, it should present the proposed work more clearly. One of the major complaints I have is the poor introduction that fails to motivate the reader to read the paper. This is one of the most important sections of any paper, and I feel that the authors disregard this section compared to an adequate detail in the other sections. Hence, the introduction needs to be rewritten and expanded in order to motivate the thematic, and the authors must show while expanding the introduction why it is a relevant scientific problem to be solved. On a positive aspect, I think that the list of contributions in the final of the introduction is adequate. Please make sure that all keywords have been used in the abstract and the title. The related work doesn’t cover the literature and state-of-the-art of the presented work appropriately. Therefore, I strongly recommend that the authors add more literature about this point. Some very relevant and recent work that the authors may use to address this point: https://ieeexplore.ieee.org/abstract/document/8851750; https://arxiv.org/pdf/1706.02430.pdf; https://ieeexplore.ieee.org/abstract/document/7792460 Some equations require further elaborations, such as equations 8-10. So, I strongly recommend that the authors have another go on all equations and make sure a consistent explanation for each is available. Results should be further analyzed and discussed, more details and further discussion of table 3 is needed. The conclusions section should conclude that you have achieved from the study, contributions of the study to academics and practices, and recommendations of future works.  Thorough proofreading is required (best by a native English speaker)

Author Response

Point 1: The abstract is long and doesn't communicate the problem well. And, it should present the proposed work more clearly. 


 

Response 1: Thank you for your valuable suggestion. We have simplified the abstract and present our proposed work more clearly.

 

L1: “Image captioning is a task to generate textual …” -> “Image captioning is a task generating textural textual …”

 

L3: “... in existing image captioning models …” -> “… in existing models …”

 

L4: “… contain irrelevant regions (e.g., background or overlapped regions) besides the object.” -> “… contain irrelevant regions (e.g., background or overlapped regions) around the object, making the model generate inaccurate captions.”

 

L5: “In this work, we propose panoptic segmentation-based attention, a novel attention mechanism that …” -> “To address this issue, we propose panoptic segmentation-based attention that …”

 

L7: “… the main part of an instance), which is …” -> “… the main part of an instance). Our approach extracts feature vectors from the corresponding segmentation regions, which is …”

 

L9: “Experimental results lead to two significant observations. First, the model can better recognize the overlapped objects in the image with our fine-grained segmentation-region features. Second, the model can better understand the scene when scene regions are included in the attention regions. Our approach achieves competitive performance against state-of-the-art methods, which demonstrate the power of the panoptic segmentation-based attention mechanism.” -> “Experimental results show that our model can recognize the overlapped objects and understand the scene better. Our approach achieves competitive performance against state-of-the-art methods.”

 

 

Point 2: One of the major complaints I have is the poor introduction that fails to motivate the reader to read the paper. This is one of the most important sections of any paper, and I feel that the authors disregard this section compared to an adequate detail in the other sections. Hence, the introduction needs to be rewritten and expanded in order to motivate the thematic, and the authors must show while expanding the introduction why it is a relevant scientific problem to be solved.

 

Response 2: Thank you for your valuable suggestion. We have rewritten and expanded the introduction to motivate the thematic and explain the importance of the problem.

 

L19: “This task has long been considered difficult as it requires fine-grained representation and a thorough understand of the images.” -> “This task has several important practical applications. For example, it can help people with visual impairments. Therefore, it requires accurate recognition of the objects and a thorough understanding of the images.”

 

L25: “However, in the vanilla encoder-decoder framework, …” -> “The main problem in image captioning is the coarse representation of images. In the vanilla encoder-decoder framework, …”

 

L31: “The development of visual attention mechanisms generally follows a coarse-to-fine route. The initial encoder-decoder framework without any attention mechanism uses the parameters from the fully-connected layer of the CNN to represent the image. Later, Xu …” -> “However, the image features in existing attention-based methods are not fine-grained. Xu …”

 

L39: “… detection-based attention mechanisms were proposed to attend …” -> “… detection-based attention mechanisms were proposed to enable the model to attend …”

 

L43: “… in the attention regions.” -> “… in the rectangular attention regions.”

 

L49: Add “The results from instance segmentation only contain the main part of the object and do not include the background or overlapped regions. Therefore, we can obtain more fine-grained attention regions with the aid of instance segmentation.” after “… each object class.”

 

L51: Add “Losing information of the stuff regions may weaken the model's ability to understand the scene.” after “… e.g., sky and grass).”

 

L54: “… our method performs attention over the features of these segmentation regions, whereas the attention regions of detection-based attention mechanisms contain irrelevant regions besides the object in their attention regions. As shown in Figure 2, compared with the detection-based attention mechanism, the attention regions …” -> “… our method extracts image features based on the shape of the segmentation regions and generates captions based on the attention-weighted features. As shown in Figure 2, compared with the detection-based attention mechanisms that contain irrelevant regions around the objects in their attention regions, the attention regions …”

 

L61: “… our approach treats things and stuff classes independently. The features of stuff regions …” -> “… our approach processes things and stuff classes independently via a dual-attention module. Incorporating the features of stuff regions …”

 

 

Point 3: Please make sure that all keywords have been used in the abstract and the title. 


 

Response 3: Thank you for your valuable suggestion. We updated the manuscript by changing the keywords from {Image captioning; Attention mechanism; Multimodal; Encoder-decoder framework; Image description} to {Image captioning; Attention mechanism; Panoptic Segmentation}.

 

 

Point 4: The related work doesn’t cover the literature and state-of-the-art of the presented work appropriately. Therefore, I strongly recommend that the authors add more literature about this point. Some very relevant and recent work that the authors may use to address this point: https://ieeexplore.ieee.org/abstract/document/8851750; https://arxiv.org/pdf/1706.02430.pdf; https://ieeexplore.ieee.org/abstract/document/7792460 


 

Response 4: Thank you for your valuable suggestion. We updated the manuscript by citing [1] and state-of-the-art Graph Convolutional Networks-based methods into related work to make it more comprehensive.

 

L85: Add [1] in “Recent detection-based attention methods [1, …] …”

 

L110: Add “Some recent works [2,3] explore the use of Graph Convolutional Networks (GCN) to encode images to improve visual relationship.” after “… 2D hidden states.”

 

[1] Yang Z, Zhang Y J, ur Rehman S, et al. Image captioning with object detection and localization[C]//International Conference on Image and Graphics. Springer, Cham, 2017: 109-118.

[2] Yang X, Tang K, Zhang H, et al. Auto-encoding scene graphs for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 10685-10694.

[3] Yao T, Pan Y, Li Y, et al. Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 684-699.

 

 

Point 5: Some equations require further elaborations, such as equations 8-10. So, I strongly recommend that the authors have another go on all equations and make sure a consistent explanation for each is available. 


 

Response 5: Sorry for your misunderstanding. In fact, in equations 8-10, the bold symbol is the vectorized version of and is also the vectorized version of . Such representation is also used in [1,2]. To avoid misunderstanding, we have made further explanation of the above equations. We have also checked other equations.

 

L155: Add:

 

Eq17, 18: We fixed some mistakes: the “i” is missing

is changed to

 

[1] Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6077-6086.

[2] Zhang Z, Wu Q, Wang Y, et al. Fine-grained and semantic-guided visual attention for image captioning[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018: 1709-1717.

 

 

Point 6: Results should be further analyzed and discussed, more details and further discussion of table 3 is needed. 


 

Response 6: Thank you for your valuable suggestion. we have updated the manuscript by adding further discussion of the performance of PanoSegAtt in Table 3.

 

L292: Add “It is also observed that the performance of PanoSegAtt are better than that of InstanceSegAtt. This suggests that the context information from stuff region features is of benefit for the model to recognize the partially occluded objects. Such a conclusion is widely acknowledged in object detection [1,2,3].” after “… compared with InstanceSegAtt.”

 

L293: “… the model can avoid the negative impact from irrelevant regions and is more capable of distinguishing instances in densely annotated images.” -> “… the model can not only avoid the negative impact from irrelevant regions but also benefit from the context information and is more capable of distinguishing instances in images with overlapped objects.”

 

 

[1] Divvala S K, Hoiem D, Hays J H, et al. An empirical study of context in object detection[C]//2009 IEEE Conference on computer vision and Pattern Recognition. IEEE, 2009: 1271-1278.

[2] Borji A, Iranmanesh S M. Empirical Upper-bound in Object Detection and More[J]. arXiv preprint arXiv:1911.12451, 2019.

[3] Heitz G, Koller D. Learning spatial context: Using stuff to find things[C]//European conference on computer vision. Springer, Berlin, Heidelberg, 2008: 30-43.

 

 

Point 7: The conclusions section should conclude that you have achieved from the study, contributions of the study to academics and practices, and recommendations of future works.

 

Response 7: Thank you for your valuable suggestion. We have updated the conclusion section to include the above suggestions.

 

L318: Add “Our method achieves competitive performance against state-of-the-art methods.” after “… segmentation.”

 

L321: Add “Our research provides a novel perspective for academics and practices on how to improve the performance of image captioning. The above results indicate that extracting fine-grained image features is a prospective research topic for future work.” after “… captioning.”

 

 

Point 8: Thorough proofreading is required (best by a native English speaker) 


 

Response 8: Thank you for your valuable suggestion. We have done a thorough proofreading to the manuscript.

 

L37: “    However, the attention regions in the top-down attention mechanism correspond to …” -> “However, in the top-down attention mechanism, the attention regions correspond to …”

 

L67: “… into the image captioning task.” -> “… into image captioning.”

 

Figure 2, last row: “… irrelevant regions, but also …” -> “… irrelevant regions but also …”

 

L95: “… obtain richer scene context.” -> “… obtain richer context information.”

 

L98: “… the shape of the mask.” -> “… the shape of the masks.”

 

L101: “… from a semantic segmentation model …” -> “… from a semantic segmentation model FCN …”

 

L106: “… the advantage of fine-grained segmentation regions compared to detection regions.” -> “… the advantage of fine-grained segmentation regions over detection regions.”

 

L102: “Moreover, as they use semantic segmentation, their model is incapable of distinguishing instances of the same class in an image.” -> “Moreover, their models are incapable of distinguishing instances of the same class in an image as they use semantic segmentation.”

 

L110: “Other work attempts to …” -> “Other works attempt to …”

 

L133: “Captions are generally generated by LSTM.” -> “In this framework, captions are generated by LSTM.”

 

Figure 3, L2: “… are fed in to the …” -> “… are fed into the …”

 

L168-169: “things” -> things, “stuff” -> stuff

 

L173: Add “Next, we describe how to obtain the segmentation regions for the segmentation-based attention model.” before “Given an image, …”

 

L176: “… is the number of instances in the image.” -> “… is the number of the instances in the image.”

 

L181: “… are the masks that denote …” -> “… is the mask indicating …”

 

L203: “… in the image captioning task, …” -> “… in image captioning, …”

 

L221, L223: “analysis” -> “analyses”

 

Table 3: “… on the dense annotation test split.” -> “… on the MSCOCO dense annotation test split.”

 

L278: “… handling densely annotated images …” -> “… handling images with overlapped objects …”

 

L290: “… make it hard for detection-based models to …” -> “… make it hard for DetectionAtt to …”

 

L297: “… in the image captioning task, …” -> “… in image captioning, …”

 

L303: “… can generate richer scene information.” -> “… can generate captions with richer scene information.”

 

L322: “… in the panoptic segmentation task …” -> “… in panoptic segmentation …”

Point 1: The abstract is long and doesn't communicate the problem well. And, it should present the proposed work more clearly. 


 

Response 1: Thank you for your valuable suggestion. We have simplified the abstract and present our proposed work more clearly.

 

L1: “Image captioning is a task to generate textual …” -> “Image captioning is a task generating textural textual …”

 

L3: “... in existing image captioning models …” -> “… in existing models …”

 

L4: “… contain irrelevant regions (e.g., background or overlapped regions) besides the object.” -> “… contain irrelevant regions (e.g., background or overlapped regions) around the object, making the model generate inaccurate captions.”

 

L5: “In this work, we propose panoptic segmentation-based attention, a novel attention mechanism that …” -> “To address this issue, we propose panoptic segmentation-based attention that …”

 

L7: “… the main part of an instance), which is …” -> “… the main part of an instance). Our approach extracts feature vectors from the corresponding segmentation regions, which is …”

 

L9: “Experimental results lead to two significant observations. First, the model can better recognize the overlapped objects in the image with our fine-grained segmentation-region features. Second, the model can better understand the scene when scene regions are included in the attention regions. Our approach achieves competitive performance against state-of-the-art methods, which demonstrate the power of the panoptic segmentation-based attention mechanism.” -> “Experimental results show that our model can recognize the overlapped objects and understand the scene better. Our approach achieves competitive performance against state-of-the-art methods.”

 

 

Point 2: One of the major complaints I have is the poor introduction that fails to motivate the reader to read the paper. This is one of the most important sections of any paper, and I feel that the authors disregard this section compared to an adequate detail in the other sections. Hence, the introduction needs to be rewritten and expanded in order to motivate the thematic, and the authors must show while expanding the introduction why it is a relevant scientific problem to be solved.

 

Response 2: Thank you for your valuable suggestion. We have rewritten and expanded the introduction to motivate the thematic and explain the importance of the problem.

 

L19: “This task has long been considered difficult as it requires fine-grained representation and a thorough understand of the images.” -> “This task has several important practical applications. For example, it can help people with visual impairments. Therefore, it requires accurate recognition of the objects and a thorough understanding of the images.”

 

L25: “However, in the vanilla encoder-decoder framework, …” -> “The main problem in image captioning is the coarse representation of images. In the vanilla encoder-decoder framework, …”

 

L31: “The development of visual attention mechanisms generally follows a coarse-to-fine route. The initial encoder-decoder framework without any attention mechanism uses the parameters from the fully-connected layer of the CNN to represent the image. Later, Xu …” -> “However, the image features in existing attention-based methods are not fine-grained. Xu …”

 

L39: “… detection-based attention mechanisms were proposed to attend …” -> “… detection-based attention mechanisms were proposed to enable the model to attend …”

 

L43: “… in the attention regions.” -> “… in the rectangular attention regions.”

 

L49: Add “The results from instance segmentation only contain the main part of the object and do not include the background or overlapped regions. Therefore, we can obtain more fine-grained attention regions with the aid of instance segmentation.” after “… each object class.”

 

L51: Add “Losing information of the stuff regions may weaken the model's ability to understand the scene.” after “… e.g., sky and grass).”

 

L54: “… our method performs attention over the features of these segmentation regions, whereas the attention regions of detection-based attention mechanisms contain irrelevant regions besides the object in their attention regions. As shown in Figure 2, compared with the detection-based attention mechanism, the attention regions …” -> “… our method extracts image features based on the shape of the segmentation regions and generates captions based on the attention-weighted features. As shown in Figure 2, compared with the detection-based attention mechanisms that contain irrelevant regions around the objects in their attention regions, the attention regions …”

 

L61: “… our approach treats things and stuff classes independently. The features of stuff regions …” -> “… our approach processes things and stuff classes independently via a dual-attention module. Incorporating the features of stuff regions …”

 

 

Point 3: Please make sure that all keywords have been used in the abstract and the title. 


 

Response 3: Thank you for your valuable suggestion. We updated the manuscript by changing the keywords from {Image captioning; Attention mechanism; Multimodal; Encoder-decoder framework; Image description} to {Image captioning; Attention mechanism; Panoptic Segmentation}.

 

 

Point 4: The related work doesn’t cover the literature and state-of-the-art of the presented work appropriately. Therefore, I strongly recommend that the authors add more literature about this point. Some very relevant and recent work that the authors may use to address this point: https://ieeexplore.ieee.org/abstract/document/8851750; https://arxiv.org/pdf/1706.02430.pdf; https://ieeexplore.ieee.org/abstract/document/7792460 


 

Response 4: Thank you for your valuable suggestion. We updated the manuscript by citing [1] and state-of-the-art Graph Convolutional Networks-based methods into related work to make it more comprehensive.

 

L85: Add [1] in “Recent detection-based attention methods [1, …] …”

 

L110: Add “Some recent works [2,3] explore the use of Graph Convolutional Networks (GCN) to encode images to improve visual relationship.” after “… 2D hidden states.”

 

[1] Yang Z, Zhang Y J, ur Rehman S, et al. Image captioning with object detection and localization[C]//International Conference on Image and Graphics. Springer, Cham, 2017: 109-118.

[2] Yang X, Tang K, Zhang H, et al. Auto-encoding scene graphs for image captioning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019: 10685-10694.

[3] Yao T, Pan Y, Li Y, et al. Exploring visual relationship for image captioning[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 684-699.

 

 

Point 5: Some equations require further elaborations, such as equations 8-10. So, I strongly recommend that the authors have another go on all equations and make sure a consistent explanation for each is available. 


 

Response 5: Sorry for your misunderstanding. In fact, in equations 8-10, the bold symbol is the vectorized version of and is also the vectorized version of . Such representation is also used in [1,2]. To avoid misunderstanding, we have made further explanation of the above equations. We have also checked other equations.

 

L155: Add:

 

Eq17, 18: We fixed some mistakes: the “i” is missing

is changed to

 

[1] Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6077-6086.

[2] Zhang Z, Wu Q, Wang Y, et al. Fine-grained and semantic-guided visual attention for image captioning[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018: 1709-1717.

 

 

Point 6: Results should be further analyzed and discussed, more details and further discussion of table 3 is needed. 


 

Response 6: Thank you for your valuable suggestion. we have updated the manuscript by adding further discussion of the performance of PanoSegAtt in Table 3.

 

L292: Add “It is also observed that the performance of PanoSegAtt are better than that of InstanceSegAtt. This suggests that the context information from stuff region features is of benefit for the model to recognize the partially occluded objects. Such a conclusion is widely acknowledged in object detection [1,2,3].” after “… compared with InstanceSegAtt.”

 

L293: “… the model can avoid the negative impact from irrelevant regions and is more capable of distinguishing instances in densely annotated images.” -> “… the model can not only avoid the negative impact from irrelevant regions but also benefit from the context information and is more capable of distinguishing instances in images with overlapped objects.”

 

 

[1] Divvala S K, Hoiem D, Hays J H, et al. An empirical study of context in object detection[C]//2009 IEEE Conference on computer vision and Pattern Recognition. IEEE, 2009: 1271-1278.

[2] Borji A, Iranmanesh S M. Empirical Upper-bound in Object Detection and More[J]. arXiv preprint arXiv:1911.12451, 2019.

[3] Heitz G, Koller D. Learning spatial context: Using stuff to find things[C]//European conference on computer vision. Springer, Berlin, Heidelberg, 2008: 30-43.

 

 

Point 7: The conclusions section should conclude that you have achieved from the study, contributions of the study to academics and practices, and recommendations of future works.

 

Response 7: Thank you for your valuable suggestion. We have updated the conclusion section to include the above suggestions.

 

L318: Add “Our method achieves competitive performance against state-of-the-art methods.” after “… segmentation.”

 

L321: Add “Our research provides a novel perspective for academics and practices on how to improve the performance of image captioning. The above results indicate that extracting fine-grained image features is a prospective research topic for future work.” after “… captioning.”

 

 

Point 8: Thorough proofreading is required (best by a native English speaker) 


 

Response 8: Thank you for your valuable suggestion. We have done a thorough proofreading to the manuscript.

 

L37: “    However, the attention regions in the top-down attention mechanism correspond to …” -> “However, in the top-down attention mechanism, the attention regions correspond to …”

 

L67: “… into the image captioning task.” -> “… into image captioning.”

 

Figure 2, last row: “… irrelevant regions, but also …” -> “… irrelevant regions but also …”

 

L95: “… obtain richer scene context.” -> “… obtain richer context information.”

 

L98: “… the shape of the mask.” -> “… the shape of the masks.”

 

L101: “… from a semantic segmentation model …” -> “… from a semantic segmentation model FCN …”

 

L106: “… the advantage of fine-grained segmentation regions compared to detection regions.” -> “… the advantage of fine-grained segmentation regions over detection regions.”

 

L102: “Moreover, as they use semantic segmentation, their model is incapable of distinguishing instances of the same class in an image.” -> “Moreover, their models are incapable of distinguishing instances of the same class in an image as they use semantic segmentation.”

 

L110: “Other work attempts to …” -> “Other works attempt to …”

 

L133: “Captions are generally generated by LSTM.” -> “In this framework, captions are generated by LSTM.”

 

Figure 3, L2: “… are fed in to the …” -> “… are fed into the …”

 

L168-169: “things” -> things, “stuff” -> stuff

 

L173: Add “Next, we describe how to obtain the segmentation regions for the segmentation-based attention model.” before “Given an image, …”

 

L176: “… is the number of instances in the image.” -> “… is the number of the instances in the image.”

 

L181: “… are the masks that denote …” -> “… is the mask indicating …”

 

L203: “… in the image captioning task, …” -> “… in image captioning, …”

 

L221, L223: “analysis” -> “analyses”

 

Table 3: “… on the dense annotation test split.” -> “… on the MSCOCO dense annotation test split.”

 

L278: “… handling densely annotated images …” -> “… handling images with overlapped objects …”

 

L290: “… make it hard for detection-based models to …” -> “… make it hard for DetectionAtt to …”

 

L297: “… in the image captioning task, …” -> “… in image captioning, …”

 

L303: “… can generate richer scene information.” -> “… can generate captions with richer scene information.”

 

L322: “… in the panoptic segmentation task …” -> “… in panoptic segmentation …”

Author Response File: Author Response.pdf

Reviewer 2 Report

The paper describes an improved image captioning model making use of masked segmentation instead of the conventional detection-based approach. The paper is well written and mostly clear in its explanations. My only suggestion for improvement would be to extend a bit the discussion on reinforcement learning as it seems somewhat short.

Author Response

Point 1: My only suggestion for improvement would be to extend a bit the discussion on reinforcement learning as it seems somewhat short.


 

Response 1: Thank you for your valuable suggestion, we have updated the manuscript by extending the description of the reinforcement learning.

Add:

from L159 in the previous manuscript.

 

Point 2: English language and style are fine/minor spell check required.

 

Response 2: Thank you for your suggestion. We have done a thorough proofreading to the manuscript.

 

L37: “    However, the attention regions in the top-down attention mechanism correspond to …” -> “However, in the top-down attention mechanism, the attention regions correspond to …”

 

L67: “… into the image captioning task.” -> “… into image captioning.”

 

Figure 2, last row: “… irrelevant regions, but also …” -> “… irrelevant regions but also …”

 

L95: “… obtain richer scene context.” -> “… obtain richer context information.”

 

L98: “… the shape of the mask.” -> “… the shape of the masks.”

 

L101: “… from a semantic segmentation model …” -> “… from a semantic segmentation model FCN …”

 

L106: “… the advantage of fine-grained segmentation regions compared to detection regions.” -> “… the advantage of fine-grained segmentation regions over detection regions.”

 

L102: “Moreover, as they use semantic segmentation, their model is incapable of distinguishing instances of the same class in an image.” -> “Moreover, their models are incapable of distinguishing instances of the same class in an image as they use semantic segmentation.”

 

L110: “Other work attempts to …” -> “Other works attempt to …”

 

L133: “Captions are generally generated by LSTM.” -> “In this framework, captions are generated by LSTM.”

 

Figure 3, L2: “… are fed in to the …” -> “… are fed into the …”

 

L168-169: “things” -> things, “stuff” -> stuff

 

L173: Add “Next, we describe how to obtain the segmentation regions for the segmentation-based attention model.” before “Given an image, …”

 

L176: “… is the number of instances in the image.” -> “… is the number of the instances in the image.”

 

L181: “… are the masks that denote …” -> “… is the mask indicating …”

 

L203: “… in the image captioning task, …” -> “… in image captioning, …”

 

L221, L223: “analysis” -> “analyses”

 

Table 3: “… on the dense annotation test split.” -> “… on the MSCOCO dense annotation test split.”

 

L278: “… handling densely annotated images …” -> “… handling images with overlapped objects …”

 

L290: “… make it hard for detection-based models to …” -> “… make it hard for DetectionAtt to …”

 

L297: “… in the image captioning task, …” -> “… in image captioning, …”

 

L303: “… can generate richer scene information.” -> “… can generate captions with richer scene information.”

 

L322: “… in the panoptic segmentation task …” -> “… in panoptic segmentation …”

Author Response File: Author Response.pdf

Back to TopTop