Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Structure Preserving Convolutional Attention for Image Captioning

Appl. Sci. 2019, 9(14), 2888; https://doi.org/10.3390/app9142888

by Shichen Lu^1,2,†,‡, Ruimin Hu^1,2,*,‡, Jing Liu³, Longteng Guo³ and Fei Zheng⁴

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2019, 9(14), 2888; https://doi.org/10.3390/app9142888

Submission received: 11 June 2019 / Revised: 16 July 2019 / Accepted: 16 July 2019 / Published: 19 July 2019

(This article belongs to the Section Computing and Artificial Intelligence)

Round 1

Reviewer 1 Report

This paper introduces an architecture for automatic image captioning. The introduction is clearly written and adequate. The rest of the paper would benefit from grammar and typos revision (it feels like the introduction and te rest were written by different people). The references are adequae, the experimentation appropriately set and the results seem correct. However, the authors should provide more extended discussion about some of the choices that has been made, e.g.,

- Why Resnet101 is used as the encoder? I agree that it is a reasonable choice but the authors should explicitly provide the arguments for such a choice.

- Also, the cross-channel attention method looks like a simple 2D convolution across the feature map channels. Is there any other novelty related to that?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

In this paper a convolutional attention module that can preserve the spatial structure of an image is presented. The paper includes an appropriate state of the art, which is also used for comparison with the proposed methods.

The level of the use of the English language must be improved. Authors whose primary language is not English are advised to seek help in the preparation of the paper.

It should be interesting to include the computation time required per method and probably better qualitative examples should be given in order to compare soft-attention and the proposed attention map, which seem quite subjective (for instance according to the reviewer it is clear that a man is throwing a ball both in the soft attention method and the proposed method)

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Structure Preserving Convolutional Attention for Image Captioning

Further Information

Guidelines

MDPI Initiatives

Follow MDPI