You are currently viewing a new version of our website. To view the old version click .
by
  • Haonan Zhou1,*,
  • Xiaoping Du2 and
  • Lurui Xia2
  • et al.

Reviewer 1: Mihaela Hnatiuc Reviewer 2: Tai Fei Reviewer 3: Anonymous Reviewer 4: Anonymous Reviewer 5: Pyke Tin

Round 1

Reviewer 1 Report

This paper can be published in th

epresent form

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This content in this paper is too heavy, and many things are not clearly explained. I am very confused about the difference between image captioning and image classification.  Further, the authors choose the few-shot learning as their backbone structure, and many modifications are made. It is not clear to me how a few-shot learning framework would look like and where the modifications are made. Too many concepts and notations are mentioned in this paper. However, their connections and meaning are not clearly presented. 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

This work proposes an interesting SFRC model for remote sensing that can be useful for many applications. Here are some of my comments.

1. The abstract could be made more precise and shortened.

2. Figure captions need to be more precise

3. Some of the formatings need to be corrected, especially gaps between lines that need to be consistent.

4. What are the other applications this SFRC model could be used for?

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

In this paper, the authors proposed a deep neural network system for remote sensing image captioning under few-shot setting. Self-supervised learning was employed on images only to learn visual features. Then an LSTM was trained as the decoder to generate captions. A type of reinforcement learning approach was further used to optimize the system. Experiments were done on three public datasets to show the superiority of the performance. Ablation study was also included to show the importance of each module. I have a few concerns,

1. The writing of this paper makes it quite hard to follow in many places. Please rephrase those many places and correct English grammatical errors while revising the manuscript. Descriptions are also too redundant. Please simplify your language. In Lines 568-571, it is unclear what the meaning of this sentence is, and it is duplicated.

2. In Line 368, “Where is the weight of the model”, what does “where” mean? What does stop gradient mean and why do you call it stop gradient?

3. In Figure 2, how could “random crop and resize” generate a square mask in Image x_m?

4. In Line 397, the motivation behind using (f_7(x_m), f_5(x_n)) is unclear. Why did you choose 5 and 7 to contrast?

5. Section 4.5 doesn’t really make sense. The difference between training with 60% and 100% is quite significant, which cannot prove your claim that this method works for few-shot setting. It would be more informative to show that under 60% setting, the performances of other methods decrease drastically but your method still generates meaningful output.   

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 5 Report

General Comments

In this paper, the authors propose a self-learning method named SFRC for few-shot remote sensing image captioning. Without relying on additional labeled remote sensing data and external knowledge, SFRC improves the performance in few-shot scenarios by ameliorating the way and efficiency of the method of learning on limited data. The authors conduct the percentage sampling few-shot experiments to test the performance of the SFRC method in few-shot remote sensing image captioning with fewer samples. They also conduct ablation experiments on key designs in SFRC.

The paper is well organized.

Particular comments

1.     Page 6 Line 270 equation (2): The use notation W subscripts are not cleared. Should be explained.

2.     Page 7, Line 277: How do you compute ht-1 before you calculate ht?

3.     Page 8 Lines 329-330: What are the two different strategies you mentioned?

4.     Page 8 Lines 334-335: The definitions of q and e look the same. Should be checked. It may disturb equation (5) of Line 346.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

I have no further comments.

Reviewer 4 Report

The authors addressed my concerns.