Next Article in Journal
A Robust Document Identification Framework through f-BP Fingerprint
Next Article in Special Issue
Investigating Semantic Augmentation in Virtual Environments for Image Segmentation Using Convolutional Neural Networks
Previous Article in Journal
The Constantly Evolving Role of Medical Image Processing in Oncology: From Traditional Medical Image Processing to Imaging Biomarkers and Radiomics
Previous Article in Special Issue
No-Reference Quality Assessment of In-Capture Distorted Videos
 
 
Article
Peer-Review Record

On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval

J. Imaging 2021, 7(8), 125; https://doi.org/10.3390/jimaging7080125
by Yan Gong *, Georgina Cosma * and Hui Fang
Reviewer 1: Anonymous
Reviewer 2: Anonymous
J. Imaging 2021, 7(8), 125; https://doi.org/10.3390/jimaging7080125
Submission received: 30 June 2021 / Revised: 21 July 2021 / Accepted: 23 July 2021 / Published: 26 July 2021
(This article belongs to the Special Issue Deep Learning for Visual Contents Processing and Analysis)

Round 1

Reviewer 1 Report

This paper analyses the current visual-semantic embedding (VSE) networks for information retrieval in-depth with various methods and summarizes the limitations of them. Future research was pointed by the paper through the comparison of strengths and limitations of the current VSE networks. These analyses explore the essence of how VSE networks work, that is important to promote the VSE network to be used in the practice. This paper seems to be good with solid contribution, and my comments are as follows: 

  1. Please explain why it is necessary to evaluate the performance of models on retrieving all 5 descriptions since that on retrieving any 1 of 5 descriptions has been evaluated.
  2. Why the Recall/Precision curve in Figure 4 is needed since that F1-measure has been computed through Recall and Precision.
  3. Please explain how the classes of UNITER in Table 4 align with VSRN.
  4. Can ‘Objects Quantity Error ’ of title 5 in Table 5 be replaced by ‘Objects Counting Error ’?

 

Author Response

We are grateful to the Reviewers for their positive and constructive comments that helped us improve the quality of the manuscript. Please see the attachment.

Author Response File: Author Response.pdf

 

Reviewer 2 Report

The paper just compares visual-semantic embedding networks without any additional contributions. The paper finds limitations but does not provide empirical studies on some proposed solution. I would strongly recommend the authors to address these limitations and come up with a proposed method

The authors should also perform experiments on sample complexity: How these methods perform for different amount of training samples ? 

It would be great for the authors to address retrieval results on adversarial images. How are the models robust in retrieval results of adversarial images ?

Author Response

We are grateful to the Reviewers for their positive and constructive comments that helped us improve the quality of the manuscript. Please see the attachment.

Author Response File: Author Response.pdf

 

Round 2

Reviewer 2 Report

The paper has been improved but not state-of-the-art technique. Still it is fine enough for this journal.

Back to TopTop