Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

Electronics 2022, 11(19), 2999; https://doi.org/10.3390/electronics11192999

by Mingxing Li^1,2

, Hao Zhang^1,2, Cheng Xu^1,2,*

, Chenyang Yan^1,2, Hongzhe Liu^1,2 and Xuewei Li^1,2

Reviewer 1: Anonymous

Reviewer 2:

Tasneem Miqdady

Electronics 2022, 11(19), 2999; https://doi.org/10.3390/electronics11192999

Submission received: 18 August 2022 / Revised: 14 September 2022 / Accepted: 18 September 2022 / Published: 21 September 2022

(This article belongs to the Special Issue Multimodal Signal, Image and Video Analysis and Application)

Round 1

Reviewer 1 Report

This study proposed a new video caption model MFVC based on the Transformer by introducing audio modal data and attention bottleneck module to improve the model effect with less operation cost through the attention bottleneck module. Comparative experiments were performed on general datasets MSR-VTT, MSVD, and self-built dataset BUUISE. The result showed that the model can obtain good results on all the datasets.

The findings reported in this study are quite refreshing. I would like to state that the study had been meticulously conducted and the research gaps had been pointed out and filled in. There are still some issues to be addressed prior to the acceptance.

1. The following statement will require some citations to support. "At the same time, the scientists also found that the operation process of the attention takes the weighted sum between multiple vectors, which is very suitable for the fusion of multi-modal information."

2. The final section should be "Conclusions". The authors should discuss their results with previous findings in the section of discussion.

3. A section of "Limitations of the Study" or "Suggestions to the Future Studies " would be suggested.

Author Response

Response to Reviewer 1 Comments

Point 1: The following statement will require some citations to support. "At the same time, the scientists also found that the operation process of the attention takes the weighted sum between multiple vectors, which is very suitable for the fusion of multi-modal information."

Response 1: I have added the reference to the part you proposed as requested. The additions is highlighted in the pdf file.

Point 2: The final section should be "Conclusions". The authors should discuss their results with previous findings in the section of discussion.

Response 2: The conclusion section has been added as you requested, and the content of the discussion section has been added. The additions is highlighted in the pdf file.

Point 3: A section of "Limitations of the Study" or "Suggestions to the Future Studies " would be suggested.

Response 3: I have supplemented the conclusion according to your suggestions. The specific contents refer to the two aspects of your suggestion. And the additions is highlighted in the pdf file.

Author Response File: Author Response.pdf

Reviewer 2 Report

In the present work, the authors tried to model a video caption algorithm to overcome the single information algorithms for urban traffic scenes. For this, the authors employed a multi-modal fusion algorithm based on the Transformer of audio modal data and attention bottleneck module to increase the available information captured. The outcomes then are revealed and compared to other models. After reviewing the manuscript, the following are my comments.

- The paper is thorough, well written, has a strong literature review, a clear description of the methodology, and makes its contribution clear to the reader by significant results. However, the Authors should also proofread these parts of the paper for minor grammatical and spelling errors.

- The second part: Discussion and conclusion sections need major reconstruction and presentation; the authors did not discuss the results sufficiently, and greater clarification of models comparison results is needed. Moreover, a section of conclusions that highlights the key findings and points out the limitations of the study is missed.

Author Response

Response to Reviewer 2 Comments

Point 1: The paper is thorough, well written, has a strong literature review, a clear description of the methodology, and makes its contribution clear to the reader by significant results. However, the Authors should also proofread these parts of the paper for minor grammatical and spelling errors.

Response 1: The full text has been reviewed per your request, and several grammatical and spelling corrections have been made. The additions is highlighted in the text

Point 2: The second part: Discussion and conclusion sections need major reconstruction and presentation; the authors did not discuss the results sufficiently, and greater clarification of models comparison results is needed. Moreover, a section of conclusions that highlights the key findings and points out the limitations of the study is missed.

Response 2: The discussion and conclusion sections have been revised and supplemented as requested. Additions include further clarification of the results of model comparisons and pointing out study limitations. The additions is highlighted in the text

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

According to the modified version of the manuscript, The reviewer thinks it's now ready for publishing.

Article Menu

MFVC: Urban Traffic Scene Video Caption Based on Multimodal Fusion

Further Information

Guidelines

MDPI Initiatives

Follow MDPI