Next Article in Journal
Cervical Cancer Diagnosis Based on Multi-Domain Features Using Deep Learning Enhanced by Handcrafted Descriptors
Previous Article in Journal
A Robust Adversarial Example Attack Based on Video Augmentation
 
 
Article
Peer-Review Record

Bimodal Fusion Network with Multi-Head Attention for Multimodal Sentiment Analysis

Appl. Sci. 2023, 13(3), 1915; https://doi.org/10.3390/app13031915
by Rui Zhang 1,2, Chengrong Xue 1,2, Qingfu Qi 3, Liyuan Lin 2,*, Jing Zhang 1,2 and Lun Zhang 1,2
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Appl. Sci. 2023, 13(3), 1915; https://doi.org/10.3390/app13031915
Submission received: 2 January 2023 / Revised: 27 January 2023 / Accepted: 29 January 2023 / Published: 2 February 2023

Round 1

Reviewer 1 Report

The authors proposed a bimodal fusion network that has a cross-modal attention mechanism to perform multimodal sentiment analysis.

Overall, the paper is well-written and well-organised. However, please consider the comments below for further improvement:

1. Line 62 - Is the term vector bias and vector offset the same? Please clarify

2. Is the alignment problem and vector offset problem being solved by the cross-modal attention mechanism? Please clarify with respect to the equations. Are both vector bias and misalignment alleviated in the cross-model attention mechanism? Possible to highlight which steps or eq alleviates which problem or the same step does it?

3. Eq 1 - should it be PFE or PFN?

4. It is unclear in section 3.2 where the vector offset problem has been remedied. Is eq 10 where vector bias is alleviated or there is a series of steps?

5. The EF-LSTM performed competitively well.  Could EF-LSTM performed just as well if used with cross-modal attention mechanism used in BMAN? 

 

6. The results are mainly overall performance. Only overall performance is shown. No further investigations done to demonstrate the effectiveness of the components that remedies either alignment or vector bias such as an ablation study. There is no evidence to support the improvements are coming specifically from the cross-model attention mechanism or the bimodal encoder or decoder.

7. How does one come to the conclusion that CTC+EF-LSTM ignore differences between modalities and vector offset when the combination was specifically chosen, can a more related model be chosen for comparison instead? 

8. The performance of BMAN is better on unaligned dataset than aligned; it is unclear in the results that with proposed remedies for alignment is improving the results.

9. While the proposed technique showed better performance compared to other models; there is no evidence the vector bias problem has been solved, to support the claim in the conclusion in line 383.  Can a side investigation be done regarding this?

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Thank you for submitting the article and please follow the suggestions,

1.       In this sentence- “The sentiment analysis (SA) of text research ……….”  What do you mean by scene? Please explain.

2.       In this sentence – “Therefore, multimodal sentiment analysis has become a popular research direction” please cite some research studies that claim the emergence of multimodal SA.

3.       Please define in a simple words what is unimodal and multimodal.

 

4.       Mention any limitations (if any and if not that list some applications of your proposed approach).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Back to TopTop