Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Enhancing Dimensional Emotion Recognition from Speech through Modulation-Filtered Cochleagram and Parallel Attention Recurrent Network

Electronics 2023, 12(22), 4620; https://doi.org/10.3390/electronics12224620

by Zhichao Peng^1,*

, Hua Zeng¹, Yongwei Li², Yegang Du³ and Jianwu Dang^4,5,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Reviewer 4: Anonymous

Electronics 2023, 12(22), 4620; https://doi.org/10.3390/electronics12224620

Submission received: 24 September 2023 / Revised: 31 October 2023 / Accepted: 9 November 2023 / Published: 12 November 2023

(This article belongs to the Special Issue Applied AI in Emotion Recognition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The authors addressed the issue of predicting emotion from speech signals. They developed new way (multi-resolution modulation-filtered Cochleagram) to extract features from the speech data. A new attention model was then used used (PA-Net) for training the predictive model and compared to various other models such as LSTM. Overall, the work is solid. The introduction and comparison to prior work is well done. The methods are clearly explained and so were the results. They also achieved state-of-the-art performance, suggesting the usefulness of their work. Some minor questions:

1. The partition of the data into development and validation sets is not clear. They cited a few references but they should briefly explain how it's done, especially if there's data leakage, if the data are not partition early and clearly, whether the models are being trained and validated on different individuals, etc.

2. It is common in many machine learning papers to bold the model the performance metric number that performs the best when comparing to many models, so authors may consider doing this for readers to be able to do a quick asessment.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This research paper proposes a framework for dimensional emotion recognition using auditory features and a parallel attention recurrent network.

After reading and analyzing the paper, I found that the paper is well written and the research appears to be sound. However I can make the following observations:

-The text presents a well-structured and informative research paper. However, you might want to include a brief discussion at the end of the "Experimental Results and Analysis" section to mention the advantages, disadvantages, and limitations of this methodological approach. You should also highlight the key finding of your paper.

- There are some minor issues related to line 138 where the title "speech feature extraction" is written with a lowercase letter and equation (5) for which the meaning of the notation ? is not specified.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

1. In the experimental results and analysis section, the emotional speech data is not fully explained.

2. Write more details about RECOLA and SEWA datasets.

3. The information provided in Figure 4 is very limited.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

In the present paper, the authors propose a model for dimensional emotion recognition that relies on two key components: the modulation-filtered cochleagram module and the Parallel Attention Recurrent Neural Network (PA-net). Initially, they extract modulation-filtered cochleagrams at various resolutions from speech signals through auditory signal processing. Subsequently, they use the PA-net to establish multiple temporal dependencies from diverse scales of features, enabling the tracking of dynamic variations in dimensional emotions within auditory modulation sequences.

Experiments conducted on the RECOLA dataset demonstrate that, at the feature level, the modulation-filtered cochleagram outperforms other assessed features in predicting valence and arousal. The good performance are particularly interesting in presence of high noise levels. Overall, the results show that PA-net achieves the highest predictive performance for both valence and arousal when compared to state-of-the-art regression models.

I found the paper to be highly valuable and informative, and I recommend its publication. The research addresses an important aspect of understanding emotional states through speech and offers significant contributions to the field of emotional recognition.

I have only a few minor comments to provide for further improvement:

1. In the tables, it would be useful to mark the models with the best performance in bold and include a brief summary sentence in the caption to provide an overview of the comparison results.

2. The figure on page 11 lacks both a title and a caption. However, having a clear caption for the results would be helpful for better understanding. Increasing the size of the image slightly may also improve visibility of the overlapping graphs.

3. I recommend replacing the "Summary" section with "Conclusion" and, if already clear, adding a sentence about future work.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Enhancing Dimensional Emotion Recognition from Speech through Modulation-Filtered Cochleagram and Parallel Attention Recurrent Network

Further Information

Guidelines

MDPI Initiatives

Follow MDPI