Speech Emotion Recognition Based on Multiple Acoustic Features and Deep Convolutional Neural Network
Round 1
Reviewer 1 Report
The authors describe a NN AI for detecting emotions from audio streams. In the introduction, they adequately postulate the problems of the proposed approaches and describe the main contributions of the paper in bullet points.
For reproducibility purposes, the authors provide both the mathematical formulæ describing each component and the architecture of the overall system. As the implementation/codebase is not provided, I would have also expected to see more detail in the training phase: how the hyperparameters were settled, and which criteria were used. In fact, the authors just describe the number of epochs required for training but do not elaborate further on how the training was carried out.
The authors also provided a good set of experimental settings where different competing AIs for the same task are compared. The authors might also want to provide the training time to describe the tradeoff between this and the accuracy of the provided results. Still, the authors make a good point showing the differences across models by showing the number or total trainable parameters, from which the reader might argue that we might prefer a system with lesser parameters, as it will be more robust and less prone to human error.
The paper is well-written and easy to read. No major grammatical issues were found. Those suggested edits are very minor and I am confident they might be carried out in a few days of work.
Author Response
Please see the attachment
Author Response File: Author Response.docx
Reviewer 2 Report
The topic of speech emotion recognition (SER) is very relevant, and over the past decade, quite a lot of articles have been published on this topic, including articles on selecting hand-crafted features and using 1D-DCNN for SER. That is why the originality/novelty, the significance of the content, and the interest to the readers are in great doubt.
There are big questions about the chief contributions of the proposed article:
1. What was the motivation to analyze exactly the listed set of acoustic features? In numerous state-of-the-art SER challenges, the community uses well-known datasets like INTERSPEECH Emotion Challenge set, Geneva Minimalistic Acoustic Parameter Set (Ge MAPS), and extended Geneva Mini-malistic Acoustic Parameter Set (eGeMAPS). What is wrong with these feature sets? They include most of the features listed by the authors of the article. What is the originality/novelty of the proposed feature set?
2. The second contribution was stated as the complexity reduction of deep learning frameworks for SER. But this contribution is not confirmed in the article. The authors evaluate only the accuracy of classification but do not give estimates of computational complexity. Therefore, it is impossible to confirm the value of this contribution and estimate the gain in computational complexity of 1-DCNN relative to 2-DCNN and whether this gain compensates for the decrease in accuracy from 96.07% to 93.31% on the Emo-DB dataset.
Author Response
Please see the attachment
Author Response File: Author Response.docx
Round 2
Reviewer 2 Report
Comments for author File: Comments.pdf
Author Response
The revised manuscript and response to the reviewers is attached
Author Response File: Author Response.pdf
Round 3
Reviewer 2 Report
The authors have corrected the manuscript in accordance with the recommendations of the reviewer, and the article can be accepted in its present form.