Performance Improvement of Speech Emotion Recognition Systems by Combining 1D CNN and LSTM with Data Augmentation
Round 1
Reviewer 1 Report
This paper proposes Speech Emotion Recognition Using 1-D CLDNN with Data Augmentation. The paper is well-written and provides valuable insights, but some concerns should be addressed:
• The paper's title should be revised to reflect its content and be more concise and engaging accurately. The authors should consider proposing a more suitable title that reflects the key findings of the research.
• The abstract should include significant findings and be structured clearly and easily. It should briefly summarize the paper's main findings to help readers understand the research.
• The introduction should provide a logical framework for the research, with each paragraph flowing seamlessly into the next. The authors should improve the coherence between paragraphs and offer more connections. The introduction should also clearly highlight the research contributions and aim, articulating the research question and its importance.
• The figures should be presented clearly and concisely, with informative captions that provide sufficient information to help readers understand them.
• The research problem should be stated clearly and concisely to increase readability.
• The authors should clarify the connections between the forecasting and Data Augmentation methods and how each method contributes to the research. They should also provide a detailed explanation of the solution method used and its effectiveness.
• The figures and charts should be high quality, easy to read, visually appealing, and effectively communicate the research findings. Figure 1 should be revised to include more data or more detailed information.
• The authors should provide more insightful and actionable managerial insights based on the research findings and clearly articulate the practical implications of the research for managers and organizations.
• The authors should carefully proofread and edit the paper to correct typos and grammatical errors. They should consider having someone else review the paper for any errors that they may have missed. Finally, relevant following papers from the MDPI journal should be used to demonstrate the research's relevance to the specific field of study:
1. A novel automatic modulation classification method using attention mechanism and hybrid parallel neural network. Applied Sciences. 2021 Feb 2;11(3):1327.
2. A survey on machine learning-based performance improvement of wireless networks: PHY, MAC and network layer. Electronics. 2021 Jan 29;10(3):318.
3. A novel machine learning approach combined with optimization models for eco-efficiency evaluation. Applied Sciences. 2020 Jul 28;10(15):5210.
4. A Novel Pipeline Age Evaluation: Considering Overall Condition Index and Neural Network Based on Measured Data. Machine Learning and Knowledge Extraction. 2023 Feb 20;5(1):252-68.
5. Radar specific emitter identification based on open-selective kernel residual network. Digital Signal Processing. 2023 Jan 9:103913.
Extensive editing of English language required
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
This study proposes a machine learning model for speech emotion recognition combining CNNs, LSTMs, and CLDNNs. Its main contribution consists in improving the accuracy of speech emotion recognition when compared to related research.
The document is easy to read and follow.
The document is well supported with references although lacking references in some sections.
Authors should address the following issues:
In line 21 of the Abstract please correct to “…compared to related research…”
At the end of line 29 authors should include bibliographic references to the referred technologies.
In line 30 authors should clarify the statement. Is speech emotion recognition only performed with artificial intelligence technology?! If not, the statement should be something like: “Speech emotion recognition can be performed using artificial intelligence technology which identifies the emotional state…”
From line 35 to 40 authors should include bibliographic references.
In line 52 please correct to “…overcome…”
In line 63 authors should clarify the meaning of “…data portability…”
In line 108 please correct to “…some research focused…”
The paragraph in line 120 only references a paper using data augmentation. Authors should include more references addressing this topic in the context of speech recognition.
In line 242 authors should clarify in the text how the reshape of each MFCCs into one-dimensional data format was performed.
In Table 5 the bold text should be standardized.
In line 423 authors say that “…just a little lower (less than 0.1%) than those of [5] and [7]…”. Please include in the text a possible explanation.
The English needs minor revision and spell checking.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
- The Introduction section at some point refers to 28 different papers not tied to a specific claim, followed by a multitude of examples with no references. Overall, the Introduction section needs some restructuring with appropriate references per example/claim.
- Is Figure 1 really needed? It is just three percentages that can be easily conveyed through text. The tables that show the proportions of emotions in each database on the other hand are very helpful to the reader.
- It is nice that the authors provide extensive details about the utilized CNN and CLDNN models.
- The evaluation results are extensive and the authors provide comparisons with multiple state-of-art works.
-
There are some grammar/syntax issues here and there that need editing. For instance "To make the designed system can recognize" at the very beginning of the abstract.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
This paper ha s a potential for publicato but before at least two issue need to be solved hich are as follows.
- the first point is that the paper lacks a clear contextualization of the contribution; we are all discussing in these times about the opportunity of recognizing and generating automatically text with transformers, while the authors of this paper propose to recognize/classify also emotions. I think that the average reader could be a little confused and s/he would benefit by a short summarization of the long history that has led up to this point. Starting from digital voice transmission (e.g., AA. VV. Design and experimental evaluation of an adaptive playout delay control mechanism for packetized audio for use over the internet, Multimedia Tools and Applications.Volume 14, Issue 1, Pages 23 - 53, May 2001, doi: 10.1023/A:1011303506685) to self attention mechanisms (choose whatever citation you want),
- the second point is that it is not clear to me (and should be explained) why and how LSTM are sufficient to the aim targeted by this paper and not the more performative transformers;
- third and final, I think it is very questionable the dataset on which these neural networks are trained. This aspect can be relevant especially in presence of biased or imbalanced datasets. To this aim, please extend the discussion and refer to: AA. VV. Survey on deep learning with class imbalance. J Big Data 6, 27 (2019). https://doi.org/10.1186/s40537-019-0192-5
Satisfactory
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
The authors answered all my comments and this version is acceptable.
Minor editing of English language required
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
As to point 1) my evaluation does not change as none of my indications has been followed. We have too many papers asking to be published just because they use a NN, a new one should be published only if it motivates the need for its presence. This paper fails to do so. I had suggested the author how to motivate the presence of such a similar paper that, in addition, still uses a LSTM, while transformers are available. At this point, I am not interested in blocking the publication, yet if my opinion is asked, my opinion is as above. I am confident the Editor can take the right decision alone as my suggestions have been intentionally neglected.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 3
Reviewer 4 Report
green light, acceptable quality