Next Article in Journal
Modeling Chronic Pain Experiences from Online Reports Using the Reddit Reports of Chronic Pain Dataset
Next Article in Special Issue
Transformers in the Real World: A Survey on NLP Applications
Previous Article in Journal
A Super-Efficient TinyML Processor for the Edge Metaverse
Previous Article in Special Issue
MBTI Personality Prediction Using Machine Learning and SMOTE for Balancing Data Based on Statement Sentences
 
 
Article
Peer-Review Record

Novel Task-Based Unification and Adaptation (TUA) Transfer Learning Approach for Bilingual Emotional Speech Data

Information 2023, 14(4), 236; https://doi.org/10.3390/info14040236
by Ismail Shahin 1,*,†, Ali Bou Nassif 2,†, Rameena Thomas 1,† and Shibani Hamsa 3,†
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Information 2023, 14(4), 236; https://doi.org/10.3390/info14040236
Submission received: 24 January 2023 / Revised: 2 April 2023 / Accepted: 4 April 2023 / Published: 12 April 2023

Round 1

Reviewer 1 Report

See the attached annotated file.


Overall evaluation, too little information was provided by the authors to explain their experiment with unnecessary references. Please focus on your datasets and methods, and explain them clearly and in detail.

Comments for author File: Comments.pdf

Author Response

Rebuttal-Reviewer 1


Overall evaluation, too little information was provided by the authors to explain their experiment with unnecessary references. Please focus on your datasets and methods, and explain them clearly and in detail.

 

We thank the reviewer for the time and efforts devoted to review our paper. We found all comments are very useful and constructive, and hence, we addressed them appropriately. The article has been revised as recommended to include more information about the suggested approach, datasets used, evaluation process, and experimentation setup. We have added a new figure to better explain the concept. Please see Figure 2 on page 7. Experimental evaluation has been conducted on the three different datasets and the results are given in Tables 3,4, and 6 on pages 14 and 16, respectively. Computational complexity is also assessed and reported in Table 5 on page 16.

Reviewer 2 Report

Gathering training data for machine learning models has become a costly process and there is a need for increasingly high performance learning algorithms trained on already exising data. Transfer learning may increase a model’s capacity for learning by transferring knowledge from one domain to anther related one. This article describes research where a transfer learning method is exploited to classify multi-dimensional features from various speech signal datasets. The importance of a novel transfer learning system (Task-based Unification and Adaptation (TUA)) is brought forward in this work here, aimed at bridging gapsbetween large-scale upstream training and downstream customization. Different available speech datasets were used to test the proposed system/method for multidimensional feature coding and transfer learning with respect to capacity for average speech emotion recognition, yielding variable recognition rates between 84.7 and 91.2% for the different datasets.  As a model framework for emotion recognition, the proposed system utilizes the benefits of transfer learning to obtain high performance model recognition of speech features relating to emotions across mulitlingual datasets without repeating complex and time consuming training procedures. While, for English and Arabic language datasets tested here, the model achieves good results by comparison with previously proposed state of-the-art models, the framework developed in this research here has important drawbacks in terms of computation times, and resources required to build an operable system.

This being acknowledged, the authors should better explain the wider interest of operationalizing such a heavy-resource approach for emotion recognition across no more than two languages. Despite the interest of transfer learning, as explained in the paper, the authors should provide a clear rationale why it is warranted in the specific case of this particular study context here.

Also, the paper extends on technical details, while the overall structure suffers from lack of a clear study rationale for the rather narrow and somewhat limited context of emotional feature recognition in speech patterns. The comparison between alternative methods proposed previously needs to be presented here in a more detailed and convincing manner, especially in the results section.

Figure 1: The framework block diagram shown here is so basic that it becomes useless. The diagram needs to be improved substantially to convey essentials relative to the feature extraction step and the ensuing classification criteria. These also need to be explained more clearly in the text.

 

Figure 6: there is a conceptual problem with the definition of what the authors refer to as emotional speech features in this Figure here. While categories such as "slow" and "fast" may indirectly reflect emotional states, it is difficult to say what these would be; an anxious emotional state, for example, may result in slower speech debit in some individuals, and in faster debit in other individuals. Also, to discriminate between slow and fast speech debit within or across languages does not justify any computationally heavy model framework as this can be achieved at a low cost by relatively simple technical means. The same holds for features "soft" and "loud". "Neutral" being the control condition here, this leaves us with "angry" as the only valid feature of "emotional valence" in the true sense of this concept. Please clearly explain the difference between Figures 6, 5, 4, 2 and 3 in this regard.

The conclusion section needs to be improved to summarize quite clearly 1) what justifies the costly procedure developed here applied to the limited speech recognition example dealt with, and beyond, and 2) how the approach could be applied to other feature recognition problems in other domains where high performance transfer learning has become an attractive solution given the cost of producing new data for training the algorithms.

 

Author Response

Rebuttal-Reviewer 2

Gathering training data for machine learning models has become a costly process and there is a need for increasingly high performance learning algorithms trained on already existing data. Transfer learning may increase a model’s capacity for learning by transferring knowledge from one domain to anther related one. This article describes research where a transfer learning method is exploited to classify multi-dimensional features from various speech signal datasets. The importance of a novel transfer learning system (Task-based Unification and Adaptation (TUA)) is brought forward in this work here, aimed at bridging gaps between large-scale upstream training and downstream customization. Different available speech datasets were used to test the proposed system/method for multidimensional feature coding and transfer learning with respect to capacity for average speech emotion recognition, yielding variable recognition rates between 84.7 and 91.2% for the different datasets.  As a model framework for emotion recognition, the proposed system utilizes the benefits of transfer learning to obtain high performance model recognition of speech features relating to emotions across multilingual datasets without repeating complex and time consuming training procedures. While, for English and Arabic language datasets tested here, the model achieves good results by comparison with previously proposed state of-the-art models, the framework developed in this research here has important drawbacks in terms of computation times, and resources required to build an operable system.

We thank the reviewer for the time and efforts devoted to review our paper. We found all comments are very useful and constructive, and hence, we addressed them appropriately.

This being acknowledged, the authors should better explain the wider interest of operationalizing such a heavy-resource approach for emotion recognition across no more than two languages. Despite the interest of transfer learning, as explained in the paper, the authors should provide a clear rationale why it is warranted in the specific case of this particular study context here.

We thank the reviewer for this comment. Manuscript is updated by including the relevance of transfer learning for emotion recognition. Please see pages 2-3, label [R2,1].

Also, the paper extends on technical details, while the overall structure suffers from lack of a clear study rationale for the rather narrow and somewhat limited context of emotional feature recognition in speech patterns. The comparison between alternative methods proposed previously needs to be presented here in a more detailed and convincing manner, especially in the results section.

Thank you for your comment. We appreciate your feedback on the paper's overall structure and the presentation of the comparison between alternative methods:-        Regarding the study rationale, the main focus of the paper is to investigate the effectiveness of various machine learning algorithms on recognizing emotional features in speech patterns. This is a relevant and important area of research given the potential applications in fields such as healthcare, human-computer interaction, and affective computing.-        However, we acknowledge that the paper could have presented a more detailed study rationale to provide a clearer context for the research.-        Regarding the comparison between alternative methods, we agree that this is an important aspect of the paper, and we have made efforts to present it in a clear and convincing manner in the results section. We have compared the performance of the proposed algorithms with several alternative methods for the three different datasets. We have also provided statistical analyses to demonstrate the superiority of our proposed algorithms in terms of accuracy, precision, recall, and F1-score. Please see pages 10-16.

Figure 1: The framework block diagram shown here is so basic that it becomes useless. The diagram needs to be improved substantially to convey essentials relative to the feature extraction step and the ensuing classification criteria. These also need to be explained more clearly in the text.

Thanks for your advice. A detailed block schematic representation has been added in the manuscript. Please see Figure 2 on page 7.

 Figure 6: there is a conceptual problem with the definition of what the authors refer to as emotional speech features in this Figure here. While categories such as "slow" and "fast" may indirectly reflect emotional states, it is difficult to say what these would be; an anxious emotional state, for example, may result in slower speech debit in some individuals, and in faster debit in other individuals. Also, to discriminate between slow and fast speech debit within or across languages does not justify any computationally heavy model framework as this can be achieved at a low cost by relatively simple technical means. The same holds for features "soft" and "loud". "Neutral" being the control condition here, this leaves us with "angry" as the only valid feature of "emotional valence" in the true sense of this concept. Please clearly explain the difference between Figures 6, 5, 4, 2 and 3 in this regard.

The Speech Under Simulated and Actual Stress (SUSAS) dataset uses descriptors such as loud, slow, and fast instead of emotion labels like angry or happy for several reasons:

  1. Stress as a primary emotion: The primary focus of the SUSAS dataset is on speech under stress, which may not necessarily be associated with specific emotions such as anger or happiness. Descriptors like loud, slow, and fast can capture important aspects of speech production that are relevant to stress, such as changes in pitch or speaking rate.
  2. Multidimensionality of emotions: Emotions can be complex and multidimensional, and it may not always be accurate or useful to categorize them into discrete labels such as angry or happy. Descriptors like loud, slow, and fast can capture different aspects of speech production that may be related to different emotions.
  3. Avoidance of subjective interpretation: Emotion labels like angry or happy can be subjective and open to interpretation, which can make it difficult to compare results across studies or between raters. Descriptors like loud, slow, and fast are more objective and easier to measure consistently.

 

Overall, the use of descriptors like loud, slow, and fast in the SUSAS dataset allows for a more nuanced and objective analysis of speech production under stress, without the limitations of discrete emotion labels.

The conclusion section needs to be improved to summarize quite clearly 1) what justifies the costly procedure developed here applied to the limited speech recognition example dealt with, and beyond, and 2) how the approach could be applied to other feature recognition problems in other domains where high performance transfer learning has become an attractive solution given the cost of producing new data for training the algorithms.

We thank the reviewer for this comment. Conclusion section has been updated as advised. Please see the conclusion on page 16, label [R6,3].

Author Response File: Author Response.docx

Reviewer 3 Report

The article focuses on deep machine transfer learning technologies used for automatic recognition of human emotions through speech acoustic characteristics. The experiments were conducted using three datasets: ESD, SUSAS, and RAVDESS. The topic is certainly relevant. From a comparative analysis, it can be inferred that the quality of results is still dependent on the input data, which is influenced by the human characteristics and their emotional reactions to various events. However, in my opinion, there are some shortcomings that need to be addressed.

 

1) The first part of the article notes that human speech is a simple and effective form of communication. However, it is well-known that emotion speech recognition can be difficult in noisy environments, even with noise reduction. One solution to this problem is to consider various physiological signals, video or audiovisual analysis. For example, the RAVDESS dataset used by the authors is an example of an audiovisual dataset. Therefore, this gap needs to be filled and a description should be added. For example, the top of the best results for audio modality, video with different evaluation metrics (https://paperswithcode.com/sota/facial-expression-recognition-on-ravdess), and audiovisual (https://paperswithcode.com/sota/facial-emotion-recognition-on-ravdess) on the available RAVDESS dataset. Including such a description would expand references to previous work of the global scientific community (from 2021 to 2022), which is regularly presented at conferences that focus on audio processing or multimodal data processing (such as CVPR, INTERSPEECH, ICASSP, etc.) or in journals ranked in Q1. Currently, the introduction and previous work sections mention that emotion recognition can be achieved through analysis of audio, video, heart rate, and so on. But this is not further explored.

 

2) The question arises as to why these three datasets were used instead of others. It would be reasonable to mention other datasets and explain why they were not suitable for this study.

 

3) The section on experiments needs to be expanded and include more details. Have any data augmentation techniques been used? Have any speed and training variations been applied (such as cosine annealing)? If not, why not?

 

4) The style of the article needs minor revisions to address spelling and punctuation errors, but it is generally easy to read.

 

It seems to me that all the proposed additions will only improve this article. The article still needs to be worked on and expanded (especially the first parts).

Author Response

Rebuttal-Reviewer 3

The article focuses on deep machine transfer learning technologies used for automatic recognition of human emotions through speech acoustic characteristics. The experiments were conducted using three datasets: ESD, SUSAS, and RAVDESS. The topic is certainly relevant. From a comparative analysis, it can be inferred that the quality of results is still dependent on the input data, which is influenced by the human characteristics and their emotional reactions to various events. However, in my opinion, there are some shortcomings that need to be addressed. 

We thank the reviewer for the time and efforts devoted to review our paper. We found all comments are very useful and constructive, and hence, we addressed them appropriately.

1) The first part of the article notes that human speech is a simple and effective form of communication. However, it is well-known that emotion speech recognition can be difficult in noisy environments, even with noise reduction. One solution to this problem is to consider various physiological signals, video or audiovisual analysis. For example, the RAVDESS dataset used by the authors is an example of an audiovisual dataset. Therefore, this gap needs to be filled and a description should be added. For example, the top of the best results for audio modality, video with different evaluation metrics (https://paperswithcode.com/sota/facial-expression-recognition-on-ravdess), and audiovisual (https://paperswithcode.com/sota/facial-emotion-recognition-on-ravdess) on the available RAVDESS dataset. Including such a description would expand references to previous work of the global scientific community (from 2021 to 2022), which is regularly presented at conferences that focus on audio processing or multimodal data processing (such as CVPR, INTERSPEECH, ICASSP, etc.) or in journals ranked in Q1. Currently, the introduction and previous work sections mention that emotion recognition can be achieved through analysis of audio, video, heart rate, and so on. But this is not further explored.

Thank you so much for the valuable advice. This research is focused solely on using audio signals for emotion recognition, rather than incorporating other modalities such as facial expressions, body language, or text. Model development, execution and evaluation are only on the basis of speech signals. 

2) The question arises as to why these three datasets were used instead of others. It would be reasonable to mention other datasets and explain why they were not suitable for this study.

Thank you so much for your valuable advice.  In this study, we have considered 3 distinct datasets to incorporate various challenges.

The first dataset mentioned is RAVDESS, which is an English dataset. This dataset likely presents challenges related to differences in accents, dialects, and speaking styles among the speakers included in the dataset.

The second dataset listed is ESD, which is an Arabic dataset. This dataset likely shows challenges related to differences in the Arabic language compared to English, as well as differences in accents, dialects, and speaking styles among the speakers included in the dataset.

The third dataset used is SUSAS, which incorporates speech samples from stressful talking conditions. This dataset likely presents challenges related to variations in speech patterns and characteristics that may occur during periods of stress or anxiety, such as changes in pitch, volume, and tempo.

Overall, the consideration of these three diverse datasets likely allows for a more comprehensive understanding of the challenges involved in analyzing speech data across different languages, accents, dialects, and speaking styles, as well as in different emotional and contextual situations. 

3) The section on experiments needs to be expanded and include more details. Have any data augmentation techniques been used? Have any speed and training variations been applied (such as cosine annealing)? If not, why not?

Thank you for your valuable advice. We have augmented the data by mixing it with noise in a ratio 2: 1 and 3:1 to scale the dataset and to improve the noise susceptibility of the system. We have used Adam optimizer for scheduling the learning rate. We prepare the model for 150 epochs with an initial learning rate of 0.0005; after the 10th epoch, the learning rate is reduced by half every ten epochs.  Please see page 10, label [R3,3].

4) The style of the article needs minor revisions to address spelling and punctuation errors, but it is generally easy to read.

We thank the reviewer for this comment. We fixed the errors in terms of spelling and punctuations. 

It seems to me that all the proposed additions will only improve this article. The article still needs to be worked on and expanded (especially the first parts).

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Thank you for revising the manuscript. See the attached file for my comments (mainly on the lack of description of transfer learning in the system).

Comments for author File: Comments.pdf

Author Response

Rebuttal-Reviewer 1

Thank you for revising the manuscript. See the attached file for my comments (mainly on the lack of description of transfer learning in the system).

 

Thank you for the valuable suggestion. The manuscript has been modified as advised.

  1. The details of transfer learning system are given in pages 8-9 (highlighted and labelled as R[1,1]).
  2. 2 has been modified as advised. Please see page 8.
  3. New benchmark has been added to Table 3 as advised. Please see page 13.
  4. Unable to add the suggested benchmark in Table 6. Since the precision, recall and F1 score values are not reported in their paper.

Reviewer 2 Report

Despite the fact that the revisions performed are minimalistic, the paper may be published in its present form.

Author Response

Rebuttal-Reviewer 2

No comments.

 

Thank you very much for your time.

Author Response File: Author Response.docx

Reviewer 3 Report

The authors have made some improvements to the article, but I have additional comments on the revised version. Firstly, while the drawings are of acceptable quality, presenting them in vector format would be better. Quality graphics are important in articles, and vector graphics provide the best image quality.

 

Secondly, the authors state in their response that their research solely focuses on using audio cues for emotion recognition, excluding other modalities such as facial expression, body language, or text. Although I agree with this statement, it is worth noting that most modern conferences and journals, such as INTERSPEECH, ICASSP, ICMI, CVPR, and other Q1 level journals, use multimodal approaches in most cases. Therefore, it would be useful for the authors to include the best works on at least the video modality in their review of previous studies. This would explain why the authors chose only the audio modality and provide insights for future research.

 

Please see my previous review on this article:

1) The first part of the article notes that human speech is a simple and effective form of communication. However, it is well-known that emotion speech recognition can be difficult in noisy environments, even with noise reduction. One solution to this problem is to consider various physiological signals, video or audiovisual analysis. For example, the RAVDESS dataset used by the authors is an example of an audiovisual dataset. Therefore, this gap needs to be filled and a description should be added. For example, the top of the best results for audio modality, video with different evaluation metrics (https://paperswithcode.com/sota/facial-expression-recognition-on-ravdess), and audiovisual (https://paperswithcode.com/sota/facial-emotion-recognition-on-ravdess) on the available RAVDESS dataset. Including such a description would expand references to previous work of the global scientific community (from 2021 to 2022), which is regularly presented at conferences that focus on audio processing or multimodal data processing (such as CVPR, INTERSPEECH, ICASSP, etc.) or in journals ranked in Q1. Currently, the introduction and previous work sections mention that emotion recognition can be achieved through analysis of audio, video, heart rate, and so on. But this is not further explored.

 

The article deserves attention and has the potential to be even more useful with some adjustments. I strongly recommend considering the shortcommings provided. It's important to note that this is not a conference paper where the length of the paper may limit the amount of information that can be included. As a journal article, additional details can only serve to attract readers and enhance the value of the article.

Author Response

Rebuttal-Reviewer 3

The authors have made some improvements to the article, but I have additional comments on the revised version. Firstly, while the drawings are of acceptable quality, presenting them in vector format would be better. Quality graphics are important in articles, and vector graphics provide the best image quality.

Thank you so much for the advice. Figures are updated as suggested. Please see Fig. 1 on page 7 and Fig. 2 on page 8.

 

Secondly, the authors state in their response that their research solely focuses on using audio cues for emotion recognition, excluding other modalities such as facial expression, body language, or text. Although I agree with this statement, it is worth noting that most modern conferences and journals, such as INTERSPEECH, ICASSP, ICMI, CVPR, and other Q1 level journals, use multimodal approaches in most cases. Therefore, it would be useful for the authors to include the best works on at least the video modality in their review of previous studies. This would explain why the authors chose only the audio modality and provide insights for future research.

 

Thank you for taking the time to review our work on identifying emotions from audio signals. We appreciate your feedback and would like to acknowledge your suggestion regarding the potential benefits of using multimodal approaches for emotion recognition.

We completely agree with your suggestion that incorporating information from multiple modalities, such as audio, visual, and physiological signals, can potentially improve the accuracy of emotion recognition. While our current work focuses on identifying emotions from audio signals alone, we will definitely consider multimodal approaches in our future works.

We believe that exploring the potential benefits of multimodal approaches is an important area for future research in emotion recognition, and we appreciate your valuable feedback. Thank you again for your review, and we hope to continue improving our work in this field.

We will consider your valuable comments in our future and coming work.

 

The article deserves attention and has the potential to be even more useful with some adjustments. I strongly recommend considering the shortcommings provided. It's important to note that this is not a conference paper where the length of the paper may limit the amount of information that can be included. As a journal article, additional details can only serve to attract readers and enhance the value of the article.

Many thanks for your valuable comments. We believe that our paper has been improved after we fixed all the comments raised by all the three reviewers.

Author Response File: Author Response.docx

Round 3

Reviewer 1 Report

I have no more comments.

Back to TopTop