Emotion-Recognition System for Smart Environments Using Acoustic Information (ERSSE)
Abstract
:1. Introduction
2. Theoretical Framework
2.1. ReM-AM
2.1.1. ODA for ReM-AM
2.1.2. Autonomic Cycle of Intelligent Sound Analysis (ISA)
- Observation (called the CAD module): the sensors obtain the acoustic information, allowing for the precise perception of the sound field in the SE.
- Analysis (called the ISUA module): the information is classified and stored for the next phase.
- Decision Making (called the DM module): determines the necessities of the current environment and identifies the tasks that can be executed in the given context.
2.2. Sound Pattern for Emotion Recognition
3. Specification of ERSSE
- Extraction: When a sound event is perceived by the system, the CAD component extracts the acoustic descriptors of the hierarchical pattern shown in Table 1, along with the associated emotion metadata from the TESS dataset.
- Analysis: This task carries out the emotional classification of the sound events by analyzing the information previously prepared by the CAD module. To do this, a classification model, previously trained using the data from the TESS dataset and the different values of the sound descriptors shown in Table 1, is subsequently used to perform this classification task using the new sound events. Specifically, sound events will be classified as exhibiting one of the following seven emotions [21,22,23]:
- Anger: Response to interference with the pursuit of a goal.
- Disgust: Repulsion by something.
- Fear: Response to the threat of harm.
- Happiness: Feelings that are enjoyed.
- Sadness: Response to the loss of something.
- Surprise: Response to a sudden unexpected event. This is the briefest emotion.
- Neutral: No reaction.
- Recommendation: The system generates a response from the combination of sound events detected in the SE. The pertinent action to recommend as a response is shown in Table 2, which details the emotion, the subject in the SE, and the actions as a response, as shown in the system or as deployed in the SE.
4. Case Study
4.1. Description of Case Study
4.2. Tasks of the Autonomic Cycle of ERSSE
- Extraction: the CAD component extracts the information available in the SE for the audio descriptors defined in Table 1, considering techniques from Ref. [24], specifically using the Fourier transform in tasks related to noise robustness and dysarthric speech, where the information is separated and pre-processed for the subsequent analysis.
- Analysis: First, this task trains a classification model using the TESS dataset, which includes the descriptors/features defined in Table 1 and the emotions associated with their values. This dataset includes 2800 audio files, with a length between 1 to 4 s each, portraying different emotions. Once the model is trained, the ISUA will receive the information from the previous task and classify it. Specifically, the classification model was built using the random forest method, similar to the model presented in Ref. [3]. Specifically, the dataset was divided as follows: 80% for training, 10% for validation, and 10% for testing, following a fivefold cross-validation scheme. In addition, 33% of the dataset was initially used, and an accuracy of 60% was obtained in the test set and 66% accuracy in the validation set. Later, the data subset was expanded to 66%, and finally, up to 85%, to see if the results improved. With this last data subset, the system achieved an accuracy of 88% in the test set, and 92% in the validation set.
- Recommendation: The recommendation system is based on the information in Table 2. Basically, based on the data that comes from the other modules (emotion and subject), it determines the action to take. Specifically, for this SE (a smart movie theater) and the events detected by the previous tasks, it would define the actions to be carried out. Table 4 indicates the recommendations for the events described in the previous section.
4.3. Analysis of the Behavior of the Autonomic Cycle
5. Comparison with Similar Works
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhang, S.; Yang, Y.; Chen, C.; Zhang, X.; Leng, Q.; Zhao, X. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects. Expert Syst. Appl. 2023, 237, 121692. [Google Scholar] [CrossRef]
- Ahmed, N.; Al Aghbari, Z.; Girija, S. A systematic survey on multimodal emotion recognition using learning algorithms. Intell. Syst. Appl. 2023, 17, 200171. [Google Scholar] [CrossRef]
- Das, S.; Imtiaz, S.; Neom, N.H.; Siddique, N.; Wang, H. A hybrid approach for Bangla sign language recognition using deep transfer learning model with random forest classifier. Expert Syst. Appl. 2023, 213, 118914. [Google Scholar] [CrossRef]
- Mishra, S.P.; Warule, P.; Deb, S. Variational mode decomposition based acoustic and entropy features for speech emotion recognition. Appl. Acoust. 2023, 212, 109578. [Google Scholar] [CrossRef]
- Bhangale, K.; Kothandaraman, M. Speech Emotion Recognition Based on Multiple Acoustic Features and Deep Convolutional Neural Network. Electronics 2023, 12, 839. [Google Scholar] [CrossRef]
- Li, X.; Shi, X.; Hu, D.; Li, Y.; Zhang, Q.; Wang, Z.; Unoki, M.; Akagi, M. Music Theory-Inspired Acoustic Representation for Speech Emotion Recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 2534–2547. [Google Scholar] [CrossRef]
- Zhang, X.; Zhang, F.; Cui, X.; Zhang, W. Speech Emotion Recognition with Complementary Acoustic Representations. In Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar, 9–12 January 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 846–852. [Google Scholar]
- Cong, G.; Qi, Y.; Li, L.; Beheshti, A.; Zhang, Z.; Hengel, A.; Yang, M.; Yan, C.; Huang, Q. StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing. arXiv 2024, arXiv:2402.12636. [Google Scholar]
- Zhang, Z.; Li, L.; Cong, G.; Haibing YI, N.; Gao, Y.; Yan, C.; van den Hengel, A.; Qi, Y. From Speaker to Dubber: Movie Dubbing with Prosody and Duration Consistency Learning. ACM Multimedia. 2024. Available online: https://openreview.net/pdf?id=QHRNR64J1m (accessed on 5 September 2024).
- Cong, G.; Li, L.; Qi, Y.; Zha, Z.; Wu, Q.; Wang, W.; Jiang, B.; Yang, M.; Huang, Q. Learning to Dub Movies via Hierarchical Prosody Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14687–14697. [Google Scholar]
- Godøy, R.I. Perceiving Sound Objects in the Musique Concrète. Front. Psychol. 2021, 12, 672949. [Google Scholar] [CrossRef] [PubMed]
- Turpault, N.; Serizel, R. Training sound event detection on a heterogeneous dataset. arXiv 2020, arXiv:2007.03931. [Google Scholar]
- Santiago, G.; Aguilar, J. Integration of ReM-AM in smart environments. WSEAS Trans. Comput. 2019, 18, 97–100. [Google Scholar]
- Liu, C.; Wang, Y.; Sun, X.; Wang, Y.; Fang, F. Decoding six basic emotions from brain functional connectivity patterns. Sci. China Life Sci. 2023, 66, 835–847. [Google Scholar] [CrossRef] [PubMed]
- Aguilar, J.; Jerez, M.; Exposito, E.; Villemur, T. CARMiCLOC: Context Awareness Middleware in Cloud Computing. In Proceedings of the Latin American Computing Conference (CLEI), Arequipa, Peru, 19–23 October 2015. [Google Scholar]
- Santiago, G.; Aguilar, J. Ontological model for the acoustic management in intelligent environments. Appl. Comput. Inform. 2022. [Google Scholar] [CrossRef]
- Sánchez, M.; Exposito, E.; Aguilar, J. Implementing self-* autonomic properties in self-coordinated manufacturing processes for the Industry 4.0 context. Comput. Ind. 2020, 121, 103247. [Google Scholar] [CrossRef]
- Chalapathi, M.M.V.; Kumar, M.R.; Sharma, N.; Shitharth, S. Ensemble Learning by High-Dimensional Acoustic Features for Emotion Recognition from Speech Audio Signal. Secur. Commun. Netw. 2022, 2022, 8777026. [Google Scholar] [CrossRef]
- Pichora-Fuller, M.K.; Dupuis, K. Toronto Emotional Speech Set (TESS); Version 1.0; Borealis: Toronto, Canada, 2020. [Google Scholar] [CrossRef]
- Zou, Z.; Semiha, E. Towards emotionally intelligent buildings: A Convolutional neural network based approach to classify human emotional experience in virtual built environments. Adv. Eng. Inform. 2023, 55, 101868. [Google Scholar] [CrossRef]
- Cordero, J.; Aguilar, J.; Aguilar, K.; Chávez, D.; Puerto, E. Recognition of the Driving Style in Vehicle Drivers. Sensors 2020, 20, 2597. [Google Scholar] [CrossRef] [PubMed]
- Salazar, C.; Aguilar, J.; Monsalve-Pulido, J.; Montoya, E. Affective recommender systems in the educational field. A systematic literature review Comput. Sci. Rev. 2021, 40, 100377. [Google Scholar]
- Ekman, P.; Cordaro, D. What is Meant by Calling Emotions Basic. Emot. Rev. 2011, 3, 364–370. [Google Scholar] [CrossRef]
- Loweimi, E.; Yue, Z.; Bell, P.; Renals, S.; Cvetkovic, Z. Multi-Stream Acoustic Modelling Using Raw Real and Imaginary Parts of the Fourier Transform. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 876–890. [Google Scholar] [CrossRef]
Sound Events Pattern | |
---|---|
Descriptor | Description |
Subject | Adult/Child/Elder |
Frequency | Frequencies in a sound event (Hz). |
Wavelength | Distance between waves crests. This will depend on the medium through which the sound wave travels. |
Duration | Normally considered in seconds. |
Intensity | Describes the power carried by an acoustic wave. |
Valence | Emotional value of a sound: positive/negative/neutral. |
Type of SE | Smart classroom, ambient assisted living, smart concert hall, etc. |
Behavior | Current behavior of the user. |
Physiological conditions | Heart rate, blood pressure, face color. |
Emotion | Action as Response |
---|---|
Anger + Children | Warning signal to the parents/person/authority in charge. |
Anger + Adult | Warning to the person in charge in case of illness. |
Anger + Elder | Warning to the person in charge in case of ambient assisted living. |
Disgust + Children | Warning signal to the parents/person/authority in charge. |
Disgust + Adult | Warning to the person in charge in case of illness. |
Disgust + Elder | Warning to the person in charge in case of ambient assisted living. |
Fear + Children | Alert signal to the parents/person/authority in charge (supervision required). |
Fear + Adult | Alert signal to the person in charge in case of illness. |
Fear + Elder | Alert signal to the person in charge in case of ambient assisted living. |
Happiness + Children | No action required. |
Happiness + Adult | No action required. |
Happiness + Elder | No action required. |
Sadness + Children | Warning signal to the parents/person/authority in charge (Supervision required). |
Sadness + Adult | Warning to the person in charge in case of illness. |
Sadness + Elder | Warning to the person in charge in case of ambient assisted living. |
Surprise + Children | Warning signal to the parents/person/authority in charge. |
Surprise + Adult | Warning to the person in charge in case of illness. |
Surprise + Elder | Warning to the person in charge in case of ambient assisted living. |
Neutral | No action. |
Tasks | Steps |
---|---|
Extraction | Identification of acoustic descriptors |
Analysis | Data classification |
Recommendation | Definition of actions according to the detected emotion |
Event | Emotion and Subject Detected | Action Proposed |
---|---|---|
1 | Laughs + Adult/Children/Elder | Adding to the soundscape. |
2 | Disgust + Adult/Children/Elder | Warning signal to the authority in charge. |
3 | Disgust + Children | Warning signal to the parents/person/authority in charge. |
4 | Fear + Adult/Children/Elder | Warning signal to the authority in charge. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Santiago, G.; Aguilar, J.; García, R. Emotion-Recognition System for Smart Environments Using Acoustic Information (ERSSE). Information 2024, 15, 677. https://doi.org/10.3390/info15110677
Santiago G, Aguilar J, García R. Emotion-Recognition System for Smart Environments Using Acoustic Information (ERSSE). Information. 2024; 15(11):677. https://doi.org/10.3390/info15110677
Chicago/Turabian StyleSantiago, Gabriela, Jose Aguilar, and Rodrigo García. 2024. "Emotion-Recognition System for Smart Environments Using Acoustic Information (ERSSE)" Information 15, no. 11: 677. https://doi.org/10.3390/info15110677
APA StyleSantiago, G., Aguilar, J., & García, R. (2024). Emotion-Recognition System for Smart Environments Using Acoustic Information (ERSSE). Information, 15(11), 677. https://doi.org/10.3390/info15110677