User Identity Protection in Automatic Emotion Recognition through Disguised Speech
Abstract
:1. Introduction
1.1. Mental Health and Affective Speech
1.2. Privacy-Concerns Related to Speech
1.3. Speech Disguising
1.4. Contribution
- Identification of acoustic features which are not affected by disguising speech;
- Evaluation of acoustic features extracted from the disguised speech for affect recognition, and comparison with features extracted from non-disguised speech;
- Demonstration of transfer-learning of acoustic features from non-disguised speech to disguised speech for affect recognition, and analysis of their generalisability.
2. Materials and Methods
2.1. Emotion Recognition System
2.1.1. Hardware Components
2.1.2. Software Components
2.2. Data Sets
2.3. Identity Protection
2.4. Acoustic Features
2.4.1. Emobase
2.4.2. ComParE
2.4.3. eGeMAPS
2.5. Statistical Analysis
- for the emobase feature set, there are 257 features out of 988 for which no statistically significant differences () between the non-disguised and disguised speech signals were found. Parts of different functional of Mfcc, fftMag, ZCR, energy, loudness and intensity are not affected by the speech alteration.
- For the ComParE feature set, we found that 2491 features out of 6373 show no statistically significant differences () between non-disguised and disguised speech signals. Some mfcc, fftMag, audiospec, HNR, ZCR, energy, RASTA, jitter and shimmer functionals are not affected by the speech alteration procedure. The full lists of emobase and ComParE features tested are available through the above-mentioned git repository.
- For the eGeMAPS feature set, we have noted that there are 24 features out of 88 which have no statistically significant differences (). The full list of those features is shown below:
2.6. Classification Methods
3. Experimentation
3.1. Experiment 1
3.2. Experiment 2
3.3. Experiment 3
3.4. Experiment 4
4. Results
4.1. Experiment 1
4.2. Experiment 2
4.3. Experiment 3
4.4. Experiment 4
5. Discussion
Limitations
- the use of an off-the-shelf pitch shifting method which could have an influence on the performance of affect recognition system;
- the fact that pitch is shifted using a constant factor of 2, whereas a different factor or a variable factor could result in different results;
- feature selection is performed though a statistical approach, and more sophisticated feature selection methods [25] might improve the results further;
- the disguised speech for affect recognition system is evaluated using data which are collected in lab-settings instead of real-world settings;
- the hardware used for the proposed system is a combination of matrix creator and Raspberry Pi 3 B+ with a 1.4 GHz 64-bit quad-core processor.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Dimitrov, Y.; Gospodinova, Z.; Žnidaršič, M.; Ženko, B.; Veleva, V.; Miteva, N. Social Activity Modelling and Multimodal Coaching for Active Aging. In Proceedings of the Personalized Coaching for the Wellbeing of an Ageing Society, COACH’2019, Rhodes, Greece, 5–7 June 2019. [Google Scholar]
- Haider, F.; Pollak, S.; Zarogianni, E.; Luz, S. SAAMEAT: Active Feature Transformation and Selection Methods for the Recognition of User Eating Conditions. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, ICMI ’18, Boulder, CO, USA, 16–20 October 2018; ACM: New York, NY, USA, 2018; pp. 564–568. [Google Scholar] [CrossRef]
- Haider, F.; Luz, S. Attitude recognition using multi-resolution cochleagram features. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; IEEE: Manhattan, NY, USA, 2019; pp. 3737–3741. [Google Scholar]
- Luz, S.; la Fuente, S.D. A Method for Analysis of Patient Speech in Dialogue for Dementia Detection. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018; Kokkinakis, D., Ed.; European Language Resources Association (ELRA): Paris, France, 2018. [Google Scholar]
- Hrovat, A.; Znidarsic, M.; Zenko, B.; Vucnik, M.; Mohorcic, M. Saam: Supporting active ageing-use cases and user-side architecture. In Proceedings of the 2018 27th European Conference on Networks and Communications (EuCNC), Ljubljana, Slovenia, 18–21 June 2018. [Google Scholar]
- Bondi, M.W.; Salmon, D.P.; Kaszniak, A.W. The neuropsychology of dementia. In Neuropsychological Assessment of Neuropsychiatric Disorders, 2nd ed.; Oxford University Press: New York, NY, USA, 1996; pp. 164–199. [Google Scholar]
- Hart, R.P.; Kwentus, J.A.; Taylor, J.R.; Harkins, S.W. Rate of forgetting in dementia and depression. J. Consult. Clin. Psychol. 1987, 55, 101–105. [Google Scholar] [CrossRef] [PubMed]
- Lopes, P.N.; Brackett, M.A.; Nezlek, J.B.; Schütz, A.; Sellin, I.; Salovey, P. Emotional intelligence and social interaction. Personal. Soc. Psychol. Bull. 2004, 30, 1018–1034. [Google Scholar] [CrossRef] [Green Version]
- Haider, F.; De La Fuente Garcia, S.; Albert, P.; Luz, S. Affective Speech for Alzheimer’s Dementia Recognition. In Proceedings of the LREC: Resources and ProcessIng of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments (RaPID), Marseille, France, 11 May 2020; Kokkinakis, D., Lundholm Fors, K., Themistocleous, C., Antonsson, M., Eckerström, M., Eds.; European Language Resources Association (ELRA): Paris, France, 2020; pp. 67–73. [Google Scholar]
- Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. In Proceedings of the Ninth European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005; pp. 1516–1520. [Google Scholar]
- Becker, J.; Boller, F.; Lopez, O.; Saxton, J.; McGonigle, K. The natural history of Alzheimer’s disease: Description of study cohort and accuracy of diagnosis. Arch. Neurol. 1994, 51, 585–594. [Google Scholar] [CrossRef]
- Parliament and the Council. Regulation (EU) 2016/679 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of Such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation). 2016. Available online: https://op.europa.eu/en/publication-detail/-/publication/3e485e15-11bd-11e6-ba9a-01aa75ed71a1 (accessed on 23 November 2021).
- Nautsch, A.; Jiménez, A.; Treiber, A.; Kolberg, J.; Jasserand, C.; Kindt, E.; Delgado, H.; Todisco, M.; Hmani, M.A.; Mtibaa, A.; et al. Preserving privacy in speaker and speech characterisation. Comput. Speech Lang. 2019, 58, 441–480. [Google Scholar] [CrossRef]
- Dimitrievski, A.; Zdravevski, E.; Lameski, P.; Trajkovik, V. Addressing Privacy and Security in Connected Health with Fog Computing. In Proceedings of the 5th EAI International Conference on Smart Objects and Technologies for Social Good. Association for Computing Machinery, GoodTechs ’19, Valencia, Spain, 25–27 September 2019; pp. 255–260. [Google Scholar] [CrossRef]
- Haider, F.; Luz, S. A System for Real-Time Privacy Preserving Data Collection for Ambient Assisted Living. In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 2374–2375. [Google Scholar]
- Perrot, P.; Aversano, G.; Chollet, G. Voice disguise and automatic detection: Review and perspectives. In Progress in Nonlinear Speech Processing; Springer: Berlin/Heidelberg, Germany, 2007; pp. 101–117. [Google Scholar]
- Zheng, L.; Li, J.; Sun, M.; Zhang, X.; Zheng, T.F. When Automatic Voice Disguise Meets Automatic Speaker Verification. IEEE Trans. Inf. Forensics Secur. 2020, 16, 824–837. [Google Scholar] [CrossRef]
- Haider, F.; Luz, S. Affect Recognition Through Scalogram and Multi-Resolution Cochleagram Features. In Proceedings of the Interspeech 2021, Brno, Czechia, 30 August–3 September 2021; pp. 4478–4482. [Google Scholar] [CrossRef]
- Eyben, F.; Scherer, K.R.; Schuller, B.W.; Sundberg, J.; André, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput. 2016, 7, 190–202. [Google Scholar] [CrossRef] [Green Version]
- Lajmi, L. An Improved Packet Loss Recovery of Audio Signals Based on Frequency Tracking. J. Audio Eng. Soc. 2018, 66, 680–689. [Google Scholar] [CrossRef]
- Boersma, P.; Weenink, D. Praat: Doing Phonetics by Computer [Computer Program]; Version 6.1.38; 2018; Volume 14, p. 2018. Available online: http://www.praat.org/ (accessed on 1 March 2021).
- Eyben, F.; Weninger, F.; Groß, F.; Schuller, B. Recent developments in opensmile, the munich open-source multimedia feature extractor. In Proceedings of the 21st ACM International Conference on Multimedia, Barcelona, Spain, 21–25 October 2013; ACM, Association for Computing Machinery: New York, NY, USA, 2013; pp. 835–838. [Google Scholar]
- Eyben, F.; Wöllmer, M.; Schuller, B. Opensmile: The munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, Firenze, Italy, 25–29 October 2010; ACM, Association for Computing Machinery: New York, NY, USA, 2010; pp. 1459–1462. [Google Scholar]
- Haider, F.; Salim, F.A.; Conlan, O.; Luz, S. An Active Feature Transformation Method for Attitude Recognition of Video Bloggers. In Proceedings of the INTERSPEECH, Hyderabad, India, 2–6 September 2018; pp. 431–435. [Google Scholar]
- Haider, F.; Pollak, S.; Albert, P.; Luz, S. Emotion recognition in low-resource settings: An evaluation of automatic feature selection methods. Comput. Speech Lang. 2020, 65, 101119. [Google Scholar] [CrossRef]
- Haider, F.; Pollak, S.; Albert, P.; Luz, S. Extracting Audio-Visual Features for Emotion Recognition Through Active Feature Selection. In Proceedings of the 2019 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Ottawa, ON, Canada, 11–14 November 2019; IEEE: New York, NY, USA, 2019; pp. 1–5. [Google Scholar]
- Pathak, M.A.; Raj, B.; Rane, S.D.; Smaragdis, P. Privacy-preserving speech processing: Cryptographic and string-matching frameworks show promise. IEEE Signal Process. Mag. 2013, 30, 62–74. [Google Scholar] [CrossRef]
Features | RF | DT | KNN | NB | SVM | LDA | Avg. |
---|---|---|---|---|---|---|---|
emobase. | 0.6835 | 0.5052 | 0.2460 | 0.6051 | 0.7308 | 0.5574 | 0.5547 |
ComParE | 0.7059 | 0.5368 | 0.2281 | 0.3953 | 0.7949 | 0.8001 | 0.5768 |
eGeMAPS | 0.7063 | 0.4918 | 0.3885 | 0.4854 | 0.6858 | 0.6616 | 0.5699 |
avg | 0.6986 | 0.5113 | 0.2875 | 0.4953 | 0.7372 | 0.6730 | - |
Features | RF | DT | KNN | NB | SVM | LDA | Avg. |
---|---|---|---|---|---|---|---|
emobase. | 0.6657 | 0.4588 | 0.2759 | 0.5865 | 0.7358 | 0.5417 | 0.5441 |
ComParE | 0.7063 | 0.5211 | 0.2016 | 0.2440 | 0.7388 | 0.7629 | 0.5291 |
eGeMAPS | 0.6335 | 0.4529 | 0.3705 | 0.4818 | 0.6759 | 0.6720 | 0.5478 |
avg | 0.6685 | 0.4776 | 0.2827 | 0.4374 | 0.7168 | 0.6589 | - |
Features | RF | DT | KNN | NB | SVM | LDA | Avg. |
---|---|---|---|---|---|---|---|
emobase. | 0.5624 | 0.4172 | 0.2162 | 0.4838 | 0.6103 | 0.4673 | 0.4595 |
ComParE | 0.6513 | 0.4479 | 0.2161 | 0.1429 | 0.1435 | 0.1344 | 0.2893 |
eGeMAPS | 0.5062 | 0.3698 | 0.2623 | 0.3470 | 0.5391 | 0.1339 | 0.3597 |
avg | 0.5733 | 0.4116 | 0.2315 | 0.3246 | 0.4310 | 0.2452 | - |
Features | RF | DT | KNN | NB | SVM | LDA | Avg. |
---|---|---|---|---|---|---|---|
emobase. | 0.5731 | 0.4121 | 0.2665 | 0.5331 | 0.6250 | 0.5075 | 0.4862 |
ComParE | 0.6832 | 0.4793 | 0.2541 | 0.1429 | 0.1839 | 0.1231 | 0.3111 |
eGeMAPS | 0.5540 | 0.4467 | 0.2623 | 0.3305 | 0.4988 | 0.4375 | 0.4216 |
avg | 0.6034 | 0.4460 | 0.2610 | 0.3355 | 0.4359 | 0.3560 | - |
Experiment | Accu. | UAR | Anger | Bore. | Disgust | Fear | Happy | Sad | Neutral |
---|---|---|---|---|---|---|---|---|---|
EXP.1 | 81.31 | 80.01 | 90.55 | 88.89 | 82.61 | 71.01 | 57.75 | 80.65 | 88.61 |
EXP.2 | 77.01 | 76.29 | 81.10 | 83.95 | 76.09 | 69.57 | 52.11 | 83.87 | 87.34 |
EXP.3 | 69.91 | 65.13 | 98.43 | 76.54 | 36.96 | 57.97 | 43.66 | 79.03 | 63.29 |
EXP.4 | 72.71 | 68.32 | 98.43 | 86.42 | 50.00 | 63.77 | 18.31 | 79.03 | 82.28 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Haider, F.; Albert, P.; Luz, S. User Identity Protection in Automatic Emotion Recognition through Disguised Speech. AI 2021, 2, 636-649. https://doi.org/10.3390/ai2040038
Haider F, Albert P, Luz S. User Identity Protection in Automatic Emotion Recognition through Disguised Speech. AI. 2021; 2(4):636-649. https://doi.org/10.3390/ai2040038
Chicago/Turabian StyleHaider, Fasih, Pierre Albert, and Saturnino Luz. 2021. "User Identity Protection in Automatic Emotion Recognition through Disguised Speech" AI 2, no. 4: 636-649. https://doi.org/10.3390/ai2040038
APA StyleHaider, F., Albert, P., & Luz, S. (2021). User Identity Protection in Automatic Emotion Recognition through Disguised Speech. AI, 2(4), 636-649. https://doi.org/10.3390/ai2040038