High-Level CNN and Machine Learning Methods for Speaker Recognition
Abstract
:1. Introduction
2. Materials and Methods
2.1. Dataset
2.2. CNN-Based Deep Learning Approach
2.2.1. Feature Maps: Fourier Spectrograms and MFCC
2.2.2. Architectures
2.3. Machine Learning-Based Approach
- Feature Extraction;
- Feature Selection;
- Classifier training.
2.3.1. Feature Extraction and Selection
2.3.2. Naïve Bayes Classifier
3. Results
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Saggio, G.; Costantini, G. Worldwide Healthy Adult Voice Baseline Parameters: A Comprehensive Review. J. Voice 2022, 36, 637–649. [Google Scholar] [CrossRef] [PubMed]
- Costantini, G.; Cesarini, V.; Robotti, C.; Benazzo, M.; Pietrantonio, F.; Di Girolamo, S.; Pisani, A.; Canzi, P.; Mauramati, S.; Bertino, G.; et al. Deep learning and machine learning-based voice analysis for the detection of COVID-19: A proposal and comparison of architectures. Knowl.-Based Syst. 2022, 253, 109539. [Google Scholar] [CrossRef] [PubMed]
- Costantini, G.; Cesarini, V.; Di Leo, P.; Amato, F.; Suppa, A.; Asci, F.; Pisani, A.; Calculli, A.; Saggio, G. Artificial Intelligence-Based Voice Assessment of Patients with Parkinson’s Disease Off and On Treatment: Machine vs. Deep-Learning Comparison. Sensors 2023, 23, 2293. [Google Scholar] [CrossRef] [PubMed]
- Robotti, C.; Costantini, G.; Saggio, G.; Cesarini, V.; Calastri, A.; Maiorano, E.; Piloni, D.; Perrone, T.; Sabatini, U.; Ferretti, V.V.; et al. Machine Learning-based Voice Assessment for the Detection of Positive and Recovered COVID-19 Patients. J. Voice 2021. [Google Scholar] [CrossRef] [PubMed]
- Costantini, G.; Parada-Cabaleiro, E.; Casali, D.; Cesarini, V. The Emotion Probe: On the Universality of Cross-Linguistic and Cross-Gender Speech Emotion Recognition via Machine Learning. Sensors 2022, 22, 2461. [Google Scholar] [CrossRef] [PubMed]
- Costantini, G.; Cesarini, V.; Casali, D. A Subset of Acoustic Features for Machine Learning-Based and Statistical Approaches in Speech Emotion Recognition. In Proceedings of the BIOSIGNALS 2022, Online Streaming, 9–11 February 2022. [Google Scholar]
- Faundez-Zanuy, M.; Monte-Moreno, E. State-of-the-art in speaker recognition. IEEE Aerosp. Electron. Syst. Mag. 2005, 20, 7–12. [Google Scholar] [CrossRef]
- Murphy, K.P. . Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
- Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Dehak, N.; Torres-Carrasquillo, P.A.; Reynolds, D.; Dehak, R. Language recognition via i-vectors and dimensionality reduction. In Proceedings of the INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, 27–31 August 2011; pp. 857–860. [Google Scholar] [CrossRef]
- Gupta, K.; Gupta, D. An analysis on LPC, RASTA and MFCC techniques in Automatic Speech recognition system. In Proceedings of the 2016 6th International Conference—Cloud System and Big Data Engineering (Confluence), Noida, India, 14–15 January 2016; pp. 493–497. [Google Scholar] [CrossRef]
- Villalba, J.; Chen, N.; Snyder, D.; Garcia-Romero, D.; McCree, A.; Sell, G.; Borgstrom, J.; García-Perera, L.P.; Richardson, F.; Dehak, R.; et al. State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations. Comput. Speech Lang. 2019, 60, 101026. [Google Scholar] [CrossRef]
- Kishore, K.V.K.; Sharrefaunnisa, S.; Venkatramaphanikumar, S. An efficient text dependent speaker recognition using fusion of MFCC and SBC. In Proceedings of the 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE), Greater Noida, India, 25–27 February 2015; pp. 18–22. [Google Scholar] [CrossRef]
- Jagiasi, R.; Ghosalkar, S.; Kulal, P.; Bharambe, A. CNN based speaker recognition in language and text-independent small scale system. In Proceedings of the 2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud), Palladam, India, 12–14 December 2019; pp. 176–179. [Google Scholar]
- Xie, X.; Liu, X.; Lee, T.; Wang, L. Bayesian Learning for Deep Neural Network Adaptation. IEEE/ACM Trans. Audio, Speech, Lang. Process. 2021, 29, 2096–2110. [Google Scholar] [CrossRef]
- Dua, S.; Kumar, S.S.; Albagory, Y.; Ramalingam, R.; Dumka, A.; Singh, R.; Rashid, M.; Gehlot, A.; Alshamrani, S.S.; AlGhamdi, A.S. Developing a Speech Recognition System for Recognizing Tonal Speech Signals Using a Convolutional Neural Network. Appl. Sci. 2022, 12, 6223. [Google Scholar] [CrossRef]
- Parada-Cabaleiro, E.; Costantini, G.; Batliner, A.; Schmitt, M.; Schuller, B.W. DEMoS: An Italian emotional speech corpus. Lang. Resour. Eval. 2019, 54, 341–383. [Google Scholar] [CrossRef]
- Cummins, N.; Baird, A.; Schuller, B.W. Speech analysis for health: Current state-of-the-art and the increasing impact of deep learning. Methods 2018, 151, 41–54. [Google Scholar] [CrossRef] [PubMed]
- DPA Microphones. Facts about speech intelligibility: Human voice frequency range. DPA. 2021. Available online: https://www.dpamicrophones.com/mic-university/facts-about-speech-intelligibility (accessed on 28 February 2023).
- Bogert, B.P.; Healy, M.J.R.; Tukey, J.W. The quefrency analysis of time series for echoes: Cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking. In Time Series Analysis; John Wiley & Sons, Inc.: New York, NY, USA, 1963; Chapter 15; pp. 209–243. [Google Scholar]
- Stevens, S.S. A scale for the measurement of a psychological magnitude: Loudness. Psychol. Rev. 1936, 43, 405–416. [Google Scholar] [CrossRef]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.; Liu, W.; et al. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef] [Green Version]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef] [Green Version]
- Eyben, F.; Schuller, B. openSMILE:). ACM SIGMultimedia Rec. 2015, 6, 4–13. [Google Scholar] [CrossRef]
- Schuller, B.; Steidl, S.; Batliner, A.; Hirschberg, J.; Burgoon, J.K.; Baird, A.; Elkins, A.; Zhang, Y.; Coutinho, E.; Evanini, K. The INTERSPEECH 2016 Computational Paralinguistics Challenge: Deception, Sincerity & Native Language. In Proceedings of the 17TH Annual Conference of the International Speech Communication Association (Interspeech 2016), San Francisco, CA, USA, 8–12 September 2016; pp. 2001–2005. [Google Scholar] [CrossRef]
- Hermansky, H.; Morgan, N. RASTA processing of speech. IEEE Trans. Speech Audio Process. 1994, 2, 578–589. [Google Scholar] [CrossRef] [Green Version]
- Hall, M.A. Correlation-Based Feature Selection for Machine Learning; The University of Waikato: Hamilton, New Zealand, 1999. [Google Scholar]
- Webb, G.I. Naïve Bayes. In Encyclopedia of Machine Learning [Internet]; Sammut, C., Webb, G.I., Eds.; Springer: Boston, MA, USA, 2010; pp. 713–714. [Google Scholar] [CrossRef]
- Eibe, F.; Hall, M.A.; Witten, I.H. The WEKA Workbench. Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques”, 4th ed.; Morgan Kauffman: Burlington, MA, USA, 2016. [Google Scholar]
- Compute Performance Metrics for Average Receiver Operating Characteristic (ROC) Curve in Multiclass Problem—MATLAB Average—MathWorks Italia. Available online: https://it.mathworks.com/help/stats/rocmetrics.average.html (accessed on 19 March 2023).
- Taylor, C.R. Dynamic Programming and the Curses of Dimensionality. In Applications Of Dynamic Programming to Agricul-tural Decision Problems; CRC Press: Boca Raton, FL, USA, 1993; pp. 1–10. [Google Scholar]
- Igras-Cybulska, M.; Ziółko, B.; Żelasko, P.; Witkowski, M. Structure of pauses in speech in the context of speaker verification and classification of speech type. EURASIP J. Audio Speech Music. Process. 2016, 2016, 1–16. [Google Scholar] [CrossRef] [Green Version]
- Yeldener, S. Method of Determining the Voicing Probability of Speech Signals. Patent Publication No. EP 1163662 A4 20040616 (EN); Patent Application No. EP 00915722 A 20000223. January 2013. Available online: https://data.epo.org/gpi/EP1163662A4 (accessed on 24 May 2022).
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar] [CrossRef] [Green Version]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
- Hasan, H.; Shafri, H.Z.; Habshi, M. A Comparison Between Support Vector Machine (SVM) and Convolutional Neural Network (CNN) Models For Hyperspectral Image Classification. IOP Conf. Series: Earth Environ. Sci. 2019, 357, 012035. [Google Scholar] [CrossRef] [Green Version]
- Trabelsi, A.; Warichet, S.; Aajaoun, Y.; Soussilane, S. Evaluation of the efficiency of state-of-the-art Speech Recognition engines. Procedia Comput. Sci. 2022, 207, 2242–2252. [Google Scholar] [CrossRef]
- Costantini, G.; Cesarini, V.; Paolizzo, F. NMF Based System for Speaker Identification. In Proceedings of the 2021 IEEE International Workshop on Metrology for Industry 4.0 & IoT (MetroInd4.0&IoT), Rome, Italy, 7–9 June 2021; pp. 620–624. [Google Scholar] [CrossRef]
Emotion | Speaker Sex | Tot. | |
---|---|---|---|
Male | Female | ||
Anger | 868 | 449 | 1317 |
Guilt | 638 | 338 | 976 |
Fear | 675 | 360 | 1035 |
Surprise | 611 | 313 | 924 |
Disgust | 1003 | 542 | 1545 |
Happiness | 918 | 483 | 1401 |
Sadness | 890 | 453 | 1343 |
Neutral | 220 | 108 | 328 |
Tot. | 5823 | 3046 | 8869 |
Accuracy | Input | Net |
---|---|---|
90.15% | Spectrogram, grayscale | CNN1 |
88.85% | Spectrogram, colored | CNN1 |
82.27% | MFCC, grayscale | CNN1 |
83.17% | MFCC, colored | CNN1 |
83.43% | MFCC, colored | AlexNet |
89.28% | Spectrogram, grayscale | AlexNet |
Metric | Value |
---|---|
FP range (min-max) | 0.000–0.010 |
FP (weighted avg.) | 0.002 |
TP range (min-max) | 0.364–0.978 |
TP (weighted avg.) | 0.871 |
AUC (weighted avg.) | 0.985 |
Precision (weighted avg.) | 0.875 |
Recall (weighted avg.) | 0.871 |
F1 Score (weighted avg.) | 0.872 |
Mean Absolute Error | 0.0044 |
RMS Error | 0.0655 |
Relative Absolute Error | 13.09% |
Accuracy | 87.09 % |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Costantini, G.; Cesarini, V.; Brenna, E. High-Level CNN and Machine Learning Methods for Speaker Recognition. Sensors 2023, 23, 3461. https://doi.org/10.3390/s23073461
Costantini G, Cesarini V, Brenna E. High-Level CNN and Machine Learning Methods for Speaker Recognition. Sensors. 2023; 23(7):3461. https://doi.org/10.3390/s23073461
Chicago/Turabian StyleCostantini, Giovanni, Valerio Cesarini, and Emanuele Brenna. 2023. "High-Level CNN and Machine Learning Methods for Speaker Recognition" Sensors 23, no. 7: 3461. https://doi.org/10.3390/s23073461
APA StyleCostantini, G., Cesarini, V., & Brenna, E. (2023). High-Level CNN and Machine Learning Methods for Speaker Recognition. Sensors, 23(7), 3461. https://doi.org/10.3390/s23073461