# Speaker Recognition Using Constrained Convolutional Neural Networks in Emotional Speech

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Proposed Speaker Recognition Architecture

## 3. Quantized CNN

#### 3.1. The 8-Bit Floating Point Quantization

_{b}= E—bias denotes the biased exponent. It should be noted that M can take values from 0 to ${2}^{m}-1$ whereas E

_{b}can take values from $-{2}^{e-1}$ to ${2}^{e-1}-1$. The value of bias and ranges of exponent and mantissa M depend on the number of bits reserved to represent exponent and mantissa.

^{3}–1 = 7. The second format that we analyze here is defined as (s, e, m) = (1, 5, 2) and it became popular recently as a part of a hybrid method for DNN training and inference [27]. In this case, bias = 16, the bias exponent takes values in the range from −16 to 15, whereas mantissa can take values from 0 to 2

^{2}–1 = 3. This format is further denoted as FP8_v2.

_{1}. Determination of an FP8 representation for the given input sample can be done using Equation (4) and the following steps:

- Step 1: Find the parameter s, following the rule:$$s=\left\{\begin{array}{ll}0& {x}_{1}\ge 0\\ 1& {x}_{1}<0\end{array}\right.\text{}$$
- Step 2: Find the biased exponent value E
_{b}by calculating the binary logarithm of the input sample:$${E}_{b}=\u230a{\mathrm{log}}_{2}\left(\text{}\left|{x}_{1}\right|\text{}\right)\u230b\text{}$$ - Step 3: Find the mantissa M value as:$$M=round\left({2}^{m}\left(\frac{\left|{x}_{1}\right|}{{E}_{b}}-1\right)\right)\text{}$$
- Step 4: Calculate the quantized value using Equation (4).

#### 3.2. Binary Quantization

_{b}is obtained using the following rule [23]:

#### 3.3. Ternary Quantization

_{t}is obtained using the following rule [25]:

_{1}and Δ

_{2}represent decision-making thresholds. Here, we apply a symmetric ternary scalar quantizer, so that y

_{1}= −y

_{2}= y. Furthermore, we simplify design additionally by setting the absolute value of decision thresholds to a half value of the representative levels’ absolute value, so that the final design is:

## 4. Results and Discussion

_{w}is weight distortion introduced during the quantization process, ${w}_{i}^{q}$ are quantized, ${w}_{i}$ are the original values of weights, μ is the mean value of original weights, whereas N is the total number of weights.

#### Comparison with Other Models

_{p}represents the accuracy of the proposed model, whereas ac

_{c}is the accuracy of the compared model. The positive values of CAG indicate better model performance of the proposed model. CAG values are given for all analyzed models and all emotional styles in Figure 3, providing a detailed comparison with the model from [17]. We can highlight that the proposed full-precision model provides better performance for all emotions, although the proposed network has far fewer parameters, demonstrating the suitability of rectangular kernels. By analyzing quantized CNN models, it can be seen that the proposed model would achieve better performance than the model from [17] for all emotional states in the case of both FP8 formats. However, better results in the case of ternary quantization are achieved only in the case of anger and sadness, whereas anger is the only emotional style for which better results are achieved in the case of binary quantization. Such behavior is not unusual, as the model from [17] has more than three times more parameters and the performance of low-precision models also rely on the model depth. However, more parameters and larger depth lead to longer processing and higher storage capability demands.

## 5. Summary and Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Kinnunen, T.; Li, H. An overview of text-independent speaker recognition: From features to supervectors. Speech Commun.
**2010**, 52, 12–40. [Google Scholar] [CrossRef] [Green Version] - Reynolds, D.A. An overview of automatic speaker recognition technology. In Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, USA, 13–17 May 2002; Volume 4, p. IV-4072. [Google Scholar] [CrossRef]
- Delić, V.; Perić, Z.; Sečujski, M.; Jakovljević, N.; Nikolić, J.; Mišković, D.; Simić, N.; Suzić, S.; Delić, T. Speech technology progress based on new machine learning paradigm. Comput. Intell. Neurosci.
**2019**, 2019, 4368036. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Soong, F.K.; Rosenberg, A.E.; Juang, B.-H.; Rabiner, L.R. A vector quantization approach to speaker recognition. AT T Tech. J.
**1987**, 66, 14–26. [Google Scholar] [CrossRef] - Furui, S. Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Processing
**1981**, 29, 254–272. [Google Scholar] [CrossRef] [Green Version] - Sturim, D.E.; Campbell, W.M.; Reynolds, D.A. Classification Methods for Speaker Recognition. In Speaker Classification I. Lecture Notes in Computer Science; Müller, C., Ed.; Springer: Berlin/Heidelberg, Germany, 2007; p. 4343. [Google Scholar] [CrossRef]
- Nijhawan, G.; Soni, M.K. Speaker recognition using support vector machine. Int. J. Comput. Appl.
**2014**, 87, 7–10. [Google Scholar] [CrossRef] - Dehak, N.; Kenny, P.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Processing
**2011**, 19, 788–798. [Google Scholar] [CrossRef] - Kenny, P. Joint Factor Analysis of Speaker and Session Variability: Theory and Algorithms; Tech. Rep. CRIM-06/08-13; CRIM: Montreal, QC, Canada, 2005. [Google Scholar]
- Mandarić, I.; Vujović, M.; Suzić, S.; Nosek, T.; Simić, N.; Delić, V. Initial analysis of the impact of emotional speech on the performance of speaker recognition on new serbian emotional database. In Proceedings of the 29th Telecommunications Forum (TELFOR), Belgrade, Serbia, 23–24 November 2021; pp. 1–4. [Google Scholar] [CrossRef]
- Dai, T. Using quantized neural network for speaker recognition on edge computing devices. J. Phys. Conf. Ser.
**2021**, 1992, 02217. [Google Scholar] [CrossRef] - Kitamura, T. Acoustic analysis of imitated voice produced by a professional impersonator. In Proceedings of the 9th Annual Conference of the International Speech Communication Association (Interspeech), Brisbane, Australia, 22–23 September 2008; pp. 813–816. [Google Scholar] [CrossRef]
- Ghiurcau, M.V.; Rusu, C.; Astola, J. Speaker recognition in an emotional environment. In Proceedings of the Signal Processing and Applied Mathematics for Electronics and Communications (SPAMEC 2011), Cluj-Napoca, Romania, 26–28 August 2011; pp. 81–84. [Google Scholar]
- Wu, W.; Zheng, F.; Xu, M.; Bao, H. Study on speaker verification on emotional speech. In Proceedings of the INTERSPEECH 2006—ICSLP, Ninth International Conference on Spoken Language Processing, Pittsburgh, PA, USA, 17–21 September 2006; pp. 2102–2105. [Google Scholar] [CrossRef]
- Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-vectors: Robust DNN embeddings for speaker recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018, Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar] [CrossRef]
- Sarma, B.D.; Das, R.K. Emotion invariant speaker embeddings for speaker identification with emotional speech. In Proceedings of the 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, 7–10 December 2020; pp. 610–615. [Google Scholar]
- Lukic, Y.; Vogt, C.; Dürr, O.; Stadelmann, T. Speaker identification and clustering using convolutional neural networks. In Proceedings of the IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), Vietri sul Mare, Italy, 13–16 September 2016; pp. 1–6. [Google Scholar] [CrossRef] [Green Version]
- McLaren, M.; Lei, Y.; Scheffer, N.; Ferrer, L. Application of convolutional neural networks to speaker recognition in noisy conditions. In Proceedings of the INTERSPEECH 2014, the 15th Annual Conference of the International Speech Communication Association, Singapore, 14–18 September 2014; pp. 686–690. [Google Scholar] [CrossRef]
- Shafik, A.; Sedik, A.; El-Rahiem, B.; El-Rabaie, E.-S.; El Banby, G.; El-Samie, F.; Khalaf, A.; Song, O.-Y.; Iliyasu, A. Speaker identification based on Radon transform and CNNs in the presence of different types of interference for Robotic Applications. Appl. Acoust.
**2021**, 177, 107665. [Google Scholar] [CrossRef] - Anvarjon, T.; Mustaqeem; Kwon, S. Deep-Net: A Lightweight CNN-Based Speech Emotion Recognition System Using Deep Frequency Features. Sensors
**2020**, 20, 5212. [Google Scholar] [CrossRef] - Anvarjon, T.; Mustaqeem; Choeh, J.; Kwon, S. Age and Gender Recognition Using a Convolutional Neural Network with a Specially Designed Multi-Attention Module through Speech Spectrograms. Sensors
**2021**, 21, 5892. [Google Scholar] [CrossRef] - Badshah, A.M.; Rahim, N.; Ullah, N.; Ahmad, J.; Muhammad, K.; Lee, M.Y.; Kwon, S.; Baik, S.W. Deep features-based speech emotion recognition for smart affective services. Multimed. Tools Appl.
**2019**, 78, 5571–5589. [Google Scholar] [CrossRef] - Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. arXiv
**2016**, arXiv:1602.02830v3. [Google Scholar] - Peric, Z.; Denic, B.; Savic, M.; Vucic, N.; Simic, N. Binary Quantization Analysis of Neural Networks Weights on MNIST Dataset. Elektronika ir Elektrotechnika
**2021**, 27, 41–47. [Google Scholar] [CrossRef] - Zhu, C.; Han, S.; Mao, H.; Dally, W. Trained Ternary Quantization. arXiv
**2017**, arXiv:1612.01064v3. [Google Scholar] - IEEE Std 754–2019 (Revision of IEEE 754–2008); IEEE Standard for Floating-Point Arithmetic. Institute of Electrical and Electronics Engineers: New York, NY, USA, 2019; pp. 1–84. [CrossRef]
- Sun, X.; Choi, J.; Chen, C.-Y.; Wang, N.; Venkataramani, S.; Cui, X.; Zhang, W.; Gopalakrishnan, K. Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32, pp. 4900–4909. [Google Scholar]
- Wang, N.; Choi, J.; Brand, B.; Chen, C.-Y.; Gopalakrishnan, K. Training deep neural networks with 8-bit floating point numbers. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 7686–7695. [Google Scholar]
- Nikolic, J.; Peric, Z.; Aleksic, D.; Tomic, S.; Jovanovic, A. Whether the support region of three-bit uniform quantizer has a strong impact on post-training quantization for MNIST Dataset? Entropy
**2021**, 21, 1699. [Google Scholar] [CrossRef] - Peric, Z.; Savic, M.; Simic, N.; Denic, B.; Despotovic, V. Design of a 2-Bit Neural Network Quantizer for Laplacian Source. Entropy
**2021**, 23, 933. [Google Scholar] [CrossRef] - Peric, Z.; Denic, B.; Savic, M.; Despotovic, V. Design and analysis of binary scalar quantizer of laplacian source with applications. Information
**2020**, 11, 501. [Google Scholar] [CrossRef] - Peric, Z.; Savic, M.; Dincic, M.; Vucic, N.; Djosic, D.; Milosavljevic, S. Floating Point and Fixed Point 32-bits Quantizers for Quantization of Weights of Neural Networks. In Proceedings of the 12th International Symposium on Advanced Topics in Electrical Engineering (ATEE), Bucharest, Romania, 25–27 March 2021; pp. 1–4. [Google Scholar] [CrossRef]
- Ye, F.; Yang, J. A deep neural network model for speaker identification. Appl. Sci.
**2021**, 11, 3603. [Google Scholar] [CrossRef] - Kwon, S. A CNN-Assisted Enhanced Audio Signal Processing for Speech Emotion Recognition. Sensors
**2020**, 20, 183. [Google Scholar] [CrossRef] [Green Version] - Sohn, J.; Kim, N.S.; Sung, W. A statistical model-based voice activity detection. IEEE Signal Processing Lett.
**1999**, 6, 1–3. [Google Scholar] [CrossRef] - Kienast, M.; Sendlmeier, W.F. Acoustical analysis of spectral and temporal changes in emotional speech. In Proceedings of the ITRW on Speech and Emotion, Newcastle upon Tyne, UK, 5–7 September 2000; pp. 92–97. [Google Scholar]
- Hershey, S.; Chaudhuri, S.; Ellis, D.P.W.; Gemmeke, J.F.; Jansen, A.; Moore, R.C.; Plakal, M.; Platt, D.; Saurous, R.A.; Seybold, B.; et al. CNN architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 131–135. [Google Scholar] [CrossRef]
- Shi, L.; Du, K.; Zhang, C.; Ma, H.; Yan, W. Lung Sound Recognition Algorithm Based on VGGish-BiGRU. IEEE Access
**2019**, 7, 139438–139449. [Google Scholar] [CrossRef]

**Figure 1.**Examples of spectrograms for various styles of emotional speech obtained from the SEAC database: (

**a**) neutral; (

**b**) anger; (

**c**) joy; (

**d**) fear; and (

**e**) sadness.

**Figure 3.**Classification accuracy gain over the model from [17].

Layer | Arguments | Number of Parameters |
---|---|---|

Convolution2D | Filters = 16, kernel size = (9, 3), input shape (128, 170, 1) | 448 |

MaxPooling2D | Pool size = (2, 2) | |

Convolution2D | Filters = 32, kernel size = (3, 1) | 1568 |

MaxPooling2D | Pool size = (2, 2) | |

Flatten | ||

Dense_1 | Nodes = 128 | 4,989,056 |

Dropout | Rate = 0.2 | |

Dense_2 | Nodes = 23 | 2967 |

Total number of parameters | 4,994,039 |

Fold | Number of Spectrograms |
---|---|

1 | 607 |

2 | 587 |

3 | 593 |

4 | 583 |

5 | 684 |

Total | 3054 |

Fold | Classification Accuracy (%) | Weighted F1 | Weighted Precision | Weighted Recall |
---|---|---|---|---|

1 | 99.51 | 1.00 | 1.00 | 1.00 |

2 | 99.49 | 1.00 | 0.99 | 0.99 |

3 | 99.66 | 1.00 | 1.00 | 1.00 |

4 | 98.46 | 0.98 | 0.98 | 0.98 |

5 | 99.12 | 0.99 | 0.99 | 0.99 |

Average | 99.248 | 0.994 | 0.992 | 0.992 |

Emotion | Number of Spectrograms | |
---|---|---|

Training | Neutral | 2370 |

Testing | Neutral | 684 |

Anger | 513 | |

Joy | 608 | |

Fear | 579 | |

Sadness | 635 |

Classification Accuracy (%) | |||||
---|---|---|---|---|---|

Emotion | |||||

Proposed Model | Neutral | Anger | Fear | Sadness | Joy |

Full-precision | 99.12 | 86.16 | 84.46 | 79.84 | 85.69 |

FP8 (1, 4, 3) | 99.12 | 86.35 | 84.46 | 79.84 | 85.69 |

FP8 (1, 5, 2) | 99.12 | 86.55 | 84.46 | 79.84 | 85.86 |

Ternary quant. | 97.22 | 86.35 | 83.07 | 76.54 | 84.54 |

Binary quant. | 94.01 | 83.82 | 77.37 | 69.29 | 83.06 |

SQNR (dB) | |
---|---|

Proposed Model | |

FP8 (1, 4, 3) | 30.98 |

FP8 (1, 5, 2) | 25.58 |

Ternary quant. (1/16) | 5.90 |

Binary quant. | −31.483 |

Weighted Precision | |||||
---|---|---|---|---|---|

Emotion | |||||

Proposed Model | Neutral | Anger | Fear | Sadness | Joy |

Full-precision | 0.99 | 0.88 | 0.88 | 0.84 | 0.89 |

FP8 (1, 4, 3) | 0.99 | 0.88 | 0.88 | 0.84 | 0.89 |

FP8 (1, 5, 2) | 0.99 | 0.88 | 0.88 | 0.84 | 0.89 |

Ternary quant. | 0.97 | 0.88 | 0.87 | 0.82 | 0.87 |

Binary quant. | 0.95 | 0.86 | 0.84 | 0.72 | 0.87 |

Weighted Recall | |||||
---|---|---|---|---|---|

Emotion | |||||

Proposed Model | Neutral | Anger | Fear | Sadness | Joy |

Full-precision | 0.99 | 0.86 | 0.84 | 0.80 | 0.86 |

FP8 (1, 4, 3) | 0.99 | 0.86 | 0.84 | 0.80 | 0.86 |

FP8 (1, 5, 2) | 0.99 | 0.86 | 0.84 | 0.80 | 0.86 |

Ternary quant. | 0.97 | 0.86 | 0.83 | 0.77 | 0.85 |

Binary quant. | 0.94 | 0.84 | 0.77 | 0.69 | 0.83 |

Weighted F1 Score | |||||
---|---|---|---|---|---|

Emotion | |||||

Proposed Model | Neutral | Anger | Fear | Sadness | Joy |

Full-precision | 0.99 | 0.86 | 0.85 | 0.79 | 0.86 |

FP8 (1, 4, 3) | 0.99 | 0.86 | 0.85 | 0.79 | 0.86 |

FP8 (1, 5, 2) | 0.99 | 0.86 | 0.85 | 0.79 | 0.86 |

Ternary quant. | 0.97 | 0.86 | 0.83 | 0.75 | 0.85 |

Binary quant. | 0.94 | 0.84 | 0.77 | 0.67 | 0.83 |

**Table 10.**The CNN model from [17].

Layer | Arguments | Number of Parameters |
---|---|---|

Convolution2D | Filters = 32, kernel size = (4, 4), input shape (128, 170, 1) | 544 |

MaxPooling2D | Pool size = (4, 4), strides = (2, 2) | |

Convolution2D | Filters = 64, kernel size = (4, 4) | 32,832 |

MaxPooling2D | Pool size = (4, 4), strides = (2, 2) | |

Flatten | ||

Dense_1 | Nodes = 230 | 15,662,310 |

Dropout | Rate = 0.5 | |

Dense_2 | Nodes = 115 | 26,565 |

Dense_3 | Nodes = 23 | 2668 |

Total number of parameters | 15,724,919 |

**Table 11.**Classification accuracy of the CNN model from [17]: full-precision and additionally quantized scenarios.

Classification Accuracy (%) | |||||
---|---|---|---|---|---|

Emotion | |||||

CNN Model from [17] | Neutral | Anger | Fear | Sadness | Joy |

Full-precision | 98.83 | 78.75 | 83.94 | 78.43 | 84.70 |

FP8 (1, 4, 3) | 98.98 | 78.75 | 84.28 | 78.58 | 84.70 |

FP8 (1, 5, 2) | 98.83 | 78.75 | 84.11 | 77.95 | 84.87 |

Ternary quant. | 98.83 | 82.65 | 85.49 | 75.59 | 86.35 |

Binary quant. | 95.47 | 76.41 | 81.17 | 72.91 | 83.39 |

**Table 12.**SQNR for various quantization models applied to the model from [17].

SQNR (dB) | |
---|---|

CNN from [17] | |

FP8 (1, 4, 3) | 30.94 |

FP8 (1, 5, 2) | 25.54 |

Ternary quant. (1/16) | 5.219 |

Binary quant. | −33.94 |

Layer | Arguments | Number of Parameters |
---|---|---|

Convolution2D | Filters = 64, kernel size = (3, 3), strides = (1, 1), input shape (128, 170, 1) | 640 |

MaxPooling2D | Pool size = (2, 2), strides = (2, 2) | |

Convolution2D | Filters = 128, kernel size = (3, 3), strides = (1, 1) | 73,856 |

MaxPooling2D | Pool size = (2, 2), strides = (2, 2) | |

Convolution2D | Filters = 256, kernel size = (3, 3), strides = (1, 1) | 295,168 |

Convolution2D | Filters = 256, kernel size = (3, 3), strides = (1, 1) | 590,080 |

MaxPooling2D | Pool size = (2, 2), strides = (2, 2) | |

Convolution2D | Filters = 512, kernel size = (3, 3), strides = (1, 1) | 1,180,160 |

Convolution2D | Filters = 512, kernel size = (3, 3), strides = (1, 1) | 2,359,808 |

MaxPooling2D | Pool size = (2, 2), strides = (2, 2) | |

Flatten | ||

Dense_1 | Nodes = 4096 | 184,553,472 |

Dense_2 | Nodes = 4096 | 16,781,312 |

Dense_3 | Nodes = 23 | 94,231 |

Total number of parameters | 205,928,727 |

Classification Accuracy (%) | |||||
---|---|---|---|---|---|

Emotion | |||||

VGGish-Based Architecture | Neutral | Anger | Fear | Sadness | Joy |

Full-precision | 98.83 | 79.93 | 85.84 | 79.21 | 85.03 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Simić, N.; Suzić, S.; Nosek, T.; Vujović, M.; Perić, Z.; Savić, M.; Delić, V.
Speaker Recognition Using Constrained Convolutional Neural Networks in Emotional Speech. *Entropy* **2022**, *24*, 414.
https://doi.org/10.3390/e24030414

**AMA Style**

Simić N, Suzić S, Nosek T, Vujović M, Perić Z, Savić M, Delić V.
Speaker Recognition Using Constrained Convolutional Neural Networks in Emotional Speech. *Entropy*. 2022; 24(3):414.
https://doi.org/10.3390/e24030414

**Chicago/Turabian Style**

Simić, Nikola, Siniša Suzić, Tijana Nosek, Mia Vujović, Zoran Perić, Milan Savić, and Vlado Delić.
2022. "Speaker Recognition Using Constrained Convolutional Neural Networks in Emotional Speech" *Entropy* 24, no. 3: 414.
https://doi.org/10.3390/e24030414