# Automatic Speaker Recognition System Based on Gaussian Mixture Models, Cepstral Analysis, and Genetic Selection of Distinctive Features

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

## 3. Methods

#### 3.1. Speech Signal Acquisition

#### 3.2. Signal Pre-Processing

_{max}from a given speaker’s utterance, according to the following equation:

_{i}is compared with the experimentally designated threshold p

_{r}(Equation (2)), which, in addition, depends on an average value of the energies of the particular frames Ē, which enables the appropriate elimination of silence in a diverse acoustic environment:

_{max}of the autocorrelation function, which must be compared with the empirically calculated voicing threshold:

_{0}, which is a vital descriptor of the speech signal.

_{r}is the power of the currently examined frame; P

_{s}is a statistical value of the average power of the frame; and p

_{p}is an empirically calculated threshold.

_{0ac}and cepstral method F

_{0c}. Those two methods of calculating F

_{0}have different resistances to signal noise. Making correct use of those characteristics enables us to define the signal frames that do not fulfil a defined quality criterion (Equation (7)) [18]. According to the literature [18], a calculation of the base frequency by means of the autocorrelation method is more exact than a calculation by means of the cepstral method, but the former one is less resistant to signal noise. Thus, the smaller the difference between base frequency determined by means of the two methods, the less noise in the signal frame concerned:

_{f}represents the optimized threshold value.

#### 3.3. Generation of Distinctive Features

_{i}(k) is the filter function in the i-th sub-band. This operation maps an N-point linear DFT spectrum to the M-point filtered spectrum in the mel scale. From 20 to 30 filters are used as a standard; however, numerous global studies on speaker recognition systems have proven that the use of a small number of filters has a negative impact on speaker recognition effectiveness.

_{j}is the j-th melcepstral coefficient.

_{g}is the upper frequency of the voice signal; f

_{d}is the lower frequency of the voice signal; and M is the number of filters.

#### 3.4. Selection of Distinctive Features

_{i}; y) is created, which occurs among distinctive features yh

_{i}subject to selection and the class membership vector y, as well as a matrix of joint information among the features I(yh

_{i}; yh

_{j}). Due to the time-consuming nature of calculating the mutual information among the variables considered, in the case of the genetic algorithm presented, one calculates the initial data, which initiate the further operation of the algorithm. These data consist of mutual information calculated in a full set of features subject to selection. The initial data obtained are then used for each evaluation of the fitting of particular individuals (sets of individual features) within a population.

_{k}is a pseudo-random value from the range [0; 1], drawn for the k-th individual; g

_{k}is the index of the individual drawn, who was deemed well fitted and classified for undergoing further genetic operation (crossover).

#### 3.5. Normalization of Distinctive Features

#### 3.6. Modeling the Speaker’s Voice

_{1}, x

_{2}, …, x

_{T}}, where T means a number of d-sized vectors of distinctive features.

_{i}, µ

_{i}, Σ

_{i}} for i = 1, ..., C may be matched in a pseudo-random or determined way. Those parameters are: the expected values (µ

_{i}) and covariant matrices (Σ

_{i}), as well as the distribution weights (w

_{i}), while the sum of weights of all the distributions equals 1. Next, we calculate the probability density function of the occurrence of d-dimensional vectors of distinctive features originating from the training dataset of a given speaker in the created model of his/her voice. This function may be approximated by means of a weighted sum C of Gaussian distributions, and for a single observation t, it assumes the form of [32]:

_{t},λ) means the probability a posteriori of the occurrence of the i-th distribution in model λ, when feature vector x

_{t}is observed. According to the assumption of the EM algorithm, if inequality Q(λ,$\overline{\lambda}$) ≥ Q(λ,λ) occurs, then also p(X|$\overline{\lambda}$) ≥ p(X|λ).

_{t},λ) (Equation (31)) [32], and the second one is maximization, which enables defining the parameters of the new model $\overline{\lambda}=\left\{{\overline{w}}_{i},{\overline{\mu}}_{i},{\overline{\mathsf{\Sigma}}}_{i}\right\}$(Equations (32)–(34)) [32], which maximizes the function described by means of Equation (30). Each subsequent step makes use of quantities arrived at in the previous step. The process of model training is terminated in a lack of adequate increments of the probability function, or if a maximum number of iterations has been reached:

_{k}(for k = 1, …, N, where N is a number of voices in a given dataset), the recognized fragment of voice represented by the X set of distinctive features vector most probably belongs. To this end firstly, through a discrimination function g

_{k}(X) (Equation (37)), the conditional probability is calculated for each model, to attest that a specific model λ

_{k}represents the specific vectors of distinctive features X [32]:

_{k}) is a likelihood function originating from speaker model (29) and means a probability that a recognized X set of vectors is represented by a voice model λ

_{k}. Moreover, p(λ

_{k}) represents the statistical popularity of the voice in the dataset, and as each voice is equally probable, then p(λ

_{k}) = 1/N. p(X) is a probability of occurrence of a given feature vector X in a speech signal, and it is the same for each of the voice models. The probability is used for normalization purposes, and in the case of using the ranking only (according to criterion 40), when looking for the most probable model, it can be ignored. The discrimination function assumes the form of [32]:

_{t}|λ

_{k}) is determined according to Equation (27).

#### 3.7. Decision-Making System

- H
_{0}(zero hypothesis)—voice signal X comes from speaker k, - H
_{1}(alternative hypothesis)—voice signal X comes from another speaker ~k from the population.

_{hyp}, and that the alternative hypothesis is represented by model ${\lambda}_{\overline{hyp}}$, then the relation appears as follows:

_{t}in speaker model λ

_{k}. According to the above, this relationship may be presented as follows:

#### 3.8. Normalization of the Speaker Recognition Result

- Test normalization—takes place online; test recording is verified against the declared speaker model and a group of other cohort models, followed by assigning an average and deviation from those results to a speaker under consideration;
- Zero normalization—the model is verified against initial utterances that do not come from the speaker modeled, followed by assigning an average of and deviation from those results to a speaker under consideration;
- Symmetric normalization—symmetric normalization calculates an average value of the normalized result of zero and test normalizations;
- Combined normalization—a combination of zero and test normalizations that assume that the results of the zero and test normalizations are independent random variations;
- Zero-time-normalization—is a combination of zero and time normalizations, during which time normalization takes place in the first place, followed by the zero normalization of the achieved verification results;
- Time-zero normalization—is a combination of zero and time normalizations, during which zero normalization takes place first and is followed by the time normalization of the achieved verification results.

_{Z}and µ

_{T}are the subsequent average values resulting from the zero and time normalizations; and δ

_{Z}and δ

_{T}are standard deviations from those normalizations. Moreover, an appropriate matching of models making a cohort that takes place in the determination of the terms of Equation (45) is a vital element of the above-mentioned normalization.

## 4. Results and Discussion

#### 4.1. ASR System Effectiveness Test in Train and Test Segment Length Function

#### 4.2. Results of Speaker Equal-Error Rate in Particular Voice Datasets

#### 4.3. Effect of Coding Applied in Telephony on Speaker Recognition Effectiveness

#### 4.4. Speaker Identification in an Open Set of Voices

#### 4.5. Validation of the ASR System Effectiveness Results and Comparison with the Competition

#### 4.6. Results of Genetic Selection of Distinctive Features and Genetic Optimization of Internal Parameters of the ASR System

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Bochenek, A.; Reicher, M. Anatomia Człowieka; PZWL Press: Warsaw, Poland, 2010. [Google Scholar]
- Gandhi, A.; Patil, H.A. Feature Extraction from Temporal Phase for Speaker Recognition. In Proceedings of the 2018 International Conference on Signal Processing and Communications (SPCOM), Bangalore, India, 16–19 July 2018; pp. 382–386. [Google Scholar]
- Martin, A.; Przybocki, M. 2002 NIST Speaker Recognition Evaluation LDC2004S04; Linguistic Data Consortium: Philadelphia, PA, USA, 2004. [Google Scholar]
- Krishnamoorthy, P.; Jayanna, H.S.; Prasanna, S.R.M. Speaker Recognition under Limited Data Condition by Noise Addition. Expert Syst. Appl.
**2011**, 38, 13487–13490. [Google Scholar] [CrossRef] - Shen, X.; Zhai, Y.; Wang, Y.; Chen, H. A Speaker Recognition Algorithm Based on Factor Analysis. In Proceedings of the 2014 7th International Congress on Image and Signal Processing, CISP 2014, Dalian, China, 14–16 October 2014; pp. 897–901. [Google Scholar]
- Zergat, K.Y.; Selouani, S.A.; Amrouche, A. Feature Selection Applied to G.729 Synthesized Speech for Automatic Speaker Recognition. In Proceedings of the Colloquium in Information Science and Technology, CIST 2018, Marrakech, Morocco, 21–24 October 2018; pp. 178–182. [Google Scholar]
- Bharath, K.P.; Rajesh Kumar, M. ELM Speaker Identification for Limited Dataset Using Multitaper Based MFCC and PNCC Features with Fusion Score. Multimed. Tools Appl.
**2020**, 79, 28859–28883. [Google Scholar] - Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S.; Dahlgren, N.L.; Zue, V. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1; Linguistic Data Consortium: Philadelphia, PA, USA, 1993. [Google Scholar]
- Xu, Q.; Wang, M.; Xu, C.; Xu, L. Speaker Recognition Based on Long Short-Term Memory Networks. In Proceedings of the 2020 IEEE 5th International Conference on Signal and Image Processing (ICSIP 2020), Nanjing, China, 23–25 October 2020; pp. 318–322. [Google Scholar]
- Veaux, C.; Yamagishi, J.; MacDonald, K. CSTR VCTK Corpus English Multi-speaker Corpus for CSTR Voice Cloning Toolkit; CSTR: Edinburgh, UK, 2017. [Google Scholar]
- Hu, Z.; Fu, Y.; Xu, X.; Zhang, H. I-Vector and DNN Hybrid Method for Short Utterance Speaker Recognition. In Proceedings of the 2020 IEEE International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA), Chongqing, China, 6–8 November 2020; pp. 67–71. [Google Scholar]
- Szwelnik, T.; Pęzik, P.; Dróżdż, Ł. SNUV—Spelling and NUmbers Voice Database; Voice Lab: Łódź, Poland, 2012. [Google Scholar]
- Kabir, M.M.; Mridha, M.F.; Shin, J.; Jahan, I.; Ohi, A.Q. A Survey of Speaker Recognition: Fundamental Theories, Recognition Methods and Opportunities. IEEE Access
**2021**, 9, 79236–79263. [Google Scholar] [CrossRef] - Mohd Hanifa, R.; Isa, K.; Mohamad, S. A Review on Speaker Recognition: Technology and Challenges. Comput. Electr. Eng.
**2021**, 90, 107005. [Google Scholar] [CrossRef] - Bai, Z.; Zhang, X.L. Speaker Recognition Based on Deep Learning: An Overview. Neural Netw.
**2021**, 140, 65–99. [Google Scholar] [CrossRef] [PubMed] - Dobrowolski, A.P. Transformacje Sygnałów od Teorii do Praktyki; BTC Press: Legionowo, Poland, 2018. [Google Scholar]
- Kamiński, K. System Automatycznego Rozpoznawania Mówcy Oparty na Analizie Cepstralnej Sygnału Mowy i Modelach Mieszanin Gaussowskich. Ph.D. Thesis, Military University of Technology, Warsaw, Poland, 2018. [Google Scholar]
- Ciota, Z. Metody Przetwarzanie Sygnałów Akustycznych w Komputerowej Analizie Mowy; EXIT: Warsaw, Poland, 2010. [Google Scholar]
- Pawłowski, Z. Foniatryczna Diagnostyka Wykonawstwa Emisji Głosu Śpiewaczego i Mówionego; Impuls Press: Cracow, Poland, 2005. [Google Scholar]
- Oppenheim, A.V.; Schafer, R.W. From Frequency to Queferency A history of the cesptrum. IEEE Signal Process. Mag.
**2004**, 21, 95–106. [Google Scholar] [CrossRef] - Majda, E. Automatyczny System Wiarygodnego Rozpoznawania Mówcy Oparty na Analizie Cepstralnej Sygnału Mowy. Ph.D. Thesis, Military University of Technology, Warsaw, Poland, 2013. [Google Scholar]
- Davis, S.B.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentations. IEEE Trans. ASSP
**1980**, 28, 357–366. [Google Scholar] [CrossRef] [Green Version] - Sahidullah, M.; Saha, G. Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Commun.
**2012**, 54, 543–565. [Google Scholar] [CrossRef] - Charbuillet, C.; Gas, B.; Chetouani, M.; Zarader, J.L. Optimizing Feature Complementarity by Evolution Strategy Application to Automatic Speaker Verification. Speech Commun.
**2009**, 51, 724–731. [Google Scholar] [CrossRef] [Green Version] - Young, S.; Evermann, G.; Gales, M.; Hain, T.; Kershaw, D.; Liu, X.; Moore, G.; Odell, J.; Ollason, D.; Povey, D.; et al. The HTK Book; Version 2.1; Cambridge University: Cambridge, UK, 1995. [Google Scholar]
- Harrag, A.; Saigaa, D.; Boukharouba, K.; Drif, M. GA-based feature subset selection Application to Arabic speaker recognition system. In Proceedings of the 2011 11th International Conference on Hybrid Intelligent Systems (HIS), Malacca, Malaysia, 5–8 December 2011; pp. 383–387. [Google Scholar]
- Kamiński, K.; Dobrowolski, A.P.; Majda, E. Selekcja cech osobniczych sygnału mowy z wykorzystaniem algorytmów genetycznych. Bull. Mil. Univ. Technol.
**2016**, 65, 147–158. [Google Scholar] [CrossRef] - Zamalloa, M.; Bordel, G.; Rodriguez, L.J.; Penagarikano, M. Feature Selection Based on Genetic Algorithms for Speaker Recognition. In Proceedings of the 2006 IEEE Odyssey—The Speaker and Language Recognition Workshop, San Juan, PR, USA, 28–30 June 2006; pp. 1–8. [Google Scholar]
- Ludwig, O.; Nunes, U. Novel Maximum-Margin Training Algorithms for Supervised Neural Networks. IEEE Trans. Neural Netw.
**2010**, 21, 972–984. [Google Scholar] [CrossRef] [PubMed] - Al-Ali, A.K.H.; Dean, D.; Senadji, B.; Chandran, V.; Naik, G.R. Enhanced Forensic Speaker Verification Using a Combination of DWT and MFCC Feature Warping in the Presence of Noise and Reverberation Conditions. IEEE Access
**2017**, 5, 15400–15413. [Google Scholar] [CrossRef] [Green Version] - Piotrowski, Z.; Wojtuń, J.; Kamiński, K. Subscriber authentication using GMM and TMS320C6713DSP. Przegląd Elektrotechniczny
**2012**, 88, 127–130. [Google Scholar] - Tran, D.; Tu, L.; Wagner, M. Fuzzy Gaussian mixture models for speaker recognition. In Proceedings of the International Conference on Spoken Language Processing ICSLP 1998, Sydney, Australia, 30 November–4 December 1998; p. 798. [Google Scholar]
- Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM algorithm. J. R. Stat. Soc.
**1977**, 39, 1–38. [Google Scholar] - Janicki, A.; Staroszczyk, T. Klasyfikacja mówców oparta na modelowaniu GMM-UBM dla mowy o różnej jakości. Przegląd Telekomunikacyjny—Wiadomości Telekomunikacyjne
**2011**, 84, 1469–1474. [Google Scholar] - Kamiński, K.; Dobrowolski, A.P.; Majda, E. Evaluation of functionality speaker recognition system for downgraded voice signal quality. Przegląd Elektrotechniczny
**2014**, 90, 164–167. [Google Scholar] - Reynolds, D.A.; Quatieri, T.F.; Dunn, R.B. Speaker Verification Using Adapted Gaussian Mixture Models. Digit. Signal Process.
**2000**, 10, 19–41. [Google Scholar] [CrossRef] [Green Version] - Kamiński, K.; Dobrowolski, A.P.; Majda, E. Voice identification in the open set of speakers. Przegląd Elektrotechniczny
**2015**, 91, 206–210. [Google Scholar] - Büyük, O.; Arslan, M.L. Model selection and score normalization for text-dependent single utterance speaker verification. Turk. J. Electr. Eng. Comput. Sci.
**2012**, 20, 1277–1295. [Google Scholar] [CrossRef] - Kamiński, K.; Dobrowolski, A.P.; Majda, E.; Posiadała, D. Optimization of the automatic speaker recognition system for different acoustic paths. Przegląd Elektrotechniczny
**2015**, 91, 89–92. [Google Scholar] - Kamiński, K.; Dobrowolski, A.P.; Tatoń, R. Automatic The assessment of efficiency of the automatic speaker recognition system for voices registered using a throat microphone. In Proceedings of the XII Conference on Reconnaissance and Electronic Warfare Systems, Oltarzew, Poland, 19–21 November 2018; SPIE: Bellingham, WA, USA, 2019; Volume 11055, pp. 165–171. [Google Scholar]
- Kamiński, K.; Dobrowolski, A.P. The impact of compression of speech signal, background noise and acoustic disturbances on the effectiveness of speaker identification. In Proceedings of the XI Conference on Reconnaissance and Electronic Warfare Systems, Oltarzew, Poland, 21–23 November 2016; SPIE: Bellingham, WA, USA, 2017; p. 104180L. [Google Scholar]
- Bagwell, C. The Main SoX Sound eXchange Manual. 2014. Available online: https://sox.sourceforge.net/ (accessed on 1 September 2022).
- Valin, J. The Speex Codec Manual Version 1.2 Beta 3; Xiph.Org Foundation: Somerville, MA, USA, 2007. [Google Scholar]
- Kabal, P. ITU-T G.723.1 Speech Coder A Matlab Implementation; TSP Lab Technical Report; McGill University: Montreal, QC, Canada, 2009. [Google Scholar]

**Figure 1.**Advantage of voice biometrics over face and fingerprint biometrics during COVID-19 pandemic.

**Figure 3.**Diagram of generating distinctive features applied in the ASR System [17].

**Figure 4.**A method of setting weighted cepstral features from real cepstrum based on trapezoidal weighted function [17].

**Figure 5.**Depicting crossover and mutation operations within a genetic algorithm [17].

**Figure 6.**An example of modeling 2-dimensional training dataset by a weighed sum of 3 Gaussian distributions [17].

**Figure 7.**Speaker identification and verification effectiveness for variable lengths of speech signal, (

**a**) Identification; (

**b**) Verification.

**Figure 9.**Results of the optimization of the internal parameters of the ASR System with the use of a genetic algorithm, (

**a**)—value of the penalty function in relation to the number of generations; (

**b**)—Effectiveness of speaker identification in relation to the calculation time.

Name of the Voice Dataset | NIST | TIMIT | Proprietary (Digits and Numbers) | Proprietary (Intonation Differentiation) | Proprietary (Multisession) | SNUV | Proprietary (i.e., Throat Microphone) | VCTK | Total Dataset |
---|---|---|---|---|---|---|---|---|---|

Dataset size (number of voices) | 330 | 630 | 100 | 50 | 50 | 210 | 50 | 109 | 1529 |

EER (%) | 1.52 | 0.48 | 1.00 | 0.00 | 6.00 | 3.33 | 4.00 | 0.92 | 2.03 |

**Table 2.**Results of speaker identification (I) and verification (V) effectiveness for the common coding standards.

Training\Testing | Uncoded | G.711 | GSM 06.10 | G.723.1 | SPEEX | |||||
---|---|---|---|---|---|---|---|---|---|---|

I | V | I | V | I | V | I | V | I | V | |

Uncoded | 96.3 | 98.0 | 95.7 | 97.7 | 82.3 | 96.1 | 80.8 | 95.8 | 90.5 | 96.6 |

G.711 | 96.0 | 98.1 | 96.1 | 97.9 | 84.0 | 96.3 | 79.9 | 96.0 | 91.8 | 96.9 |

GSM 06.10 | 85.9 | 96.5 | 86.9 | 96.7 | 93.5 | 97.3 | 72.5 | 93.9 | 84.4 | 95.5 |

G.723.1 | 84.5 | 96.4 | 84.2 | 96.0 | 71.6 | 93.9 | 92.5 | 97.2 | 86.3 | 95.9 |

SPEEX | 92.6 | 97.7 | 93.1 | 97.4 | 83.1 | 96.2 | 85.2 | 96.6 | 95.2 | 97.6 |

Dataset Size | Training Signal | Test Signal | Comparative Parameter | Result for This ASR System | The Result for the Compared ASR System | Methods Used | References of the ASR System Compared |
---|---|---|---|---|---|---|---|

100 random voices from the “test” folder | 5 utterances | 5 utterances | IR | 99.00% | 80.00% | MFCC, GMM-UBM | [4] |

38 random voices (19 male, 19 female) from the “test” folder | No data—7 utterances were assumed | no data—3 utterances were assumed | IR | 100.00% | 82.84% | MFCC, LFA-SVM | [5] |

130 random voices from the “train” folder | 8 utterances | 2 utterances | EER | Uncoded speech: 0.77% Synthesized speech (G.729): 2.31% | Uncoded speech: 0.91% Synthesized speech (G.729): 12.50% | MFCC, LDA, GPLDA | [6] |

124 random voices from the “test” folder | 6 utterances, limited to 7 s | 4 utterances, limited to 7 s | IR | Clean speech: 98.39% AWGN noise (30 dB): 95.16% | Clean speech: 97.52% AWGN noise (30 dB): 86.70% | Multi-taper (MFCC + PNCC), ELM | [7] |

Parameter | Value |
---|---|

Number of reduced individual features | 38 |

Increase of IR for NIST SRE 2002 dataset | 21% |

Increase of IR TIMIT dataset | 20% |

Stage | Parameter |
---|---|

Silence cutting | Frame length |

Frame overlap | |

Threshold | |

Signal filtration | High-pass filter row |

Cut-off frequency | |

Cepstrum filtration | Low-pass filter row |

Cut-off frequency | |

Fundamental frequency filtration | Minimum fundamental frequency |

Maximum fundamental frequency | |

Feature generation | Frame length |

Frame overlap | |

Voicing threshold | |

Power threshold | |

Fundamental frequency difference threshold |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Kamiński, K.A.; Dobrowolski, A.P.
Automatic Speaker Recognition System Based on Gaussian Mixture Models, Cepstral Analysis, and Genetic Selection of Distinctive Features. *Sensors* **2022**, *22*, 9370.
https://doi.org/10.3390/s22239370

**AMA Style**

Kamiński KA, Dobrowolski AP.
Automatic Speaker Recognition System Based on Gaussian Mixture Models, Cepstral Analysis, and Genetic Selection of Distinctive Features. *Sensors*. 2022; 22(23):9370.
https://doi.org/10.3390/s22239370

**Chicago/Turabian Style**

Kamiński, Kamil A., and Andrzej P. Dobrowolski.
2022. "Automatic Speaker Recognition System Based on Gaussian Mixture Models, Cepstral Analysis, and Genetic Selection of Distinctive Features" *Sensors* 22, no. 23: 9370.
https://doi.org/10.3390/s22239370