MDPI - Publisher of Open Access Journals

17 pages, 543 KB

Open AccessArticle

Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding

by Minsoo Kim and Gil-Jin Jang

Appl. Sci. 2024, 14(18), 8138; https://doi.org/10.3390/app14188138 - 10 Sep 2024

Cited by 3 | Viewed by 4589

Abstract

Automatic speech recognition (ASR) aims at understanding naturally spoken human speech to be used as text inputs to machines. In multi-speaker environments, where multiple speakers are talking simultaneously with a large amount of overlap, a significant performance degradation may occur with conventional ASR [...] Read more.

Automatic speech recognition (ASR) aims at understanding naturally spoken human speech to be used as text inputs to machines. In multi-speaker environments, where multiple speakers are talking simultaneously with a large amount of overlap, a significant performance degradation may occur with conventional ASR systems if they are trained by recordings of single talkers. This paper proposes a multi-speaker ASR method that incorporates speaker embedding information as an additional input. The embedding information for each of the speakers in the training set was extracted as numeric vectors, and all of the embedding vectors were stacked to construct a total speaker profile matrix. The speaker profile matrix from the training dataset enables finding embedding vectors that are close to the speakers of the input recordings in the test conditions, and it helps to recognize the individual speakers’ voices mixed in the input. Furthermore, the proposed method efficiently reuses the decoder from the existing speaker-independent ASR model, eliminating the need for retraining the entire system. Various speaker embedding methods such as i-vector, d-vector, and x-vector were adopted, and the experimental results show 0.33% and 0.95% absolute (3.9% and 11.5% relative) improvements without and with the speaker profile in the word error rate (WER). Full article

(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)

► Show Figures

Figure 1

13 pages, 456 KB

Open AccessArticle

Robust Detection of Background Acoustic Scene in the Presence of Foreground Speech

by Siyuan Song, Yanjue Song and Nilesh Madhu

Appl. Sci. 2024, 14(2), 609; https://doi.org/10.3390/app14020609 - 10 Jan 2024

Cited by 2 | Viewed by 1695

Abstract

The characterising sound required for the Acoustic Scene Classification (ASC) system is contained in the ambient signal. However, in practice, this is often distorted by e.g., foreground speech of the speakers in the surroundings. Previously, based on the iVector framework, we proposed different [...] Read more.

The characterising sound required for the Acoustic Scene Classification (ASC) system is contained in the ambient signal. However, in practice, this is often distorted by e.g., foreground speech of the speakers in the surroundings. Previously, based on the iVector framework, we proposed different strategies to improve the classification accuracy when foreground speech is present. In this paper, we extend these methods to deep-learning (DL)-based ASC systems, for improving foreground speech robustness. ResNet models are proposed as the baseline, in combination with multi-condition training at different signal-to-background ratios (SBRs). For further robustness, we first investigate the noise-floor-based Mel-FilterBank Energies (NF-MFBE) as the input feature of the ResNet model. Next, speech presence information is incorporated within the ASC framework obtained from a speech enhancement (SE) system. As the speech presence information is time-frequency specific, it allows the network to learn to distinguish better between background signal regions and foreground speech. While the proposed modifications improve the performance of ASC systems when foreground speech is dominant, in scenarios with low-level or absent foreground speech, performance is slightly worse. Therefore, as a last consideration, ensemble methods are introduced, to integrate classification scores from different models in a weighted manner. The experimental study systematically validates the contribution of each proposed modification and, for the final system, it is shown that with the proposed input features and meta-learner, the classification accuracy is improved in all tested SBRs. Especially for SBRs of 20 dB, absolute improvements of up to 9% can be obtained. Full article

(This article belongs to the Special Issue Deep Learning Based Speech Enhancement Technology)

► Show Figures

Figure 1

21 pages, 6268 KB

Open AccessArticle

On Training Targets and Activation Functions for Deep Representation Learning in Text-Dependent Speaker Verification

by Achintya Kumar Sarkar and Zheng-Hua Tan

Acoustics 2023, 5(3), 693-713; https://doi.org/10.3390/acoustics5030042 - 17 Jul 2023

Cited by 3 | Viewed by 3430

Abstract

Deep representation learning has gained significant momentum in advancing text-dependent speaker verification (TD-SV) systems. When designing deep neural networks (DNN) for extracting bottleneck (BN) features, the key considerations include training targets, activation functions, and loss functions. In this paper, we systematically study the [...] Read more.

Deep representation learning has gained significant momentum in advancing text-dependent speaker verification (TD-SV) systems. When designing deep neural networks (DNN) for extracting bottleneck (BN) features, the key considerations include training targets, activation functions, and loss functions. In this paper, we systematically study the impact of these choices on the performance of TD-SV. For training targets, we consider speaker identity, time-contrastive learning (TCL), and auto-regressive prediction coding, with the first being supervised and the last two being self-supervised. Furthermore, we study a range of loss functions when speaker identity is used as the training target. With regard to activation functions, we study the widely used sigmoid function, rectified linear unit (ReLU), and Gaussian error linear unit (GELU). We experimentally show that GELU is able to reduce the error rates of TD-SV significantly compared to sigmoid, irrespective of the training target. Among the three training targets, TCL performs the best. Among the various loss functions, cross-entropy, joint-softmax, and focal loss functions outperform the others. Finally, the score-level fusion of different systems is also able to reduce the error rates. To evaluate the representation learning methods, experiments are conducted on the RedDots 2016 challenge database consisting of short utterances for TD-SV systems based on classic Gaussian mixture model-universal background model (GMM-UBM) and i-vector methods. Full article

(This article belongs to the Collection Featured Position and Review Papers in Acoustics Science)

► Show Figures

Figure 1

23 pages, 7840 KB

Open AccessArticle

A Smart Image Encryption Technology via Applying Personal Information and Speaker-Verification System

by Shih-Yu Li, Chun-Hung Lee and Lap-Mou Tam

Sensors 2023, 23(13), 5906; https://doi.org/10.3390/s23135906 - 26 Jun 2023

Cited by 4 | Viewed by 2024

Abstract

In this paper, a framework for authorization and personal image protection that applies user accounts, passwords, and personal I-vectors as the keys for ciphering the image content was developed and connected. There were two main systems in this framework. The first involved a [...] Read more.

In this paper, a framework for authorization and personal image protection that applies user accounts, passwords, and personal I-vectors as the keys for ciphering the image content was developed and connected. There were two main systems in this framework. The first involved a speaker verification system, wherein the user entered their account information and password to log into the system and provided a short voice sample for identification, and then the algorithm transferred the user’s voice (biometric) features, along with their account and password details, to a second image encryption system. For the image encryption process, the account name and password presented by the user were applied to produce the initial conditions for hyper-chaotic systems to generate private keys for image-shuffling and ciphering. In the final stage, the biometric features were also applied to protect the content of the image, so the encryption technology would be more robust. The final results of the encryption system were acceptable, as a lower correlation was obtained in the cipher images. The voice database we applied was the Pitch Tracking Database from the Graz University of Technology (PTDB-TUG), which provided the microphone and laryngoscope signals of 20 native English speakers. For image processing, four standard testing images from the University of Southern California–Signal and Image Processing Institute (USC-SIPI), including Lena, F-16, Mandrill, and Peppers, were presented to further demonstrate the effectiveness and efficiency of the smart image encryption algorithm. Full article

(This article belongs to the Special Issue Clear Reasoning about Security)

► Show Figures

Figure 1

28 pages, 11874 KB

Open AccessArticle

Speaker Counting Based on a Novel Hive Shaped Nested Microphone Array by WPT and 2D Adaptive SRP Algorithms in Near-Field Scenarios

by Ali Dehghan Firoozabadi, Pablo Adasme, David Zabala-Blanco, Pablo Palacios Játiva and Cesar Azurdia-Meza

Sensors 2023, 23(9), 4499; https://doi.org/10.3390/s23094499 - 5 May 2023

Viewed by 2540

Abstract

Speech processing algorithms, especially sound source localization (SSL), speech enhancement, and speaker tracking are considered to be the main fields in this application. Most speech processing algorithms require knowing the number of speakers for real implementation. In this article, a novel method for [...] Read more.

Speech processing algorithms, especially sound source localization (SSL), speech enhancement, and speaker tracking are considered to be the main fields in this application. Most speech processing algorithms require knowing the number of speakers for real implementation. In this article, a novel method for estimating the number of speakers is proposed based on the hive shaped nested microphone array (HNMA) by wavelet packet transform (WPT) and 2D sub-band adaptive steered response power (SB-2DASRP) with phase transform (PHAT) and maximum likelihood (ML) filters, and, finally, the agglomerative classification and elbow criteria for obtaining the number of speakers in near-field scenarios. The proposed HNMA is presented for aliasing and imaging elimination and preparing the proper signals for the speaker counting method. In the following, the Blackman–Tukey spectral estimation method is selected for detecting the proper frequency components of the recorded signal. The WPT is considered for smart sub-band processing by focusing on the frequency bins of the speech signal. In addition, the SRP method is implemented in 2D format and adaptively by ML and PHAT filters on the sub-band signals. The SB-2DASRP peak positions are extracted on various time frames based on the standard deviation (SD) criteria, and the final number of speakers is estimated by unsupervised agglomerative clustering and elbow criteria. The proposed HNMA-SB-2DASRP method is compared with the frequency-domain magnitude squared coherence (FD-MSC), i-vector probabilistic linear discriminant analysis (i-vector PLDA), ambisonics features of the correlational recurrent neural network (AF-CRNN), and speaker counting by density-based classification and clustering decision (SC-DCCD) algorithms on noisy and reverberant environments, which represents the superiority of the proposed method for real implementation. Full article

(This article belongs to the Special Issue Localising Sensors through Wireless Communication)

► Show Figures

Figure 1

23 pages, 4407 KB

Open AccessArticle

Empirical Comparison between Deep and Classical Classifiers for Speaker Verification in Emotional Talking Environments

by Ali Bou Nassif, Ismail Shahin, Mohammed Lataifeh, Ashraf Elnagar and Nawel Nemmour

Information 2022, 13(10), 456; https://doi.org/10.3390/info13100456 - 27 Sep 2022

Cited by 4 | Viewed by 2828

Abstract

Speech signals carry various bits of information relevant to the speaker such as age, gender, accent, language, health, and emotions. Emotions are conveyed through modulations of facial and vocal expressions. This paper conducts an empirical comparison of performances between the classical classifiers: Gaussian [...] Read more.

Speech signals carry various bits of information relevant to the speaker such as age, gender, accent, language, health, and emotions. Emotions are conveyed through modulations of facial and vocal expressions. This paper conducts an empirical comparison of performances between the classical classifiers: Gaussian Mixture Model (GMM), Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Artificial neural networks (ANN); and the deep learning classifiers, i.e., Long Short-Term Memory (LSTM), Convolutional Neural Network (CNN), and Gated Recurrent Unit (GRU) in addition to the ivector approach for a text-independent speaker verification task in neutral and emotional talking environments. The deep models undergo hyperparameter tuning using the Grid Search optimization algorithm. The models are trained and tested using a private Arabic Emirati Speech Database, Ryerson Audio–Visual Database of Emotional Speech and Song dataset (RAVDESS) database, and a public Crowd-Sourced Emotional Multimodal Actors (CREMA) database. Experimental results illustrate that deep architectures do not necessarily outperform classical classifiers. In fact, evaluation was carried out through Equal Error Rate (EER) along with Area Under the Curve (AUC) scores. The findings reveal that the GMM model yields the lowest EER values and the best AUC scores across all datasets, amongst classical classifiers. In addition, the ivector model surpasses all the fine-tuned deep models (CNN, LSTM, and GRU) based on both evaluation metrics in the neutral, as well as the emotional speech. In addition, the GMM outperforms the ivector using the Emirati and RAVDESS databases. Full article

(This article belongs to the Special Issue Signal Processing Based on Convolutional Neural Network)

► Show Figures

Figure 1

20 pages, 2415 KB

Open AccessArticle

Defending against FakeBob Adversarial Attacks in Speaker Verification Systems with Noise-Adding

by Zesheng Chen, Li-Chi Chang, Chao Chen, Guoping Wang and Zhuming Bi

Algorithms 2022, 15(8), 293; https://doi.org/10.3390/a15080293 - 17 Aug 2022

Cited by 10 | Viewed by 3187

Abstract

Speaker verification systems use human voices as an important biometric to identify legitimate users, thus adding a security layer to voice-controlled Internet-of-things smart homes against illegal access. Recent studies have demonstrated that speaker verification systems are vulnerable to adversarial attacks such as FakeBob. [...] Read more.

Speaker verification systems use human voices as an important biometric to identify legitimate users, thus adding a security layer to voice-controlled Internet-of-things smart homes against illegal access. Recent studies have demonstrated that speaker verification systems are vulnerable to adversarial attacks such as FakeBob. The goal of this work is to design and implement a simple and light-weight defense system that is effective against FakeBob. We specifically study two opposite pre-processing operations on input audios in speak verification systems: denoising that attempts to remove or reduce perturbations and noise-adding that adds small noise to an input audio. Through experiments, we demonstrate that both methods are able to weaken the ability of FakeBob attacks significantly, with noise-adding achieving even better performance than denoising. Specifically, with denoising, the targeted attack success rate of FakeBob attacks can be reduced from 100% to 56.05% in GMM speaker verification systems, and from 95% to only 38.63% in i-vector speaker verification systems, respectively. With noise adding, those numbers can be further lowered down to 5.20% and 0.50%, respectively. As a proactive measure, we study several possible adaptive FakeBob attacks against the noise-adding method. Experiment results demonstrate that noise-adding can still provide a considerable level of protection against these countermeasures. Full article

(This article belongs to the Special Issue Commemorative Special Issue: Adversarial and Federated Machine Learning: State of the Art and New Perspectives)

► Show Figures

Figure 1

8 pages, 322 KB

Open AccessArticle

The XMUSPEECH System for Accented English Automatic Speech Recognition

by Fuchuan Tong, Tao Li, Dexin Liao, Shipeng Xia, Song Li, Qingyang Hong and Lin Li

Appl. Sci. 2022, 12(3), 1478; https://doi.org/10.3390/app12031478 - 29 Jan 2022

Cited by 5 | Viewed by 3783

Abstract

In this paper, we present the XMUSPEECH systems for Track 2 of the Interspeech 2020 Accented English Speech Recognition Challenge (AESRC2020). Track 2 is an Automatic Speech Recognition (ASR) task where the non-native English speakers have various accents, which reduces the accuracy of [...] Read more.

In this paper, we present the XMUSPEECH systems for Track 2 of the Interspeech 2020 Accented English Speech Recognition Challenge (AESRC2020). Track 2 is an Automatic Speech Recognition (ASR) task where the non-native English speakers have various accents, which reduces the accuracy of the ASR system. To solve this problem, we experimented with acoustic models and input features. Furthermore, we trained a TDNN-LSTM language model for lattice rescoring to obtain better results. Compared with our baseline system, we achieved relative word error rate (WER) improvements of 40.7% and 35.7% on the development set and evaluation set, respectively. Full article

(This article belongs to the Special Issue Selected Papers from 16th National Conference on Man-Machine Speech Communication (NCMMSC2021))

► Show Figures

Figure 1

11 pages, 1578 KB

Open AccessCommunication

Evaluating the Performance of Speaker Recognition Solutions in E-Commerce Applications

by Olja Krčadinac, Uroš Šošević and Dušan Starčević

Sensors 2021, 21(18), 6231; https://doi.org/10.3390/s21186231 - 17 Sep 2021

Cited by 6 | Viewed by 2600

Abstract

Two important tasks in many e-commerce applications are identity verification of the user accessing the system and determining the level of rights that the user has for accessing and manipulating system’s resources. The performance of these tasks is directly dependent on the certainty [...] Read more.

Two important tasks in many e-commerce applications are identity verification of the user accessing the system and determining the level of rights that the user has for accessing and manipulating system’s resources. The performance of these tasks is directly dependent on the certainty of establishing the identity of the user. The main research focus of this paper is user identity verification approach based on voice recognition techniques. The paper presents research results connected to the usage of open-source speaker recognition technologies in e-commerce applications with an emphasis on evaluating the performance of the algorithms they use. Four open-source speaker recognition solutions (SPEAR, MARF, ALIZE, and HTK) have been evaluated in cases of mismatched conditions during training and recognition phases. In practice, mismatched conditions are influenced by various lengths of spoken sentences, different types of recording devices, and the usage of different languages in training and recognition phases. All tests conducted in this research were performed in laboratory conditions using the specially designed framework for multimodal biometrics. The obtained results show consistency with the findings of recent research which proves that i-vectors and solutions based on probabilistic linear discriminant analysis (PLDA) continue to be the dominant speaker recognition approaches for text-independent tasks. Full article

(This article belongs to the Special Issue Sensor-Based Biometrics Recognition and Processing)

► Show Figures

Figure 1

22 pages, 985 KB

Open AccessArticle

Resultant Information Descriptors, Equilibrium States and Ensemble Entropy ^†

by Roman F. Nalewajski

Entropy 2021, 23(4), 483; https://doi.org/10.3390/e23040483 - 19 Apr 2021

Cited by 5 | Viewed by 2722

Abstract

In this article, sources of information in electronic states are reexamined and a need for the resultant measures of the entropy/information content, combining contributions due to probability and phase/current densities, is emphasized. Probability distribution reflects the wavefunction modulus and generates classical contributions to [...] Read more.

In this article, sources of information in electronic states are reexamined and a need for the resultant measures of the entropy/information content, combining contributions due to probability and phase/current densities, is emphasized. Probability distribution reflects the wavefunction modulus and generates classical contributions to Shannon’s global entropy and Fisher’s gradient information. The phase component of molecular states similarly determines their nonclassical supplements, due to probability “convection”. The local-energy concept is used to examine the phase equalization in the equilibrium, phase-transformed states. Continuity relations for the wavefunction modulus and phase components are reexamined, the convectional character of the local source of the resultant gradient information is stressed, and latent probability currents in the equilibrium (stationary) quantum states are related to the horizontal (“thermodynamic”) phase. The equivalence of the energy and resultant gradient information (kinetic energy) descriptors of chemical processes is stressed. In the grand-ensemble description, the reactivity criteria are defined by the populational derivatives of the system average electronic energy. Their entropic analogs, given by the associated derivatives of the overall gradient information, are shown to provide an equivalent set of reactivity indices for describing the charge transfer phenomena. Full article

(This article belongs to the Special Issue Entropic and Complexity Measures in Atomic and Molecular Systems)

► Show Figures

Figure 1

20 pages, 4854 KB

Open AccessArticle

Weighted Cluster-Range Loss and Criticality-Enhancement Loss for Speaker Recognition

by Jianye Mo and Li Xu

Appl. Sci. 2020, 10(24), 9004; https://doi.org/10.3390/app10249004 - 16 Dec 2020

Cited by 6 | Viewed by 2325

Abstract

While traditional i-vector based methods are popular in the field of speaker recognition, deep learning has recently found more and more applications to the end-to-end models due to its attractive performance. One effective practice is the integration of attention mechanism into the Convolution [...] Read more.

While traditional i-vector based methods are popular in the field of speaker recognition, deep learning has recently found more and more applications to the end-to-end models due to its attractive performance. One effective practice is the integration of attention mechanism into the Convolution Neural Networks (CNNs). In this work, a light-weight dual-path attention block is proposed by combining the self-attention and Convolutional Block Attention Module (CBAM), which helps to capture more multi-source features with neglectable extra time expense. Additionally, a Weighted Cluster-Range Loss (WCRL) is proposed to enhance the identification performance of Cluster-Range Loss (CRL) on indecisive samples. Besides, to address the low efficiency in the initial training stage of CRL, a novel Criticality-Enhancement Loss (CEL) is also presented. Both of the proposed loss functions could significantly promote the training efficiency and globally improve the recognition performance. Experimental results are presented to show the effectiveness of the proposed scheme, which achieves a competitive top-1 accuracy of 92.0%, top-5 accuracy of 97.6%, and Equal Error Rate (EER) of 3.5% on the VoxCeleb1 dataset. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

17 pages, 397 KB

Open AccessArticle

Robust Deep Speaker Recognition: Learning Latent Representation with Joint Angular Margin Loss

by Labib Chowdhury, Hasib Zunair and Nabeel Mohammed

Appl. Sci. 2020, 10(21), 7522; https://doi.org/10.3390/app10217522 - 26 Oct 2020

Cited by 13 | Viewed by 6352

Abstract

Speaker identification is gaining popularity, with notable applications in security, automation, and authentication. For speaker identification, deep-convolutional-network-based approaches, such as SincNet, are used as an alternative to i-vectors. Convolution performed by parameterized sinc functions in SincNet demonstrated superior results in this area. This [...] Read more.

Speaker identification is gaining popularity, with notable applications in security, automation, and authentication. For speaker identification, deep-convolutional-network-based approaches, such as SincNet, are used as an alternative to i-vectors. Convolution performed by parameterized sinc functions in SincNet demonstrated superior results in this area. This system optimizes softmax loss, which is integrated in the classification layer that is responsible for making predictions. Since the nature of this loss is only to increase interclass distance, it is not always an optimal design choice for biometric-authentication tasks such as face and speaker recognition. To overcome the aforementioned issues, this study proposes a family of models that improve upon the state-of-the-art SincNet model. Proposed models AF-SincNet, Ensemble-SincNet, and ALL-SincNet serve as a potential successor to the successful SincNet model. The proposed models are compared on a number of speaker-recognition datasets, such as TIMIT and LibriSpeech, with their own unique challenges. Performance improvements are demonstrated compared to competitive baselines. In interdataset evaluation, the best reported model not only consistently outperformed the baselines and current prior models, but also generalized well on unseen and diverse tasks such as Bengali speaker recognition. Full article

(This article belongs to the Special Issue Advances in Pattern Analysis for Identity Recognition and Verification)

► Show Figures

Figure 1

15 pages, 1358 KB

Open AccessFeature PaperArticle

Emotional Variability Analysis Based I-Vector for Speaker Verification in Under-Stress Conditions

by Barlian Henryranu Prasetio, Hiroki Tamura and Koichi Tanno

Electronics 2020, 9(9), 1420; https://doi.org/10.3390/electronics9091420 - 1 Sep 2020

Cited by 4 | Viewed by 2891

Abstract

Emotional conditions cause changes in the speech production system. It produces the differences in the acoustical characteristics compared to neutral conditions. The presence of emotion makes the performance of a speaker verification system degrade. In this paper, we propose a speaker modeling that [...] Read more.

Emotional conditions cause changes in the speech production system. It produces the differences in the acoustical characteristics compared to neutral conditions. The presence of emotion makes the performance of a speaker verification system degrade. In this paper, we propose a speaker modeling that accommodates the presence of emotions on the speech segments by extracting a speaker representation compactly. The speaker model is estimated by following a similar procedure to the i-vector technique, but it considerate the emotional effect as the channel variability component. We named this method as the emotional variability analysis (EVA). EVA represents the emotion subspace separately to the speaker subspace, like the joint factor analysis (JFA) model. The effectiveness of the proposed system is evaluated by comparing it with the standard i-vector system in the speaker verification task of the Speech Under Simulated and Actual Stress (SUSAS) dataset with three different scoring methods. The evaluation focus in terms of the equal error rate (EER). In addition, we also conducted an ablation study for a more comprehensive analysis of the EVA-based i-vector. Based on experiment results, the proposed system outperformed the standard i-vector system and achieved state-of-the-art results in the verification task for the under-stressed speakers. Full article

(This article belongs to the Special Issue Applications of Bioinspired Neural Network)

► Show Figures

Graphical abstract

18 pages, 4505 KB

Open AccessArticle

Automatic Language Identification Using Speech Rhythm Features for Multi-Lingual Speech Recognition

by Hwamin Kim and Jeong-Sik Park

Appl. Sci. 2020, 10(7), 2225; https://doi.org/10.3390/app10072225 - 25 Mar 2020

Cited by 15 | Viewed by 5427

Abstract

The conventional speech recognition systems can handle the input speech of a specific single language. To realize multi-lingual speech recognition, a language should be firstly identified from input speech. This study proposes an efficient Language IDentification (LID) approach for the multi-lingual system. The [...] Read more.

The conventional speech recognition systems can handle the input speech of a specific single language. To realize multi-lingual speech recognition, a language should be firstly identified from input speech. This study proposes an efficient Language IDentification (LID) approach for the multi-lingual system. The standard LID tasks depend on common acoustic features used in speech recognition. However, the features may convey insufficient language-specific information, as they aim to discriminate the general tendency of phonemic information. This study investigates another type of feature characterizing language-specific properties, considering computation complexity. We focus on speech rhythm features providing the prosodic characteristics of speech signals. The rhythm features represent the tendency of consonants and vowels of languages, and therefore, classifying them from speech signals is necessary. For the rapid classification, we employ Gaussian Mixture Model (GMM)-based learning in which two GMMs corresponding to consonants and vowels are firstly trained and used for classifying them. By using the classification results, we estimate the tendency of two phonemic groups such as the duration of consonantal and vocalic intervals and calculate rhythm metrics called R-vector. In experiments on several speech corpora, the automatically extracted R-vector provided similar language tendencies to the conventional studies on linguistics. In addition, the proposed R-vector-based LID approach demonstrated superior or comparable LID performance to the conventional approaches in spite of low computation complexity. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Graphical abstract

12 pages, 451 KB

Open AccessProceeding Paper

On the Use of Fisher Vector Encoding for Voice Spoofing Detection

by Jahangir Alam

Proceedings 2019, 31(1), 37; https://doi.org/10.3390/proceedings2019031037 - 20 Nov 2019

Cited by 2 | Viewed by 1816

Abstract

Recently, the vulnerability of automatic speaker recognition systems to spoofing attacks has received significant interest among researchers. A robust speaker recognition system demands not only high recognition accuracy but also robustness to spoofing attacks. Several spoofing and countermeasure challenges have been organized to [...] Read more.

Recently, the vulnerability of automatic speaker recognition systems to spoofing attacks has received significant interest among researchers. A robust speaker recognition system demands not only high recognition accuracy but also robustness to spoofing attacks. Several spoofing and countermeasure challenges have been organized to draw attention to this problem among the speaker recognition communities. Low-level descriptors designed to detect artifacts in spoofed speech are found to be the most effective countermeasures against spoofing attacks. In this work, we used Fisher vector encoding of low-level descriptors extracted from speech signals. The idea behind Fisher vector encoding is to determine the amount of change induced by the descriptors of the signal on a background probability model which is typically a Gaussian mixture model. The Fisher vector encodes the amount of change of the model parameters to optimally fit the new- coming data. For performance evaluation of the proposed approach we carried out spoofing detection experiments on the 2015 edition of automatic speaker verification spoofing and countermeasure challenge (ASVspoof2015) and report results on the evaluation set. As baseline systems, we used the standard Gaussian mixture model and i-vector/PLDA paradigms. For a fair comparison, in all systems, Constant Q cepstral coefficient (CQCC) features were used as low-level descriptors. With the Fisher vector-based approach, we achieved an equal error rate (EER) of 0.1145% on the known attacks, 1.223% on the unknown attacks, and 0.668% on the average. Moreover, with a single decision threshold this approach yielded an EER of 1.05% on the evaluation set. Full article

(This article belongs to the Proceedings of 13th International Conference on Ubiquitous Computing and Ambient ‪Intelligence UCAmI 2019‬)

► Show Figures

Figure 1

Search Results (19)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (19)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI