Next Article in Journal
A Measurement Method of Microsphere with Dual Scanning Probes
Previous Article in Journal
Improved Anti-Collision Algorithm for the Application on Intelligent Warehouse
Previous Article in Special Issue
Attention-Based LSTM Algorithm for Audio Replay Detection in Noisy Environments
Article Menu

Article Versions

Export Article

Open AccessArticle
Appl. Sci. 2019, 9(8), 1597; (registering DOI)

Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings

Department of Electrical and Computer Engineering and the Institute of New Media and Communications, Seoul National University, Seoul 08826, Korea
Author to whom correspondence should be addressed.
Received: 13 March 2019 / Revised: 5 April 2019 / Accepted: 12 April 2019 / Published: 17 April 2019
(This article belongs to the Special Issue Advanced Biometrics with Deep Learning)
PDF [1152 KB, uploaded 17 April 2019]


Recently, the increasing demand for voice-based authentication systems has encouraged researchers to investigate methods for verifying users with short randomized pass-phrases with constrained vocabulary. The conventional i-vector framework, which has been proven to be a state-of-the-art utterance-level feature extraction technique for speaker verification, is not considered to be an optimal method for this task since it is known to suffer from severe performance degradation when dealing with short-duration speech utterances. More recent approaches that implement deep-learning techniques for embedding the speaker variability in a non-linear fashion have shown impressive performance in various speaker verification tasks. However, since most of these techniques are trained in a supervised manner, which requires speaker labels for the training data, it is difficult to use them when a scarce amount of labeled data is available for training. In this paper, we propose a novel technique for extracting an i-vector-like feature based on the variational autoencoder (VAE), which is trained in an unsupervised manner to obtain a latent variable representing the variability within a Gaussian mixture model (GMM) distribution. The proposed framework is compared with the conventional i-vector method using the TIDIGITS dataset. Experimental results showed that the proposed method could cope with the performance deterioration caused by the short duration. Furthermore, the performance of the proposed approach improved significantly when applied in conjunction with the conventional i-vector framework.
Keywords: speech embedding; deep learning; speaker recognition speech embedding; deep learning; speaker recognition
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Share & Cite This Article

MDPI and ACS Style

Kang, W.H.; Kim, N.S. Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings. Appl. Sci. 2019, 9, 1597.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Appl. Sci. EISSN 2076-3417 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top