Next Article in Journal
Adaptive Fuzzy Backstepping Sliding Mode Control for a 3-DOF Hydraulic Manipulator with Nonlinear Disturbance Observer for Large Payload Variation
Next Article in Special Issue
Exploring Efficient Neural Architectures for Linguistic–Acoustic Mapping in Text-To-Speech
Previous Article in Journal
Real-Time Mitigation of the Mobility Effect for IEEE 802.15.4g SUN MR-OFDM
Previous Article in Special Issue
Intelligibility and Listening Effort of Spanish Oesophageal Speech
Article Menu
Issue 16 (August-2) cover image

Export Article

Open AccessArticle

Supervector Extraction for Encoding Speaker and Phrase Information with Neural Networks for Text-Dependent Speaker Verification

ViVoLab, Aragón Institute for Engineering Research (I3A), University of Zaragoza, 50018 Zaragoza, Spain
*
Authors to whom correspondence should be addressed.
This paper is an extended version of our paper published in the conference IberSPEECH2018.
Appl. Sci. 2019, 9(16), 3295; https://doi.org/10.3390/app9163295
Received: 24 June 2019 / Revised: 6 August 2019 / Accepted: 6 August 2019 / Published: 11 August 2019
  |  
PDF [1115 KB, uploaded 11 August 2019]
  |     |  

Abstract

In this paper, we propose a new differentiable neural network with an alignment mechanism for text-dependent speaker verification. Unlike previous works, we do not extract the embedding of an utterance from the global average pooling of the temporal dimension. Our system replaces this reduction mechanism by a phonetic phrase alignment model to keep the temporal structure of each phrase since the phonetic information is relevant in the verification task. Moreover, we can apply a convolutional neural network as front-end, and, thanks to the alignment process being differentiable, we can train the network to produce a supervector for each utterance that will be discriminative to the speaker and the phrase simultaneously. This choice has the advantage that the supervector encodes the phrase and speaker information providing good performance in text-dependent speaker verification tasks. The verification process is performed using a basic similarity metric. The new model using alignment to produce supervectors was evaluated on the RSR2015-Part I database, providing competitive results compared to similar size networks that make use of the global average pooling to extract embeddings. Furthermore, we also evaluated this proposal on the RSR2015-Part II. To our knowledge, this system achieves the best published results obtained on this second part. View Full-Text
Keywords: text-dependent speaker verification; HMM alignment; deep neural networks; supervectors text-dependent speaker verification; HMM alignment; deep neural networks; supervectors
Figures

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).
SciFeed

Share & Cite This Article

MDPI and ACS Style

Mingote, V.; Miguel, A.; Ortega, A.; Lleida, E. Supervector Extraction for Encoding Speaker and Phrase Information with Neural Networks for Text-Dependent Speaker Verification. Appl. Sci. 2019, 9, 3295.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics

1

Comments

[Return to top]
Appl. Sci. EISSN 2076-3417 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top