Next Article in Journal
Scale-Adaptive Simulation of Unsteady Cavitation Around a Naca66 Hydrofoil
Next Article in Special Issue
Summarization of Spanish Talk Shows with Siamese Hierarchical Attention Networks
Previous Article in Journal
PPDC: A Privacy-Preserving Distinct Counting Scheme for Mobile Sensing
Previous Article in Special Issue
Exploring Efficient Neural Architectures for Linguistic–Acoustic Mapping in Text-To-Speech

An Analysis of the Short Utterance Problem for Speaker Characterization

ViVoLab, Aragón Institute for Engineering Research (I3A), University of Zaragoza, 50018 Zaragoza, Spain
Authors to whom correspondence should be addressed.
This paper is an extended version of our paper published in IberSPEECH2018.
Appl. Sci. 2019, 9(18), 3697;
Received: 11 July 2019 / Revised: 27 August 2019 / Accepted: 29 August 2019 / Published: 5 September 2019
Speaker characterization has always been conditioned by the length of the evaluated utterances. Despite performing well with large amounts of audio, significant degradations in performance are obtained when short utterances are considered. In this work we present an analysis of the short utterance problem providing an alternative point of view. From our perspective the performance in the evaluation of short utterances is highly influenced by the phonetic similarity between enrollment and test utterances. Both enrollment and test should contain similar phonemes to properly discriminate, being degraded otherwise. In this study we also interpret short utterances as incomplete long utterances where some acoustic units are either unbalanced or just missing. These missing units are responsible for the speaker representations to be unreliable. These unreliable representations are biased with respect to the reference counterparts, obtained from long utterances. These undesired shifts increase the intra-speaker variability, causing a significant loss of performance. According to our experiments, short utterances (3–60 s) can perform as accurate as if long utterances were involved by just reassuring the phonetic distributions. This analysis is determined by the current embedding extraction approach, based on the accumulation of local short-time information. Thus it is applicable to most of the state-of-the-art embeddings, including traditional i-vectors and Deep Neural Network (DNN) xvectors. View Full-Text
Keywords: speaker recognition; short utterances; phonetic content speaker recognition; short utterances; phonetic content
Show Figures

Figure 1

MDPI and ACS Style

Viñals, I.; Ortega, A.; Miguel, A.; Lleida, E. An Analysis of the Short Utterance Problem for Speaker Characterization. Appl. Sci. 2019, 9, 3697.

AMA Style

Viñals I, Ortega A, Miguel A, Lleida E. An Analysis of the Short Utterance Problem for Speaker Characterization. Applied Sciences. 2019; 9(18):3697.

Chicago/Turabian Style

Viñals, Ignacio, Alfonso Ortega, Antonio Miguel, and Eduardo Lleida. 2019. "An Analysis of the Short Utterance Problem for Speaker Characterization" Applied Sciences 9, no. 18: 3697.

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

Back to TopTop