Next Article in Journal
Requirements for Validation of Dynamic Wind Turbine Models: An International Grid Code Review
Next Article in Special Issue
Emotion Analysis in Human–Robot Interaction
Previous Article in Journal
An On-Path Caching Scheme Based on the Expected Number of Copies in Information-Centric Networks
Previous Article in Special Issue
Lex-Pos Feature-Based Grammar Error Detection System for the English Language
Open AccessArticle

Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System

Department of Computer Science and Engineering, Sogang University, Seoul 04107, Korea
*
Author to whom correspondence should be addressed.
Electronics 2020, 9(10), 1706; https://doi.org/10.3390/electronics9101706
Received: 19 August 2020 / Revised: 14 October 2020 / Accepted: 15 October 2020 / Published: 17 October 2020
(This article belongs to the Special Issue Human Computer Interaction for Intelligent Systems)
One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively). View Full-Text
Keywords: text-independent speaker verification system; self-attentive pooling; multi-layer aggregation; feature recalibration; deep length normalization; speaker embedding; shortcut connections; convolutional neural networks; ResNet text-independent speaker verification system; self-attentive pooling; multi-layer aggregation; feature recalibration; deep length normalization; speaker embedding; shortcut connections; convolutional neural networks; ResNet
Show Figures

Figure 1

MDPI and ACS Style

Seo, S.; Kim, J.-H. Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System. Electronics 2020, 9, 1706.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Search more from Scilit
 
Search
Back to TopTop