MDPI - Publisher of Open Access Journals

23 pages, 520 KB

Open AccessArticle

Investigation of Text-Independent Speaker Verification by Support Vector Machine-Based Machine Learning Approaches

by Odin Kohler and Masudul Imtiaz

Electronics 2025, 14(5), 963; https://doi.org/10.3390/electronics14050963 - 28 Feb 2025

Cited by 4 | Viewed by 2915

Speaker verification is a common issue that has enumerable biomedical security applications. Speaker verification comes in two different forms: text-independent and text-dependent. Each of these forms can be implemented via many different machine learning and deep learning techniques. From our research, we found that there is significantly less work implementing text-independent speaker verification using machine learning techniques than there is using deep learning techniques. Because of this gap, we were motivated to build our own SVM and CNN model for text-independent speaker verification and compare them to other systems using SVMs or deep learning techniques. We limited ourselves to SVMs because they are commonly used for speech recognition and achieved very high accuracies. The main motivation behind this was two-fold. The first reason is to demonstrate that SVMs can and have been successfully used for text-independent speaker verification at a level comparable to deep learning techniques; the second reason is to make work using SVMs for text-independent speaker verification more accessible so it can be expanded upon easily. The analysis and comparison conducted in this paper will demonstrate how SVMs achieve results comparable to deep learning techniques and allow future researchers to more easily find SVMs used for text-independent speaker verification and derive a sense of what is being implemented in the field. Full article

(This article belongs to the Special Issue Advanced Machine Learning, Pattern Recognition, and Deep Learning Technologies: Methodologies and Applications)

► Show Figures

Figure 1

16 pages, 7288 KB

Open AccessArticle

mmSafe: A Voice Security Verification System Based on Millimeter-Wave Radar

by Zhanjun Hao, Jianxiang Peng, Xiaochao Dang, Hao Yan and Ruidong Wang

Sensors 2022, 22(23), 9309; https://doi.org/10.3390/s22239309 - 29 Nov 2022

Cited by 4 | Viewed by 3728

Abstract

With the increasing popularity of smart devices, users can control their mobile phones, TVs, cars, and smart furniture by using voice assistants, but voice assistants are susceptible to intrusion by outsider speakers or playback attacks. In order to address this security issue, a millimeter-wave radar-based voice security authentication system is proposed in this paper. First, the speaker’s fine-grained vocal cord vibration signal is extracted by eliminating static object clutter and motion effects; second, the weighted Mel Frequency Cepstrum Coefficients (MFCCs) are obtained as biometric features; and finally, text-independent security authentication is performed by the WMHS (Weighted MFCCs and Hog-based SVM) method. This system is highly adaptable and can authenticate designated speakers, resist intrusion by other unspecified speakers as well as playback attacks, and is secure for smart devices. Extensive experiments have verified that the system achieves a 93.4% speaker verification accuracy and a 5.8% miss detection rate for playback attacks. Full article

(This article belongs to the Special Issue Communication, Security, and Privacy in IoT)

► Show Figures

Figure 1

17 pages, 3285 KB

Open AccessArticle

Pseudo-Phoneme Label Loss for Text-Independent Speaker Verification

by Mengqi Niu, Liang He, Zhihua Fang, Baowei Zhao and Kai Wang

Appl. Sci. 2022, 12(15), 7463; https://doi.org/10.3390/app12157463 - 25 Jul 2022

Cited by 3 | Viewed by 2827

Abstract

Compared with text-independent speaker verification (TI-SV) systems, text-dependent speaker verification (TD-SV) counterparts often have better performance for their efficient utilization of speech content information. On this account, some TI-SV methods tried to boost performance by incorporating an extra automatic speech recognition (ASR) component to explore content information, such as c-vector. However, the introduced ASR component requires a large amount of annotated data and consumes high computation resources. In this paper, we propose a pseudo-phoneme label (PPL) loss for the TI-SR task by integrating content cluster loss at the frame level and speaker recognition loss at the segment level in a unified network by multitask learning, without additional data requirement and exhausting computation. By referring to HuBERT, we generate pseudo-phoneme labels to adjust a frame level feature distribution by deep cluster to ensure each cluster corresponds to an implicit pronunciation unit in the feature space. We compare the proposed loss with the softmax loss, center loss, triplet loss, log-likelihood-ratio cost loss, additive margin softmax loss and additive angular margin loss on the VoxCeleb database. Experimental results demonstrate the effectiveness of our proposed method. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

11 pages, 1578 KB

Open AccessCommunication

Evaluating the Performance of Speaker Recognition Solutions in E-Commerce Applications

by Olja Krčadinac, Uroš Šošević and Dušan Starčević

Sensors 2021, 21(18), 6231; https://doi.org/10.3390/s21186231 - 17 Sep 2021

Cited by 6 | Viewed by 2765

Abstract

Two important tasks in many e-commerce applications are identity verification of the user accessing the system and determining the level of rights that the user has for accessing and manipulating system’s resources. The performance of these tasks is directly dependent on the certainty of establishing the identity of the user. The main research focus of this paper is user identity verification approach based on voice recognition techniques. The paper presents research results connected to the usage of open-source speaker recognition technologies in e-commerce applications with an emphasis on evaluating the performance of the algorithms they use. Four open-source speaker recognition solutions (SPEAR, MARF, ALIZE, and HTK) have been evaluated in cases of mismatched conditions during training and recognition phases. In practice, mismatched conditions are influenced by various lengths of spoken sentences, different types of recording devices, and the usage of different languages in training and recognition phases. All tests conducted in this research were performed in laboratory conditions using the specially designed framework for multimodal biometrics. The obtained results show consistency with the findings of recent research which proves that i-vectors and solutions based on probabilistic linear discriminant analysis (PLDA) continue to be the dominant speaker recognition approaches for text-independent tasks. Full article

(This article belongs to the Special Issue Sensor-Based Biometrics Recognition and Processing)

► Show Figures

Figure 1

14 pages, 3446 KB

Open AccessArticle

Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System

by Soonshin Seo and Ji-Hwan Kim

Electronics 2020, 9(10), 1706; https://doi.org/10.3390/electronics9101706 - 17 Oct 2020

Cited by 4 | Viewed by 3117

Abstract

One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the number of model parameters, we set the ResNet with the scaled channel width and layer depth as a baseline. To control the variability in the training, we apply a self-attention mechanism to perform multi-layer aggregation with dropout regularizations and batch normalizations. Subsequently, we apply a feature recalibration layer to the aggregated feature using fully-connected layers and nonlinear activation functions. Further, deep length normalization is used on a recalibrated feature in the training process. Experimental results using the VoxCeleb1 evaluation dataset showed that the performance of the proposed methods was comparable to that of state-of-the-art models (equal error rate of 4.95% and 2.86%, using the VoxCeleb1 and VoxCeleb2 training datasets, respectively). Full article

(This article belongs to the Special Issue Human Computer Interaction for Intelligent Systems)

► Show Figures

Figure 1

12 pages, 1672 KB

Open AccessArticle

Addressing Text-Dependent Speaker Verification Using Singing Speech

by Yan Shi, Juanjuan Zhou, Yanhua Long, Yijie Li and Hongwei Mao

Appl. Sci. 2019, 9(13), 2636; https://doi.org/10.3390/app9132636 - 28 Jun 2019

Cited by 7 | Viewed by 3459

Abstract

The automatic speaker verification (ASV) has achieved significant progress in recent years. However, it is still very challenging to generalize the ASV technologies to new, unknown and spoofing conditions. Most previous studies focused on extracting the speaker information from natural speech. This paper attempts to address the speaker verification from another perspective. The speaker identity information was exploited from singing speech. We first designed and released a new corpus for speaker verification based on singing and normal reading speech. Then, the speaker discrimination was compared and analyzed between natural and singing speech in different feature spaces. Furthermore, the conventional Gaussian mixture model, the dynamic time warping and the state-of-the-art deep neural network were investigated. They were used to build text-dependent ASV systems with different training-test conditions. Experimental results show that the voiceprint information in the singing speech was more distinguishable than the one in the normal speech. More than relative 20% reduction of equal error rate was obtained on both the gender-dependent and independent 1 s-1 s evaluation tasks. Full article

(This article belongs to the Section Acoustics and Vibrations)

► Show Figures

Figure 1

Search Results (6)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (6)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI