MDPI - Publisher of Open Access Journals

17 pages, 1071 KiB

Open AccessFeature PaperArticle

Empirical Analysis of Learning Improvements in Personal Voice Activity Detection Frameworks

by Yu-Tseng Yeh, Chia-Chi Chang and Jeih-Weih Hung

Electronics 2025, 14(12), 2372; https://doi.org/10.3390/electronics14122372 - 10 Jun 2025

Viewed by 502

Personal Voice Activity Detection (PVAD) has emerged as a critical technology for enabling speaker-specific detection in multi-speaker environments, surpassing the limitations of conventional Voice Activity Detection (VAD) systems that merely distinguish speech from non-speech. PVAD systems are essential for applications such as personalized [...] Read more.

Personal Voice Activity Detection (PVAD) has emerged as a critical technology for enabling speaker-specific detection in multi-speaker environments, surpassing the limitations of conventional Voice Activity Detection (VAD) systems that merely distinguish speech from non-speech. PVAD systems are essential for applications such as personalized voice assistants and robust speech recognition, where accurately identifying a target speaker’s voice amidst background speech and noise is crucial for both user experience and computational efficiency. Despite significant progress, PVAD frameworks still face challenges related to temporal modeling, integration of speaker information, class imbalance, and deployment on resource-constrained devices. In this study, we present a systematic enhancement of the PVAD framework through four key innovations: (1) a Bi-GRU (Bidirectional Gated Recurrent Unit) layer for improved temporal modeling of speech dynamics, (2) a cross-attention mechanism for context-aware speaker embedding integration, (3) a hybrid CE-AUROC (Cross-Entropy and Area Under Receiver Operating Characteristic) loss function to address class imbalance, and (4) Cosine Annealing Learning Rate (CALR) for optimized training convergence. Evaluated on LibriSpeech datasets under varied acoustic conditions, the proposed modifications demonstrate significant performance gains over the baseline PVAD framework, achieving 87.59% accuracy (vs. 86.18%) and 0.9481 mean Average Precision (vs. 0.9378) while maintaining real-time processing capabilities. These advancements address critical challenges in PVAD deployment, including robustness to noisy environments, with the hybrid loss function reducing false negatives by 12% in imbalanced scenarios. The work provides practical insights for implementing personalized voice interfaces on resource-constrained devices. Future extensions will explore quantized inference and multi-modal sensor fusion to further bridge the gap between laboratory performance and real-world deployment requirements. Full article

(This article belongs to the Special Issue Emerging Trends in Generative-AI Based Audio Processing)

► Show Figures

Figure 1

18 pages, 3278 KiB

Open AccessArticle

Efficient Detection of Mind Wandering During Reading Aloud Using Blinks, Pitch Frequency, and Reading Rate

by Amir Rabinovitch, Eden Ben Baruch, Maor Siton, Nuphar Avital, Menahem Yeari and Dror Malka

AI 2025, 6(4), 83; https://doi.org/10.3390/ai6040083 - 18 Apr 2025

Cited by 2 | Viewed by 961

Abstract

Mind wandering is a common issue among schoolchildren and academic students, often undermining the quality of learning and teaching effectiveness. Current detection methods mainly rely on eye trackers and electrodermal activity (EDA) sensors, focusing on external indicators such as facial movements but neglecting [...] Read more.

Mind wandering is a common issue among schoolchildren and academic students, often undermining the quality of learning and teaching effectiveness. Current detection methods mainly rely on eye trackers and electrodermal activity (EDA) sensors, focusing on external indicators such as facial movements but neglecting voice detection. These methods are often cumbersome, uncomfortable for participants, and invasive, requiring specialized, expensive equipment that disrupts the natural learning environment. To overcome these challenges, a new algorithm has been developed to detect mind wandering during reading aloud. Based on external indicators like the blink rate, pitch frequency, and reading rate, the algorithm integrates these three criteria to ensure the accurate detection of mind wandering using only a standard computer camera and microphone, making it easy to implement and widely accessible. An experiment with ten participants validated this approach. Participants read aloud a text of 1304 words while the algorithm, incorporating the Viola–Jones model for face and eye detection and pitch-frequency analysis, monitored for signs of mind wandering. A voice activity detection (VAD) technique was also used to recognize human speech. The algorithm achieved 76% accuracy in predicting mind wandering during specific text segments, demonstrating the feasibility of using noninvasive physiological indicators. This method offers a practical, non-intrusive solution for detecting mind wandering through video and audio data, making it suitable for educational settings. Its ability to integrate seamlessly into classrooms holds promise for enhancing student concentration, improving the teacher–student dynamic, and boosting overall teaching effectiveness. By leveraging standard, accessible technology, this approach could pave the way for more personalized, technology-enhanced education systems. Full article

► Show Figures

Figure 1

20 pages, 20407 KiB

Open AccessArticle

VAD-CLVA: Integrating CLIP with LLaVA for Voice Activity Detection

by Andrea Appiani and Cigdem Beyan

Information 2025, 16(3), 233; https://doi.org/10.3390/info16030233 - 16 Mar 2025

Viewed by 1585

Abstract

Voice activity detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining [...] Read more.

Voice activity detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech in an audiovisual data. Traditionally, this task has been tackled by processing either audio signals or visual data, or by combining both modalities through fusion or joint learning. In our study, drawing inspiration from recent advancements in visual-language models, we introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. The CLIP visual encoder analyzes video segments focusing on the upper body of an individual, while the text encoder processes textual descriptions generated by a Generative Large Multimodal Model, i.e., the Large Language and Vision Assistant (LLaVA). Subsequently, embeddings from these encoders are fused through a deep neural network to perform VAD. Our experimental analysis across three VAD benchmarks showcases the superior performance of our method compared to existing visual VAD approaches. Notably, our approach outperforms several audio-visual methods despite its simplicity and without requiring pretraining on extensive audio-visual datasets. Full article

(This article belongs to the Special Issue Application of Machine Learning in Human Activity Recognition)

► Show Figures

Figure 1

17 pages, 4873 KiB

Open AccessArticle

An Ensemble Approach for Speaker Identification from Audio Files in Noisy Environments

by Syed Shahab Zarin, Ehzaz Mustafa, Sardar Khaliq uz Zaman, Abdallah Namoun and Meshari Huwaytim Alanazi

Appl. Sci. 2024, 14(22), 10426; https://doi.org/10.3390/app142210426 - 13 Nov 2024

Viewed by 1183

Abstract

Automatic noise-robust speaker identification is essential in various applications, including forensic analysis, e-commerce, smartphones, and security systems. Audio files containing suspect speech often include background noise, as they are typically not recorded in soundproof environments. To this end, we address the challenges of [...] Read more.

Automatic noise-robust speaker identification is essential in various applications, including forensic analysis, e-commerce, smartphones, and security systems. Audio files containing suspect speech often include background noise, as they are typically not recorded in soundproof environments. To this end, we address the challenges of noise robustness and accuracy in speaker identification systems. An ensemble approach is proposed combining two different neural network architectures including an RNN and DNN using softmax. This approach enhances the system’s ability to identify speakers even in noisy environments accurately. Using softmax, we combine voice activity detection (VAD) with a multilayer perceptron (MLP). The VAD component aims to remove noisy frames from the recording. The softmax function addresses these residual traces by assigning a higher probability to the speaker’s voice compared to the noise. We tested our proposed solution on the Kaggle speaker recognition dataset and compared it to two baseline systems. Experimental results show that our approach outperforms the baseline systems, achieving a 3.6% and 5.8% increase in test accuracy. Additionally, we compared the proposed MLP system with Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) classifiers. The results demonstrate that the MLP with VAD and softmax outperforms the LSTM by 23.2% and the BiLSTM by 6.6% in test accuracy. Full article

(This article belongs to the Special Issue Advances in Intelligent Information Systems and AI Applications)

► Show Figures

Figure 1

19 pages, 3360 KiB

Open AccessArticle

ATC-SD Net: Radiotelephone Communications Speaker Diarization Network

by Weijun Pan, Yidi Wang, Yumei Zhang and Boyuan Han

Aerospace 2024, 11(7), 599; https://doi.org/10.3390/aerospace11070599 - 22 Jul 2024

Cited by 1 | Viewed by 2122

Abstract

This study addresses the challenges that high-noise environments and complex multi-speaker scenarios present in civil aviation radio communications. A novel radiotelephone communications speaker diffraction network is developed specifically for these circumstances. To improve the precision of the speaker diarization network, three core modules [...] Read more.

This study addresses the challenges that high-noise environments and complex multi-speaker scenarios present in civil aviation radio communications. A novel radiotelephone communications speaker diffraction network is developed specifically for these circumstances. To improve the precision of the speaker diarization network, three core modules are designed: voice activity detection (VAD), end-to-end speaker separation for air–ground communication (EESS), and probabilistic knowledge-based text clustering (PKTC). First, the VAD module uses attention mechanisms to separate silence from irrelevant noise, resulting in pure dialogue commands. Subsequently, the EESS module distinguishes between controllers and pilots by levying voice print differences, resulting in effective speaker segmentation. Finally, the PKTC module addresses the issue of pilot voice print ambiguity using text clustering, introducing a novel flight prior knowledge-based text-related clustering model. To achieve robust speaker diarization in multi-pilot scenarios, this model uses prior knowledge-based graph construction, radar data-based graph correction, and probabilistic optimization. This study also includes the development of the specialized ATCSPEECH dataset, which demonstrates significant performance improvements over both the AMI and ATCO2 PROJECT datasets. Full article

(This article belongs to the Special Issue Advances in Air Traffic and Airspace Control and Management (2nd Edition))

► Show Figures

Figure 1

23 pages, 4689 KiB

Open AccessArticle

Orthogonalization of the Sensing Matrix Through Dominant Columns in Compressive Sensing for Speech Enhancement

by Vasundhara Shukla and Preety D. Swami

Appl. Sci. 2023, 13(15), 8954; https://doi.org/10.3390/app13158954 - 4 Aug 2023

Viewed by 1261

Abstract

This paper introduces a novel speech enhancement approach called dominant columns group orthogonalization of the sensing matrix (DCGOSM) in compressive sensing (CS). DCGOSM optimizes the sensing matrix using particle swarm optimization (PSO), ensuring separate basis vectors for speech and noise signals. By utilizing [...] Read more.

This paper introduces a novel speech enhancement approach called dominant columns group orthogonalization of the sensing matrix (DCGOSM) in compressive sensing (CS). DCGOSM optimizes the sensing matrix using particle swarm optimization (PSO), ensuring separate basis vectors for speech and noise signals. By utilizing an orthogonal matching pursuit (OMP) based CS signal reconstruction with this optimized matrix, noise components are effectively avoided, resulting in lower noise in the reconstructed signal. The reconstruction process is accelerated by iterating only through the known speech-contributing columns. DCGOSM is evaluated against various noise types using speech quality measures such as SNR, SSNR, STOI, and PESQ. Compared to other OMP-based CS algorithms and deep neural network (DNN)-based speech enhancement techniques, DCGOSM demonstrates significant improvements, with maximum enhancements of 42.54%, 62.97%, 27.48%, and 8.72% for SNR, SSNR, PESQ, and STOI, respectively. Additionally, DCGOSM outperforms DNN-based techniques by 20.32% for PESQ and 8.29% for STOI. Furthermore, it reduces recovery time by at least 13.2% compared to other OMP-based CS algorithms. Full article

(This article belongs to the Special Issue Advances in Speech and Language Processing)

► Show Figures

Figure 1

18 pages, 3789 KiB

Open AccessArticle

Efficient Pause Extraction and Encode Strategy for Alzheimer’s Disease Detection Using Only Acoustic Features from Spontaneous Speech

by Jiamin Liu, Fan Fu, Liang Li, Junxiao Yu, Dacheng Zhong, Songsheng Zhu, Yuxuan Zhou, Bin Liu and Jianqing Li

Brain Sci. 2023, 13(3), 477; https://doi.org/10.3390/brainsci13030477 - 11 Mar 2023

Cited by 13 | Viewed by 3303

Abstract

Clinical studies have shown that speech pauses can reflect the cognitive function differences between Alzheimer’s Disease (AD) and non-AD patients, while the value of pause information in AD detection has not been fully explored. Herein, we propose a speech pause feature extraction and [...] Read more.

Clinical studies have shown that speech pauses can reflect the cognitive function differences between Alzheimer’s Disease (AD) and non-AD patients, while the value of pause information in AD detection has not been fully explored. Herein, we propose a speech pause feature extraction and encoding strategy for only acoustic-signal-based AD detection. First, a voice activity detection (VAD) method was constructed to detect pause/non-pause feature and encode it to binary pause sequences that are easier to calculate. Then, an ensemble machine-learning-based approach was proposed for the classification of AD from the participants’ spontaneous speech, based on the VAD Pause feature sequence and common acoustic feature sets (ComParE and eGeMAPS). The proposed pause feature sequence was verified in five machine-learning models. The validation data included two public challenge datasets (ADReSS and ADReSSo, English voice) and a local dataset (10 audio recordings containing five patients and five controls, Chinese voice). Results showed that the VAD Pause feature was more effective than common feature sets (ComParE: 6373 features and eGeMAPS: 88 features) for AD classification, and that the ensemble method improved the accuracy by more than 5% compared to several baseline methods (8% on the ADReSS dataset; 5.9% on the ADReSSo dataset). Moreover, the pause-sequence-based AD detection method could achieve 80% accuracy on the local dataset. Our study further demonstrated the potential of pause information in speech-based AD detection, and also contributed to a more accessible and general pause feature extraction and encoding method for AD detection. Full article

(This article belongs to the Section Computational Neuroscience, Neuroinformatics, and Neurocomputing)

► Show Figures

Figure 1

20 pages, 4001 KiB

Open AccessArticle

Ultra-Low-Power Voice Activity Detection System Using Level-Crossing Sampling

by Maral Faghani, Hamidreza Rezaee-Dehsorkh, Nassim Ravanshad and Hamed Aminzadeh

Electronics 2023, 12(4), 795; https://doi.org/10.3390/electronics12040795 - 5 Feb 2023

Cited by 10 | Viewed by 4786

Abstract

This paper presents an ultra-low-power voice activity detection (VAD) system to discriminate speech from non-speech parts of audio signals. The proposed VAD system uses level-crossing sampling for voice activity detection. The useless samples in the non-speech parts of the signal are eliminated due [...] Read more.

This paper presents an ultra-low-power voice activity detection (VAD) system to discriminate speech from non-speech parts of audio signals. The proposed VAD system uses level-crossing sampling for voice activity detection. The useless samples in the non-speech parts of the signal are eliminated due to the activity-dependent nature of this sampling scheme. A 40 ms moving window with a 30 ms overlap is exploited as a feature extraction block, within which the output samples of the level-crossing analog-to-digital converter (LC-ADC) are counted as the feature. The only variable used to distinguish speech and non-speech segments in the audio input signal is the number of LC-ADC output samples within a time window. The proposed system achieves an average of 91.02% speech hit rate and 82.64% non-speech hit rate over 12 noise types at −5, 0, 5, and 10 dB signal-to-noise ratios (SNR) over the TIMIT database. The proposed system including LC-ADC, feature extraction, and classification circuits was designed in 0.18 µm CMOS technology. Post-layout simulation results show a power consumption of 394.6 nW with a silicon area of 0.044 mm², which makes it suitable as an always-on device in an automatic speech recognition system. Full article

(This article belongs to the Special Issue Ultra-Low-Voltage and Ultra-Low-Power Integrated Circuits and Systems Evolution)

► Show Figures

Figure 1

12 pages, 1695 KiB

Open AccessFeature PaperArticle

Supervised Contrastive Learning for Voice Activity Detection

by Youngjun Heo and Sunggu Lee

Electronics 2023, 12(3), 705; https://doi.org/10.3390/electronics12030705 - 31 Jan 2023

Cited by 4 | Viewed by 3222

Abstract

The noise robustness of voice activity detection (VAD) tasks, which are used to identify the human speech portions of a continuous audio signal, is important for subsequent downstream applications such as keyword spotting and automatic speech recognition. Although various aspects of VAD have [...] Read more.

The noise robustness of voice activity detection (VAD) tasks, which are used to identify the human speech portions of a continuous audio signal, is important for subsequent downstream applications such as keyword spotting and automatic speech recognition. Although various aspects of VAD have been recently studied by researchers, a proper training strategy for VAD has not received sufficient attention. Thus, a training strategy for VAD using supervised contrastive learning is proposed for the first time in this paper. The proposed method is used in conjunction with audio-specific data augmentation methods. The proposed supervised contrastive learning-based VAD (SCLVAD) method is trained using two common speech datasets and then evaluated using a third dataset. The experimental results show that the SCLVAD method is particularly effective in improving VAD performance in noisy environments. For clean environments, data augmentation improves VAD accuracy by 8.0 to 8.6%, but there is no improvement due to the use of supervised contrastive learning. On the other hand, for noisy environments, the SCLVAD method results in VAD accuracy improvements of 2.9% and 4.6% for “speech with noise” and “speech with music”, respectively, with only a negligible increase in processing overhead during training. Full article

(This article belongs to the Special Issue Feature Papers in Circuit and Signal Processing)

► Show Figures

Figure 1

21 pages, 32597 KiB

Open AccessArticle

Using Voice Activity Detection and Deep Neural Networks with Hybrid Speech Feature Extraction for Deceptive Speech Detection

by Serban Mihalache and Dragos Burileanu

Sensors 2022, 22(3), 1228; https://doi.org/10.3390/s22031228 - 6 Feb 2022

Cited by 20 | Viewed by 7780

Abstract

In this work, we first propose a deep neural network (DNN) system for the automatic detection of speech in audio signals, otherwise known as voice activity detection (VAD). Several DNN types were investigated, including multilayer perceptrons (MLPs), recurrent neural networks (RNNs), and convolutional [...] Read more.

In this work, we first propose a deep neural network (DNN) system for the automatic detection of speech in audio signals, otherwise known as voice activity detection (VAD). Several DNN types were investigated, including multilayer perceptrons (MLPs), recurrent neural networks (RNNs), and convolutional neural networks (CNNs), with the best performance being obtained for the latter. Additional postprocessing techniques, i.e., hysteretic thresholding, minimum duration filtering, and bilateral extension, were employed in order to boost performance. The systems were trained and tested using several data subsets of the CENSREC-1-C database, with different simulated ambient noise conditions, and additional testing was performed on a different CENSREC-1-C data subset containing actual ambient noise, as well as on a subset of the TIMIT database. An accuracy of up to 99.13% was obtained for the CENSREC-1-C datasets, and 97.60% for the TIMIT dataset. We proceed to show how the final VAD system can be adapted and employed within an utterance-level deceptive speech detection (DSD) processing pipeline. The best DSD performance is achieved by a novel hybrid CNN-MLP network leveraging a fusion of algorithmically and automatically extracted speech features, and reaches an unweighted accuracy (UA) of 63.7% on the RLDD database, and 62.4% on the RODeCAR database. Full article

(This article belongs to the Special Issue Advances in Deep Learning for Intelligent Sensing Systems)

► Show Figures

Figure 1

21 pages, 1689 KiB

Open AccessArticle

Object Localization and Tracking System Using Multiple Ultrasonic Sensors with Newton–Raphson Optimization and Kalman Filtering Techniques

by Chung-Wei Juan and Jwu-Sheng Hu

Appl. Sci. 2021, 11(23), 11243; https://doi.org/10.3390/app112311243 - 26 Nov 2021

Cited by 12 | Viewed by 4152

Abstract

In this paper, an object localization and tracking system is implemented with an ultrasonic sensing technique and improved algorithms. The system is composed of one ultrasonic transmitter and five receivers, which uses the principle of ultrasonic ranging measurement to locate the target object. [...] Read more.

In this paper, an object localization and tracking system is implemented with an ultrasonic sensing technique and improved algorithms. The system is composed of one ultrasonic transmitter and five receivers, which uses the principle of ultrasonic ranging measurement to locate the target object. This system has several stages of locating and tracking the target object. First, a simple voice activity detection (VAD) algorithm is used to detect the ultrasonic echo signal of each receiving channel, and then a demodulation method with a low-pass filter is used to extract the signal envelope. The time-of-flight (TOF) estimation algorithm is then applied to the signal envelope for range measurement. Due to the variations of position, direction, material, size, and other factors of the detected object and the signal attenuation during the ultrasonic propagation process, the shape of the echo waveform is easily distorted, and TOF estimation is often inaccurate and unstable. In order to improve the accuracy and stability of TOF estimation, a new method of TOF estimation by fitting the general (GN) model and the double exponential (DE) model on the suitable envelope region using Newton–Raphson (NR) optimization with Levenberg–Marquardt (LM) modification (NRLM) is proposed. The final stage is the object localization and tracking. An extended Kalman filter (EKF) is designed, which inherently considers the interference and outlier problems of range measurement, and effectively reduces the interference to target localization under critical measurement conditions. The performance of the proposed system is evaluated by the experimental evaluation of conditions, such as stationary pen localization, stationary finger localization, and moving finger tracking. The experimental results verify the performance of the system and show that the system has a considerable degree of accuracy and stability for object localization and tracking. Full article

(This article belongs to the Special Issue Ultrasound Technology in Industry and Medicine)

► Show Figures

Figure 1

15 pages, 17508 KiB

Open AccessArticle

Detecting Apnea/Hypopnea Events Time Location from Sound Recordings for Patients with Severe or Moderate Sleep Apnea Syndrome

by Georgia Korompili, Lampros Kokkalas, Stelios A. Mitilineos, Nicolas-Alexander Tatlas and Stelios M. Potirakis

Appl. Sci. 2021, 11(15), 6888; https://doi.org/10.3390/app11156888 - 27 Jul 2021

Cited by 7 | Viewed by 3956

Abstract

The most common index for diagnosing Sleep Apnea Syndrome (SAS) is the Apnea-Hypopnea Index (AHI), defined as the average count of apnea/hypopnea events per sleeping hour. Despite its broad use in automated systems for SAS severity estimation, researchers now focus on individual event [...] Read more.

The most common index for diagnosing Sleep Apnea Syndrome (SAS) is the Apnea-Hypopnea Index (AHI), defined as the average count of apnea/hypopnea events per sleeping hour. Despite its broad use in automated systems for SAS severity estimation, researchers now focus on individual event time detection rather than the insufficient classification of the patient in SAS severity groups. Towards this direction, in this work, we aim at the detection of the exact time location of apnea/hypopnea events. We particularly examine the hypothesis of employing a standard Voice Activity Detection (VAD) algorithm to extract breathing segments during sleep and identify the respiratory events from severely altered breathing amplitude within the event. The algorithm, which is tested only in severe and moderate patients, is applied to recordings from a tracheal and an ambient microphone. It proves good sensitivity for apneas, reaching 81% and 70.4% for the two microphones, respectively, and moderate sensitivity to hypopneas—approx. 50% were identified. The algorithm also presents an adequate estimator of the Mean Apnea Duration index—defined as the average duration of the detected events—for patients with severe or moderate apnea, with mean error 1.7 s and 3.2 s for the two microphones, respectively. Full article

(This article belongs to the Section Electrical, Electronics and Communications Engineering)

► Show Figures

Figure 1

15 pages, 21420 KiB

Open AccessArticle

A Robust Dual-Microphone Generalized Sidelobe Canceller Using a Bone-Conduction Sensor for Speech Enhancement

by Yi Zhou, Haiping Wang, Yijing Chu and Hongqing Liu

Sensors 2021, 21(5), 1878; https://doi.org/10.3390/s21051878 - 8 Mar 2021

Cited by 5 | Viewed by 3262

Abstract

The use of multiple spatially distributed microphones allows performing spatial filtering along with conventional temporal filtering, which can better reject the interference signals, leading to an overall improvement of the speech quality. In this paper, we propose a novel dual-microphone generalized sidelobe canceller [...] Read more.

The use of multiple spatially distributed microphones allows performing spatial filtering along with conventional temporal filtering, which can better reject the interference signals, leading to an overall improvement of the speech quality. In this paper, we propose a novel dual-microphone generalized sidelobe canceller (GSC) algorithm assisted by a bone-conduction (BC) sensor for speech enhancement, which is named BC-assisted GSC (BCA-GSC) algorithm. The BC sensor is relatively insensitive to the ambient noise compared to the conventional air-conduction (AC) microphone. Hence, BC speech can be analyzed to generate very accurate voice activity detection (VAD), even in a high noise environment. The proposed algorithm incorporates the VAD information obtained by the BC speech into the adaptive blocking matrix (ABM) and adaptive noise canceller (ANC) in GSC. By using VAD to control ABM and combining VAD with signal-to-interference ratio (SIR) to control ANC, the proposed method could suppress interferences and improve the overall performance of GSC significantly. It is verified by experiments that the proposed GSC system not only improves speech quality remarkably but also boosts speech intelligibility. Full article

(This article belongs to the Section Sensor Networks)

► Show Figures

Figure 1

22 pages, 1420 KiB

Open AccessArticle

Speech Processing for Language Learning: A Practical Approach to Computer-Assisted Pronunciation Teaching

by Natalia Bogach, Elena Boitsova, Sergey Chernonog, Anton Lamtev, Maria Lesnichaya, Iurii Lezhenin, Andrey Novopashenny, Roman Svechnikov, Daria Tsikach, Konstantin Vasiliev, Evgeny Pyshkin and John Blake

Electronics 2021, 10(3), 235; https://doi.org/10.3390/electronics10030235 - 20 Jan 2021

Cited by 39 | Viewed by 7145

Abstract

This article contributes to the discourse on how contemporary computer and information technology may help in improving foreign language learning not only by supporting better and more flexible workflow and digitizing study materials but also through creating completely new use cases made possible [...] Read more.

This article contributes to the discourse on how contemporary computer and information technology may help in improving foreign language learning not only by supporting better and more flexible workflow and digitizing study materials but also through creating completely new use cases made possible by technological improvements in signal processing algorithms. We discuss an approach and propose a holistic solution to teaching the phonological phenomena which are crucial for correct pronunciation, such as the phonemes; the energy and duration of syllables and pauses, which construct the phrasal rhythm; and the tone movement within an utterance, i.e., the phrasal intonation. The working prototype of StudyIntonation Computer-Assisted Pronunciation Training (CAPT) system is a tool for mobile devices, which offers a set of tasks based on a “listen and repeat” approach and gives the audio-visual feedback in real time. The present work summarizes the efforts taken to enrich the current version of this CAPT tool with two new functions: the phonetic transcription and rhythmic patterns of model and learner speech. Both are designed on a base of a third-party automatic speech recognition (ASR) library Kaldi, which was incorporated inside StudyIntonation signal processing software core. We also examine the scope of automatic speech recognition applicability within the CAPT system workflow and evaluate the Levenstein distance between the transcription made by human experts and that obtained automatically in our code. We developed an algorithm of rhythm reconstruction using acoustic and language ASR models. It is also shown that even having sufficiently correct production of phonemes, the learners do not produce a correct phrasal rhythm and intonation, and therefore, the joint training of sounds, rhythm and intonation within a single learning environment is beneficial. To mitigate the recording imperfections voice activity detection (VAD) is applied to all the speech records processed. The try-outs showed that StudyIntonation can create transcriptions and process rhythmic patterns, but some specific problems with connected speech transcription were detected. The learners feedback in the sense of pronunciation assessment was also updated and a conventional mechanism based on dynamic time warping (DTW) was combined with cross-recurrence quantification analysis (CRQA) approach, which resulted in a better discriminating ability. The CRQA metrics combined with those of DTW were shown to add to the accuracy of learner performance estimation. The major implications for computer-assisted English pronunciation teaching are discussed. Full article

(This article belongs to the Special Issue Recent Advances in Multimedia Signal Processing and Communications)

► Show Figures

Figure 1

17 pages, 4542 KiB

Open AccessArticle

A Real-Time Dual-Microphone Speech Enhancement Algorithm Assisted by Bone Conduction Sensor

by Yi Zhou, Yufan Chen, Yongbao Ma and Hongqing Liu

Sensors 2020, 20(18), 5050; https://doi.org/10.3390/s20185050 - 5 Sep 2020

Cited by 20 | Viewed by 6490

Abstract

The quality and intelligibility of the speech are usually impaired by the interference of background noise when using internet voice calls. To solve this problem in the context of wearable smart devices, this paper introduces a dual-microphone, bone-conduction (BC) sensor assisted beamformer and [...] Read more.

The quality and intelligibility of the speech are usually impaired by the interference of background noise when using internet voice calls. To solve this problem in the context of wearable smart devices, this paper introduces a dual-microphone, bone-conduction (BC) sensor assisted beamformer and a simple recurrent unit (SRU)-based neural network postfilter for real-time speech enhancement. Assisted by the BC sensor, which is insensitive to the environmental noise compared to the regular air-conduction (AC) microphone, the accurate voice activity detection (VAD) can be obtained from the BC signal and incorporated into the adaptive noise canceller (ANC) and adaptive block matrix (ABM). The SRU-based postfilter consists of a recurrent neural network with a small number of parameters, which improves the computational efficiency. The sub-band signal processing is designed to compress the input features of the neural network, and the scale-invariant signal-to-distortion ratio (SI-SDR) is developed as the loss function to minimize the distortion of the desired speech signal. Experimental results demonstrate that the proposed real-time speech enhancement system provides significant speech sound quality and intelligibility improvements for all noise types and levels when compared with the AC-only beamformer with a postfiltering algorithm. Full article

(This article belongs to the Special Issue Signal Processing and Machine Learning for Smart Sensing Applications)

► Show Figures

Figure 1

Search Results (26)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (26)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI