Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (165)

Search Parameters:
Keywords = audio speech training

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
26 pages, 712 KB  
Article
Comparing Multi-Scale and Pipeline Models for Speaker Change Detection
by Alymzhan Toleu, Gulmira Tolegen and Bagashar Zhumazhanov
Acoustics 2026, 8(1), 5; https://doi.org/10.3390/acoustics8010005 (registering DOI) - 25 Jan 2026
Abstract
Speaker change detection (SCD) in long, multi-party meetings is essential for diarization, Automatic speech recognition (ASR), and summarization, and is now often performed in the space of pre-trained speech embeddings. However, unsupervised approaches remain dominant when timely labeled audio is scarce, and their [...] Read more.
Speaker change detection (SCD) in long, multi-party meetings is essential for diarization, Automatic speech recognition (ASR), and summarization, and is now often performed in the space of pre-trained speech embeddings. However, unsupervised approaches remain dominant when timely labeled audio is scarce, and their behavior under a unified modeling setup is still not well understood. In this paper, we systematically compare two representative unsupervised approaches on the multi-talker audio meeting corpus: (i) a clustering-based pipeline that segments and clusters embeddings/features and scores boundaries via cluster changes and jump magnitude, and (ii) a multi-scale jump-based detector that measures embedding discontinuities at several window lengths and fuses them via temporal clustering and voting. Using a shared front-end and protocol, we vary the underlying features (ECAPA, WavLM, wav2vec 2.0, MFCC, and log-Mel) and test the model’s robustness under additive noise. The results show that embedding choice is crucial and that the two methods offer complementary trade-offs: the pipeline yields low false alarm rates but higher misses, while the multi-scale detector achieves relatively high recall at the cost of many false alarms. Full article
Show Figures

Figure 1

30 pages, 6201 KB  
Article
AFAD-MSA: Dataset and Models for Arabic Fake Audio Detection
by Elsayed Issa
Computation 2026, 14(1), 20; https://doi.org/10.3390/computation14010020 - 14 Jan 2026
Viewed by 170
Abstract
As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of [...] Read more.
As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of authentic and synthetic Arabic speech designed to advance research on Arabic deepfake and spoofed-speech detection. The synthetic subset is generated with four state-of-the-art proprietary text-to-speech and voice-conversion models. Rich metadata—covering speaker attributes and generation information—is provided to support reproducibility and benchmarking. To establish reference performance, we trained three AASIST models and compared their performance to two baseline transformer detectors (Wav2Vec 2.0 and Whisper). On the AFAD-MSA test split, AASIST-2 achieved perfect accuracy, surpassing the baseline models. However, its performance declined under cross-dataset evaluation. These results underscore the importance of data construction. Detectors generalize best when exposed to diverse attack types. In addition, continual or contrastive training that interleaves bona fide speech with large, heterogeneous spoofed corpora will further improve detectors’ robustness. Full article
Show Figures

Figure 1

14 pages, 1392 KB  
Article
AirSpeech: Lightweight Speech Synthesis Framework for Home Intelligent Space Service Robots
by Xiugong Qin, Fenghu Pan, Jing Gao, Shilong Huang, Yichen Sun and Xiao Zhong
Electronics 2026, 15(1), 239; https://doi.org/10.3390/electronics15010239 - 5 Jan 2026
Viewed by 267
Abstract
Text-to-Speech (TTS) methods typically employ a sequential approach with an Acoustic Model (AM) and a vocoder, using a Mel spectrogram as an intermediate representation. However, in home environments, TTS systems often struggle with issues such as inadequate robustness against environmental noise and limited [...] Read more.
Text-to-Speech (TTS) methods typically employ a sequential approach with an Acoustic Model (AM) and a vocoder, using a Mel spectrogram as an intermediate representation. However, in home environments, TTS systems often struggle with issues such as inadequate robustness against environmental noise and limited adaptability to diverse speaker characteristics. The quality of the Mel spectrogram directly affects the performance of TTS systems, yet existing methods overlook the potential of enhancing Mel spectrogram quality through more comprehensive speech features. To address the complex acoustic characteristics of home environments, this paper introduces AirSpeech, a post-processing model for Mel-spectrogram synthesis. We adopt a Generative Adversarial Network (GAN) to improve the accuracy of Mel spectrogram prediction and enhance the expressiveness of synthesized speech. By incorporating additional conditioning extracted from synthesized audio using specified speech feature parameters, our method significantly enhances the expressiveness and emotional adaptability of synthesized speech in home environments. Furthermore, we propose a global normalization strategy to stabilize the GAN training process. Through extensive evaluations, we demonstrate that the proposed method significantly improves the signal quality and naturalness of synthesized speech, providing a more user-friendly speech interaction solution for smart home applications. Full article
Show Figures

Figure 1

36 pages, 1309 KB  
Article
Listen Closely: Self-Supervised Phoneme Tracking for Children’s Reading Assessment
by Philipp Ollmann, Erik Sonnleitner, Marc Kurz, Jens Krösche and Stephan Selinger
Information 2026, 17(1), 40; https://doi.org/10.3390/info17010040 - 4 Jan 2026
Viewed by 330
Abstract
Reading proficiency in early childhood is crucial for academic success and intellectual development. However, more and more children are struggling with reading. According to the last PISA study in Austria, one out of five children is dealing with reading difficulties. The reasons for [...] Read more.
Reading proficiency in early childhood is crucial for academic success and intellectual development. However, more and more children are struggling with reading. According to the last PISA study in Austria, one out of five children is dealing with reading difficulties. The reasons for this are diverse, but an application that tracks children while reading aloud and guides them when they experience difficulties could offer meaningful help. Therefore, this proposal explores a prototyping approach for a core component that tracks children’s reading using a self-supervised Wav2Vec2 model with a limited amount of data. Self-supervised learning allows models to learn general representations from large amounts of unlabeled audio, which can then be fine-tuned on smaller, task-specific datasets, making it especially useful when labeled data is limited. Our model is operating on the phonetic level with the help of the International Phonetic Alphabet (IPA). To implement this, the KidsTALC dataset from the Leibniz University Hannover was used, which contains spontaneous speech recordings of German-speaking children. To enhance the training data and improve robustness, several data augmentation techniques were applied and evaluated, including pitch shifting, formant shifting, and speed variation. The models were trained using different data configurations to compare the effects of data variety and quality on recognition performance. The best model trained in this work achieved a phoneme error rate (PER) of 14.3% and a word error rate (WER) of 31.6% on unseen child speech data, demonstrating the potential of self-supervised models for such use cases. Full article
(This article belongs to the Special Issue AI Technology-Enhanced Learning and Teaching)
Show Figures

Figure 1

16 pages, 1427 KB  
Article
Acoustic Vector Sensor–Based Speaker Diarization Using Sound Intensity Analysis for Two-Speaker Dialogues
by Grzegorz Szwoch, Józef Kotus and Szymon Zaporowski
Appl. Sci. 2025, 15(23), 12780; https://doi.org/10.3390/app152312780 - 3 Dec 2025
Viewed by 2126
Abstract
Speaker diarization is a key component of automatic speech recognition (ASR) systems, particularly in interview scenarios where speech segments must be assigned to individual speakers. This study presents a diarization algorithm based on sound intensity analysis using an Acoustic Vector Sensor (AVS). The [...] Read more.
Speaker diarization is a key component of automatic speech recognition (ASR) systems, particularly in interview scenarios where speech segments must be assigned to individual speakers. This study presents a diarization algorithm based on sound intensity analysis using an Acoustic Vector Sensor (AVS). The algorithm determines the azimuth of each speaker, defines directional beams, and detects speaker activity by analyzing intensity distributions within each beam, enabling identification of both single and overlapping speech segments. A dedicated dataset of interview recordings involving five speakers was created for evaluation. Performance was assessed using the Diarization Error Rate (DER) metric and compared with the State-of-the-Art Pyannote.audio system. The proposed AVS-based method achieved a lower DER value (0.112) than Pyannote (0.213) without overlapping speech, and a DER equal to 0.187 with overlapping speech included, demonstrating improved diarization accuracy and better handling of overlapping speech. The algorithm does not require training, operates independently of speaker-specific features, and can be adapted to various acoustic conditions. The results confirm that AVS-based diarization provides a robust and interpretable alternative to neural approaches, particularly suitable for structured two-speaker dialogues such as physician–patient or interviewer–interviewee scenarios. Full article
(This article belongs to the Special Issue Advances in Audio Signal Processing)
Show Figures

Figure 1

37 pages, 16007 KB  
Review
Speech Separation Using Advanced Deep Neural Network Methods: A Recent Survey
by Zeng Wang and Zhongqiang Luo
Big Data Cogn. Comput. 2025, 9(11), 289; https://doi.org/10.3390/bdcc9110289 - 14 Nov 2025
Viewed by 3217
Abstract
Speech separation, as an important research direction in audio signal processing, has been widely studied by the academic community since its emergence in the mid-1990s. In recent years, with the rapid development of deep neural network technology, speech processing based on deep neural [...] Read more.
Speech separation, as an important research direction in audio signal processing, has been widely studied by the academic community since its emergence in the mid-1990s. In recent years, with the rapid development of deep neural network technology, speech processing based on deep neural networks has shown outstanding performance in speech separation. While existing studies have surveyed the application of deep neural networks in speech separation from multiple dimensions including learning paradigms, model architectures, loss functions, and training strategies, current achievements still lack systematic comprehension of the field’s developmental trajectory. To address this, this paper focuses on single-channel supervised speech separation tasks, proposing a technological evolution path “U-Net–TasNet–Transformer–Mamba” as the main thread to systematically analyze the impact mechanisms of core architectural designs on separation performance across different stages. By reviewing the transition process from traditional methods to deep learning paradigms and delving into the improvements and integration of deep learning architectures at various stages, this paper summarizes milestone achievements, mainstream evaluation frameworks, and typical datasets in the field, while also providing prospects for future research directions. Through this detailed-focused review perspective, we aim to provide researchers in the speech separation field with a clearly articulated technical evolution map and practical reference. Full article
Show Figures

Figure 1

18 pages, 3175 KB  
Article
AudioFakeNet: A Model for Reliable Speaker Verification in Deepfake Audio
by Samia Dilbar, Muhammad Ali Qureshi, Serosh Karim Noon and Abdul Mannan
Algorithms 2025, 18(11), 716; https://doi.org/10.3390/a18110716 - 13 Nov 2025
Viewed by 1044
Abstract
Deepfake audio refers to the generation of voice recordings using deep neural networks that replicate a specific individual’s voice, often for deceptive or fraud purposes. Although this has been an area of research for quite some time, deepfakes still pose substantial challenges for [...] Read more.
Deepfake audio refers to the generation of voice recordings using deep neural networks that replicate a specific individual’s voice, often for deceptive or fraud purposes. Although this has been an area of research for quite some time, deepfakes still pose substantial challenges for reliable true speaker authentication. To address the issue, we propose AudioFakeNet, a hybrid deep learning architecture that use Convolutional Neural Networks (CNNs) along with Long Short-Term Memory (LSTM) units, and Multi-Head Attention (MHA) mechanisms for robust deepfake detection. CNN extracts spatial and spectral features, LSTM captures temporal dependencies, and MHA enhances to focus on informative audio segments. The model is trained using Mel-Frequency Cepstral Coefficients (MFCCs) from the publicly available dataset and was validated on self-collected dataset, ensuring reproducibility. Performance comparisons with state-of-the-art machine learning and deep learning models show that our proposed AudioFakeNet achieves higher accuracy, better generalization, and lower Equal Error Rate (EER). Its modular design allows for broader adaptability in fake-audio detection tasks, offering significant potential across diverse speech synthesis applications. Full article
(This article belongs to the Section Algorithms for Multidisciplinary Applications)
Show Figures

Figure 1

38 pages, 2282 KB  
Article
Cross-Lingual Bimodal Emotion Recognition with LLM-Based Label Smoothing
by Elena Ryumina, Alexandr Axyonov, Timur Abdulkadirov, Darya Koryakovskaya and Dmitry Ryumin
Big Data Cogn. Comput. 2025, 9(11), 285; https://doi.org/10.3390/bdcc9110285 - 12 Nov 2025
Viewed by 1949
Abstract
Bimodal emotion recognition based on audio and text is widely adopted in video-constrained real-world applications such as call centers and voice assistants. However, existing systems suffer from limited cross-domain generalization and monolingual bias. To address these limitations, a cross-lingual bimodal emotion recognition method [...] Read more.
Bimodal emotion recognition based on audio and text is widely adopted in video-constrained real-world applications such as call centers and voice assistants. However, existing systems suffer from limited cross-domain generalization and monolingual bias. To address these limitations, a cross-lingual bimodal emotion recognition method is proposed, integrating Mamba-based temporal encoders for audio (Wav2Vec2.0) and text (Jina-v3) with a Transformer-based cross-modal fusion architecture (BiFormer). Three corpus-adaptive augmentation strategies are introduced: (1) Stacked Data Sampling, in which short utterances are concatenated to stabilize sequence length; (2) Label Smoothing Generation based on Large Language Model, where the Qwen3-4B model is prompted to detect subtle emotional cues missed by annotators, producing soft labels that reflect latent emotional co-occurrences; and (3) Text-to-Utterance Generation, in which emotionally labeled utterances are generated by ChatGPT-5 and synthesized into speech using the DIA-TTS model, enabling controlled creation of affective audio–text pairs without human annotation. BiFormer is trained jointly on the English Multimodal EmotionLines Dataset and the Russian Emotional Speech Dialogs corpus, enabling cross-lingual transfer without parallel data. Experimental results show that the optimal data augmentation strategy is corpus-dependent: Stacked Data Sampling achieves the best performance on short, noisy English utterances, while Label Smoothing Generation based on Large Language Model better captures nuanced emotional expressions in longer Russian utterances. Text-to-Utterance Generation does not yield a measurable gain due to current limitations in expressive speech synthesis. When combined, the two best performing strategies produce complementary improvements, establishing new state-of-the-art performance in both monolingual and cross-lingual settings. Full article
Show Figures

Figure 1

16 pages, 647 KB  
Article
Implementation of a Generative AI-Powered Digital Interactive Platform for Clinical Language Therapy in Children with Language Delay: A Pilot Study
by Chia-Hui Chueh, Tzu-Hui Chiang, Po-Wei Pan, Ko-Long Lin, Yen-Sen Lu, Sheng-Hui Tuan, Chao-Ruei Lin, I-Ching Huang and Hsu-Sheng Cheng
Life 2025, 15(10), 1628; https://doi.org/10.3390/life15101628 - 18 Oct 2025
Viewed by 1541
Abstract
Early intervention is pivotal for optimizing neurodevelopmental outcomes in children with language delay, where increased language stimulation can optimize therapeutic outcomes. Extending speech–language therapy from clinical settings to the home is a promising strategy; however, practical barriers and a lack of scalable, customizable [...] Read more.
Early intervention is pivotal for optimizing neurodevelopmental outcomes in children with language delay, where increased language stimulation can optimize therapeutic outcomes. Extending speech–language therapy from clinical settings to the home is a promising strategy; however, practical barriers and a lack of scalable, customizable home-based models limit the implementation of this approach. The integration of AI-powered digital interactive tools could bridge this gap. This pilot feasibility study adopted a single-arm pre–post (before–after) design within a two-phase, mixed-methods framework to evaluate a generative AI-powered interactive platform supporting home-based language therapy in children with either idiopathic language delay or autism spectrum disorder (ASD)-related language impairment: two conditions known to involve heterogeneous developmental profiles. The participants received clinical language assessments and engaged in home-based training using AI-enhanced tablet software, and 2000 audio recordings were collected and analyzed to assess pre- and postintervention language abilities. A total of 22 children aged 2–12 years were recruited, with 19 completing both phases. Based on 6-week cumulative usage, participants were stratified with respect to hours of AI usage into Groups A (≤5 h, n = 5), B (5 < h ≤ 10, n = 5), C (10 < h ≤ 15, n = 4), and D (>15 h, n = 5). A threshold effect was observed: only Group D showed significant gains between baseline and postintervention, with total words (58→110, p = 0.043), characters (98→192, p = 0.043), type–token ratio (0.59→0.78, p = 0.043), nouns (34→56, p = 0.043), verbs (12→34, p = 0.043), and mean length of utterance (1.83→3.24, p = 0.043) all improving. No significant changes were found in Groups A to C. These findings indicate the positive impact of extended use on the development of language. Generative AI-powered digital interactive tools, when they are integrated into home-based language therapy programs, can significantly improve language outcomes in children who have language delay and ASD. This approach offers a scalable, cost-effective extension of clinical care to the home, demonstrating the potential to enhance therapy accessibility and long-term outcomes. Full article
(This article belongs to the Section Medical Research)
Show Figures

Figure A1

17 pages, 2436 KB  
Article
Deep Learning System for Speech Command Recognition
by Dejan Vujičić, Đorđe Damnjanović, Dušan Marković and Zoran Stamenković
Electronics 2025, 14(19), 3793; https://doi.org/10.3390/electronics14193793 - 24 Sep 2025
Cited by 1 | Viewed by 1824
Abstract
We present a deep learning model for the recognition of speech commands in the English language. The dataset is based on the Google Speech Commands Dataset by Warden P., version 0.01, and it consists of ten distinct commands (“left”, “right”, “go”, “stop”, “up”, [...] Read more.
We present a deep learning model for the recognition of speech commands in the English language. The dataset is based on the Google Speech Commands Dataset by Warden P., version 0.01, and it consists of ten distinct commands (“left”, “right”, “go”, “stop”, “up”, “down”, “on”, “off”, “yes”, and “no”) along with additional “silence” and “unknown” classes. The dataset is split in a speaker-independent manner, with 70% of speakers assigned to the training set and 15% to the test set and validation set. All audio clips are sampled at 16 kHz, with a total of 46 146 clips. Audio files are converted into Mel spectrogram representations, which are then used as input to a deep learning model composed of a four-layer convolutional neural network followed by two fully connected layers. The model employs Rectified Linear Unit (ReLU) activation, the Adam optimizer, and dropout regularization to improve generalization. The achieved testing accuracy is 96.05%. Micro- and macro-averaged precision, recall, and F1-score of 95% are reported to reflect class-wise performance, and a confusion matrix is also provided. The proposed model has been deployed on a Raspberry Pi 5 as a Fog computing device for real-time speech recognition applications. Full article
(This article belongs to the Special Issue Data-Centric Artificial Intelligence: New Methods for Data Processing)
Show Figures

Figure 1

20 pages, 2930 KB  
Article
Pain Level Classification from Speech Using GRU-Mixer Architecture with Log-Mel Spectrogram Features
by Adi Alhudhaif
Diagnostics 2025, 15(18), 2362; https://doi.org/10.3390/diagnostics15182362 - 17 Sep 2025
Cited by 1 | Viewed by 756
Abstract
Background/Objectives: Automatic pain detection from speech signals holds strong promise for non-invasive and real-time assessment in clinical and caregiving settings, particularly for populations with limited capacity for self-report. Methods: In this study, we introduce a lightweight recurrent deep learning approach, namely the [...] Read more.
Background/Objectives: Automatic pain detection from speech signals holds strong promise for non-invasive and real-time assessment in clinical and caregiving settings, particularly for populations with limited capacity for self-report. Methods: In this study, we introduce a lightweight recurrent deep learning approach, namely the Gated Recurrent Unit (GRU)-Mixer model for pain level classification based on speech signals. The proposed model maps raw audio inputs into Log-Mel spectrogram features, which are passed through a stacked bidirectional GRU for modeling the spectral and temporal dynamics of vocal expressions. To extract compact utterance-level embeddings, an adaptive average pooling-based temporal mixing mechanism is applied over the GRU outputs, followed by a fully connected classification head alongside dropout regularization. This architecture is used for several supervised classification tasks, including binary (pain/non-pain), graded intensity (mild, moderate, severe), and thermal-state (cold/warm) classification. End-to-end training is done using speaker-independent splits and class-balanced loss to promote generalization and discourage bias. The provided audio inputs are normalized to a consistent 3-s window and resampled at 8 kHz for consistency and computational efficiency. Results: Experiments on the TAME Pain dataset showcase strong classification performance, achieving 83.86% accuracy for binary pain detection and as high as 75.36% for multiclass pain intensity classification. Conclusions: As the first deep learning based classification work on the TAME Pain dataset, this work introduces the GRU-Mixer as an effective benchmark architecture for future studies on speech-based pain recognition and affective computing. Full article
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
Show Figures

Figure 1

20 pages, 3942 KB  
Article
Self-Supervised Voice Denoising Network for Multi-Scenario Human–Robot Interaction
by Mu Li, Wenjin Xu, Chao Zeng and Ning Wang
Biomimetics 2025, 10(9), 603; https://doi.org/10.3390/biomimetics10090603 - 9 Sep 2025
Viewed by 1186
Abstract
Human–robot interaction (HRI) via voice command has significantly advanced in recent years, with large Vision–Language–Action (VLA) models demonstrating particular promise in human–robot voice interaction. However, these systems still struggle with environmental noise contamination during voice interaction and lack a specialized denoising network for [...] Read more.
Human–robot interaction (HRI) via voice command has significantly advanced in recent years, with large Vision–Language–Action (VLA) models demonstrating particular promise in human–robot voice interaction. However, these systems still struggle with environmental noise contamination during voice interaction and lack a specialized denoising network for multi-speaker command isolation in an overlapping speech scenario. To overcome these challenges, we introduce a method to enhance voice command-based HRI in noisy environments, leveraging synthetic data and a self-supervised denoising network to enhance its real-world applicability. Our approach focuses on improving self-supervised network performance in denoising mixed-noise audio through training data scaling. Extensive experiments show our method outperforms existing approaches in simulation and achieves 7.5% higher accuracy than the state-of-the-art method in noisy real-world environments, enhancing voice-guided robot control. Full article
(This article belongs to the Special Issue Intelligent Human–Robot Interaction: 4th Edition)
Show Figures

Figure 1

15 pages, 252 KB  
Article
Enhanced Neural Speech Recognition of Quranic Recitations via a Large Audio Model
by Mohammad Alshboul, Abdul Rahman Al Muaitah, Suhad Al-Issa and Mahmoud Al-Ayyoub
Appl. Sci. 2025, 15(17), 9521; https://doi.org/10.3390/app15179521 - 29 Aug 2025
Viewed by 2303
Abstract
In this work, we build on our recent work toward developing a neural speech recognition (NSR) for Quranic recitations that is accessible to people of any age, gender, or expertise level. The Quran recitations by females and males (QRFAM) dataset, a sizable benchmark [...] Read more.
In this work, we build on our recent work toward developing a neural speech recognition (NSR) for Quranic recitations that is accessible to people of any age, gender, or expertise level. The Quran recitations by females and males (QRFAM) dataset, a sizable benchmark dataset of audio recordings made by male and female reciters from various age groups and competence levels, was previously reported in our prior works. In addition to this dataset, we used various subsets of the QRFAM dataset for training, validation, and testing to build several basic NSR systems based on Mozilla’s DeepSpeech model. Our current efforts to optimize and enhance these baseline models have also been presented. In this study, we expand our efforts by utilizing one of the well-known speech recognition models, Whisper, and we describe the effect of this choice on the model’s accuracy, expressed as the word error rate (WER), in comparison to that of DeepSpeech. Full article
51 pages, 15030 KB  
Review
A Review on Sound Source Localization in Robotics: Focusing on Deep Learning Methods
by Reza Jalayer, Masoud Jalayer and Amirali Baniasadi
Appl. Sci. 2025, 15(17), 9354; https://doi.org/10.3390/app15179354 - 26 Aug 2025
Cited by 3 | Viewed by 3846
Abstract
Sound source localization (SSL) adds a spatial dimension to auditory perception, allowing a system to pinpoint the origin of speech, machinery noise, warning tones, or other acoustic events, capabilities that facilitate robot navigation, human–machine dialogue, and condition monitoring. While existing surveys provide valuable [...] Read more.
Sound source localization (SSL) adds a spatial dimension to auditory perception, allowing a system to pinpoint the origin of speech, machinery noise, warning tones, or other acoustic events, capabilities that facilitate robot navigation, human–machine dialogue, and condition monitoring. While existing surveys provide valuable historical context, they typically address general audio applications and do not fully account for robotic constraints or the latest advancements in deep learning. This review addresses these gaps by offering a robotics-focused synthesis, emphasizing recent progress in deep learning methodologies. We start by reviewing classical methods such as time difference of arrival (TDOA), beamforming, steered-response power (SRP), and subspace analysis. Subsequently, we delve into modern machine learning (ML) and deep learning (DL) approaches, discussing traditional ML and neural networks (NNs), convolutional neural networks (CNNs), convolutional recurrent neural networks (CRNNs), and emerging attention-based architectures. The data and training strategy that are the two cornerstones of DL-based SSL are explored. Studies are further categorized by robot types and application domains to facilitate researchers in identifying relevant work for their specific contexts. Finally, we highlight the current challenges in SSL works in general, regarding environmental robustness, sound source multiplicity, and specific implementation constraints in robotics, as well as data and learning strategies in DL-based SSL. Also, we sketch promising directions to offer an actionable roadmap toward robust, adaptable, efficient, and explainable DL-based SSL for next-generation robots. Full article
Show Figures

Figure 1

18 pages, 3632 KB  
Article
Multilingual Mobility: Audio-Based Language ID for Automotive Systems
by Joowon Oh and Jeaho Lee
Appl. Sci. 2025, 15(16), 9209; https://doi.org/10.3390/app15169209 - 21 Aug 2025
Viewed by 1095
Abstract
With the growing demand for natural and intelligent human–machine interaction in multilingual environments, automatic language identification (LID) has emerged as a crucial component in voice-enabled systems, particularly in the automotive domain. This study proposes an audio-based LID model that identifies the spoken language [...] Read more.
With the growing demand for natural and intelligent human–machine interaction in multilingual environments, automatic language identification (LID) has emerged as a crucial component in voice-enabled systems, particularly in the automotive domain. This study proposes an audio-based LID model that identifies the spoken language directly from voice input without requiring manual language selection. The model architecture leverages two types of feature extraction pipelines: a Variational Autoencoder (VAE) and a pre-trained Wav2Vec model, both used to obtain latent speech representations. These embeddings are then fed into a multi-layer perceptron (MLP)-based classifier to determine the speaker’s language among five target languages: Korean, Japanese, Chinese, Spanish, and French. The model is trained and evaluated using a dataset preprocessed into Mel-Frequency Cepstral Coefficients (MFCCs) and raw waveform inputs. Experimental results demonstrate the effectiveness of the proposed approach in achieving accurate and real-time language detection, with potential applications in in-vehicle systems, speech translation platforms, and multilingual voice assistants. By eliminating the need for predefined language settings, this work contributes to more seamless and user-friendly multilingual voice interaction systems. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

Back to TopTop