Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (38)

Search Parameters:
Keywords = whispered speech

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
12 pages, 1025 KB  
Article
Enhancing Whisper Fine-Tuning with Discrete Wavelet Transform-Based LoRA Initialization
by Liang Lan, Molin Fang, Yuxuan Chen, Daliang Wang and Wenyong Wang
Electronics 2026, 15(3), 586; https://doi.org/10.3390/electronics15030586 - 29 Jan 2026
Viewed by 63
Abstract
In low-resource automatic speech recognition (ASR) scenarios, parameter-efficient fine-tuning (PEFT) has become a crucial approach for adapting large pre-trained speech models. Although low-rank adaptation (LoRA) offers clear advantages in efficiency, stability, and deployment friendliness, its performance remains constrained because random initialization fails to [...] Read more.
In low-resource automatic speech recognition (ASR) scenarios, parameter-efficient fine-tuning (PEFT) has become a crucial approach for adapting large pre-trained speech models. Although low-rank adaptation (LoRA) offers clear advantages in efficiency, stability, and deployment friendliness, its performance remains constrained because random initialization fails to capture the time–frequency structural characteristics of speech signals. To address this limitation, this work proposes a structured initialization mechanism that integrates LoRA with the discrete wavelet transform (DWT). By combining wavelet-based initialization, a multi-scale fusion mechanism, and a residual strategy, the proposed method constructs a low-rank adaptation subspace that better aligns with the local time–frequency properties of speech signals. Discrete Wavelet Transform-Based LoRA Initialization (DWTLoRA) enables LoRA modules to incorporate prior modeling of speech dynamics at the start of fine-tuning, substantially reducing the search space of ineffective directions during early training and improving convergence speed, training stability, and recognition accuracy under low-resource conditions. Experimental results on Sichuan dialect speech recognition based on the Whisper architecture demonstrate that the proposed DWTLoRA initialization outperforms standard LoRA and several PEFT baseline methods in terms of character error rate (CER) and training efficiency, confirming the critical role of signal-structure-aware initialization in low-resource ASR. Full article
Show Figures

Figure 1

38 pages, 6181 KB  
Article
An AIoT-Based Framework for Automated English-Speaking Assessment: Architecture, Benchmarking, and Reliability Analysis of Open-Source ASR
by Paniti Netinant, Rerkchai Fooprateepsiri, Ajjima Rukhiran and Meennapa Rukhiran
Informatics 2026, 13(2), 19; https://doi.org/10.3390/informatics13020019 - 26 Jan 2026
Viewed by 231
Abstract
The emergence of low-cost edge devices has enabled the integration of automatic speech recognition (ASR) into IoT environments, creating new opportunities for real-time language assessment. However, achieving reliable performance on resource-constrained hardware remains a significant challenge, especially on the Artificial Internet of Things [...] Read more.
The emergence of low-cost edge devices has enabled the integration of automatic speech recognition (ASR) into IoT environments, creating new opportunities for real-time language assessment. However, achieving reliable performance on resource-constrained hardware remains a significant challenge, especially on the Artificial Internet of Things (AIoT). This study presents an AIoT-based framework for automated English-speaking assessment that integrates architecture and system design, ASR benchmarking, and reliability analysis on edge devices. The proposed AIoT-oriented architecture incorporates a lightweight scoring framework capable of analyzing pronunciation, fluency, prosody, and CEFR-aligned speaking proficiency within an automated assessment system. Seven open-source ASR models—four Whisper variants (tiny, base, small, and medium) and three Vosk models—were systematically benchmarked in terms of recognition accuracy, inference latency, and computational efficiency. Experimental results indicate that Whisper-medium deployed on the Raspberry Pi 5 achieved the strongest overall performance, reducing inference latency by 42–48% compared with the Raspberry Pi 4 and attaining the lowest Word Error Rate (WER) of 6.8%. In contrast, smaller models such as Whisper-tiny, with a WER of 26.7%, exhibited two- to threefold higher scoring variability, demonstrating how recognition errors propagate into automated assessment reliability. System-level testing revealed that the Raspberry Pi 5 can sustain near real-time processing with approximately 58% CPU utilization and around 1.2 GB of memory, whereas the Raspberry Pi 4 frequently approaches practical operational limits under comparable workloads. Validation using real learner speech data (approximately 100 sessions) confirmed that the proposed system delivers accurate, portable, and privacy-preserving speaking assessment using low-power edge hardware. Overall, this work introduces a practical AIoT-based assessment framework, provides a comprehensive benchmark of open-source ASR models on edge platforms, and offers empirical insights into the trade-offs among recognition accuracy, inference latency, and scoring stability in edge-based ASR deployments. Full article
Show Figures

Figure 1

30 pages, 6201 KB  
Article
AFAD-MSA: Dataset and Models for Arabic Fake Audio Detection
by Elsayed Issa
Computation 2026, 14(1), 20; https://doi.org/10.3390/computation14010020 - 14 Jan 2026
Viewed by 239
Abstract
As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of [...] Read more.
As generative speech synthesis produces near-human synthetic voices and reliance on online media grows, robust audio-deepfake detection is essential to fight misuse and misinformation. In this study, we introduce the Arabic Fake Audio Dataset for Modern Standard Arabic (AFAD-MSA), a curated corpus of authentic and synthetic Arabic speech designed to advance research on Arabic deepfake and spoofed-speech detection. The synthetic subset is generated with four state-of-the-art proprietary text-to-speech and voice-conversion models. Rich metadata—covering speaker attributes and generation information—is provided to support reproducibility and benchmarking. To establish reference performance, we trained three AASIST models and compared their performance to two baseline transformer detectors (Wav2Vec 2.0 and Whisper). On the AFAD-MSA test split, AASIST-2 achieved perfect accuracy, surpassing the baseline models. However, its performance declined under cross-dataset evaluation. These results underscore the importance of data construction. Detectors generalize best when exposed to diverse attack types. In addition, continual or contrastive training that interleaves bona fide speech with large, heterogeneous spoofed corpora will further improve detectors’ robustness. Full article
Show Figures

Figure 1

22 pages, 1784 KB  
Article
Automated Severity and Breathiness Assessment of Disordered Speech Using a Speech Foundation Model
by Vahid Ashkanichenarlogh, Arman Hassanpour and Vijay Parsa
Information 2026, 17(1), 32; https://doi.org/10.3390/info17010032 - 3 Jan 2026
Viewed by 259
Abstract
In this study, we propose a novel automated model for speech quality estimation that objectively evaluates perceptual dysphonia severity and breathiness in audio samples, demonstrating strong correlation with expert ratings. The proposed model integrates Whisper encoder embeddings with Mel spectrograms augmented by second-order [...] Read more.
In this study, we propose a novel automated model for speech quality estimation that objectively evaluates perceptual dysphonia severity and breathiness in audio samples, demonstrating strong correlation with expert ratings. The proposed model integrates Whisper encoder embeddings with Mel spectrograms augmented by second-order delta features combined with a sequential-attention fusion network feature mapping path. This hybrid approach enhances the model’s sensitivity to phonetic, high-level feature representation, and spectral variations, enabling more accurate predictions of perceptual speech quality. A sequential-attention fusion network feature mapping module captures long-range dependencies through the multi-head attention network, while LSTM layers refine the learned representations by modeling temporal dynamics. Comparative analysis against state-of-the-art methods for dysphonia assessment demonstrates our model’s better correlation with clinician’s judgments across test samples. Our findings underscore the effectiveness of ASR-derived embeddings alongside the deep feature mapping structure in disordered speech quality assessment, offering a promising pathway for advancing automated evaluation systems. Full article
Show Figures

Graphical abstract

22 pages, 929 KB  
Article
Low-Resource Speech Recognition by Fine-Tuning Whisper with Optuna-LoRA
by Huan Wang, Jie Bin, Chunyan Gou, Lian Yang, Baolin Hou and Mingwei Qin
Appl. Sci. 2025, 15(24), 13090; https://doi.org/10.3390/app152413090 - 12 Dec 2025
Viewed by 1275
Abstract
In low-resource speech recognition, the performance of the Whisper model is often limited by the size of the available training data. To address this challenge, this paper proposes a training optimization method for the Whisper model that integrates Low-Rank Adaptation (LoRA) with the [...] Read more.
In low-resource speech recognition, the performance of the Whisper model is often limited by the size of the available training data. To address this challenge, this paper proposes a training optimization method for the Whisper model that integrates Low-Rank Adaptation (LoRA) with the Optuna hyperparameter optimization framework. This combined approach enables efficient fine-tuning and enhances model performance. A dual-metric early stopping strategy, based on validation loss and relative word error rate improvement, is introduced to ensure robust convergence during training. Experimental data were collected from three low-resource languages in Xinjiang, China: Uyghur, Kazakh, and Kyrgyz. Compared to baseline LoRA fine-tuning, the proposed optimization method reduces WER by 20.98%, 6.46%, and 8.72%, respectively, across the three languages. The dual-metric early stopping strategy effectively shortens optimization time while preventing overfitting. Overall, these results demonstrate that the proposed method significantly reduces both WER and computational costs, making it highly effective for low-resource speech recognition tasks. Full article
Show Figures

Figure 1

17 pages, 2332 KB  
Article
Speech Recognition-Based Analysis of Vessel Traffic Service (VTS) Communications for Estimating Advisory Timing
by Sang-Lok Yoo, Kwang-Il Kim and Cho-Young Jung
Appl. Sci. 2025, 15(22), 11968; https://doi.org/10.3390/app152211968 - 11 Nov 2025
Viewed by 571
Abstract
Vessel Traffic Service systems play a critical role in maritime safety by providing timely advisories to vessels in congested waterways. However, the optimal timing for VTS operator interventions has remained largely unstudied, relying primarily on subjective operator experience rather than empirical evidence. This [...] Read more.
Vessel Traffic Service systems play a critical role in maritime safety by providing timely advisories to vessels in congested waterways. However, the optimal timing for VTS operator interventions has remained largely unstudied, relying primarily on subjective operator experience rather than empirical evidence. This study presents the first large-scale empirical analysis of VTS operator intervention timing using automated speech recognition technology applied to actual maritime communication data. VHF radio communications were collected from five major VTS centers in Korea over nine months, comprising 171,175 communication files with a total duration of 334.2 h. The recorded communications were transcribed using the Whisper speech-to-text model and processed through natural language processing techniques to extract encounter situations and advisory distances. A tokenization and keyword framework was developed to handle Maritime English and local-language communications, normalize textual numerical expressions, and facilitate cross-site analysis. Results reveal that VTS operator intervention timing varies by encounter type. In head-on and crossing encounters, advisories are provided at distances, with mean values of 3.1 nm and 2.8 nm, respectively. These quantitative benchmarks provide an empirical foundation for developing standardized VTS operational guidelines and decision support systems, ultimately enhancing maritime safety and operational consistency across jurisdictions. Full article
(This article belongs to the Special Issue Risk and Safety of Maritime Transportation)
Show Figures

Figure 1

17 pages, 2618 KB  
Article
Optimizer-Aware Fine-Tuning of Whisper Small with Low-Rank Adaption: An Empirical Study of Adam and AdamW
by Hadia Arshad, Tahir Abdullah, Mariam Rehman, Afzaal Hussain, Faria Kanwal and Mehwish Parveen
Information 2025, 16(11), 928; https://doi.org/10.3390/info16110928 - 22 Oct 2025
Viewed by 1281
Abstract
Whisper is a transformer-based multilingual model that has illustrated state-of-the-art behavior in numerous languages. However, the efficiency remains persistent with the limited computational resources. To address this issue, an experiment was performed on librispeech-train-clean-100 for training purposes. The test-clean set was utilized to [...] Read more.
Whisper is a transformer-based multilingual model that has illustrated state-of-the-art behavior in numerous languages. However, the efficiency remains persistent with the limited computational resources. To address this issue, an experiment was performed on librispeech-train-clean-100 for training purposes. The test-clean set was utilized to evaluate its performance. To enhance efficiency and to cater the computational needs, a parameter-efficient fine-tuning technique, i.e., Low-Rank Adaptation, was employed to add a limited number of trainable parameters into the frozen layers of the model. The results showed that Low-Rank Adaptation attained excellent Automatic Speech Recognition results while using fewer computational resources, showing its effectiveness for resource-saving adaptation. The research work emphasizes the promise of Low-Rank Adaptation as a lightweight and scalable fine-tuning strategy for large speech models using a transformer architecture. The baseline Whisper Small model achieved a word error rate of 16.7% without any parameter-efficient adaptation. In contrast, the Low-Rank Adaptation enhanced fine-tuned model achieved a lower word error rate of 6.08%, demonstrating the adaptability of the proposed parameter-efficient approach. Full article
Show Figures

Figure 1

26 pages, 2388 KB  
Article
Facial and Speech-Based Emotion Recognition Using Sequential Pattern Mining
by Younghun Song and Kyungyong Chung
Electronics 2025, 14(20), 4015; https://doi.org/10.3390/electronics14204015 - 13 Oct 2025
Viewed by 1476
Abstract
We propose a multimodal emotion recognition framework that integrates facial expressions and speech transcription (where text is derived from the transcribed speech), with a particular focus on effectively modeling the continuous changes and transitions of emotional states during conversation. Existing studies have primarily [...] Read more.
We propose a multimodal emotion recognition framework that integrates facial expressions and speech transcription (where text is derived from the transcribed speech), with a particular focus on effectively modeling the continuous changes and transitions of emotional states during conversation. Existing studies have primarily relied on single modalities (text or facial expressions). They often perform static emotion classification at specific time points. This approach limits their ability to capture abrupt emotional shifts or the structural patterns of emotional flow within dialogues. To address these limitations, this paper utilizes the MELD dataset to construct emotion sequences based on the order of utterances and introduces an analytical approach using Sequential Pattern Mining (SPM). Facial expressions are detected using DeepFace, while speech is transcribed with Whisper and passed through a BERT-based emotion classifier to infer emotions. The proposed method fuses multimodal results through a weighted voting scheme to generate emotion label sequences for each utterance. These sequences are then used to construct an emotion transition matrix, apply change-point detection, perform SPM, and train an LSTM-based classification model to predict the overall emotional flow of the dialogue. This approach goes beyond single-point judgments by capturing the contextual flow and dynamics of emotions and demonstrates superior performance compared to existing methods through experimental validation. Full article
(This article belongs to the Special Issue Application of Data Mining in Social Media)
Show Figures

Figure 1

29 pages, 1708 KB  
Article
Speech Recognition and Synthesis Models and Platforms for the Kazakh Language
by Aidana Karibayeva, Vladislav Karyukin, Balzhan Abduali and Dina Amirova
Information 2025, 16(10), 879; https://doi.org/10.3390/info16100879 - 10 Oct 2025
Cited by 1 | Viewed by 3864
Abstract
With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and text-to-speech (TTS) have become key components of the digital transformation of society. The Kazakh language, as a representative of the Turkic language family, remains a low-resource language [...] Read more.
With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and text-to-speech (TTS) have become key components of the digital transformation of society. The Kazakh language, as a representative of the Turkic language family, remains a low-resource language with limited audio corpora, language models, and high-quality speech synthesis systems. This study provides a comprehensive analysis of existing speech recognition and synthesis models, emphasizing their applicability and adaptation to the Kazakh language. Special attention is given to linguistic and technical barriers, including the agglutinative structure, rich vowel system, and phonemic variability. Both open-source and commercial solutions were evaluated, including Whisper, GPT-4 Transcribe, ElevenLabs, OpenAI TTS, Voiser, KazakhTTS2, and TurkicTTS. Speech recognition systems were assessed using BLEU, WER, TER, chrF, and COMET, while speech synthesis was evaluated with MCD, PESQ, STOI, and DNSMOS, thus covering both lexical–semantic and acoustic–perceptual characteristics. The results demonstrate that, for speech-to-text (STT), the strongest performance was achieved by Soyle on domain-specific data (BLEU 74.93, WER 18.61), while Voiser showed balanced accuracy (WER 40.65–37.11, chrF 80.88–84.51) and GPT-4 Transcribe achieved robust semantic preservation (COMET up to 1.02). In contrast, Whisper performed weakest (WER 77.10, BLEU 13.22), requiring further adaptation for Kazakh. For text-to-speech (TTS), KazakhTTS2 delivered the most natural perceptual quality (DNSMOS 8.79–8.96), while OpenAI TTS achieved the best spectral accuracy (MCD 123.44–117.11, PESQ 1.14). TurkicTTS offered reliable intelligibility (STOI 0.15, PESQ 1.16), and ElevenLabs produced natural but less spectrally accurate speech. Full article
(This article belongs to the Section Artificial Intelligence)
Show Figures

Figure 1

21 pages, 3434 KB  
Article
Deep Learning-Based Compliance Assessment for Chinese Rail Transit Dispatch Speech
by Qiuzhan Zhao, Jinbai Zou and Lingxiao Chen
Appl. Sci. 2025, 15(19), 10498; https://doi.org/10.3390/app151910498 - 28 Sep 2025
Viewed by 550
Abstract
Rail transit dispatch speech plays a critical role in ensuring the safety of urban rail operations. To enable automated and accurate compliance assessment of dispatch speech, this study proposes an improved deep learning model to address the limitations of conventional approaches in terms [...] Read more.
Rail transit dispatch speech plays a critical role in ensuring the safety of urban rail operations. To enable automated and accurate compliance assessment of dispatch speech, this study proposes an improved deep learning model to address the limitations of conventional approaches in terms of accuracy and robustness. Building upon the baseline Whisper model, two key enhancements are introduced: (1) low-rank adaptation (LoRA) fine-tuning to better adapt the model to the specific acoustic and linguistic characteristics of rail transit dispatch speech, and (2) a novel entity-aware attention mechanism that incorporates named entity recognition (NER) embeddings into the decoder. This mechanism enables attention computation between words belonging to the same entity category across different commands and recitations, which helps highlight keywords critical for compliance assessment and achieve precise inter-sentence element alignment. Experimental results on real-world test sets demonstrate that the proposed model improves recognition accuracy by 30.5% compared to the baseline model. In terms of robustness, we evaluate the relative performance retention under severe noise conditions. While Zero-shot, Full Fine-tuning, and LoRA-only models achieve robustness scores of 72.2%, 72.4%, and 72.1%, respectively, and the NER-only variant reaches 88.1%, our proposed approach further improves to 89.6%. These results validate the model’s significant robustness and its potential to provide efficient and reliable technical support for ensuring the normative use of dispatch speech in urban rail transit operations. Full article
Show Figures

Figure 1

32 pages, 3609 KB  
Article
BPMN-Based Design of Multi-Agent Systems: Personalized Language Learning Workflow Automation with RAG-Enhanced Knowledge Access
by Hedi Tebourbi, Sana Nouzri, Yazan Mualla, Meryem El Fatimi, Amro Najjar, Abdeljalil Abbas-Turki and Mahjoub Dridi
Information 2025, 16(9), 809; https://doi.org/10.3390/info16090809 - 17 Sep 2025
Cited by 1 | Viewed by 2812
Abstract
The intersection of Artificial Intelligence (AI) and education is revolutionizing learning and teaching in this digital era, with Generative AI and large language models (LLMs) providing even greater possibilities for the future. The digital transformation of language education demands innovative approaches that combine [...] Read more.
The intersection of Artificial Intelligence (AI) and education is revolutionizing learning and teaching in this digital era, with Generative AI and large language models (LLMs) providing even greater possibilities for the future. The digital transformation of language education demands innovative approaches that combine pedagogical rigor with explainable AI (XAI) principles, particularly for low-resource languages. This paper presents a novel methodology that integrates Business Process Model and Notation (BPMN) with Multi-Agent Systems (MAS) to create transparent, workflow-driven language tutors. Our approach uniquely embeds XAI through three mechanisms: (1) BPMN’s visual formalism that makes agent decision-making auditable, (2) Retrieval-Augmented Generation (RAG) with verifiable knowledge provenance from textbooks of the National Institute of Languages of Luxembourg, and (3) human-in-the-loop validation of both content and pedagogical sequencing. To ensure realism in learner interaction, we integrate speech-to-text and text-to-speech technologies, creating an immersive, human-like learning environment. The system simulates intelligent tutoring through agents’ collaboration and dynamic adaptation to learner progress. We demonstrate this framework through a Luxembourgish language learning platform where specialized agents (Conversational, Reading, Listening, QA, and Grammar) operate within BPMN-modeled workflows. The system achieves high response faithfulness (0.82) and relevance (0.85) according to RAGA metrics, while speech integration using Whisper STT and Coqui TTS enables immersive practice. Evaluation with learners showed 85.8% satisfaction with contextual responses and 71.4% engagement rates, confirming the effectiveness of our process-driven approach. This work advances AI-powered language education by showing how formal process modeling can create pedagogically coherent and explainable tutoring systems. The architecture’s modularity supports extension to other low-resource languages while maintaining the transparency critical for educational trust. Future work will expand curriculum coverage and develop teacher-facing dashboards to further improve explainability. Full article
(This article belongs to the Section Information Applications)
Show Figures

Figure 1

12 pages, 304 KB  
Article
LoRA-INT8 Whisper: A Low-Cost Cantonese Speech Recognition Framework for Edge Devices
by Lusheng Zhang, Shie Wu and Zhongxun Wang
Sensors 2025, 25(17), 5404; https://doi.org/10.3390/s25175404 - 1 Sep 2025
Cited by 2 | Viewed by 2703
Abstract
To address the triple bottlenecks of data scarcity, oversized models, and slow inference that hinder Cantonese automatic speech recognition (ASR) in low-resource and edge-deployment settings, this study proposes a cost-effective Cantonese ASR system based on LoRA fine-tuning and INT8 quantization. First, Whisper-tiny is [...] Read more.
To address the triple bottlenecks of data scarcity, oversized models, and slow inference that hinder Cantonese automatic speech recognition (ASR) in low-resource and edge-deployment settings, this study proposes a cost-effective Cantonese ASR system based on LoRA fine-tuning and INT8 quantization. First, Whisper-tiny is parameter-efficiently fine-tuned on the Common Voice zh-HK training set using LoRA with rank = 8. Only 1.6% of the original weights are updated, reducing the character error rate (CER) from 49.5% to 11.1%, a performance close to full fine-tuning (10.3%), while cutting the training memory footprint and computational cost by approximately one order of magnitude. Next, the fine-tuned model is compressed into a 60 MB INT8 checkpoint via dynamic quantization in ONNX Runtime. On a MacBook Pro M1 Max CPU, the quantized model achieves an RTF = 0.20 (offline inference 5 × real-time) and 43% lower latency than the FP16 baseline; on an NVIDIA A10 GPU, it reaches RTF = 0.06, meeting the requirements of high-concurrency cloud services. Ablation studies confirm that the LoRA-INT8 configuration offers the best trade-off among accuracy, speed, and model size. Limitations include the absence of spontaneous-speech noise data, extreme-hardware validation, and adaptive LoRA structure optimization. Future work will incorporate large-scale self-supervised pre-training, tone-aware loss functions, AdaLoRA architecture search, and INT4/NPU quantization, and will establish an mJ/char energy–accuracy curve. The ultimate goal is to achieve CER ≤ 8%, RTF < 0.1, and mJ/char < 1 for low-power real-time Cantonese ASR in practical IoT scenarios. Full article
(This article belongs to the Section Electronic Sensors)
Show Figures

Figure 1

15 pages, 252 KB  
Article
Enhanced Neural Speech Recognition of Quranic Recitations via a Large Audio Model
by Mohammad Alshboul, Abdul Rahman Al Muaitah, Suhad Al-Issa and Mahmoud Al-Ayyoub
Appl. Sci. 2025, 15(17), 9521; https://doi.org/10.3390/app15179521 - 29 Aug 2025
Viewed by 2355
Abstract
In this work, we build on our recent work toward developing a neural speech recognition (NSR) for Quranic recitations that is accessible to people of any age, gender, or expertise level. The Quran recitations by females and males (QRFAM) dataset, a sizable benchmark [...] Read more.
In this work, we build on our recent work toward developing a neural speech recognition (NSR) for Quranic recitations that is accessible to people of any age, gender, or expertise level. The Quran recitations by females and males (QRFAM) dataset, a sizable benchmark dataset of audio recordings made by male and female reciters from various age groups and competence levels, was previously reported in our prior works. In addition to this dataset, we used various subsets of the QRFAM dataset for training, validation, and testing to build several basic NSR systems based on Mozilla’s DeepSpeech model. Our current efforts to optimize and enhance these baseline models have also been presented. In this study, we expand our efforts by utilizing one of the well-known speech recognition models, Whisper, and we describe the effect of this choice on the model’s accuracy, expressed as the word error rate (WER), in comparison to that of DeepSpeech. Full article
31 pages, 5187 KB  
Article
Investigation of ASR Models for Low-Resource Kazakh Child Speech: Corpus Development, Model Adaptation, and Evaluation
by Diana Rakhimova, Zhansaya Duisenbekkyzy and Eşref Adali
Appl. Sci. 2025, 15(16), 8989; https://doi.org/10.3390/app15168989 - 14 Aug 2025
Cited by 2 | Viewed by 2788
Abstract
This study focuses on the development and evaluation of automatic speech recognition (ASR) systems for Kazakh child speech, an underexplored domain in both linguistic and computational research. A specialized acoustic corpus was constructed for children aged 2 to 8 years, incorporating age-related vocabulary [...] Read more.
This study focuses on the development and evaluation of automatic speech recognition (ASR) systems for Kazakh child speech, an underexplored domain in both linguistic and computational research. A specialized acoustic corpus was constructed for children aged 2 to 8 years, incorporating age-related vocabulary stratification and gender variation to capture phonetic and prosodic diversity. The data were collected from three sources: a custom-designed Telegram bot, high-quality Dictaphone recordings, and naturalistic speech samples recorded in home and preschool environments. Four ASR models, Whisper, DeepSpeech, ESPnet, and Vosk, were evaluated. Whisper, ESPnet, and DeepSpeech were fine-tuned on the curated corpus, while Vosk was applied in its standard pretrained configuration. Performance was measured using five evaluation metrics: Word Error Rate (WER), BLEU, Translation Edit Rate (TER), Character Similarity Rate (CSRF2), and Accuracy. The results indicate that ESPnet achieved the highest accuracy (32%) and the lowest WER (0.242) for sentences, while Whisper performed well in semantically rich utterances (Accuracy = 33%; WER = 0.416). Vosk demonstrated the best performance on short words (Accuracy = 68%) and yielded the highest BLEU score (0.600) for short words. DeepSpeech showed moderate improvements in accuracy, particularly for short words (Accuracy = 60%), but faced challenges with longer utterances, achieving an Accuracy of 25% for sentences. These findings emphasize the critical importance of age-appropriate corpora and domain-specific adaptation when developing ASR systems for low-resource child speech, particularly in educational and therapeutic contexts. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

23 pages, 8167 KB  
Article
Revisiting the Acoustics of St Paul’s Cathedral, London
by Aglaia Foteinou, Francis Stevens and Damian Murphy
Acoustics 2025, 7(3), 49; https://doi.org/10.3390/acoustics7030049 - 13 Aug 2025
Cited by 1 | Viewed by 2783
Abstract
The acoustics of St Paul’s Cathedral, London, have been discussed in previous studies as a space of historical, cultural, societal, and architectural interest in the capital city of the United Kingdom. This paper presents the results from recent acoustic measurements carried out within [...] Read more.
The acoustics of St Paul’s Cathedral, London, have been discussed in previous studies as a space of historical, cultural, societal, and architectural interest in the capital city of the United Kingdom. This paper presents the results from recent acoustic measurements carried out within the space, making use of state-of-the-art measurement techniques and equipment. The results from these measurements provide a new perspective on the acoustic properties of different and distinct spaces within the cathedral, including coupling effects between the main areas, and the whispering gallery effect that can be heard around the walkway at the base of the dome. The discussion includes the analysis of room acoustic parameters included in the international standards and speech intelligibility parameters, and an indirect comparison between the techniques used here and those used in previous studies of this space. Full article
(This article belongs to the Special Issue The Past Has Ears: Archaeoacoustics and Acoustic Heritage)
Show Figures

Figure 1

Back to TopTop