MDPI - Publisher of Open Access Journals

17 pages, 14845 KB

Open AccessArticle

A Collaborative Robotic System for Autonomous Object Handling with Natural User Interaction

by Federico Neri, Gaetano Lettera, Giacomo Palmieri and Massimo Callegari

Robotics 2026, 15(3), 49; https://doi.org/10.3390/robotics15030049 - 27 Feb 2026

Viewed by 35

In Industry 5.0, the transition from fixed traditional automation to flexible human–robot collaboration (HRC) needs interfaces that are both intuitive and efficient. This paper introduces a novel, multimodal control system for autonomous object handling, specifically designed to enhance natural user interaction in dynamic [...] Read more.

In Industry 5.0, the transition from fixed traditional automation to flexible human–robot collaboration (HRC) needs interfaces that are both intuitive and efficient. This paper introduces a novel, multimodal control system for autonomous object handling, specifically designed to enhance natural user interaction in dynamic work environments. The system integrates a 6-Degrees of Freedom (DoF) collaborative robot (UR5e) with a hand-eye RGB-D vision system to achieve robust autonomy. The core technical contribution lies in a vision pipeline utilizing deep learning for object detection and point cloud processing for accurate 6D pose estimation, enabling advanced tasks such as human-aware object handover directly onto the operator’s hand. Crucially, an Automatic Speech Recognition (ASR) is incorporated, providing a Natural Language Understanding (NLU) layer that allows operators to issue real-time commands for task modification, error correction and object selection. Experimental results demonstrate that this multimodal approach offers a streamlined workflow aiming to improve operational flexibility compared to traditional HMIs, while enhancing the perceived naturalness of the collaborative task. The system establishes a framework for highly responsive and intuitive human–robot workspaces, advancing the state of the art in natural interaction for collaborative object manipulation. Full article

(This article belongs to the Special Issue Human–Robot Collaboration in Industry 5.0)

31 pages, 2433 KB

Open AccessArticle

Quality vs. Populism in Short-Video Political Communication: A Multimodal Study of TikTok

by Alicia Rodas-Coloma, Marcos Cabezas-González, Sonia Casillas-Martín and Pedro Nevado-Batalla Moreno

Journal. Media 2026, 7(1), 46; https://doi.org/10.3390/journalmedia7010046 - 25 Feb 2026

Viewed by 210

Abstract

The article examines how framing and actor identity structure attention in short-video politics using a country-level corpus from Ecuador. It assembles 4612 public TikTok videos from official accounts and politically salient hashtags, extracts multimodal text via automatic speech recognition and on-screen OCR, and [...] Read more.

The article examines how framing and actor identity structure attention in short-video politics using a country-level corpus from Ecuador. It assembles 4612 public TikTok videos from official accounts and politically salient hashtags, extracts multimodal text via automatic speech recognition and on-screen OCR, and constructs two continuous indices: a quality index (programmatic, efficacy-oriented content) and a populism index (antagonistic, people-versus-elite cues). Engagement is modeled as a fractional response (binomial GLM with logit link), with robustness checks using OLS on logit(ER) and Poisson counts with an offset for log(plays + 1). Models include affect (positive sentiment and anger), hour/day controls, and actor fixed effects (leader, creator, institution, party, and media). The indices display construct validity: quality aligns with positive/joyful tone and populism with anger. Net of controls, populism is positively and consistently associated with engagement across estimators; quality is small and often null or negative. Effects are heterogeneous: leaders gain under both frames, creators primarily under populism, and media modestly under populism, while institutions face penalties under both, and parties show limited returns. Monthly series reveal event-linked intensification of populism, and hashtag networks are modular, mapping onto institutional, partisan, and creator ecosystems. A design analysis identifies a non-populist pathway—benefit-first micro-explanations, concise captions, targeted hashtags, and joyful/efficacy affect—that raises engagement without antagonism. The study contributes a reproducible, open-source pipeline for survey-free, multimodal framing measurement and clarifies how persona × frame interactions and meso-level discursive structure jointly organize attention in short-video politics. Full article

► Show Figures

Figure 1

16 pages, 434 KB

Open AccessArticle

Modern Speech Recognition for Romanian Language

by Remus-Dan Ungureanu and Mihai Dascalu

Appl. Sci. 2026, 16(4), 1928; https://doi.org/10.3390/app16041928 - 14 Feb 2026

Viewed by 252

Abstract

Despite having approximately 24 million native speakers, Romanian remains a low-resource language for automatic speech recognition (ASR), with few accurate and publicly available systems. To address this gap, this study explores the challenges of adapting modern speech recognition models, such as wav2vec 2.0 [...] Read more.

Despite having approximately 24 million native speakers, Romanian remains a low-resource language for automatic speech recognition (ASR), with few accurate and publicly available systems. To address this gap, this study explores the challenges of adapting modern speech recognition models, such as wav2vec 2.0 and Conformer, to Romanian. Our investigation is a comprehensive analysis of the two models, their capabilities to adapt to Romanian data, and the performance of the trained models. The research also focuses on unique attributes of the Romanian language, data collection techniques, including weakly supervised learning, and processing methodologies. Building on the previously introduced Echo dataset of 378 h, we release CRoWL (Crawled Romanian Weakly Labeled), a weakly supervised dataset of 9000 h created via automatic transcription. We obtain strong results that, to the best of our knowledge, are competitive with or exceed publicly reported results for Romanian under comparable open evaluation settings, with Conformer attaining 3.01% WER on Echo + CRoWL and wav2vec 2.0 reaching 4.04% (Echo) and 4.17% (Echo + CRoWL). In addition to the datasets, we also release our most capable models as open source, along with their training plans, thereby providing a solid foundation for researchers interested in languages with limited representation. Full article

(This article belongs to the Special Issue Computational Linguistics: From Text to Speech Technologies, 2nd Edition)

► Show Figures

Figure 1

39 pages, 5230 KB

Open AccessReview

An In-Depth Review of Speech Enhancement Algorithms: Classifications, Underlying Principles, Challenges, and Emerging Trends

by Nisreen Talib Abdulhusein and Basheera M. Mahmmod

Algorithms 2026, 19(2), 134; https://doi.org/10.3390/a19020134 - 7 Feb 2026

Viewed by 282

Abstract

Speech enhancement aims to improve speech quality and intelligibility in noisy environments and is important in applications such as hearing aids, mobile communications and automatic speech recognition (ASR). This paper shows a structured review of speech enhancement techniques, classified depending on the channel [...] Read more.

Speech enhancement aims to improve speech quality and intelligibility in noisy environments and is important in applications such as hearing aids, mobile communications and automatic speech recognition (ASR). This paper shows a structured review of speech enhancement techniques, classified depending on the channel configuration and signal processing framework. Both traditional and modern approaches are discussed, including classical signal processing methods, machine learning techniques, and recent deep learning-based models. Furthermore, common noise types, widely used speech datasets, and standard evaluation metrics for evaluating speech quality and intelligibility are reviewed. Key challenges such as non-stationary noise, data limitations, reverberation, and generalization to unseen noise conditions are highlighted. This review presents the advancements in speech enhancement and discusses the challenges and trends of this field. Valuable insights are provided for researchers, engineers, and practitioners in the area. The findings aid in the selection of suitable techniques for improved speech quality and intelligibility, and we concluded that the trend in speech enhancement has shifted from standard algorithms to deep learning methods that can efficiently learn information regarding speech signals. Full article

► Show Figures

Figure 1

12 pages, 1323 KB

Open AccessProceeding Paper

Edge AI System Using Lightweight Semantic Voting to Detect Segment-Based Voice Scams

by Shao-Yong Lu and Wen-Ping Chen

Eng. Proc. 2025, 120(1), 14; https://doi.org/10.3390/engproc2025120014 - 2 Feb 2026

Viewed by 375

Abstract

Real-time telecom scam detection is difficult without cloud AI, particularly for privacy-sensitive, low-resource environments. We developed a lightweight, offline voice scam detector using on-device audio segmentation, automatic speech recognition (ASR), and semantic similarity. Four detection strategies were implemented. We used Whisper ASR and [...] Read more.

Real-time telecom scam detection is difficult without cloud AI, particularly for privacy-sensitive, low-resource environments. We developed a lightweight, offline voice scam detector using on-device audio segmentation, automatic speech recognition (ASR), and semantic similarity. Four detection strategies were implemented. We used Whisper ASR and DeepSeek to process 5 s speech chunks. An analysis of 120 synthetic and paraphrased Mandarin phone call dialogues reveals the A4 voting strategy’s superior performance in optimizing early detection and minimizing false positives, achieving an F1 score of 0.90, a 2.5% false positive rate, and a mean response time of under 4 s. The system is deployable on ESP32 for offline mobile inference. The proposed architecture provides a robust and scalable defense against threats targeting vulnerable user groups, such as older adults. It introduces a new method for real-time voice threat mitigation on devices through interpretable segment-level semantic analysis. Full article

(This article belongs to the Proceedings of 8th International Conference on Knowledge Innovation and Invention)

► Show Figures

Figure 1

25 pages, 17750 KB

Open AccessArticle

A Mixed Reality Tool with Automatic Speech Recognition for 3D CAD Based Visualization and Automatic Dimension Generation in the Industry 5.0 Shipyard

by Aida Vidal-Balea, Antón Valladares-Poncela, Javier Vilar-Martínez, Tiago M. Fernández-Caramés and Paula Fraga-Lamas

Multimodal Technol. Interact. 2026, 10(2), 13; https://doi.org/10.3390/mti10020013 - 1 Feb 2026

Viewed by 324

Abstract

Industry 5.0 is composed of a variety of complex tasks and challenging processes requiring specialized labor and multidisciplinary coordination. Specifically, when it comes to shipbuilding, shipyards leverage advanced technologies, seeking to replace operations that continue to rely on traditional methods, such as 2D [...] Read more.

Industry 5.0 is composed of a variety of complex tasks and challenging processes requiring specialized labor and multidisciplinary coordination. Specifically, when it comes to shipbuilding, shipyards leverage advanced technologies, seeking to replace operations that continue to rely on traditional methods, such as 2D blueprints and paper-based documentation, which can lead to inefficiencies and alignment errors in precision-dependent tasks. For this reason, this article focuses on embracing Mixed Reality (MR) technologies to address these challenges in the context of electrical outfitting tasks. The design, development and evaluation of a MR application tailored for HoloLens 2 smart glasses aims to streamline the workflow for operators, reducing reliance on paper-based documentation and enhancing the precision of assembly processes. The proposed system allows for the precise positioning of 3D models in the real environment, ensuring accurate alignment during assembly. Additionally, it incorporates automatic dimension generation between objects in the scene. To further enhance usability, the application integrates a Galician on-device Automatic Speech Recognition (ASR) system, allowing operators to interact seamlessly with the MR interface using voice commands. The whole system has been exhaustively tested, both through usability and functionality evaluations, which validate MR as a viable tool for shipyard assembly and inspection tasks. Full article

(This article belongs to the Special Issue Multimodal Interaction Design in Immersive Learning and Training Environments)

► Show Figures

Figure 1

12 pages, 550 KB

Open AccessArticle

Temporal Parameters of Spontaneous Speech as Early Indicators of Alcohol-Related Cognitive Impairment

by Fanni Fruzsina Farkas, Ildikó Hoffmann, Otília Bagi, Janka Gajdics, Bálint Andó, Gábor Gosztolya, Ildikó Kovács, Bence András Lázár and János Kálmán

J. Clin. Med. 2026, 15(3), 1092; https://doi.org/10.3390/jcm15031092 - 30 Jan 2026

Viewed by 201

Abstract

Background/Objectives: Most patients with alcohol use disorder (AUD) suffer from mild cognitive decline, which does not meet the diagnostic criteria of the severe form of alcohol-related cognitive impairment (ARCI). ARCI is associated with executive abnormalities in addictive behaviors and therefore influences relapse [...] Read more.

Background/Objectives: Most patients with alcohol use disorder (AUD) suffer from mild cognitive decline, which does not meet the diagnostic criteria of the severe form of alcohol-related cognitive impairment (ARCI). ARCI is associated with executive abnormalities in addictive behaviors and therefore influences relapse and daily functioning. Abnormalities in speech production reflect cognitive disturbances. The aim of this study was to examine the temporal speech parameters (TSPs) in ARCI. Methods: The TSPs were measured with the S-GAP Test^® on 34 AUD patients with intact cognitive functions and 31 age- and gender-matched control participants. Results: Ten out of fifteen parameters of TSPs were significantly different between the AUD and healthy groups. Speech tempo and the total pause duration rate have significant classification potential. Conclusions: Our exploratory study revealed that filled pause-related temporal parameters appear to be particularly altered in ARCI and indicated that marked TSP alterations could serve as early indicators of cognitive deficits. Full article

(This article belongs to the Special Issue Substance and Behavioral Addictions: Prevention and Diagnosis)

► Show Figures

Figure 1

12 pages, 1025 KB

Open AccessArticle

Enhancing Whisper Fine-Tuning with Discrete Wavelet Transform-Based LoRA Initialization

by Liang Lan, Molin Fang, Yuxuan Chen, Daliang Wang and Wenyong Wang

Electronics 2026, 15(3), 586; https://doi.org/10.3390/electronics15030586 - 29 Jan 2026

Viewed by 226

Abstract

In low-resource automatic speech recognition (ASR) scenarios, parameter-efficient fine-tuning (PEFT) has become a crucial approach for adapting large pre-trained speech models. Although low-rank adaptation (LoRA) offers clear advantages in efficiency, stability, and deployment friendliness, its performance remains constrained because random initialization fails to [...] Read more.

In low-resource automatic speech recognition (ASR) scenarios, parameter-efficient fine-tuning (PEFT) has become a crucial approach for adapting large pre-trained speech models. Although low-rank adaptation (LoRA) offers clear advantages in efficiency, stability, and deployment friendliness, its performance remains constrained because random initialization fails to capture the time–frequency structural characteristics of speech signals. To address this limitation, this work proposes a structured initialization mechanism that integrates LoRA with the discrete wavelet transform (DWT). By combining wavelet-based initialization, a multi-scale fusion mechanism, and a residual strategy, the proposed method constructs a low-rank adaptation subspace that better aligns with the local time–frequency properties of speech signals. Discrete Wavelet Transform-Based LoRA Initialization (DWTLoRA) enables LoRA modules to incorporate prior modeling of speech dynamics at the start of fine-tuning, substantially reducing the search space of ineffective directions during early training and improving convergence speed, training stability, and recognition accuracy under low-resource conditions. Experimental results on Sichuan dialect speech recognition based on the Whisper architecture demonstrate that the proposed DWTLoRA initialization outperforms standard LoRA and several PEFT baseline methods in terms of character error rate (CER) and training efficiency, confirming the critical role of signal-structure-aware initialization in low-resource ASR. Full article

(This article belongs to the Special Issue Machine Learning Meets Large-Scale Model: Current Trends and Future Challenges)

► Show Figures

Figure 1

38 pages, 6181 KB

Open AccessArticle

An AIoT-Based Framework for Automated English-Speaking Assessment: Architecture, Benchmarking, and Reliability Analysis of Open-Source ASR

by Paniti Netinant, Rerkchai Fooprateepsiri, Ajjima Rukhiran and Meennapa Rukhiran

Informatics 2026, 13(2), 19; https://doi.org/10.3390/informatics13020019 - 26 Jan 2026

Viewed by 704

Abstract

The emergence of low-cost edge devices has enabled the integration of automatic speech recognition (ASR) into IoT environments, creating new opportunities for real-time language assessment. However, achieving reliable performance on resource-constrained hardware remains a significant challenge, especially on the Artificial Internet of Things [...] Read more.

The emergence of low-cost edge devices has enabled the integration of automatic speech recognition (ASR) into IoT environments, creating new opportunities for real-time language assessment. However, achieving reliable performance on resource-constrained hardware remains a significant challenge, especially on the Artificial Internet of Things (AIoT). This study presents an AIoT-based framework for automated English-speaking assessment that integrates architecture and system design, ASR benchmarking, and reliability analysis on edge devices. The proposed AIoT-oriented architecture incorporates a lightweight scoring framework capable of analyzing pronunciation, fluency, prosody, and CEFR-aligned speaking proficiency within an automated assessment system. Seven open-source ASR models—four Whisper variants (tiny, base, small, and medium) and three Vosk models—were systematically benchmarked in terms of recognition accuracy, inference latency, and computational efficiency. Experimental results indicate that Whisper-medium deployed on the Raspberry Pi 5 achieved the strongest overall performance, reducing inference latency by 42–48% compared with the Raspberry Pi 4 and attaining the lowest Word Error Rate (WER) of 6.8%. In contrast, smaller models such as Whisper-tiny, with a WER of 26.7%, exhibited two- to threefold higher scoring variability, demonstrating how recognition errors propagate into automated assessment reliability. System-level testing revealed that the Raspberry Pi 5 can sustain near real-time processing with approximately 58% CPU utilization and around 1.2 GB of memory, whereas the Raspberry Pi 4 frequently approaches practical operational limits under comparable workloads. Validation using real learner speech data (approximately 100 sessions) confirmed that the proposed system delivers accurate, portable, and privacy-preserving speaking assessment using low-power edge hardware. Overall, this work introduces a practical AIoT-based assessment framework, provides a comprehensive benchmark of open-source ASR models on edge platforms, and offers empirical insights into the trade-offs among recognition accuracy, inference latency, and scoring stability in edge-based ASR deployments. Full article

► Show Figures

Figure 1

26 pages, 712 KB

Open AccessArticle

Comparing Multi-Scale and Pipeline Models for Speaker Change Detection

by Alymzhan Toleu, Gulmira Tolegen and Bagashar Zhumazhanov

Acoustics 2026, 8(1), 5; https://doi.org/10.3390/acoustics8010005 - 25 Jan 2026

Viewed by 462

Abstract

Speaker change detection (SCD) in long, multi-party meetings is essential for diarization, Automatic speech recognition (ASR), and summarization, and is now often performed in the space of pre-trained speech embeddings. However, unsupervised approaches remain dominant when timely labeled audio is scarce, and their [...] Read more.

Speaker change detection (SCD) in long, multi-party meetings is essential for diarization, Automatic speech recognition (ASR), and summarization, and is now often performed in the space of pre-trained speech embeddings. However, unsupervised approaches remain dominant when timely labeled audio is scarce, and their behavior under a unified modeling setup is still not well understood. In this paper, we systematically compare two representative unsupervised approaches on the multi-talker audio meeting corpus: (i) a clustering-based pipeline that segments and clusters embeddings/features and scores boundaries via cluster changes and jump magnitude, and (ii) a multi-scale jump-based detector that measures embedding discontinuities at several window lengths and fuses them via temporal clustering and voting. Using a shared front-end and protocol, we vary the underlying features (ECAPA, WavLM, wav2vec 2.0, MFCC, and log-Mel) and test the model’s robustness under additive noise. The results show that embedding choice is crucial and that the two methods offer complementary trade-offs: the pipeline yields low false alarm rates but higher misses, while the multi-scale detector achieves relatively high recall at the cost of many false alarms. Full article

(This article belongs to the Special Issue Advancing Audio/Speech Machine Learning: From Static to Continual Learning)

► Show Figures

Figure 1

23 pages, 725 KB

Open AccessArticle

From Sound to Risk: Streaming Audio Flags for Real-World Hazard Inference Based on AI

by Ilyas Potamitis

J. Sens. Actuator Netw. 2026, 15(1), 6; https://doi.org/10.3390/jsan15010006 - 1 Jan 2026

Viewed by 1043

Abstract

Seconds count differently for people in danger. We present a real-time streaming pipeline for audio-based detection of hazardous life events affecting life and property. The system operates online rather than as a retrospective analysis tool. Its objective is to reduce the latency between [...] Read more.

Seconds count differently for people in danger. We present a real-time streaming pipeline for audio-based detection of hazardous life events affecting life and property. The system operates online rather than as a retrospective analysis tool. Its objective is to reduce the latency between the occurrence of a crime, conflict, or accident and the corresponding response by authorities. The key idea is to map reality as perceived by audio into a written story and question the text via a large language model. The method integrates streaming, zero-shot algorithms in an online decoding mode that convert sound into short, interpretable tokens, which are processed by a lightweight language model. CLAP text–audio prompting identifies agitation, panic, and distress cues, combined with conversational dynamics derived from speaker diarization. Lexical information is obtained through streaming automatic speech recognition, while general audio events are detected by a streaming version of Audio Spectrogram Transformer tagger. Prosodic features are incorporated using pitch- and energy-based rules derived from robust F0 tracking and periodicity measures. The system uses a large language model configured for online decoding and outputs binary (YES/NO) life-threatening risk decisions every two seconds, along with a brief justification and a final session-level verdict. The system emphasizes interpretability and accountability. We evaluate it on a subset of the X-Violence dataset, comprising only real-world videos. We release code, prompts, decision policies, evaluation splits, and example logs to enable the community to replicate, critique, and extend our blueprint. Full article

(This article belongs to the Topic Trends and Prospects in Security, Encryption and Encoding)

► Show Figures

Figure 1

19 pages, 1187 KB

Open AccessArticle

Dual-Pipeline Machine Learning Framework for Automated Interpretation of Pilot Communications at Non-Towered Airports

by Abdullah All Tanvir, Chenyu Huang, Moe Alahmad, Chuyang Yang and Xin Zhong

Aerospace 2026, 13(1), 32; https://doi.org/10.3390/aerospace13010032 - 28 Dec 2025

Viewed by 376

Abstract

Accurate estimation of aircraft operations, such as takeoffs and landings, is critical for airport planning and resource allocation, yet it remains particularly challenging at non-towered airports, where no dedicated surveillance infrastructure exists. Existing solutions, including video analytics, acoustic sensors, and transponder-based systems, are [...] Read more.

Accurate estimation of aircraft operations, such as takeoffs and landings, is critical for airport planning and resource allocation, yet it remains particularly challenging at non-towered airports, where no dedicated surveillance infrastructure exists. Existing solutions, including video analytics, acoustic sensors, and transponder-based systems, are often costly, incomplete, or unreliable in environments with mixed traffic and inconsistent radio usage, highlighting the need for a scalable, infrastructure-free alternative. To address this gap, this study proposes a novel dual-pipeline machine learning framework that classifies pilot radio communications using both textual and spectral features to infer operational intent. A total of 2489 annotated pilot transmissions collected from a U.S. non-towered airport were processed through automatic speech recognition (ASR) and Mel-spectrogram extraction. We benchmarked multiple traditional classifiers and deep learning models, including ensemble methods, long short-term memory (LSTM) networks, and convolutional neural networks (CNNs), across both feature pipelines. Results show that spectral features paired with deep architectures consistently achieved the highest performance, with F1-scores exceeding 91% despite substantial background noise, overlapping transmissions, and speaker variability These findings indicate that operational intent can be inferred reliably from existing communication audio alone, offering a practical, low-cost path toward scalable aircraft operations monitoring and supporting emerging virtual tower and automated air traffic surveillance applications. Full article

(This article belongs to the Special Issue AI, Machine Learning and Automation for Air Traffic Control (ATC))

► Show Figures

Figure 1

31 pages, 5478 KB

Open AccessArticle

An Intelligent English-Speaking Training System Using Generative AI and Speech Recognition

by Ching-Ta Lu, Yen-Ju Chen, Tai-Ying Wu and Yen-Yu Lu

Appl. Sci. 2026, 16(1), 189; https://doi.org/10.3390/app16010189 - 24 Dec 2025

Viewed by 1053

Abstract

English is the first foreign language most Taiwanese have encountered, yet few have achieved proficient speaking skills. This paper presents a generative AI-based English speaking training system designed to enhance oral proficiency through interactive AI agents. The system employs ChatGPT version 5.2 to [...] Read more.

English is the first foreign language most Taiwanese have encountered, yet few have achieved proficient speaking skills. This paper presents a generative AI-based English speaking training system designed to enhance oral proficiency through interactive AI agents. The system employs ChatGPT version 5.2 to generate diverse and tailored conversational scenarios, enabling learners to practice in contextually relevant situations. Spoken responses are captured via speech recognition and analyzed by a large language model, which provides intelligent scoring and personalized feedback to guide improvement. Learners can automatically generate scenario-based scripts according to their learning needs. The D-ID AI system then produces a virtual character of the AI agent, whose lip movements are synchronized with the conversation, thereby creating realistic video interactions. Learning with an AI agent, the system maintains controlled emotional expression, reduces communication anxiety, and helps learners adapt to non-native interaction, fostering more natural and confident speech production. Accordingly, the proposed system supports compelling, immersive, and personalized language learning. The experimental results indicate that repeated practice with the proposed system substantially improves English speaking proficiency. Full article

(This article belongs to the Section Applied Neuroscience and Neural Engineering)

► Show Figures

Figure 1

20 pages, 3334 KB

Open AccessArticle

The Development of Northern Thai Dialect Speech Recognition System

by Jakramate Bootkrajang, Papangkorn Inkeaw, Jeerayut Chaijaruwanich, Supawat Taerungruang, Adisorn Boonyawisit, Bak Jong Min Sutawong, Vataya Chunwijitra and Phimphaka Taninpong

Appl. Sci. 2026, 16(1), 160; https://doi.org/10.3390/app16010160 - 23 Dec 2025

Cited by 1 | Viewed by 627

Abstract

This study investigated the necessary ingredients for the development of an automatic speech recognition (ASR) system for the Northern Thai language. Building an ASR model for such an arguably low-resource language poses challenges both in terms of the quantity and the quality of [...] Read more.

This study investigated the necessary ingredients for the development of an automatic speech recognition (ASR) system for the Northern Thai language. Building an ASR model for such an arguably low-resource language poses challenges both in terms of the quantity and the quality of the corpus. The experimental results demonstrated that the current state-of-the-art deep neural network trained in an end-to-end manner, and pre-trained from a closely related language, such as Standard Thai, often outperformed its traditional HMM-based counterparts. The results also suggested that incorporating northern Thai-specific tonal information and augmenting the character-based end-to-end model with an n-gram language model further improves the recognition performance. Surprisingly, the quality of the transcription of the speech corpus was not found to positively correlate with the recognition performance in the case of the end-to-end system. The results show that the end-to-end ASR system was able to achieve the best word error rate (WER) of 0.94 on out-of-sample data. This is equivalent to 77.02% and 60.34% relative word error rate reduction over the 4.09 and 2.37 WERs of the traditional TDNN-HMM and the vanilla deep neural network baselines. Full article

(This article belongs to the Special Issue Speech Recognition and Natural Language Processing)

► Show Figures

Figure 1

13 pages, 284 KB

Open AccessArticle

Two-Stage Domain Adaptation for LLM-Based ASR by Decoupling Linguistic and Acoustic Factors

by Lin Zheng, Xuyang Wang, Qingwei Zhao and Ta Li

Appl. Sci. 2026, 16(1), 60; https://doi.org/10.3390/app16010060 - 20 Dec 2025

Viewed by 489

Abstract

Large language models (LLMs) have been increasingly applied in Automatic Speech Recognition (ASR), achieving significant advancements. However, the performance of LLM-based ASR (LLM-ASR) models remains unsatisfactory when applied across domains due to domain shifts between acoustic and linguistic conditions. To address this challenge, [...] Read more.

Large language models (LLMs) have been increasingly applied in Automatic Speech Recognition (ASR), achieving significant advancements. However, the performance of LLM-based ASR (LLM-ASR) models remains unsatisfactory when applied across domains due to domain shifts between acoustic and linguistic conditions. To address this challenge, we propose a decoupled two-stage domain adaptation framework that separates the adaptation process into text-only and audio-only stages. In the first stage, we leverage abundant text data from the target domain to refine the LLM component, thereby improving its contextual and linguistic alignment with the target domain. In the second stage, we employ a pseudo-labeling method with unlabeled audio data in the target domain and introduce two key enhancements: (1) incorporating decoupled auxiliary Connectionist Temporal Classification (CTC) loss to improve the robustness of the speech encoder under different acoustic conditions; (2) adopting a synchronous LLM tuning strategy, allowing the LLM to continuously learn linguistic alignment from pseudo-labeled transcriptions enriched with domain textual knowledge. The experimental results demonstrate that our proposed methods significantly improve the performance of LLM-ASR in the target domain, achieving a relative word error rate reduction of 19.2%. Full article

(This article belongs to the Special Issue Speech Recognition: Techniques, Applications and Prospects)

► Show Figures

Figure 1

Search Results (337)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (337)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI