MDPI - Publisher of Open Access Journals

26 pages, 3422 KB

Open AccessArticle

Voice-Driven Support System for Speech Practice in Older Adults: An Accessible Web–Mobile Approach

by Lucrecia Llerena, Nancy Rodríguez, Bertha Vásquez, John W. Castro and Alexander Herrera

Algorithms 2026, 19(6), 469; https://doi.org/10.3390/a19060469 - 9 Jun 2026

Viewed by 193

Population aging poses significant challenges to oral communication due to age-related changes in articulation, verbal fluency, and speech pacing, even among older adults without neurodegenerative conditions. Despite advances in voice-based assistive technologies, there remains a lack of integrated engineering solutions that support structured, [...] Read more.

Population aging poses significant challenges to oral communication due to age-related changes in articulation, verbal fluency, and speech pacing, even among older adults without neurodegenerative conditions. Despite advances in voice-based assistive technologies, there remains a lack of integrated engineering solutions that support structured, autonomous speech practice in non-clinical environments. This study proposes a deterministic, rule-based speech evaluation workflow implemented within a hybrid web–mobile assistive system. The workflow integrates audio capture, cloud-based automatic speech recognition (ASR), rule-based pronunciation evaluation, immediate multimodal feedback, and progress monitoring within a unified system architecture. The proposed architecture includes a mobile application for older adults and a web platform for configuration and monitoring by caregivers. A prototyping-oriented methodology was applied, including requirements elicitation, system design, implementation, and usability evaluation using the Thinking Aloud method and the System Usability Scale (SUS). Results showed stable system behavior under controlled evaluation conditions, an average recognition accuracy of 90% during preliminary evaluation sessions, and a response latency of 1.82 s, supporting stable real-time interaction during guided speech exercises. These findings demonstrate the feasibility of the proposed assistive architecture as an accessible and reproducible solution for guided speech support in older adults. Full article

► Show Figures

Figure 1

18 pages, 10628 KB

Open AccessArticle

From Speech to Summary in Turkmen: A Parameter-Efficient Neural Pipeline

by Ualsher Tukeyev and Maksim Ocheretin

Appl. Sci. 2026, 16(12), 5734; https://doi.org/10.3390/app16125734 - 6 Jun 2026

Viewed by 261

Abstract

This paper presents the development of a neural model pipeline for automatic speech recognition (ASR) and text summarization in Turkmen, a low-resource language with agglutinative morphology. For the ASR task, the MMS-1b-all model (Meta) was employed with LoRA adaptation and CTC decoding, fine-tuned [...] Read more.

This paper presents the development of a neural model pipeline for automatic speech recognition (ASR) and text summarization in Turkmen, a low-resource language with agglutinative morphology. For the ASR task, the MMS-1b-all model (Meta) was employed with LoRA adaptation and CTC decoding, fine-tuned on the Common Voice corpus (2733 samples). For summarization, the mBART-50-large model was used with Turkmen-specific tokenization and was trained on a news text corpus (10,248 samples). The following results were achieved: WER = 17.59% for ASR (baseline model: 107.33%) and ROUGE-L = 0.4255 for summarization (zero-shot baseline: 0.2294). The scientific contribution is the creation of a parameter-efficient neural pipeline for speech-to-summary for Turkmen. The developed system can be applied to automated meeting transcription and text data processing in the Turkmen language. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

20 pages, 2019 KB

Open AccessReview

Diagnostic Accuracy of Artificial Intelligence in Laryngeal Disorders: An Integrative Review

by Samantha Mairesse, Antonino Maniaci, Giovanni Briganti and Jerome R. Lechien

J. Pers. Med. 2026, 16(6), 301; https://doi.org/10.3390/jpm16060301 - 1 Jun 2026

Viewed by 611

Abstract

Background/Objectives: Laryngeal disorders are among the most prevalent conditions in otolaryngology, yet they remain challenging to diagnose without specialized expertise. Artificial intelligence (AI) systems leveraging machine learning (ML) and deep learning (DL) have demonstrated promising performance for the automatic detection and classification [...] Read more.

Background/Objectives: Laryngeal disorders are among the most prevalent conditions in otolaryngology, yet they remain challenging to diagnose without specialized expertise. Artificial intelligence (AI) systems leveraging machine learning (ML) and deep learning (DL) have demonstrated promising performance for the automatic detection and classification of voice disorders and laryngeal lesions. Methods: This review synthesizes findings from 88 studies published between 2015 and 2025 on AI-based laryngeal disorder detection, considering physioacoustic mechanisms, databases and acquisition protocols, AI architectures and validation strategies, and diagnostic performance. Results: The current literature supports high internal accuracies for binary healthy versus pathological detection (88–99%); meanwhile, performance decreases for higher-level tasks such as pathophysiological category classification and identification, particularly under external validation. From a clinical perspective, clinicians do not infer specific diagnoses from isolated acoustic parameters such as percent jitter or shimmer. Instead, they rely on how these perturbation patterns dynamically evolve during connected speech, where alterations guide perceptual differentiation between underlying disorders. Recurrent sources of bias include dependence on a limited number of historical vowel-based databases, class and demographic imbalance, and limited ecological validity of recording protocols. Additional concerns involve the predominant use of internal cross-validation and insufficient reproducibility or code sharing. Conclusions: Drawing on the literature, an integrative three-level clinical recognition framework is proposed, delineating realistic use cases for AI as a decision-support tool rather than an autonomous diagnostic system. Key priorities for future personalized medicine and research are also identified, including diversified multi-center datasets, standardized methodological reporting, rigorous external validation, and compliance with regulatory and ethical requirements for medical AI deployment. Full article

(This article belongs to the Special Issue Personalized Medicine in Otolaryngology: New Challenges and Future Perspectives)

► Show Figures

Figure 1

27 pages, 2923 KB

Open AccessArticle

An Assistant System for Speaker and Sentiment Recognition Using RAM and a Hybrid AI Model

by Fatma Bozyiğit, İrfan Aygün, Oğuzhan Sağlam, Eren Özcan, Emin Borandağ and Bahadır Karasulu

Electronics 2026, 15(8), 1731; https://doi.org/10.3390/electronics15081731 - 19 Apr 2026

Viewed by 887

Abstract

In the age of remote communication and digital archiving, automated analysis of voice data has become increasingly important in various application areas. Despite significant advances in the field of Automatic Speech Recognition, integrating speaker recognition, textual sentiment analysis, and acoustic sentiment detection within [...] Read more.

In the age of remote communication and digital archiving, automated analysis of voice data has become increasingly important in various application areas. Despite significant advances in the field of Automatic Speech Recognition, integrating speaker recognition, textual sentiment analysis, and acoustic sentiment detection within a unified real-time processing pipeline remains a challenging task. Current approaches are often limited to monolithic designs or operate in batch processing modes, which restricts their scalability and real-time applicability. To address this gap, this work proposes a novel feature selection method called RAM, along with a hybrid decision-level merging approach combining Conv1D CNN and AutoML-based models. The proposed hybrid framework enables independent model training and integrates its probabilistic outputs through a weighted merging strategy for performance improvement. Furthermore, a scalable microservice-based software architecture has been developed to support real-time processing, feature selection, and model deployment. This design enhances system modularity, flexibility, and integration capability in practical applications. Experimental results show that when the proposed RAM method is used in conjunction with a hybrid AI model, it achieves over 97% accuracy in speaker recognition and over 82% accuracy in emotion classification, even with short audio samples. These findings demonstrate that the proposed approach provides a robust and efficient solution for real-time speech analysis tasks. Full article

(This article belongs to the Special Issue Techniques and Applications of Multimodal Data Fusion)

► Show Figures

Figure 1

29 pages, 417 KB

Open AccessFeature PaperArticle

An AI-Based Security Architecture for Fraud Detection in Cloud Call Centers for Low-Resource Languages: Arabic as a Use Case

by Pinar Boluk and Hana’a Maratouq

Electronics 2026, 15(8), 1718; https://doi.org/10.3390/electronics15081718 - 18 Apr 2026

Viewed by 415

Abstract

Cloud-based telephony platforms face growing fraud risks including voice phishing (vishing), subscription abuse, and organizational impersonation, with detection being especially challenging in low-resource languages such as Arabic. We present an Artificial Intelligence (AI)-based security architecture for fraud detection in Arabic cloud call centers, [...] Read more.

Cloud-based telephony platforms face growing fraud risks including voice phishing (vishing), subscription abuse, and organizational impersonation, with detection being especially challenging in low-resource languages such as Arabic. We present an Artificial Intelligence (AI)-based security architecture for fraud detection in Arabic cloud call centers, combining onboarding verification, behavioral monitoring, domain-adapted Automatic Speech Recognition (ASR), semantic transcript search, and Large Language Model (LLM)-based entity verification. The domain-adapted Langa ASR model achieves a Word Error Rate (WER) of 41.0% and Character Error Rate (CER) of 18.2%, outperforming all evaluated commercial baselines. LLM-based entity extraction with multi-call consensus achieves 97.3% company-name accuracy (Generative Pre-trained Transformer 4, GPT-4) and 92.0% in the cost-effective deployed configuration (GPT-3.5 with log-probability filtering). Evaluated on production data from a Middle East and North Africa (MENA)-region provider spanning more than 1000 accounts, the pipeline flagged 47 accounts of which 41 were confirmed fraudulent (directly observed precision 87.2%, 95% confidence interval (CI): 74.3–95.2%; estimated recall 51–82% under conservative base-rate assumptions—not directly measured), providing evidence for the viability of a unified, threat-model-driven architecture for low-resource telephony fraud detection. Full article

(This article belongs to the Special Issue AI-Enhanced Security: Advancing Threat Detection and Defense)

► Show Figures

Figure 1

29 pages, 8422 KB

Open AccessArticle

A Transformer-Based Method for Bidirectional French–Lingala Machine Translation in Speech and Text

by Reagan E. Mandiya, Selain K. Kasereka, Christophe B. Wizamo, Milena Savova-Mratsenkova, Ruffin-Benoît M. Ngoie, Tasho Tashev and Nathanaël M. Kasoro

Appl. Sci. 2026, 16(7), 3399; https://doi.org/10.3390/app16073399 - 31 Mar 2026

Viewed by 945

Abstract

Underrepresented languages such as Lingala are a significant part of the world’s cultural and linguistic heritage. Lingala plays a central role in daily communication, business, media, education, and culture for millions of people in the Democratic Republic of Congo (DRC) and the Republic [...] Read more.

Underrepresented languages such as Lingala are a significant part of the world’s cultural and linguistic heritage. Lingala plays a central role in daily communication, business, media, education, and culture for millions of people in the Democratic Republic of Congo (DRC) and the Republic of Congo. However, due to data scarcity and dialectal diversity, natural language processing (NLP) research often overlooks this language. In this paper, we propose a deep neural network pipeline for bidirectional French–Lingala automatic translation, covering both text-to-text and voice-to-text scenarios, by integrating Long Short-Term Memory (LSTM) and Transformer models on a specialized parallel corpus. The Bidirectional Encoder Representations from Transformers (BERT) model is used as a bidirectional source encoder to improve contextual representation, while the Whisper model handles automatic speech recognition as the first stage of the audio translation pipeline. Experimental results show that the standalone Transformer achieves a BLEU score of 35.3, compared to 8.12 for the LSTM SeqToSeq baseline. Fine-tuning with BERT raises the BLEU score to 38.6. Integrating the Whisper ASR module for an end-to-end speech translation task yields a final pipeline BLEU score of 55.4, with a Word Error Rate of 12.3% on the speech recognition sub-task, confirming the effectiveness of each component. These results demonstrate the potential of combining domain-specific pre-trained models with modular neural architectures to achieve competitive translation performance in a critically under-resourced language. Full article

(This article belongs to the Special Issue The Advanced Trends in Natural Language Processing)

► Show Figures

Figure 1

26 pages, 3165 KB

Open AccessArticle

Analysis of Fundamental Frequency Changes in Astronaut Speech in Microgravity and in Terrestrial Conditions

by Natalia Repyuk, Anton Konev, Vladimir Faerman, Dmitry Rulev and Grigory Yashchenko

Acoustics 2026, 8(1), 18; https://doi.org/10.3390/acoustics8010018 - 13 Mar 2026

Viewed by 1110

Abstract

This study investigates the influence of microgravity on the fundamental frequency (F0) of astronauts’ speech. A speech corpus was compiled, including recordings in microgravity and on Earth, matched by speaker and content. The signal processing methodology included filtering with consideration of human auditory [...] Read more.

This study investigates the influence of microgravity on the fundamental frequency (F0) of astronauts’ speech. A speech corpus was compiled, including recordings in microgravity and on Earth, matched by speaker and content. The signal processing methodology included filtering with consideration of human auditory perception, segmentation of speech fragments, F0 estimation using digital signal processing techniques, and visualization through fundamental frequency dynamics plots. Results revealed a consistent increase in F0 for most astronauts under microgravity, with maximum values of 450 Hz for female speakers and 245 Hz for male speakers. Elevated F0 levels were observed for approximately 86% of the total duration of speech fragments recorded in microgravity, compared with 14% on Earth. These findings confirm that microgravity affects the speech apparatus and acoustic characteristics of voice. Practical implications include adapting voice-controlled systems and automatic speech recognition for space environments, monitoring crew condition, and studying speech physiology under extreme conditions. Full article

(This article belongs to the Special Issue Advancing Audio/Speech Machine Learning: From Static to Continual Learning)

► Show Figures

Figure 1

13 pages, 1494 KB

Open AccessArticle

Development and Clinical Validation of an Artificial Intelligence-Based Automated Visual Acuity Testing System

by Kelvin Zhenghao Li, Hnin Hnin Oo, Kenneth Chee Wei Liang, Najah Ismail, Jasmine Ling Ling Chua, Jackson Jie Sheng Chng, Yang Wu, Daryl Wei Ren Wong, Sumaya Rani Khan, Boon Peng Yap, Rong Tong, Choon Meng Kiew, Yufei Huang, Chun Hau Chua, Alva Khai Shin Lim and Xiuyi Fan

Life 2026, 16(2), 357; https://doi.org/10.3390/life16020357 - 20 Feb 2026

Viewed by 1187

Abstract

Background: To develop and validate an automated visual acuity (VA) testing system integrating artificial intelligence (AI)–driven speech and image recognition technologies, enabling self-administered, clinic-based VA assessment; Methods: The system incorporated a fine-tuned Whisper speech-recognition model with Silero voice activity detection and pose estimation [...] Read more.

Background: To develop and validate an automated visual acuity (VA) testing system integrating artificial intelligence (AI)–driven speech and image recognition technologies, enabling self-administered, clinic-based VA assessment; Methods: The system incorporated a fine-tuned Whisper speech-recognition model with Silero voice activity detection and pose estimation through facial landmark and ArUco marker detection. A state-driven interface guided users through sequential testing with and without a pinhole. Speech recognition was enhanced using a local Singaporean English dataset. Laboratory validation assessed speech and pose recognition performance, while clinical validation compared automated and manual VA testing at a tertiary eye clinic; Results: The fine-tuned model reduced word error rates from 17.83% to 9.81% for letters and 2.76% to 1.97% for numbers. Pose detection accurately identified valid occluder states. Among 72 participants (144 eyes), automated unaided VA showed good agreement with manual VA (ICC = 0.77, 95% CI 0.62–0.85), while pinhole VA demonstrated moderate agreement (ICC = 0.63, 95% CI 0.25–0.83). Automated testing took longer (132.1 ± 47.5 s vs. 97.1 ± 47.8 s; p < 0.001), but user experience remained positive (mean Likert scale score 4.3 ± 0.8); Conclusions: The AI-based automated VA system delivered accurate, reliable, and user-friendly performance, supporting its feasibility for clinical implementation. Full article

(This article belongs to the Section Biochemistry, Biophysics and Computational Biology)

► Show Figures

Figure 1

18 pages, 9134 KB

Open AccessArticle

An Autonomous Robotic System for Object Retrieval and Delivery: Enhancing Independence for Users Living with Disability and Older Adults

by Jincheng Li, Chenghao Lin, Amna Mazen and Youssef A. Bazzi

Robotics 2026, 15(2), 41; https://doi.org/10.3390/robotics15020041 - 12 Feb 2026

Viewed by 1226

Abstract

As the global population ages, there is a growing need for assistive technologies to help older adults maintain their independence. This work presents a cost-effective autonomous socially assistive robot designed for object retrieval and delivery, enhancing accessibility in home environments. The system is [...] Read more.

As the global population ages, there is a growing need for assistive technologies to help older adults maintain their independence. This work presents a cost-effective autonomous socially assistive robot designed for object retrieval and delivery, enhancing accessibility in home environments. The system is built on the Robot Operating System (ROS) framework and integrates three key components: the Pioneer P3-DX mobile robot for autonomous navigation, the ReactorX-200 robotic arm for pick-and-place operations, and the Kinect v2 RGB-D camera for object detection and localization. Users interact with the robot through natural language processing by issuing voice commands to retrieve various objects. Microsoft Azure-powered speech recognition processes these commands to extract keywords and then localize requested objects on a predefined building map. Pioneer P3-DX, equipped with a Hokuyo LiDAR, enables autonomous navigation and obstacle avoidance, while Kinect v2, integrated with the YOLOv8 algorithm, facilitates object recognition and localization. The robot retrieves and delivers the user’s requested objects while following the shortest available path. Experimental evaluations in a home environment demonstrate the system’s effectiveness in identifying and retrieving requested objects. The subsystems achieve a success rate of 85–95% across more than 50 runs, highlighting their strong performance. The proposed approach provides a proof of concept for future advancements in assistive robotics, demonstrating the seamless integration of advanced technologies into a cost-effective and user-friendly platform. Full article

(This article belongs to the Special Issue AI-Powered Robotic Systems: Learning, Perception and Decision-Making)

► Show Figures

Figure 1

12 pages, 1323 KB

Open AccessProceeding Paper

Edge AI System Using Lightweight Semantic Voting to Detect Segment-Based Voice Scams

by Shao-Yong Lu and Wen-Ping Chen

Eng. Proc. 2025, 120(1), 14; https://doi.org/10.3390/engproc2025120014 - 2 Feb 2026

Viewed by 1620

Abstract

Real-time telecom scam detection is difficult without cloud AI, particularly for privacy-sensitive, low-resource environments. We developed a lightweight, offline voice scam detector using on-device audio segmentation, automatic speech recognition (ASR), and semantic similarity. Four detection strategies were implemented. We used Whisper ASR and [...] Read more.

Real-time telecom scam detection is difficult without cloud AI, particularly for privacy-sensitive, low-resource environments. We developed a lightweight, offline voice scam detector using on-device audio segmentation, automatic speech recognition (ASR), and semantic similarity. Four detection strategies were implemented. We used Whisper ASR and DeepSeek to process 5 s speech chunks. An analysis of 120 synthetic and paraphrased Mandarin phone call dialogues reveals the A4 voting strategy’s superior performance in optimizing early detection and minimizing false positives, achieving an F1 score of 0.90, a 2.5% false positive rate, and a mean response time of under 4 s. The system is deployable on ESP32 for offline mobile inference. The proposed architecture provides a robust and scalable defense against threats targeting vulnerable user groups, such as older adults. It introduces a new method for real-time voice threat mitigation on devices through interpretable segment-level semantic analysis. Full article

(This article belongs to the Proceedings of 8th International Conference on Knowledge Innovation and Invention)

► Show Figures

Figure 1

25 pages, 17750 KB

Open AccessArticle

A Mixed Reality Tool with Automatic Speech Recognition for 3D CAD Based Visualization and Automatic Dimension Generation in the Industry 5.0 Shipyard

by Aida Vidal-Balea, Antón Valladares-Poncela, Javier Vilar-Martínez, Tiago M. Fernández-Caramés and Paula Fraga-Lamas

Multimodal Technol. Interact. 2026, 10(2), 13; https://doi.org/10.3390/mti10020013 - 1 Feb 2026

Cited by 2 | Viewed by 968

Abstract

Industry 5.0 is composed of a variety of complex tasks and challenging processes requiring specialized labor and multidisciplinary coordination. Specifically, when it comes to shipbuilding, shipyards leverage advanced technologies, seeking to replace operations that continue to rely on traditional methods, such as 2D [...] Read more.

Industry 5.0 is composed of a variety of complex tasks and challenging processes requiring specialized labor and multidisciplinary coordination. Specifically, when it comes to shipbuilding, shipyards leverage advanced technologies, seeking to replace operations that continue to rely on traditional methods, such as 2D blueprints and paper-based documentation, which can lead to inefficiencies and alignment errors in precision-dependent tasks. For this reason, this article focuses on embracing Mixed Reality (MR) technologies to address these challenges in the context of electrical outfitting tasks. The design, development and evaluation of a MR application tailored for HoloLens 2 smart glasses aims to streamline the workflow for operators, reducing reliance on paper-based documentation and enhancing the precision of assembly processes. The proposed system allows for the precise positioning of 3D models in the real environment, ensuring accurate alignment during assembly. Additionally, it incorporates automatic dimension generation between objects in the scene. To further enhance usability, the application integrates a Galician on-device Automatic Speech Recognition (ASR) system, allowing operators to interact seamlessly with the MR interface using voice commands. The whole system has been exhaustively tested, both through usability and functionality evaluations, which validate MR as a viable tool for shipyard assembly and inspection tasks. Full article

(This article belongs to the Special Issue Multimodal Interaction Design in Immersive Learning and Training Environments)

► Show Figures

Figure 1

26 pages, 29009 KB

Open AccessArticle

Quantifying the Relationship Between Speech Quality Metrics and Biometric Speaker Recognition Performance Under Acoustic Degradation

by Ajan Ahmed and Masudul H. Imtiaz

Signals 2026, 7(1), 7; https://doi.org/10.3390/signals7010007 - 12 Jan 2026

Cited by 1 | Viewed by 2030

Abstract

Self-supervised learning (SSL) models have achieved remarkable success in speaker verification tasks, yet their robustness to real-world audio degradation remains insufficiently characterized. This study presents a comprehensive analysis of how audio quality degradation affects three prominent SSL-based speaker verification systems (WavLM, Wav2Vec2, and [...] Read more.

Self-supervised learning (SSL) models have achieved remarkable success in speaker verification tasks, yet their robustness to real-world audio degradation remains insufficiently characterized. This study presents a comprehensive analysis of how audio quality degradation affects three prominent SSL-based speaker verification systems (WavLM, Wav2Vec2, and HuBERT) across three diverse datasets: TIMIT, CHiME-6, and Common Voice. We systematically applied 21 degradation conditions spanning noise contamination (SNR levels from 0 to 20 dB), reverberation (RT60 from 0.3 to 1.0 s), and codec compression (various bit rates), then measured both objective audio quality metrics (PESQ, STOI, SNR, SegSNR, fwSNRseg, jitter, shimmer, HNR) and speaker verification performance metrics (EER, AUC-ROC, d-prime, minDCF). At the condition level, multiple regression with all eight quality metrics explained up to 80% of the variance in minDCF for HuBERT and 78% for WavLM, but only 35% for Wav2Vec2; EER predictability was lower (69%, 67%, and 28%, respectively). PESQ was the strongest single predictor for WavLM and HuBERT, while Shimmer showed the highest single-metric correlation for Wav2Vec2; fwSNRseg yielded the top single-metric R² for WavLM, and PESQ for HuBERT and Wav2Vec2 (with much smaller gains for Wav2Vec2). WavLM and HuBERT exhibited more predictable quality-performance relationships compared to Wav2Vec2. These findings establish quantitative relationships between measurable audio quality and speaker verification accuracy at the condition level, though substantial within-condition variability limits utterance-level prediction accuracy. Full article

(This article belongs to the Special Issue Advanced Signal Processing Technologies: Integrating AI, Future Communications, and Innovative Applications)

► Show Figures

Figure 1

32 pages, 5708 KB

Open AccessArticle

Affordable Audio Hardware and Artificial Intelligence Can Transform the Dementia Care Pipeline

by Ilyas Potamitis

Algorithms 2025, 18(12), 787; https://doi.org/10.3390/a18120787 - 12 Dec 2025

Viewed by 3183

Abstract

Population aging is increasing dementia care demand. We present an audio-driven monitoring pipeline that operates either on mobile phones, microcontroller nodes, or smart television sets. The system combines audio signal processing with AI tools for structured interpretation. Preprocessing includes voice activity detection, speaker [...] Read more.

Population aging is increasing dementia care demand. We present an audio-driven monitoring pipeline that operates either on mobile phones, microcontroller nodes, or smart television sets. The system combines audio signal processing with AI tools for structured interpretation. Preprocessing includes voice activity detection, speaker diarization, automatic speech recognition for dialogs, and speech-emotion recognition. An audio classifier detects home-care–relevant events (cough, cane taps, thuds, knocks, and speech). A large language model integrates transcripts, acoustic features, and a consented household knowledge base to produce a daily caregiver report covering orientation/disorientation (person, place, and time), delusion themes, agitation events, health proxies, and safety flags (e.g., exit seeking and falling). The pipeline targets real-time monitoring in homes and facilities, and it is an adjunct to caregiving, not a diagnostic device. Evaluation focuses on human-in-the-loop review, various audio/speech modalities, and the ability of AI to integrate information and reason. Intended users are low-income households in remote settings where in-person caregiving cannot be secured, enabling remote monitoring support for older adults with dementia. Full article

(This article belongs to the Special Issue AI-Assisted Medical Diagnostics)

► Show Figures

Figure 1

11 pages, 216 KB

Open AccessArticle

RNN-Based F0 Estimation Method with Attention Mechanism

by Ales Jandera, Martin Muzelak and Tomas Skovranek

Information 2025, 16(12), 1089; https://doi.org/10.3390/info16121089 - 7 Dec 2025

Cited by 2 | Viewed by 954

Abstract

Fundamental frequency estimation, also known as F0 estimation, is a crucial task in speech processing and analysis, with significant applications in areas such as speech recognition, speaker identification, and emotion detection. Traditional algorithms, while effective, often encounter challenges in real-time environments due to [...] Read more.

Fundamental frequency estimation, also known as F0 estimation, is a crucial task in speech processing and analysis, with significant applications in areas such as speech recognition, speaker identification, and emotion detection. Traditional algorithms, while effective, often encounter challenges in real-time environments due to computational limitations. Recent advances in deep learning, especially in the use of recurrent neural networks (RNNs), have opened new opportunities for enhancing F0 estimation accuracy and efficiency. This paper introduces a novel RNN-based F0 estimation method with an attention mechanism and evaluates its performance against selected state-of-the-art F0 estimation approaches, including standard baseline methods, as well as neural-network-based regression and classification models. By integrating attention mechanisms, the model eliminates the necessity for post-processing steps and enables a more efficient seq2scal estimation process. While the self-attention mechanism used in Transformers captures all pairwise temporal dependencies at a quadratic computational cost, the proposed method’s implementation of the attention mechanism enables it to selectively focus on the most relevant acoustic cues for F0 prediction, enhancing robustness without increasing the model’s complexity. Experimental results using the LibriSpeech and Common Voice datasets demonstrate superior computational efficiency of the proposed method compared to current state-of-the-art RNN-based seq2seq models, while maintaining comparable estimation accuracy. Furthermore, the proposed “RNN-based F0 estimation method with an attention mechanism” achieves the lowest computational complexity among all compared models, while maintaining high accuracy, making it suitable for low-latency, resource-limited deployments and competitive even with standard baseline methods, such as pYIN or CREPE. Finally, the performance of the developed RNN-based F0 estimation method with attention mechanism in terms of RMSE and FLOPs demonstrates the potential of attention mechanisms and sequence modelling in achieving high accuracy alongside lightweight F0 estimation suitable for modern speech processing applications, which aligns with the growing trend towards deploying intelligent systems on resource-constrained devices. Full article

(This article belongs to the Special Issue Signal Processing and Machine Learning, 2nd Edition)

► Show Figures

Graphical abstract

17 pages, 342 KB

Open AccessArticle

Improving Mandarin ASR Performance Through Multimodality

by Rui Jiang, Zhao Yang, Xiao Fu and Jizhong Zhao

Appl. Sci. 2025, 15(22), 12224; https://doi.org/10.3390/app152212224 - 18 Nov 2025

Viewed by 1839

Abstract

In the context of Internet of Things (IoT) applications, accurate and efficient speech recognition is essential for enabling seamless voice-based interactions and control. Mandarin ASR, in particular, presents unique challenges due to the ideographic nature of the Chinese language, where recognition results are [...] Read more.

In the context of Internet of Things (IoT) applications, accurate and efficient speech recognition is essential for enabling seamless voice-based interactions and control. Mandarin ASR, in particular, presents unique challenges due to the ideographic nature of the Chinese language, where recognition results are not directly correlated with pronunciation. Pinyin, as a representation of Chinese character pronunciation, has an intrinsic connection with Chinese characters, making it a valuable tool for enhancing ASR performance. This paper proposes a multimodal ASR neural network that combines pinyin data from the text modality and speech data from the audio modality as shared inputs to the ASR model. Specifically, the system processes the speech input through a preprocessed WeNet to generate pinyin text, which is then enhanced using a label denoising algorithm to improve its accuracy. The proposed text-acoustic multimodal ASR model improves the overall speech recognition performance by approximately 4%, making it more suitable for IoT applications that require high accuracy in voice commands and interactions. Full article

► Show Figures

Figure 1

Search Results (207)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (207)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI