MDPI - Publisher of Open Access Journals

21 pages, 4423 KiB

Open AccessArticle

CaDCR: An Efficient Cascaded Dynamic Collaborative Reasoning Framework for Intelligent Recognition Systems

by Bowen Li, Xudong Cao, Jun Li, Li Ji, Xueliang Wei, Jile Geng and Ruogu Zhang

Electronics 2025, 14(13), 2628; https://doi.org/10.3390/electronics14132628 - 29 Jun 2025

Viewed by 328

To address the challenges of high computational cost and energy consumption posed by deep neural networks in embedded systems, this paper presents CaDCR, a lightweight dynamic collaborative reasoning framework. By integrating a feature discrepancy-guided skipping mechanism with a depth-sensitive early exit mechanism, the [...] Read more.

To address the challenges of high computational cost and energy consumption posed by deep neural networks in embedded systems, this paper presents CaDCR, a lightweight dynamic collaborative reasoning framework. By integrating a feature discrepancy-guided skipping mechanism with a depth-sensitive early exit mechanism, the framework establishes hierarchical decision logic: dynamically selects execution paths of network blocks based on the complexity of input samples and enables early exit for simple samples through shallow confidence assessment, thereby forming an adaptive computational resource allocation strategy. CaDCR can both constantly suppress unnecessary computational cost for simple samples and satisfy hard resource constraints by forcibly terminating the inference process for all samples. Based on this framework, we design a cascaded inference system tailored for embedded system deployment to tackle practical deployment challenges. Experiments on the CIFAR-10/100, SpeechCommands datasets demonstrate that CaDCR maintains accuracy comparable to or higher than baseline models while significantly reducing computational cost by approximately 40–70% within a controllable accuracy loss margin. In deployment tests on the STM32 embedded platform, the framework’s performance matches theoretical expectations, further verifying its effectiveness in reducing energy consumption and accelerating inference speed. Full article

(This article belongs to the Topic Smart Edge Devices: Design and Applications)

► Show Figures

Figure 1

28 pages, 1791 KiB

Open AccessArticle

Speech Recognition-Based Wireless Control System for Mobile Robotics: Design, Implementation, and Analysis

by Sandeep Gupta, Udit Mamodiya and Ahmed J. A. Al-Gburi

Automation 2025, 6(3), 25; https://doi.org/10.3390/automation6030025 - 24 Jun 2025

Viewed by 921

Abstract

This paper describes an innovative wireless mobile robotics control system based on speech recognition, where the ESP32 microcontroller is used to control motors, facilitate Bluetooth communication, and deploy an Android application for the real-time speech recognition logic. With speech processed on the Android [...] Read more.

This paper describes an innovative wireless mobile robotics control system based on speech recognition, where the ESP32 microcontroller is used to control motors, facilitate Bluetooth communication, and deploy an Android application for the real-time speech recognition logic. With speech processed on the Android device and motor commands handled on the ESP32, the study achieves significant performance gains through distributed architectures while maintaining low latency for feedback control. In experimental tests over a range of 1–10 m, stable 110–140 ms command latencies, with low variation (±15 ms) were observed. The system’s voice and manual button modes both yield over 92% accuracy with the aid of natural language processing, resulting in training requirements being low, and displaying strong performance in high-noise environments. The novelty of this work is evident through an adaptive keyword spotting algorithm for improved recognition performance in high-noise environments and a gradual latency management system that optimizes processing parameters in the presence of noise. By providing a user-friendly, real-time speech interface, this work serves to enhance human–robot interaction when considering future assistive devices, educational platforms, and advanced automated navigation research. Full article

(This article belongs to the Section Robotics and Autonomous Systems)

► Show Figures

Figure 1

24 pages, 4466 KiB

Open AccessArticle

Natural Interaction in Virtual Heritage: Enhancing User Experience with Large Language Models

by Isabel Sánchez-Berriel, Fernando Pérez-Nava and Lucas Pérez-Rosario

Electronics 2025, 14(12), 2478; https://doi.org/10.3390/electronics14122478 - 18 Jun 2025

Viewed by 398

Abstract

In recent years, Virtual Reality (VR) has emerged as a powerful tool for disseminating Cultural Heritage (CH), often incorporating Virtual Humans (VHs) to guide users through historical recreations. The advent of Large Language Models (LLMs) now enables natural, unscripted communication with these VHs, [...] Read more.

In recent years, Virtual Reality (VR) has emerged as a powerful tool for disseminating Cultural Heritage (CH), often incorporating Virtual Humans (VHs) to guide users through historical recreations. The advent of Large Language Models (LLMs) now enables natural, unscripted communication with these VHs, even on limited devices. This paper details a natural interaction system for VHs within a VR application of San Cristóbal de La Laguna, a UNESCO World Heritage Site. Our system integrates Speech-to-Text, LLM-based dialogue generation, and Text-to-Speech synthesis. Adhering to user-centered design (UCD) principles, we conducted two studies: a preliminary study revealing user interest in historically adapted language, and a qualitative test that identified key user experience improvements, such as incorporating feedback mechanisms and gender selection for VHs. The project successfully developed a prioritized user experience, focusing on usability evaluation, immersion, and dialogue quality. We propose a generalist methodology and recommendations for integrating unscripted VH dialogue in VR. However, limitations include dialogue generation latency and reduced quality in non-English languages. While a formative usability test evaluated the process, the small sample size restricts broad generalizations about user behavior. Full article

(This article belongs to the Special Issue Recent Advances in Virtual Reality and Computer Vision Based on Deep Learning)

► Show Figures

Figure 1

14 pages, 558 KiB

Open AccessArticle

External Validation and Extension of a Cochlear Implant Performance Prediction Model: Analysis of the Oldenburg Cohort

by Rieke Ollermann, Robert Böscke, John Neidhardt and Andreas Radeloff

Audiol. Res. 2025, 15(3), 69; https://doi.org/10.3390/audiolres15030069 - 12 Jun 2025

Viewed by 339

Abstract

Background/Objectives: Rehabilitation success with a cochlear implant (CI) varies considerably and identifying predictive factors for the reliable prediction of speech understanding with CI remains a challenge. Hoppe and colleagues have recently described a predictive model, which was specifically based on Cochlear™ recipients [...] Read more.

Background/Objectives: Rehabilitation success with a cochlear implant (CI) varies considerably and identifying predictive factors for the reliable prediction of speech understanding with CI remains a challenge. Hoppe and colleagues have recently described a predictive model, which was specifically based on Cochlear™ recipients with a four-frequency pure tone average (4FPTA) ≤ 80 dB HL. The aim of this retrospective study is to test the applicability to an independent patient cohort with extended inclusion criteria. Methods: The Hoppe et al. model was applied to CI recipients with varying degrees of hearing loss. Model performance was analyzed for Cochlear™ recipients with 4FPTA ≤ 80 dB HL and for all recipients regardless of 4FPTA. Subgroup analyses were conducted by WRS_max and CI manufacturer. Results: The model yielded comparable results in our patient cohort when the original inclusion criteria were met (n = 24). Extending the model to patients with profound hearing loss (4FPTA > 80 dB HL; n = 238) resulted in a weaker but significant correlation (r = 0.273; p < 0.0001) between predicted and measured word recognition score at 65 dB with CI (WRS₆₅(CI)). Also, a higher percentage of data points deviated by more than 20 pp, either better or worse. When patients provided with CIs from different manufacturers were enrolled, the prediction error was also higher than in the original cohort. In Cochlear™ recipients with a maximum word recognition score (WRS_max) > 0% (n = 83), we found a moderate correlation between measured and predicted scores (r = 0.3274; p = 0.0025). Conclusions: In conclusion, as long as the same inclusion criteria are used, the Hoppe et al. (2021) prediction model results in similar prediction success in our cohort, and thus seems applicable independently of the cohort used. Nevertheless, it has limitations when applied to a broader and more diverse patient cohort. Our data suggest that the model would benefit from adaptations for broader clinical use, as the model lacks sufficient sensitivity in identifying poor performers. Full article

► Show Figures

Figure 1

15 pages, 885 KiB

Open AccessArticle

Adapting the Bogenhausen Dysarthria Scales (BoDyS) to Chilean Spanish Speakers: Face and Content Validation

by Marcela Sanhueza-Garrido, Virginia García-Flores, Sebastián Contreras-Cubillos and Jaime Crisosto-Alarcón

Brain Sci. 2025, 15(6), 604; https://doi.org/10.3390/brainsci15060604 - 4 Jun 2025

Viewed by 716

Abstract

Background: Dysarthria is a neuromotor speech disorder that significantly impacts patients’ quality of life. In Chile, there is a lack of culturally validated instruments for assessing dysarthria. This study aimed to cross-culturally adapt the Bogenhausen Dysarthria Scales (BoDyS) into Chilean Spanish and to [...] Read more.

Background: Dysarthria is a neuromotor speech disorder that significantly impacts patients’ quality of life. In Chile, there is a lack of culturally validated instruments for assessing dysarthria. This study aimed to cross-culturally adapt the Bogenhausen Dysarthria Scales (BoDyS) into Chilean Spanish and to conduct face and content validation. Methods: The adaptation process included translation and back-translation, followed by validation by a panel of experts. Clarity, format, and length were evaluated, and the Kappa index (KI), content validity index (CVI), and content validity ratio (CVR) were calculated to confirm item relevance. A pilot test was subsequently conducted with ten speech–language pathologists to apply the adapted version to patients. Results: The adaptation process produced a consensus version that preserved the semantic and cultural characteristics of the original scale. The statistical measures (KI = 1.00; I-CVI = 1.00; S-CVI/Ave = 1.00; S-CVI/UA = 1.00; CVR = 1.00) indicated satisfactory levels of agreement. The pilot test demonstrated the scale’s appropriateness and effectiveness for assessing dysarthria within the Chilean context, although some experts recommended reducing task repetition for patients prone to fatigue. Conclusions: The Chilean version of the BoDyS (BoDyS-CL) is a valid and useful tool for evaluating dysarthria in Chile. This study provides a foundation for further research and the systematic implementation of this scale in local clinical practice. Full article

(This article belongs to the Special Issue Recent Advances in Assessment and Rehabilitation of Individuals with Communication and Language Disorders)

► Show Figures

Figure 1

16 pages, 1674 KiB

Open AccessArticle

Validation of a New Stress Induction Protocol Using Speech Improvisation (IMPRO)

by Marina Saskovets, Mykhailo Lohachov and Zilu Liang

Brain Sci. 2025, 15(5), 522; https://doi.org/10.3390/brainsci15050522 - 19 May 2025

Cited by 1 | Viewed by 635

Abstract

Background: Acute stress induction is essential in psychology research for understanding physiological and psychological responses. In this study, ‘acute stress’ refers to a short-term, immediate stress response—distinct from chronic, long-term stress exposure. Traditional methods, such as the Trier Social Stress Test (TSST), have [...] Read more.

Background: Acute stress induction is essential in psychology research for understanding physiological and psychological responses. In this study, ‘acute stress’ refers to a short-term, immediate stress response—distinct from chronic, long-term stress exposure. Traditional methods, such as the Trier Social Stress Test (TSST), have ecological validity and resource-efficiency limitations. This study introduces the Interactive Multitask Performance Response Observation (IMPRO) protocol, a novel stress-induction method utilizing speech improvisation in a dynamic and unpredictable social setting. Methods: Thirty-five healthy adults (aged 18–38 years; 19 males, 16 females) participated in the study. The IMPRO protocol consisted of three speech improvisation tasks with increasing cognitive and social stressors. Salivary cortisol was used as a biochemical marker of acute stress, while electrodermal activity (EDA) provided real-time autonomic arousal measurements. Stress responses were assessed using paired t-tests for cortisol levels and repeated-measures ANOVA for EDA variations across experimental stages. Results: Salivary cortisol levels significantly increased from baseline (M = 2.68 nM, SD = 0.99) to post-task (M = 3.54 nM, SD = 1.25, p = 0.001, Cohen’s d = 0.59), confirming hypothalamic–pituitary–adrenal (HPA) axis activation. EDA showed a significant rise during the anticipation phase (p < 0.001), peaking at the final task and decreasing during recovery (η² = 0.643). Conclusions: The IMPRO protocol effectively induces acute stress responses, providing a scalable, ecologically valid alternative to traditional stress paradigms. Its low-cost, adaptable design makes it ideal for research in psychology, neuroscience, and behavioral sciences. Future studies should explore its application in clinical populations and group settings. Full article

(This article belongs to the Topic The Relationship Between Bodily, Autonomic, and Communicative Behaviors and the Experiential and Cognitive Aspects of Emotion)

► Show Figures

Figure 1

20 pages, 3422 KiB

Open AccessArticle

Hands-Free Human–Machine Interfaces Using Piezoelectric Sensors and Accelerometers for Simulated Wheelchair Control in Older Adults and People with Physical Disabilities

by Charoenporn Bouyam, Nannaphat Siribunyaphat, Dollaporn Anopas, May Thu and Yunyong Punsawad

Sensors 2025, 25(10), 3037; https://doi.org/10.3390/s25103037 - 12 May 2025

Viewed by 1572

Abstract

Human–machine interface (HMI) systems are increasingly utilized to develop assistive technologies for individuals with disabilities and older adults. This study proposes two HMI systems using piezoelectric sensors to detect facial muscle activations from eye and tongue movements, and accelerometers to monitor head movements. [...] Read more.

Human–machine interface (HMI) systems are increasingly utilized to develop assistive technologies for individuals with disabilities and older adults. This study proposes two HMI systems using piezoelectric sensors to detect facial muscle activations from eye and tongue movements, and accelerometers to monitor head movements. This system enables hands-free wheelchair control for those with physical disabilities and speech impairments. A prototype wearable sensing device was also designed and implemented. Four commands can be generated using each sensor to steer the wheelchair. We conducted tests in offline and real-time scenarios to assess efficiency and usability among older volunteers. The head–machine interface achieved greater efficiency than the face–machine interface. The simulated wheelchair control tests showed that the head–machine interface typically required twice the time of joystick control, whereas the face–machine interface took approximately four times longer. Participants noted that the head-mounted wearable device was flexible and comfortable. Both modalities can be used for wheelchair control, especially the head–machine interface for patients retaining head movement. In severe cases, the face–machine interface can be used. Moreover, hybrid control can be employed to satisfy specific requirements. Compared to current commercial devices, the proposed HMIs provide lower costs, easier fabrication, and greater adaptability for real-world applications. We will further verify and improve the proposed devices for controlling a powered wheelchair, ensuring practical usability for people with paralysis and speech impairments. Full article

(This article belongs to the Special Issue Wearable Sensors, Robotic Systems and Assistive Devices)

► Show Figures

Figure 1

18 pages, 976 KiB

Open AccessArticle

A Z-Test-Based Evaluation of a Least Mean Square Filter for Noise Reduction

by Alan Rodríguez Bojorjes, Abel Garcia-Barrientos, Marco Cárdenas-Juárez, Ulises Pineda-Rico, Armando Arce, Sharon Macias Velasquez and Obed Pérez Cortés

Acoustics 2025, 7(2), 20; https://doi.org/10.3390/acoustics7020020 - 14 Apr 2025

Viewed by 1114

Abstract

This paper presents a comprehensive evaluation using a Z-test to assess the effectiveness of an adaptive Least Mean Squares (LMS) filter driven by the Steepest Descent Method (SDM). The study utilizes a male voice recording, captured in a controlled studio environment, to which [...] Read more.

This paper presents a comprehensive evaluation using a Z-test to assess the effectiveness of an adaptive Least Mean Squares (LMS) filter driven by the Steepest Descent Method (SDM). The study utilizes a male voice recording, captured in a controlled studio environment, to which persistent Gaussian noise was intentionally introduced, simulating real-world interference. All signal processing methods were implemented accordingly in MATLAB.version: 9.13.0 (R2022b), Natick, MA, USA: The MathWorks Inc.; 2022. The adaptive filter demonstrated a significant improvement of 20 dB in Signal-to-Noise Ratio (SNR) following the initial optimization of the filter parameter

μ

. To further assess the LMS filter’s performance, an empirical experiment was conducted with 30 young adults, aged between 20 and 30 years, who were tasked with qualitatively distinguishing between the clean and noise-corrupted signals (blind test). The quantitative analysis and statistical evaluation of the participants’ responses revealed that a significant majority, specifically 80%, were able to reliably identify the noise-affected and filtered signals. This outcome highlights the LMS filter’s potential—despite the slow convergence of the SDM—for enhancing signal clarity in noise-contaminated environments, thus validating its practical application in speech processing and noise reduction. Full article

(This article belongs to the Special Issue Developments in Acoustic Phonetic Research)

► Show Figures

Figure 1

11 pages, 227 KiB

Open AccessReview

Multi-Faceted Assessment of Children with Selective Mutism: Challenges and Practical Suggestions

by Maayan Shorer

Behav. Sci. 2025, 15(4), 472; https://doi.org/10.3390/bs15040472 - 5 Apr 2025

Viewed by 1463

Abstract

The multi-faceted nature of Selective Mutism (SM), and its comorbidity with other disorders, necessitates a comprehensive assessment process. However, evaluating children with SM presents significant challenges, including difficulties in building rapport, establishing an accurate diagnosis, and conducting formal psychological and neuropsychological assessments. This [...] Read more.

The multi-faceted nature of Selective Mutism (SM), and its comorbidity with other disorders, necessitates a comprehensive assessment process. However, evaluating children with SM presents significant challenges, including difficulties in building rapport, establishing an accurate diagnosis, and conducting formal psychological and neuropsychological assessments. This paper explores the key obstacles in assessing children with SM and provides practical recommendations for overcoming these challenges. Effective strategies for reducing anxiety during assessments include extended rapport-building phases, playful and engaging interactions, and the strategic use of parental involvement. Additionally, given the variability in SM symptoms across different settings, a multi-informant and multi-method assessment approach—including clinical observation, structured interviews, and standardized parent- and teacher-report measures—is recommended. This paper also discusses adaptations for formal testing, particularly in cognitive, language, and neurodevelopmental assessments, where SM-related speech avoidance can interfere with standardized evaluations. Nonverbal assessment tools, modifications to testing environments, and alternative response formats are proposed as potential solutions. Furthermore, we highlight the importance of differentiating SM from overlapping conditions, such as autism spectrum disorder and language impairments, to ensure accurate diagnosis and intervention planning. By implementing tailored assessment strategies, clinicians and researchers can improve diagnostic accuracy and better understand the unique needs of children with SM. This, in turn, can inform individualized treatment plans, enhance educational placement decisions, and support the overall well-being of children with SM. Full article

(This article belongs to the Special Issue Approaches to Overcoming Selective Mutism in Children and Youths)

16 pages, 25849 KiB

Open AccessArticle

A Hybrid Approach to Semantic Digital Speech: Enabling Gradual Transition in Practical Communication Systems

by Münif Zeybek, Bilge Kartal Çetin and Erkan Zeki Engin

Electronics 2025, 14(6), 1130; https://doi.org/10.3390/electronics14061130 - 13 Mar 2025

Viewed by 928

Abstract

Recent advances in deep learning have fostered a transition from the traditional, bit-centric paradigm of Shannon’s information theory to a semantic-oriented approach, emphasizing the transmission of meaningful information rather than mere data fidelity. However, black-box AI-based semantic communication lacks structured discretization and remains [...] Read more.

Recent advances in deep learning have fostered a transition from the traditional, bit-centric paradigm of Shannon’s information theory to a semantic-oriented approach, emphasizing the transmission of meaningful information rather than mere data fidelity. However, black-box AI-based semantic communication lacks structured discretization and remains dependent on analog modulation, which presents deployment challenges. This paper presents a new semantic-aware digital speech communication system, named Hybrid-DeepSCS, a stepping stone between traditional and fully end-to-end semantic communication. Our system comprises the following parts: a semantic encoder for extracting and compressing structured features, a standard transmitter for digital modulation including source and channel encoding, a standard receiver for recovering the bitstream, and a semantic decoder for expanding the features and reconstructing speech. By adding semantic encoding to a standard digital transmission, our system works with existing communication networks while exploring the potential of deep learning for feature representation and reconstruction. This hybrid method allows for gradual implementation, making it more practical for real-world uses like low-bandwidth speech, robust voice transmission over wireless networks, and AI-assisted speech on edge devices. The system’s compatibility with conventional digital infrastructure positions it as a viable solution for IoT deployments, where seamless integration with legacy systems and energy-efficient processing are critical. Furthermore, our approach addresses IoT-specific challenges such as bandwidth constraints in industrial sensor networks and latency-sensitive voice interactions in smart environments. We test the system under various channel conditions using Signal-to-Distortion Ratio (SDR), PESQ, and STOI metrics. The results show that our system delivers robust and clear speech, connecting traditional wireless systems with the future of AI-driven communication. The framework’s adaptability to edge computing architectures further underscores its relevance for IoT platforms, enabling efficient semantic processing in resource-constrained environments. Full article

(This article belongs to the Special Issue Application of Artificial Intelligence in Wireless Communications)

► Show Figures

Figure 1

21 pages, 537 KiB

Open AccessSystematic Review

Assessing Pragmatic Skills in People with Intellectual Disabilities

by Sonia Hernández Hernández, Sergio Marín Quinto, Verónica Marina Guillén Martín and Cristina Mumbardó-Adam

Behav. Sci. 2025, 15(3), 281; https://doi.org/10.3390/bs15030281 - 27 Feb 2025

Viewed by 1326

Abstract

People with intellectual disabilities live with significant conceptual, social, and practical limitations that hinder the acquisition, development, and use of language. Pragmatic skills facilitate interpersonal relationships, allowing for the understanding and expression of oneself, as well as the planning, organization, and adaptation of [...] Read more.

People with intellectual disabilities live with significant conceptual, social, and practical limitations that hinder the acquisition, development, and use of language. Pragmatic skills facilitate interpersonal relationships, allowing for the understanding and expression of oneself, as well as the planning, organization, and adaptation of speech depending on the context and interlocutor. These skills imply, therefore, complex higher functions that must be articulated harmoniously for effective communication. Identifying the weaknesses of people with intellectual disability in the pragmatic dimension of language enables the provision of individualized support resources to guarantee their participation and social inclusion. This study presents a systematic review based on the PRISMA guidelines, and it includes the most commonly used assessment tools for pragmatic competence in people with intellectual disabilities over time. Of the 172 articles found, 20 met the inclusion criteria and were finally reviewed. The results show a lack of conformity between instruments in the pragmatic aspects evaluated and a lack of adjustment of the evaluation tools to the characteristics of this population. Therefore, the design of new standardized tests that specifically evaluate the pragmatic skills of people with intellectual disability is required in the near future. A tailored assessment is crucial for defining a complete profile of their communication skills and generating individualized intervention and support programs. Full article

(This article belongs to the Special Issue Psychology of Children and Adolescents with Intellectual or Developmental Disabilities)

► Show Figures

Figure 1

10 pages, 780 KiB

Open AccessArticle

Validation of the Second Version of the LittlEARS^® Early Speech Production Questionnaire (LEESPQ) in Romanian-Speaking Children with Normal Hearing

by Alina-Catalina Ivanov, Luminita Radulescu, Sebastian Cozma, Madalina Georgescu, Bogdan Cobzeanu, Adriana Neagos, Petronela Moraru, Alma Maniu and Corina Butnaru

Audiol. Res. 2025, 15(1), 9; https://doi.org/10.3390/audiolres15010009 - 22 Jan 2025

Viewed by 714

Abstract

Objectives: The objectives of the current study were to validate the LittlEARS^® Early Speech Production Questionnaire (LEESPQ) in Romanian and to evaluate the psychometric properties of the Romanian version of the questionnaire for Romanian children with normal hearing. The LEESPQ was created [...] Read more.

Objectives: The objectives of the current study were to validate the LittlEARS^® Early Speech Production Questionnaire (LEESPQ) in Romanian and to evaluate the psychometric properties of the Romanian version of the questionnaire for Romanian children with normal hearing. The LEESPQ was created and tested for the assessment of preverbal and early verbal skills (0–18 months) in children with normal hearing. Methods: The English version of the LittlEARS^® Early Speech Production Questionnaire (LEESPQ) was adapted into Romanian language using a translation/back-translation procedure and validation of the content before applying the questionnaire. The Romanian version was applied to the parents of 232 children with normal hearing, aged between 0 and 18 months. The questionnaire was statistically analyzed to assess its reliability, internal consistency, predictive accuracy, and the influence of gender on children’s scores. Results: Statistical analyses confirmed the LEESPQ’s reliability (α = 0.876) and high predictive accuracy (λ = 0.951). Age correlated strongly with total scores (ρ = 0.67; p < 0.001), supporting the age-dependent progression of speech production milestones. Gender did not significantly affect the scores. Normative curves and minimum expected scores were established for each age group. Conclusions: This study confirmed that the Romanian version of the LEESPQ is a reliable, valid, language-independent instrument, useful in the assessment of language development in children with normal hearing, aged up to 18 months. Full article

► Show Figures

Figure 1

26 pages, 3823 KiB

Open AccessArticle

Enhanced Conformer-Based Speech Recognition via Model Fusion and Adaptive Decoding with Dynamic Rescoring

by Junhao Geng, Dongyao Jia, Zihao He, Nengkai Wu and Ziqi Li

Appl. Sci. 2024, 14(24), 11583; https://doi.org/10.3390/app142411583 - 11 Dec 2024

Viewed by 1991

Abstract

Speech recognition is widely applied in fields like security, education, and healthcare. While its development drives global information infrastructure and AI strategies, current models still face challenges such as overfitting, local optima, and inefficiencies in decoding accuracy and computational cost. These issues cause [...] Read more.

Speech recognition is widely applied in fields like security, education, and healthcare. While its development drives global information infrastructure and AI strategies, current models still face challenges such as overfitting, local optima, and inefficiencies in decoding accuracy and computational cost. These issues cause instability and long response times, hindering AI’s competitiveness. Therefore, addressing these technical bottlenecks is critical for advancing national scientific progress and global information infrastructure. In this paper, we propose improvements to the model structure fusion and decoding algorithms. First, based on the Conformer network and its variants, we introduce a weighted fusion method using training loss as an indicator, adjusting the weights, thresholds, and other related parameters of the fused models to balance the contributions of different model structures, thereby creating a more robust and generalized model that alleviates overfitting and local optima. Second, for the decoding phase, we design a dynamic adaptive decoding method that combines traditional decoding algorithms such as connectionist temporal classification and attention-based models. This ensemble approach enables the system to adapt to different acoustic environments, improving its robustness and overall performance. Additionally, to further optimize the decoding process, we introduce a penalty function mechanism as a regularization technique to reduce the model’s dependence on a single decoding approach. The penalty function limits the weights of decoding strategies to prevent over-reliance on any single decoder, thus enhancing the model’s generalization. Finally, we validate our model on the Librispeech dataset, a large-scale English speech corpus containing approximately 1000 h of audio data. Experimental results demonstrate that the proposed method achieves word error rates (WERs) of 3.92% and 4.07% on the development and test sets, respectively, significantly improving over single-model and traditional decoding methods. Notably, the method reduces WER by approximately 0.4% on complex datasets compared to several advanced mainstream models, underscoring its superior robustness and adaptability in challenging acoustic environments. The effectiveness of the proposed method in addressing overfitting and improving accuracy and efficiency during the decoding phase was validated, highlighting its significance in advancing speech recognition technology. Full article

(This article belongs to the Special Issue Deep Learning for Speech, Image and Language Processing)

► Show Figures

Figure 1

17 pages, 8195 KiB

Open AccessArticle

Measuring Speech Intelligibility with Romanian Synthetic Unpredictable Sentences in Normal Hearing

by Oana Astefanei, Sebastian Cozma, Cristian Martu, Roxana Serban, Corina Butnaru, Petronela Moraru, Gabriela Musat and Luminita Radulescu

Audiol. Res. 2024, 14(6), 1028-1044; https://doi.org/10.3390/audiolres14060085 - 1 Dec 2024

Viewed by 1264

Abstract

Background/Objectives: Understanding speech in background noise is a challenging task for listeners with normal hearing and even more so for individuals with hearing impairments. The primary objective of this study was to develop Romanian speech material in noise to assess speech perception in [...] Read more.

Background/Objectives: Understanding speech in background noise is a challenging task for listeners with normal hearing and even more so for individuals with hearing impairments. The primary objective of this study was to develop Romanian speech material in noise to assess speech perception in diverse auditory populations, including individuals with normal hearing and those with various types of hearing loss. The goal was to create a versatile tool that can be used in different configurations and expanded for future studies examining auditory performance across various populations and rehabilitation methods. Methods: This study outlines the development of Romanian speech material for speech-in-noise testing, initially presented to normal-hearing listeners to establish baseline data. The material consisted of unpredictable sentences, each with a fixed syntactic structure, generated using speech synthesis from all Romanian phonemes. A total of 50 words were selected and organized into 15 lists, each containing 10 sentences, with five words per sentence. Two evaluation methods were applied in two sessions to 20 normal-hearing volunteers. The first method was an adaptive speech-in-noise recognition test designed to assess the speech recognition threshold (SRT) by adjusting the signal-to-noise ratio (SNR) based on individual performance. The intelligibility of the lists was further assessed at the sentence level to evaluate the training effect. The second method was used to obtain normative data for the SRT, defined as the SNR at which a subject correctly recognizes 50% of the speech material, as well as for the slope, which refers to the steepness of the psychometric function derived from threshold recognition scores measured at three fixed SNRs (−10 dB, −7 dB, and −4 dB) during the measurement phase. Results: The adaptive method showed that the training effect was established after two lists and remained consistent across both sessions. During the measurement phase, the fixed SNR method yielded a mean SRT50 of −7.38 dB with a slope of 11.39%. These results provide reliable and comparable data, supporting the validity of the material for both general population testing and future clinical applications. Conclusions: This study demonstrates that the newly developed Romanian speech material is effective for evaluating speech recognition abilities in noise. The training phase successfully mitigated initial unfamiliarity with the material, ensuring that the results reflect realistic auditory performance. The obtained SRT and slope values provide valuable normative data for future auditory assessments. Due to its flexible design, the material can be further developed and extended to accommodate various auditory rehabilitation methods and diverse populations in future studies. Full article

(This article belongs to the Special Issue Rehabilitation of Hearing Impairment: 2nd Edition)

► Show Figures

Figure 1

40 pages, 3414 KiB

Open AccessArticle

Investigating the Predominance of Large Language Models in Low-Resource Bangla Language over Transformer Models for Hate Speech Detection: A Comparative Analysis

by Fatema Tuj Johora Faria, Laith H. Baniata and Sangwoo Kang

Mathematics 2024, 12(23), 3687; https://doi.org/10.3390/math12233687 - 25 Nov 2024

Cited by 5 | Viewed by 3023

Abstract

The rise in abusive language on social media is a significant threat to mental health and social cohesion. For Bengali speakers, the need for effective detection is critical. However, current methods fall short in addressing the massive volume of content. Improved techniques are [...] Read more.

The rise in abusive language on social media is a significant threat to mental health and social cohesion. For Bengali speakers, the need for effective detection is critical. However, current methods fall short in addressing the massive volume of content. Improved techniques are urgently needed to combat online hate speech in Bengali. Traditional machine learning techniques, while useful, often require large, linguistically diverse datasets to train models effectively. This paper addresses the urgent need for improved hate speech detection methods in Bengali, aiming to fill the existing research gap. Contextual understanding is crucial in differentiating between harmful speech and benign expressions. Large language models (LLMs) have shown state-of-the-art performance in various natural language tasks due to their extensive training on vast amounts of data. We explore the application of LLMs, specifically GPT-3.5 Turbo and Gemini 1.5 Pro, for Bengali hate speech detection using Zero-Shot and Few-Shot Learning approaches. Unlike conventional methods, Zero-Shot Learning identifies hate speech without task-specific training data, making it highly adaptable to new datasets and languages. Few-Shot Learning, on the other hand, requires minimal labeled examples, allowing for efficient model training with limited resources. Our experimental results show that LLMs outperform traditional approaches. In this study, we evaluate GPT-3.5 Turbo and Gemini 1.5 Pro on multiple datasets. To further enhance our study, we consider the distribution of comments in different datasets and the challenge of class imbalance, which can affect model performance. The BD-SHS dataset consists of 35,197 comments in the training set, 7542 in the validation set, and 7542 in the test set. The Bengali Hate Speech Dataset v1.0 and v2.0 include comments distributed across various hate categories: personal hate (629), political hate (1771), religious hate (502), geopolitical hate (1179), and gender abusive hate (316). The Bengali Hate Dataset comprises 7500 non-hate and 7500 hate comments. GPT-3.5 Turbo achieved impressive results with 97.33%, 98.42%, and 98.53% accuracy. In contrast, Gemini 1.5 Pro showed lower performance across all datasets. Specifically, GPT-3.5 Turbo excelled with significantly higher accuracy compared to Gemini 1.5 Pro. These outcomes highlight a 6.28% increase in accuracy compared to traditional methods, which achieved 92.25%. Our research contributes to the growing body of literature on LLM applications in natural language processing, particularly in the context of low-resource languages. Full article

(This article belongs to the Special Issue Advances in Artificial Intelligence: Models, Optimization, and Machine Learning, 3rd Edition)

► Show Figures

Figure 1

Search Results (79)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (79)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI