Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (305)

Search Parameters:
Keywords = voice assistants

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
19 pages, 3791 KB  
Article
A Machine Learning Framework for Cognitive Impairment Screening from Speech with Multimodal Large Models
by Shiyu Chen, Ying Tan, Wenyu Hu, Yingxi Chen, Lihua Chen, Yurou He, Weihua Yu and Yang Lü
Bioengineering 2026, 13(1), 73; https://doi.org/10.3390/bioengineering13010073 - 8 Jan 2026
Viewed by 305
Abstract
Background: Early diagnosis of Alzheimer’s disease (AD) is essential for slowing disease progression and mitigating cognitive decline. However, conventional diagnostic methods are often invasive, time-consuming, and costly, limiting their utility in large-scale screening. There is an urgent need for scalable, non-invasive, and [...] Read more.
Background: Early diagnosis of Alzheimer’s disease (AD) is essential for slowing disease progression and mitigating cognitive decline. However, conventional diagnostic methods are often invasive, time-consuming, and costly, limiting their utility in large-scale screening. There is an urgent need for scalable, non-invasive, and accessible screening tools. Methods: We propose a novel screening framework combining a pre-trained multimodal large language model with structured MMSE speech tasks. An artificial intelligence-assisted multilingual Mini-Mental State Examination system (AAM-MMSE) was utilized to collect voice data from 1098 participants in Sichuan and Chongqing. CosyVoice2 was used to extract speaker embeddings, speech labels, and acoustic features, which were converted into statistical representations. Fourteen machine learning models were developed for subject classification into three diagnostic categories: Healthy Control (HC), Mild Cognitive Impairment (MCI), and Alzheimer’s Disease (AD). SHAP analysis was employed to assess the importance of the extracted speech features. Results: Among the evaluated models, LightGBM and Gradient Boosting classifiers exhibited the highest performance, achieving an average AUC of 0.9501 across classification tasks. SHAP-based analysis revealed that spectral complexity, energy dynamics, and temporal features were the most influential in distinguishing cognitive states, aligning with known speech impairments in early-stage AD. Conclusions: This framework offers a non-invasive, interpretable, and scalable solution for cognitive screening. It is suitable for both clinical and telemedicine applications, demonstrating the potential of speech-based AI models in early AD detection. Full article
(This article belongs to the Section Biosignal Processing)
Show Figures

Figure 1

23 pages, 6094 KB  
Systematic Review
Toward Smart VR Education in Media Production: Integrating AI into Human-Centered and Interactive Learning Systems
by Zhi Su, Tse Guan Tan, Ling Chen, Hang Su and Samer Alfayad
Biomimetics 2026, 11(1), 34; https://doi.org/10.3390/biomimetics11010034 - 4 Jan 2026
Viewed by 506
Abstract
Smart virtual reality (VR) systems are becoming central to media production education, where immersive practice, real-time feedback, and hands-on simulation are essential. This review synthesizes the integration of artificial intelligence (AI) into human-centered, interactive VR learning for television and media production. Searches in [...] Read more.
Smart virtual reality (VR) systems are becoming central to media production education, where immersive practice, real-time feedback, and hands-on simulation are essential. This review synthesizes the integration of artificial intelligence (AI) into human-centered, interactive VR learning for television and media production. Searches in Scopus, Web of Science, IEEE Xplore, ACM Digital Library, and SpringerLink (2013–2024) identified 790 records; following PRISMA screening, 94 studies met the inclusion criteria and were synthesized using a systematic scoping review approach. Across this corpus, common AI components include learner modeling, adaptive task sequencing (e.g., RL-based orchestration), affect sensing (vision, speech, and biosignals), multimodal interaction (gesture, gaze, voice, haptics), and growing use of LLM/NLP assistants. Reported benefits span personalized learning trajectories, high-fidelity simulation of studio workflows, and more responsive feedback loops that support creative, technical, and cognitive competencies. Evaluation typically covers usability and presence, workload and affect, collaboration, and scenario-based learning outcomes, leveraging interaction logs, eye tracking, and biofeedback. Persistent challenges include latency and synchronization under multimodal sensing, data governance and privacy for biometric/affective signals, limited transparency/interpretability of AI feedback, and heterogeneous evaluation protocols that impede cross-system comparison. We highlight essential human-centered design principles—teacher-in-the-loop orchestration, timely and explainable feedback, and ethical data governance—and outline a research agenda to support standardized evaluation and scalable adoption of smart VR education in the creative industries. Full article
(This article belongs to the Special Issue Biomimetic Innovations for Human–Machine Interaction)
Show Figures

Figure 1

28 pages, 3895 KB  
Article
Advancing Machine Learning Strategies for Power Consumption-Based IoT Botnet Detection
by Almustapha A. Wakili, Saugat Guni, Sabbir Ahmed Khan, Wei Yu and Woosub Jung
Sensors 2025, 25(24), 7553; https://doi.org/10.3390/s25247553 - 12 Dec 2025
Viewed by 562
Abstract
The proliferation of Internet of Things (IoT) devices has amplified botnet risks, while traditional network-based intrusion detection systems (IDSs) struggle under encrypted and/or sparse traffic. Power consumption offers an effective side channel for device-level detection. Yet, prior studies typically focus on a single [...] Read more.
The proliferation of Internet of Things (IoT) devices has amplified botnet risks, while traditional network-based intrusion detection systems (IDSs) struggle under encrypted and/or sparse traffic. Power consumption offers an effective side channel for device-level detection. Yet, prior studies typically focus on a single model family (often a convolutional neural network (CNN)) and rarely assess generalization across devices or compare broader model classes. In this paper, we conduct unified benchmarking and comparison of classical (SVM and RF), deep (CNN, LSTM, and 1D Transformer), and hybrid (CNN + LSTM, CNN + Transformer, and CNN + RF) models on the CHASE’19 dataset and a newly curated three-class botnet dataset, using consistent preprocessing and evaluation across single- and cross-device settings, reporting both accuracy and efficiency (latency and throughput). Experimental results demonstrate that Random Forest achieves the highest single-device accuracy (99.43% on the Voice Assistant with Seed 42), while CNN + Transformer shows a strong accuracy–efficiency trade-off in cross-device scenarios (94.02% accuracy on the combined dataset at ∼60,000 samples/s when using the best-performing Seed 42). These results offer practical guidance for selecting models under accuracy, latency, and throughput constraints and establish a reproducible baseline for power-side-channel IDSs. Full article
(This article belongs to the Special Issue IoT Cybersecurity: 2nd Edition)
Show Figures

Figure 1

20 pages, 14885 KB  
Article
MultiPhysio-HRC: A Multimodal Physiological Signals Dataset for Industrial Human–Robot Collaboration
by Andrea Bussolan, Stefano Baraldo, Oliver Avram, Pablo Urcola, Luis Montesano, Luca Maria Gambardella and Anna Valente
Robotics 2025, 14(12), 184; https://doi.org/10.3390/robotics14120184 - 5 Dec 2025
Viewed by 835
Abstract
Human–robot collaboration (HRC) is a key focus of Industry 5.0, aiming to enhance worker productivity while ensuring well-being. The ability to perceive human psycho-physical states, such as stress and cognitive load, is crucial for adaptive and human-aware robotics. This paper introduces MultiPhysio-HRC, a [...] Read more.
Human–robot collaboration (HRC) is a key focus of Industry 5.0, aiming to enhance worker productivity while ensuring well-being. The ability to perceive human psycho-physical states, such as stress and cognitive load, is crucial for adaptive and human-aware robotics. This paper introduces MultiPhysio-HRC, a multimodal dataset containing physiological, audio, and facial data collected during real-world HRC scenarios. The dataset includes electroencephalography (EEG), electrocardiography (ECG), electrodermal activity (EDA), respiration (RESP), electromyography (EMG), voice recordings, and facial action units. The dataset integrates controlled cognitive tasks, immersive virtual reality experiences, and industrial disassembly activities performed manually and with robotic assistance, to capture a holistic view of the participants’ mental states. Rich ground truth annotations were obtained using validated psychological self-assessment questionnaires. Baseline models were evaluated for stress and cognitive load classification, demonstrating the dataset’s potential for affective computing and human-aware robotics research. MultiPhysio-HRC is publicly available to support research in human-centered automation, workplace well-being, and intelligent robotic systems. Full article
(This article belongs to the Special Issue Human–Robot Collaboration in Industry 5.0)
Show Figures

Figure 1

11 pages, 1985 KB  
Concept Paper
Reflections on the Quality of Life of Adults with Down Syndrome from an International Congress
by Rachel Spencer, Robin Gibson, Leigh Creighton, Catherine Watson and Roy McConkey
Disabilities 2025, 5(4), 111; https://doi.org/10.3390/disabilities5040111 - 4 Dec 2025
Viewed by 485
Abstract
People with Down Syndrome often experience more barriers to achieving a good quality of life compared to people without disabilities. A lot of the existing research has focused on the views of parents and professionals, rather than directly including the voices and perspectives [...] Read more.
People with Down Syndrome often experience more barriers to achieving a good quality of life compared to people without disabilities. A lot of the existing research has focused on the views of parents and professionals, rather than directly including the voices and perspectives of people with Down Syndrome themselves. We wanted to find out how this might be done. At the 2024 World Down Syndrome Conference, over 140 adults with Down Syndrome came together at a one-day Forum to talk about their lives—aspects that are going well and what could be better. The goal was to hear directly from them. This article explains how the Forum was run so that others with Down Syndrome can use a similar process. We describe how Artificial Intelligence (AI) was used to assist the authors in organising and sharing the information from participants, such as grouping what people said into different themes and helping to create plain language reports. This process worked. Eight key themes were found that could help people to have a good life, such as having good relationships with family and friends; having a job; making personal choices; and being respected and included. The list was longer than previously reported in other studies. The Forum gave valuable insights and helped us think of new ideas for supporting people with Down Syndrome to speak up for themselves. Used thoughtfully, AI (Artificial Intelligence) could be a helpful tool in the future to help these people share their experiences and needs. More research is needed to understand how people with Down Syndrome can be more involved in making changes through advocacy projects where they take an active role. Full article
Show Figures

Figure 1

44 pages, 10088 KB  
Article
NAIA: A Robust Artificial Intelligence Framework for Multi-Role Virtual Academic Assistance
by Adrián F. Pabón M., Kenneth J. Barrios Q., Samuel D. Solano C. and Christian G. Quintero M.
Systems 2025, 13(12), 1091; https://doi.org/10.3390/systems13121091 - 3 Dec 2025
Viewed by 941
Abstract
Virtual assistants in academic environments often lack comprehensive multimodal integration and specialized role-based architecture. This paper presents NAIA (Nimble Artificial Intelligence Assistant), a robust artificial intelligence framework designed for multi-role virtual academic assistance through a modular monolithic approach. The system integrates Large Language [...] Read more.
Virtual assistants in academic environments often lack comprehensive multimodal integration and specialized role-based architecture. This paper presents NAIA (Nimble Artificial Intelligence Assistant), a robust artificial intelligence framework designed for multi-role virtual academic assistance through a modular monolithic approach. The system integrates Large Language Models (LLMs), Computer Vision, voice processing, and animated digital avatars within five specialized roles: researcher, receptionist, personal skills trainer, personal assistant, and university guide. NAIA’s architecture implements simultaneous voice, vision, and text processing through a three-model LLM system for optimized response quality, Redis-based conversation state management for context-aware interactions, and strategic third-party service integration with OpenAI, Backblaze B2, and SerpAPI. The framework seamlessly connects with the institutional ecosystem through Microsoft Graph API integration, while the frontend delivers immersive experiences via 3D avatar rendering using Ready Player Me and Mixamo. System effectiveness is evaluated through a comprehensive mixed-methods approach involving 30 participants from Universidad del Norte, employing Technology Acceptance Model (TAM2/TAM3) constructs and System Usability Scale (SUS) assessments. Results demonstrate strong user acceptance: 93.3% consider NAIA useful overall, 93.3% find it easy to use and learn, 100% intend to continue using and recommend it, and 90% report confident independent operation. Qualitative analysis reveals high satisfaction with role specialization, intuitive interface design, and institutional integration. The comparative analysis positions NAIA’s distinctive contributions through its synthesis of institutional knowledge integration with enhanced multimodal capabilities and specialized role architecture, establishing a comprehensive framework for intelligent human-AI interaction in modern educational environments. Full article
Show Figures

Figure 1

21 pages, 27048 KB  
Article
Evaluating Rich Visual Feedback on Head-Up Displays for In-Vehicle Voice Assistants: A User Study
by Mahmoud Baghdadi, Dilara Samad-Zada and Achim Ebert
Multimodal Technol. Interact. 2025, 9(11), 114; https://doi.org/10.3390/mti9110114 - 16 Nov 2025
Viewed by 711
Abstract
In-vehicle voice assistants face usability challenges due to limitations in delivering feedback within the constraints of the driving environment. The presented study explores the potential of Rich Visual Feedback (RVF) on Head-Up Displays (HUDs) as a multimodal solution to enhance system usability. A [...] Read more.
In-vehicle voice assistants face usability challenges due to limitations in delivering feedback within the constraints of the driving environment. The presented study explores the potential of Rich Visual Feedback (RVF) on Head-Up Displays (HUDs) as a multimodal solution to enhance system usability. A user study with 32 participants evaluated three HUD User Interface (UI) designs: the AR Fusion UI, which integrates augmented reality elements for layered, dynamic information presentation; the Baseline UI, which displays only essential keywords; and the Flat Fusion UI, which uses conventional vertical scrolling. To explore HUD interface principles and inform future HUD design without relying on specific hardware, a simulated near-field overlay was used. Usability was measured using the System Usability Scale (SUS), and distraction was assessed with a penalty point method. Results show that RVF on the HUD significantly influences usability, with both content quantity and presentation style affecting outcomes. The minimal Baseline UI achieved the highest overall usability. However, among the two Fusion designs, the AR-based layered information mechanism outperformed the flat scrolling method. Distraction effects were not statistically significant, indicating the need for further research. These findings suggest RVF-enabled HUDs can enhance in-vehicle voice assistant usability, potentially contributing to safer, more efficient driving. Full article
Show Figures

Figure 1

38 pages, 2282 KB  
Article
Cross-Lingual Bimodal Emotion Recognition with LLM-Based Label Smoothing
by Elena Ryumina, Alexandr Axyonov, Timur Abdulkadirov, Darya Koryakovskaya and Dmitry Ryumin
Big Data Cogn. Comput. 2025, 9(11), 285; https://doi.org/10.3390/bdcc9110285 - 12 Nov 2025
Viewed by 1900
Abstract
Bimodal emotion recognition based on audio and text is widely adopted in video-constrained real-world applications such as call centers and voice assistants. However, existing systems suffer from limited cross-domain generalization and monolingual bias. To address these limitations, a cross-lingual bimodal emotion recognition method [...] Read more.
Bimodal emotion recognition based on audio and text is widely adopted in video-constrained real-world applications such as call centers and voice assistants. However, existing systems suffer from limited cross-domain generalization and monolingual bias. To address these limitations, a cross-lingual bimodal emotion recognition method is proposed, integrating Mamba-based temporal encoders for audio (Wav2Vec2.0) and text (Jina-v3) with a Transformer-based cross-modal fusion architecture (BiFormer). Three corpus-adaptive augmentation strategies are introduced: (1) Stacked Data Sampling, in which short utterances are concatenated to stabilize sequence length; (2) Label Smoothing Generation based on Large Language Model, where the Qwen3-4B model is prompted to detect subtle emotional cues missed by annotators, producing soft labels that reflect latent emotional co-occurrences; and (3) Text-to-Utterance Generation, in which emotionally labeled utterances are generated by ChatGPT-5 and synthesized into speech using the DIA-TTS model, enabling controlled creation of affective audio–text pairs without human annotation. BiFormer is trained jointly on the English Multimodal EmotionLines Dataset and the Russian Emotional Speech Dialogs corpus, enabling cross-lingual transfer without parallel data. Experimental results show that the optimal data augmentation strategy is corpus-dependent: Stacked Data Sampling achieves the best performance on short, noisy English utterances, while Label Smoothing Generation based on Large Language Model better captures nuanced emotional expressions in longer Russian utterances. Text-to-Utterance Generation does not yield a measurable gain due to current limitations in expressive speech synthesis. When combined, the two best performing strategies produce complementary improvements, establishing new state-of-the-art performance in both monolingual and cross-lingual settings. Full article
Show Figures

Figure 1

19 pages, 901 KB  
Article
End-Users’ Perspectives on Implementation Outcomes of Digital Voice Assistants Delivering a Home-Based Lifestyle Intervention in Older Obese Adults with Type 2 Diabetes Mellitus: A Qualitative Analysis
by Costas Glavas, Jiani Ma, Surbhi Sood, Elena S. George, Robin M. Daly, Eugene Gvozdenko, Barbora de Courten, David Scott and Paul Jansons
Technologies 2025, 13(11), 511; https://doi.org/10.3390/technologies13110511 - 9 Nov 2025
Viewed by 969
Abstract
Managing blood glucose levels and adhering to exercise is challenging for older adults with obesity and type 2 diabetes mellitus (T2DM). Digital voice assistants (DVAs) utilising conversation-based interactions and natural language may overcome barriers to accessing home-based lifestyle programs, but end-user perspectives are [...] Read more.
Managing blood glucose levels and adhering to exercise is challenging for older adults with obesity and type 2 diabetes mellitus (T2DM). Digital voice assistants (DVAs) utilising conversation-based interactions and natural language may overcome barriers to accessing home-based lifestyle programs, but end-user perspectives are essential for implementation. This analysis investigated end-user perspectives on implementation outcomes of a DVA-delivered lifestyle program nested within a randomised controlled trial of 50 older adults (aged 50–75 years) with obesity and T2DM (DVA n = 25; control n = 25). Following trial completion, 10 DVA participants (mean ± SD age 67 ± 4 years) completed semi-structured interviews guided by the Practical Planning for Implementations and Scale-up guide and Proctor’s implementation outcome taxonomy. Over half (60%) were willing to pay for the DVA-delivered program, indicating perceived value. DVA audiovisual and conversation-based modalities enhanced engagement and acceptability. Most end-users found the DVA program feasible as a modality for delivering lifestyle programs, but suggested greater personalisation to bolster sustainability. Overall, the intervention was identified as acceptable and appropriate, suggesting digitally delivered programs may be feasible and sustainable for long-term use. Findings should be interpreted cautiously, given the small sample size and short intervention period. Nevertheless, end-users’ suggestions could inform the implementation of digital health interventions into healthcare systems. Full article
Show Figures

Graphical abstract

616 KB  
Proceeding Paper
Evaluating Voice Biomarkers and Deep Learning for Neurodevelopmental Disorder Screening in Real-World Conditions
by Hajarimino Rakotomanana and Ghazal Rouhafzay
Eng. Proc. 2025, 118(1), 46; https://doi.org/10.3390/ECSA-12-26523 - 7 Nov 2025
Viewed by 255
Abstract
Voice acoustics have been extensively investigated as potential non-invasive markers for Autism Spectrum Disorder (ASD). Although many studies report high accuracies, they typically rely on highly controlled clinical protocols that reduce linguistic variability. Their data is also recorded using specialized microphone arrays that [...] Read more.
Voice acoustics have been extensively investigated as potential non-invasive markers for Autism Spectrum Disorder (ASD). Although many studies report high accuracies, they typically rely on highly controlled clinical protocols that reduce linguistic variability. Their data is also recorded using specialized microphone arrays that ensure high-quality recordings. Such dependencies limit their applicability in real-world or in-home screening contexts. In this work, we explore an alternative approach designed to reflect the requirements of mobile-based applications that could assist parents in monitoring their children. We use an open-access dataset of naturalistic storytelling, extracting only the speech segments in which the child is speaking. We applied previously published ASD voice-analysis pipelines to this dataset, which yielded suboptimal performance under these less controlled conditions. We then introduce a deep learning-based method that learns discriminative representations directly from raw audio, eliminating the need for manual feature extraction while being more robust to environmental noise. This approach achieves an accuracy of up to 77% in classifying children with ASD, children with Attention Deficit Hyperactivity Disorder (ADHD), and neurotypical children. Frequency-band occlusion sensitivity analysis on the deep model revealed that ASD speech relied more heavily on the 2000–4000 Hz range, TD speech on both low (100–300 Hz) and high (4000–8000 Hz) bands, and ADHD speech on mid-frequency regions. These spectral patterns may help bring us closer to developing practical, accessible pre-screening tools for parents. Full article
Show Figures

Figure 1

15 pages, 273 KB  
Article
Equine-Assisted Interventions: Cross Perspectives of Beneficiaries and Their Caregivers from a Qualitative Perspective
by Léa Badin, Elina Van Dendaele and Nathalie Bailly
Geriatrics 2025, 10(6), 145; https://doi.org/10.3390/geriatrics10060145 - 6 Nov 2025
Viewed by 610
Abstract
Background: Although equine-assisted interventions (EAI) are gaining growing attention, their scientific evaluation among individuals with Alzheimer’s disease (AD) living in nursing homes remains limited. This study aimed to explore the lived experiences of an EAI program from the perspectives of the participants [...] Read more.
Background: Although equine-assisted interventions (EAI) are gaining growing attention, their scientific evaluation among individuals with Alzheimer’s disease (AD) living in nursing homes remains limited. This study aimed to explore the lived experiences of an EAI program from the perspectives of the participants living with AD as well as their families and professional caregivers. Methods: Thirty non-directive interviews were conducted between June and July 2024 across several nursing homes in the Centre-Val de Loire region (France). The interviews were recorded, transcribed, and analyzed using thematic analysis. Results: Four main themes emerged from the analysis: (1) the experience with the horse, reflecting a unique relationship with the animal, the activities carried out, and perceived personality traits; (2) the environment of EAI sessions, offering a break from daily routines, encouraging contact with nature, and taking place in a setting specific to this type of intervention; (3) the implementation of the program within the institutional context, highlighting logistical aspects, environmental factors, and the adherence; (4) the effects of the intervention, including enhanced social interactions, memory stimulation, emotional engagement, and behavioral benefits. Conclusions: These findings provide insight into the multiple dimensions involved in an EAI program. By giving voice to both participants and their caregivers, this study emphasizes the value of qualitative approaches in deeply understanding the meaning and impact of these non-pharmacological interventions. Full article
Show Figures

Graphical abstract

27 pages, 1286 KB  
Systematic Review
Smart Speakers for Health and Well-Being of Older Adults: A Mixed-Methods Review
by Michael Joseph Dino, Carla Leinbach, Gerald Dino, Ladda Thiamwong, Chloe Margalaux Villafuerte, Mona Shattell, Justin Pimentel, Maybelle Anne Zamora, Anbel Bautista, John Paul Vitug, Joyline Chepkorir and Nerceilyn Marave
Healthcare 2025, 13(21), 2772; https://doi.org/10.3390/healthcare13212772 - 31 Oct 2025
Viewed by 1381
Abstract
Background: Rapid population aging poses significant challenges to health and wellness systems, necessitating innovative technological interventions. Smart home technologies, particularly voice-activated intelligent assistants (smart speakers), represent a promising avenue for supporting aging populations. Objectives: This study critically examines the empirical literature on smart [...] Read more.
Background: Rapid population aging poses significant challenges to health and wellness systems, necessitating innovative technological interventions. Smart home technologies, particularly voice-activated intelligent assistants (smart speakers), represent a promising avenue for supporting aging populations. Objectives: This study critically examines the empirical literature on smart speakers’ influence on older adults’ health and well-being, mapping the characteristics of existing studies, assessing the current state of this domain, and providing a comprehensive overview. Methods: A mixed-methods systematic review was conducted in accordance with published guidelines. Bibliometric data, article purposes and outcomes, keyword network analysis, and mixed-methods findings from articles retrieved from five major databases were managed through the Covidence and VosViewer applications. Results: The majority of studies were conducted in the American region. Bibliometric analysis revealed five predominant thematic clusters: health management, psychological support, social connectedness, technology adoption, and usability. Findings demonstrated multifaceted benefits across several domains. Older adults reported improvements in daily living activities, enhanced emotional well-being, strengthened social connections, and overall health benefits. Qualitative evidence particularly emphasized the advantages of medication adherence, routine maintenance, and facilitated social support. However, mixed-method synthesis revealed significant barriers to adoption and sustained use, including privacy concerns, technical difficulties, cost constraints, and limited digital literacy among older users. Conclusions: The integration of smart speakers into the homes of older adults offers considerable potential to enhance technological wellness and promote successful aging in place, underscoring the need for structured integration of smart speaker technology and human-centered designs within geriatric care systems. Full article
(This article belongs to the Section Digital Health Technologies)
Show Figures

Figure 1

11 pages, 703 KB  
Article
Distinguishing Between Healthy and Unhealthy Newborns Based on Acoustic Features and Deep Learning Neural Networks Tuned by Bayesian Optimization and Random Search Algorithm
by Salim Lahmiri, Chakib Tadj and Christian Gargour
Entropy 2025, 27(11), 1109; https://doi.org/10.3390/e27111109 - 27 Oct 2025
Viewed by 451
Abstract
Voice analysis and classification for biomedical diagnosis purpose is receiving a growing attention to assist physicians in the decision-making process in clinical milieu. In this study, we develop and test deep feedforward neural networks (DFFNN) to distinguish between healthy and unhealthy newborns. The [...] Read more.
Voice analysis and classification for biomedical diagnosis purpose is receiving a growing attention to assist physicians in the decision-making process in clinical milieu. In this study, we develop and test deep feedforward neural networks (DFFNN) to distinguish between healthy and unhealthy newborns. The DFFNN are trained with acoustic features measured from newborn cries, including auditory-inspired amplitude modulation (AAM), Mel Frequency Cepstral Coefficients (MFCC), and prosody. The configuration of the DFFNN is optimized by using Bayesian optimization (BO) and random search (RS) algorithm. Under both optimization techniques, the experimental results show that the DFFNN yielded to the highest classification rate when trained with all acoustic features. Specifically, the DFFNN-BO and DFFNN-RS achieved 87.80% ± 0.23 and 86.12% ± 0.33 accuracy, respectively, under ten-fold cross-validation protocol. Both DFFNN-BO and DFFNN-RS outperformed existing approaches tested on the same database. Full article
(This article belongs to the Section Signal and Data Analysis)
Show Figures

Figure 1

7 pages, 1456 KB  
Proceeding Paper
Towards a More Natural Urdu: A Comprehensive Approach to Text-to-Speech and Voice Cloning
by Muhammad Ramiz Saud, Muhammad Romail Imran and Raja Hashim Ali
Eng. Proc. 2025, 87(1), 112; https://doi.org/10.3390/engproc2025087112 - 20 Oct 2025
Cited by 12 | Viewed by 1304
Abstract
This paper introduces a comprehensive approach to building natural-sounding Urdu Text-to-Speech (TTS) and voice cloning systems, addressing the lack of computational resources for Urdu. We developed a large-scale dataset of over 100 h of Urdu speech, carefully cleaned and phonetically aligned through an [...] Read more.
This paper introduces a comprehensive approach to building natural-sounding Urdu Text-to-Speech (TTS) and voice cloning systems, addressing the lack of computational resources for Urdu. We developed a large-scale dataset of over 100 h of Urdu speech, carefully cleaned and phonetically aligned through an automated transcription pipeline to preserve linguistic accuracy. The dataset was then used to fine-tune Tacotron2, a neural network model originally trained for English, with modifications tailored to Urdu’s phonological and morphological features. To further enhance naturalness, we integrated voice cloning techniques that capture regional accents and produce personalized speech outputs. Model performance was evaluated through mean opinion score (MOS), word error rate (WER), and speaker similarity, showing substantial improvements compared to previous Urdu systems. The results demonstrate clear progress toward natural and intelligible Urdu speech synthesis, while also revealing challenges such as handling dialectal variation and preventing model overfitting. This work contributes an essential resource and methodology for advancing Urdu natural language processing (NLP), with promising applications in education, accessibility, entertainment, and assistive technologies. Full article
(This article belongs to the Proceedings of The 5th International Electronic Conference on Applied Sciences)
Show Figures

Graphical abstract

41 pages, 849 KB  
Article
HEUXIVA: A Set of Heuristics for Evaluating User eXperience with Voice Assistants
by Daniela Quiñones, Luis Felipe Rojas, Camila Serrá, Jessica Ramírez, Viviana Barrientos and Sandra Cano
Appl. Sci. 2025, 15(20), 11178; https://doi.org/10.3390/app152011178 - 18 Oct 2025
Viewed by 804
Abstract
Voice assistants have become increasingly common in everyday devices such as smartphones and smart speakers. Improving their user experience (UX) is crucial to ensuring usability, acceptance, and long-term effectiveness. Heuristic evaluation is a widely used method for UX evaluation due to its efficiency [...] Read more.
Voice assistants have become increasingly common in everyday devices such as smartphones and smart speakers. Improving their user experience (UX) is crucial to ensuring usability, acceptance, and long-term effectiveness. Heuristic evaluation is a widely used method for UX evaluation due to its efficiency in detecting problems quickly and at low cost. Nonetheless, existing usability/UX heuristics were not designed to address the specific challenges of voice-based interaction, which relies on spoken dialog and auditory feedback. To overcome this limitation, we developed HEUXIVA, a set of 13 heuristics specifically developed for evaluating UX with voice assistants. The proposal was created through a structured methodology and refined in two iterations. We validated HEUXIVA through heuristic evaluations, expert judgment, and user testing. The results offer preliminary but consistent evidence supporting the effectiveness of HEUXIVA in identifying UX issues specific to the voice assistant “Google Nest Mini”. Experts described the heuristics as clear, practical, and easy to use. They also highlighted their usefulness in evaluating interaction features and supporting the overall UX evaluation process. HEUXIVA therefore provides designers, researchers, and practitioners with a specialized tool to improve the quality of voice assistant interfaces and improve user satisfaction. Full article
(This article belongs to the Special Issue Emerging Technologies in Innovative Human–Computer Interactions)
Show Figures

Figure 1

Back to TopTop