Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (65)

Search Parameters:
Keywords = voice control interface

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
19 pages, 3031 KB  
Article
Voice, Text, or Embodied AI Avatar? Effects of Generative AI Interface Modalities in VR Museums
by Pakinee Ariya, Perasuk Worragin, Songpon Khanchai, Darin Poollapalin and Phichete Julrode
Informatics 2026, 13(3), 42; https://doi.org/10.3390/informatics13030042 - 11 Mar 2026
Viewed by 653
Abstract
Virtual museums delivered through immersive virtual reality (VR) function as information environments where users access interpretive content while navigating spatially. With the integration of generative artificial intelligence (AI), conversational assistants can dynamically mediate information interaction; however, evidence remains limited regarding how different AI [...] Read more.
Virtual museums delivered through immersive virtual reality (VR) function as information environments where users access interpretive content while navigating spatially. With the integration of generative artificial intelligence (AI), conversational assistants can dynamically mediate information interaction; however, evidence remains limited regarding how different AI interface representations affect user experience. This study compares three generative AI interface modalities in a VR virtual museum: voice only, voice with synchronized text, and voice with an embodied AI avatar. A controlled experiment with 75 participants examined their effects on user engagement, perceived information quality, and subjective cognitive workload while holding informational content constant. The results indicate that the voice-and-text modality produced the highest perceived information quality, whereas the embodied AI avatar modality yielded the highest user engagement. No significant differences were observed in cognitive workload across modalities. These findings suggest that AI interface modalities play complementary roles in VR-based information interaction and provide design guidance for selecting appropriate AI representations in immersive information systems. Full article
Show Figures

Figure 1

17 pages, 14849 KB  
Article
A Collaborative Robotic System for Autonomous Object Handling with Natural User Interaction
by Federico Neri, Gaetano Lettera, Giacomo Palmieri and Massimo Callegari
Robotics 2026, 15(3), 49; https://doi.org/10.3390/robotics15030049 - 27 Feb 2026
Viewed by 527
Abstract
In Industry 5.0, the transition from fixed traditional automation to flexible human–robot collaboration (HRC) needs interfaces that are both intuitive and efficient. This paper introduces a novel, multimodal control system for autonomous object handling, specifically designed to enhance natural user interaction in dynamic [...] Read more.
In Industry 5.0, the transition from fixed traditional automation to flexible human–robot collaboration (HRC) needs interfaces that are both intuitive and efficient. This paper introduces a novel, multimodal control system for autonomous object handling, specifically designed to enhance natural user interaction in dynamic work environments. The system integrates a 6-Degrees of Freedom (DoF) collaborative robot (UR5e) with a hand-eye RGB-D vision system to achieve robust autonomy. The core technical contribution lies in a vision pipeline utilizing deep learning for object detection and point cloud processing for accurate 6D pose estimation, enabling advanced tasks such as human-aware object handover directly onto the operator’s hand. Crucially, an Automatic Speech Recognition (ASR) is incorporated, providing a Natural Language Understanding (NLU) layer that allows operators to issue real-time commands for task modification, error correction and object selection. Experimental results demonstrate that this multimodal approach offers a streamlined workflow aiming to improve operational flexibility compared to traditional HMIs, while enhancing the perceived naturalness of the collaborative task. The system establishes a framework for highly responsive and intuitive human–robot workspaces, advancing the state of the art in natural interaction for collaborative object manipulation. Full article
(This article belongs to the Special Issue Human–Robot Collaboration in Industry 5.0)
Show Figures

Figure 1

27 pages, 15108 KB  
Article
Inclusive Digital Gaming Platform
by Rodrigo Mendonça, Salvador Lopes, Ângela Oliveira, Paulo Serra and Filipe Fidalgo
Multimedia 2026, 2(1), 4; https://doi.org/10.3390/multimedia2010004 - 27 Feb 2026
Viewed by 420
Abstract
The lack of accessibility in digital gaming platforms remains a significant barrier to equitable user participation. To address this issue, this article presents an inclusive solution developed as a multimedia project designed to promote access to digital games for any user through the [...] Read more.
The lack of accessibility in digital gaming platforms remains a significant barrier to equitable user participation. To address this issue, this article presents an inclusive solution developed as a multimedia project designed to promote access to digital games for any user through the ipcb.games platform. The platform offers features that enhance accessibility, including voice-based authentication, voice-assisted registration, facial recognition, visual and auditory feedback, and a simplified interface. It also enables users to submit their own games for subsequent approval and integration. The development process followed a multimedia project methodology, structured into phases of analysis, planning, design, production, testing, and validation. The proposal was informed by a systematic review of scientific literature on digital inclusion and accessibility, complemented by a comparative analysis of existing platforms. During usability testing, the platform was evaluated by approximately 50 teachers from different educational levels, who provided highly positive feedback. Future work includes implementing voice-controlled gameplay, enabling keyboard-based navigation, re-implementing a functional eye-tracking system, and creating pedagogical groups, further strengthening the platform’s role in educational contexts. Full article
Show Figures

Figure 1

20 pages, 636 KB  
Article
Using Denoising Diffusion Model for Predicting Global Style Tokens in an Expressive Text-to-Speech System
by Wiktor Prosowicz and Tomasz Hachaj
Electronics 2025, 14(23), 4759; https://doi.org/10.3390/electronics14234759 - 3 Dec 2025
Viewed by 1039
Abstract
Text-to-speech (TTS) systems based on neural networks have undergone a significant evolution, taking a step forward towards achieving human-like quality and expressiveness, which is crucial for applications such as social media content creation and voice interfaces for visually impaired individuals. An entire branch [...] Read more.
Text-to-speech (TTS) systems based on neural networks have undergone a significant evolution, taking a step forward towards achieving human-like quality and expressiveness, which is crucial for applications such as social media content creation and voice interfaces for visually impaired individuals. An entire branch of research, known as Expressive Text-to-speech (ETTS), has emerged to address the so-called one-to-many mapping problem, which limits the naturalness of generated output. However, most ETTS systems applying explicit style modeling treat the prediction of prosodic features as a regressive, rather than generative, process and, consequently, do not capture prosodic diversity. We address this problem by proposing a novel technique for inference-time prediction of speaking-style features, which leverages a diffusion framework for sampling from a learned space of Global Style Tokens-based embeddings, which are then used to condition a neural TTS model. By incorporating the diffusion model, we can leverage its powerful modeling capabilities to learn the distribution of possible stylistic features and, during inference, sample them non-deterministically, which makes the generated speech more human-like by alleviating prosodic monotony across multiple sentences. Our system blends a regressive predictor with a diffusion-based generator to enable smooth control over the diversity of generated speech. Through quantitative and qualitative (human-centered) experiments, we demonstrated that our system generates expressive human speech with non-deterministic high-level prosodic features. Full article
(This article belongs to the Special Issue Advances in Algorithm Optimization and Computational Intelligence)
Show Figures

Figure 1

22 pages, 2265 KB  
Article
A Secure and Robust Multimodal Framework for In-Vehicle Voice Control: Integrating Bilingual Wake-Up, Speaker Verification, and Fuzzy Command Understanding
by Zhixiong Zhang, Yao Li, Wen Ren and Xiaoyan Wang
Eng 2025, 6(11), 319; https://doi.org/10.3390/eng6110319 - 10 Nov 2025
Viewed by 1317
Abstract
Intelligent in-vehicle voice systems face critical challenges in robustness, security, and semantic flexibility under complex acoustic conditions. To address these issues holistically, this paper proposes a novel multimodal and secure voice-control framework. The system integrates a hybrid dual-channel wake-up mechanism, combining a commercial [...] Read more.
Intelligent in-vehicle voice systems face critical challenges in robustness, security, and semantic flexibility under complex acoustic conditions. To address these issues holistically, this paper proposes a novel multimodal and secure voice-control framework. The system integrates a hybrid dual-channel wake-up mechanism, combining a commercial English engine (Picovoice) with a custom lightweight ResNet-Lite model for Chinese, to achieve robust cross-lingual activation. For reliable identity authentication, an optimized ECAPA-TDNN model is introduced, enhanced with spectral augmentation, sliding window feature fusion, and an adaptive threshold mechanism. Furthermore, a two-tier fuzzy command matching algorithm operating at character and pinyin levels is designed to significantly improve tolerance to speech variations and ASR errors. Comprehensive experiments on a test set encompassing various Chinese dialects, English accents, and noise environments demonstrate that the proposed system achieves high performance across all components: the wake-up mechanism maintains commercial-grade reliability for English and provides a functional baseline for Chinese; the improved ECAPA-TDNN attains low equal error rates of 2.37% (quiet), 5.59% (background music), and 3.12% (high-speed noise), outperforming standard baselines and showing strong noise robustness against the state of the art; and the fuzzy matcher boosts command recognition accuracy to over 95.67% in quiet environments and above 92.7% under noise, substantially outperforming hard matching by approximately 30%. End-to-end tests confirm an overall interaction success rate of 93.7%. This work offers a practical, integrated solution for developing secure, robust, and flexible voice interfaces in intelligent vehicles. Full article
(This article belongs to the Section Electrical and Electronic Engineering)
Show Figures

Figure 1

12 pages, 890 KB  
Article
Control Modality and Accuracy on the Trust and Acceptance of Construction Robots
by Daeguk Lee, Donghun Lee, Jae Hyun Jung and Taezoon Park
Appl. Sci. 2025, 15(21), 11827; https://doi.org/10.3390/app152111827 - 6 Nov 2025
Viewed by 841
Abstract
This study investigates how control modalities and recognition accuracy influence construction workers’ trust and acceptance of collaborative robots. Sixty participants evaluated voice and gesture control under varying levels of recognition accuracy while performing tiling together with collaborative robots. Experimental results indicated that recognition [...] Read more.
This study investigates how control modalities and recognition accuracy influence construction workers’ trust and acceptance of collaborative robots. Sixty participants evaluated voice and gesture control under varying levels of recognition accuracy while performing tiling together with collaborative robots. Experimental results indicated that recognition accuracy significantly affected perceived enjoyment (PE, p = 0.010), ease of use (PEOU, p = 0.030), and intention to use (ITU, p = 0.022), but not trust, usefulness (PU), or attitude (ATT). Furthermore, the interaction between control modality and accuracy shaped most acceptance factors (PE, p = 0.049; PEOU, p = 0.006; PU, p = 0.006; ATT, p = 0.003, and ITU, p < 0.001) except trust. In general, high recognition accuracy enhanced user experience and adoption intentions. Voice interfaces were favored when recognition accuracy was high, whereas gesture interfaces were more acceptable under low-accuracy conditions. These findings highlight the importance of designing high-accuracy, task-appropriate interfaces to support technology acceptance in construction. The preference for voice interfaces under accurate conditions aligns with the noisy, fast-paced nature of construction sites, where efficiency is paramount. By contrast, gesture interfaces offer resilience when recognition errors occur. The study provides practical guidance for robot developers, interface designers, and construction managers, emphasizing that carefully matching interaction modalities and accuracy levels to on-site demands can improve acceptance and long-term adoption in this traditionally conservative sector. Full article
(This article belongs to the Special Issue Robot Control in Human–Computer Interaction)
Show Figures

Figure 1

17 pages, 2127 KB  
Article
Leveraging Large Language Models for Real-Time UAV Control
by Kheireddine Choutri, Samiha Fadloun, Ayoub Khettabi, Mohand Lagha, Souham Meshoul and Raouf Fareh
Electronics 2025, 14(21), 4312; https://doi.org/10.3390/electronics14214312 - 2 Nov 2025
Cited by 2 | Viewed by 2899
Abstract
As drones become increasingly integrated into civilian and industrial domains, the demand for natural and accessible control interfaces continues to grow. Conventional manual controllers require technical expertise and impose cognitive overhead, limiting their usability in dynamic and time-critical scenarios. To address these limitations, [...] Read more.
As drones become increasingly integrated into civilian and industrial domains, the demand for natural and accessible control interfaces continues to grow. Conventional manual controllers require technical expertise and impose cognitive overhead, limiting their usability in dynamic and time-critical scenarios. To address these limitations, this paper presents a multilingual voice-driven control framework for quadrotor drones, enabling real-time operation in both English and Arabic. The proposed architecture combines offline Speech-to-Text (STT) processing with large language models (LLMs) to interpret spoken commands and translate them into executable control code. Specifically, Vosk is employed for bilingual STT, while Google Gemini provides semantic disambiguation, contextual inference, and code generation. The system is designed for continuous, low-latency operation within an edge–cloud hybrid configuration, offering an intuitive and robust human–drone interface. While speech recognition and safety validation are processed entirely offline, high-level reasoning and code generation currently rely on cloud-based LLM inference. Experimental evaluation demonstrates an average speech recognition accuracy of 95% and end-to-end command execution latency between 300 and 500 ms, validating the feasibility of reliable, multilingual, voice-based UAV control. This research advances multimodal human–robot interaction by showcasing the integration of offline speech recognition and LLMs for adaptive, safe, and scalable aerial autonomy. Full article
Show Figures

Figure 1

28 pages, 4508 KB  
Article
Mixed Reality-Based Multi-Scenario Visualization and Control in Automated Terminals: A Middleware and Digital Twin Driven Approach
by Yubo Wang, Enyu Zhang, Ang Yang, Keshuang Du and Jing Gao
Buildings 2025, 15(21), 3879; https://doi.org/10.3390/buildings15213879 - 27 Oct 2025
Viewed by 1339
Abstract
This study presents a Digital Twin–Mixed Reality (DT–MR) framework for the immersive and interactive supervision of automated container terminals (ACTs), addressing the fragmented data and limited situational awareness of conventional 2D monitoring systems. The framework employs a middleware-centric architecture that integrates heterogeneous [...] Read more.
This study presents a Digital Twin–Mixed Reality (DT–MR) framework for the immersive and interactive supervision of automated container terminals (ACTs), addressing the fragmented data and limited situational awareness of conventional 2D monitoring systems. The framework employs a middleware-centric architecture that integrates heterogeneous subsystems—covering terminal operation, equipment control, and information management—through standardized industrial communication protocols. It ensures synchronized timestamps and delivers semantically aligned, low-latency data streams to a multi-scale Digital Twin developed in Unity. The twin applies level-of-detail modeling, spatial anchoring, and coordinate alignment (from Industry Foundation Classes (IFCs) to east–north–up (ENU) coordinates and Unity space) for accurate registration with physical assets, while a Microsoft HoloLens 2 device provides an intuitive Mixed Reality interface that combines gaze, gesture, and voice commands with built-in safety interlocks for secure human–machine interaction. Quantitative performance benchmarks—latency ≤100 ms, status refresh ≤1 s, and throughput ≥10,000 events/s—were met through targeted engineering and validated using representative scenarios of quay crane alignment and automated guided vehicle (AGV) rerouting, demonstrating improved anomaly detection, reduced decision latency, and enhanced operational resilience. The proposed DT–MR pipeline establishes a reproducible and extensible foundation for real-time, human-in-the-loop supervision across ports, airports, and other large-scale smart infrastructures. Full article
(This article belongs to the Special Issue Digital Technologies, AI and BIM in Construction)
Show Figures

Figure 1

26 pages, 7995 KB  
Article
Smart Home Control Using Real-Time Hand Gesture Recognition and Artificial Intelligence on Raspberry Pi 5
by Thomas Hobbs and Anwar Ali
Electronics 2025, 14(20), 3976; https://doi.org/10.3390/electronics14203976 - 10 Oct 2025
Viewed by 4323
Abstract
This paper outlines the process of developing a low-cost system for home appliance control via real-time hand gesture classification using Computer Vision and a custom lightweight machine learning model. This system strives to enable those with speech or hearing disabilities to interface with [...] Read more.
This paper outlines the process of developing a low-cost system for home appliance control via real-time hand gesture classification using Computer Vision and a custom lightweight machine learning model. This system strives to enable those with speech or hearing disabilities to interface with smart home devices in real time using hand gestures, such as is possible with voice-activated ‘smart assistants’ currently available. The system runs on a Raspberry Pi 5 to enable future IoT integration and reduce costs. The system also uses the official camera module v2 and 7-inch touchscreen. Frame preprocessing uses MediaPipe to assign hand coordinates, and NumPy tools to normalise them. A machine learning model then predicts the gesture. The model, a feed-forward network consisting of five fully connected layers, was built using Keras 3 and compiled with TensorFlow Lite. Training data utilised the HaGRIDv2 dataset, modified to consist of 15 one-handed gestures from its original of 23 one- and two-handed gestures. When used to train the model, validation metrics of 0.90 accuracy and 0.31 loss were returned. The system can control both analogue and digital hardware via GPIO pins and, when recognising a gesture, averages 20.4 frames per second with no observable delay. Full article
Show Figures

Figure 1

15 pages, 6691 KB  
Proceeding Paper
Smart Customizable Spinning System
by Wei-Chuan Lin, Yu-Wen Hsu and Wan-Lin Yu
Eng. Proc. 2025, 108(1), 46; https://doi.org/10.3390/engproc2025108046 - 12 Sep 2025
Viewed by 597
Abstract
As global obesity rates rise, cardiovascular diseases increase, and stress-related issues become more severe. This increases the public awareness of health and exercise. However, existing spinning fitness equipment lacks personalized customization for individual needs. To address this, we developed a smart customizable spinning [...] Read more.
As global obesity rates rise, cardiovascular diseases increase, and stress-related issues become more severe. This increases the public awareness of health and exercise. However, existing spinning fitness equipment lacks personalized customization for individual needs. To address this, we developed a smart customizable spinning system that enables health monitoring, central computation, flywheel, voice interaction, notification, and query subsystems. Users can set fitness goals based on their personal needs, monitor workout data via sensors, and utilize voice interaction and control to track their exercise status in real time. The system notifies users of workout progress through a buzzer and message queuing telemetry transport, while the Web interface provides access to past workouts and health records. Additionally, the system supports bilingual functionality (Chinese and English), allowing users to operate it in their preferred language, enhancing global usability. Full article
Show Figures

Figure 1

11 pages, 1005 KB  
Proceeding Paper
Multimodal Fusion for Enhanced Human–Computer Interaction
by Ajay Sharma, Isha Batra, Shamneesh Sharma and Anggy Pradiftha Junfithrana
Eng. Proc. 2025, 107(1), 81; https://doi.org/10.3390/engproc2025107081 - 10 Sep 2025
Cited by 2 | Viewed by 2102
Abstract
Our paper introduces a novel idea of a virtual mouse character driven by gesture detection, eye-tracking, and voice monitoring. This system uses cutting-edge computer vision and machine learning technology to let users command and control the mouse pointer using eye motions, voice commands, [...] Read more.
Our paper introduces a novel idea of a virtual mouse character driven by gesture detection, eye-tracking, and voice monitoring. This system uses cutting-edge computer vision and machine learning technology to let users command and control the mouse pointer using eye motions, voice commands, or hand gestures. This system’s main goal is to provide users who want a more natural, hands-free approach to interacting with their computers as well as those with impairments that limit their bodily motions, such as those with paralysis—with an easy and engaging interface. The system improves accessibility and usability by combining many input modalities, therefore providing a flexible answer for numerous users. While the speech recognition function permits hands-free operation via voice instructions, the eye-tracking component detects and responds to the user’s gaze, therefore providing exact cursor control. Gesture recognition enhances these features even further by letting users use their hands simply to execute mouse operations. This technology not only enhances personal user experience for people with impairments but also marks a major development in human–computer interaction. It shows how computer vision and machine learning may be used to provide more inclusive and flexible user interfaces, therefore improving the accessibility and efficiency of computer usage for everyone. Full article
Show Figures

Figure 1

20 pages, 2732 KB  
Article
Redesigning Multimodal Interaction: Adaptive Signal Processing and Cross-Modal Interaction for Hands-Free Computer Interaction
by Bui Hong Quan, Nguyen Dinh Tuan Anh, Hoang Van Phi and Bui Trung Thanh
Sensors 2025, 25(17), 5411; https://doi.org/10.3390/s25175411 - 2 Sep 2025
Cited by 1 | Viewed by 1649
Abstract
Hands-free computer interaction is a key topic in assistive technology, with camera-based and voice-based systems being the most common methods. Recent camera-based solutions leverage facial expressions or head movements to simulate mouse clicks or key presses, while voice-based systems enable control via speech [...] Read more.
Hands-free computer interaction is a key topic in assistive technology, with camera-based and voice-based systems being the most common methods. Recent camera-based solutions leverage facial expressions or head movements to simulate mouse clicks or key presses, while voice-based systems enable control via speech commands, wake-word detection, and vocal gestures. However, existing systems often suffer from limitations in responsiveness and accuracy, especially under real-world conditions. In this paper, we present 3-Modal Human-Computer Interaction (3M-HCI), a novel interaction system that dynamically integrates facial, vocal, and eye-based inputs through a new signal processing pipeline and a cross-modal coordination mechanism. This approach not only enhances recognition accuracy but also reduces interaction latency. Experimental results demonstrate that 3M-HCI outperforms several recent hands-free interaction solutions in both speed and precision, highlighting its potential as a robust assistive interface. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

25 pages, 19135 KB  
Article
Development of a Multi-Platform AI-Based Software Interface for the Accompaniment of Children
by Isaac León, Camila Reyes, Iesus Davila, Bryan Puruncajas, Dennys Paillacho, Nayeth Solorzano, Marcelo Fajardo-Pruna, Hyungpil Moon and Francisco Yumbla
Multimodal Technol. Interact. 2025, 9(9), 88; https://doi.org/10.3390/mti9090088 - 26 Aug 2025
Viewed by 2220
Abstract
The absence of parental presence has a direct impact on the emotional stability and social routines of children, especially during extended periods of separation from their family environment, as in the case of daycare centers, hospitals, or when they remain alone at home. [...] Read more.
The absence of parental presence has a direct impact on the emotional stability and social routines of children, especially during extended periods of separation from their family environment, as in the case of daycare centers, hospitals, or when they remain alone at home. At the same time, the technology currently available to provide emotional support in these contexts remains limited. In response to the growing need for emotional support and companionship in child care, this project proposes the development of a multi-platform software architecture based on artificial intelligence (AI), designed to be integrated into humanoid robots that assist children between the ages of 6 and 14. The system enables daily verbal and non-verbal interactions intended to foster a sense of presence and personalized connection through conversations, games, and empathetic gestures. Built on the Robot Operating System (ROS), the software incorporates modular components for voice command processing, real-time facial expression generation, and joint movement control. These modules allow the robot to hold natural conversations, display dynamic facial expressions on its LCD (Liquid Crystal Display) screen, and synchronize gestures with spoken responses. Additionally, a graphical interface enhances the coherence between dialogue and movement, thereby improving the quality of human–robot interaction. Initial evaluations conducted in controlled environments assessed the system’s fluency, responsiveness, and expressive behavior. Subsequently, it was implemented in a pediatric hospital in Guayaquil, Ecuador, where it accompanied children during their recovery. It was observed that this type of artificial intelligence-based software, can significantly enhance the experience of children, opening promising opportunities for its application in clinical, educational, recreational, and other child-centered settings. Full article
Show Figures

Graphical abstract

20 pages, 3244 KB  
Article
SOUTY: A Voice Identity-Preserving Mobile Application for Arabic-Speaking Amyotrophic Lateral Sclerosis Patients Using Eye-Tracking and Speech Synthesis
by Hessah A. Alsalamah, Leena Alhabrdi, May Alsebayel, Aljawhara Almisned, Deema Alhadlaq, Loody S. Albadrani, Seetah M. Alsalamah and Shada AlSalamah
Electronics 2025, 14(16), 3235; https://doi.org/10.3390/electronics14163235 - 14 Aug 2025
Viewed by 1379
Abstract
Amyotrophic Lateral Sclerosis (ALS) is a neurodegenerative disorder that progressively impairs motor and communication abilities. Globally, the prevalence of ALS was estimated at approximately 222,800 cases in 2015 and is projected to increase by nearly 70% to 376,700 cases by 2040, primarily driven [...] Read more.
Amyotrophic Lateral Sclerosis (ALS) is a neurodegenerative disorder that progressively impairs motor and communication abilities. Globally, the prevalence of ALS was estimated at approximately 222,800 cases in 2015 and is projected to increase by nearly 70% to 376,700 cases by 2040, primarily driven by demographic shifts in aging populations, and the lifetime risk of developing ALS is 1 in 350–420. Despite international advancements in assistive technologies, a recent national survey in Saudi Arabia revealed that 100% of ALS care providers lack access to eye-tracking communication tools, and 92% reported communication aids as inconsistently available. While assistive technologies such as speech-generating devices and gaze-based control systems have made strides in recent decades, they primarily support English speakers, leaving Arabic-speaking ALS patients underserved. This paper presents SOUTY, a cost-effective, mobile-based application that empowers ALS patients to communicate using gaze-controlled interfaces combined with a text-to-speech (TTS) feature in Arabic language, which is one of the five most widely spoken languages in the world. SOUTY (i.e., “my voice”) utilizes a personalized, pre-recorded voice bank of the ALS patient and integrated eye-tracking technology to support the formation and vocalization of custom phrases in Arabic. This study describes the full development life cycle of SOUTY from conceptualization and requirements gathering to system architecture, implementation, evaluation, and refinement. Validation included expert interviews with Human–Computer Interaction (HCI) expertise and speech pathology specialty, as well as a public survey assessing awareness and technological readiness. The results support SOUTY as a culturally and linguistically relevant innovation that enhances autonomy and quality of life for Arabic-speaking ALS patients. This approach may serve as a replicable model for developing inclusive Augmentative and Alternative Communication (AAC) tools in other underrepresented languages. The system achieved 100% task completion during internal walkthroughs, with mean phrase selection times under 5 s and audio playback latency below 0.3 s. Full article
Show Figures

Figure 1

26 pages, 6831 KB  
Article
Human–Robot Interaction and Tracking System Based on Mixed Reality Disassembly Tasks
by Raúl Calderón-Sesmero, Adrián Lozano-Hernández, Fernando Frontela-Encinas, Guillermo Cabezas-López and Mireya De-Diego-Moro
Robotics 2025, 14(8), 106; https://doi.org/10.3390/robotics14080106 - 30 Jul 2025
Cited by 4 | Viewed by 3513
Abstract
Disassembly is a crucial process in industrial operations, especially in tasks requiring high precision and strict safety standards when handling components with collaborative robots. However, traditional methods often rely on rigid and sequential task planning, which makes it difficult to adapt to unforeseen [...] Read more.
Disassembly is a crucial process in industrial operations, especially in tasks requiring high precision and strict safety standards when handling components with collaborative robots. However, traditional methods often rely on rigid and sequential task planning, which makes it difficult to adapt to unforeseen changes or dynamic environments. This rigidity not only limits flexibility but also leads to prolonged execution times, as operators must follow predefined steps that do not allow for real-time adjustments. Although techniques like teleoperation have attempted to address these limitations, they often hinder direct human–robot collaboration within the same workspace, reducing effectiveness in dynamic environments. In response to these challenges, this research introduces an advanced human–robot interaction (HRI) system leveraging a mixed-reality (MR) interface embedded in a head-mounted device (HMD). The system enables operators to issue real-time control commands using multimodal inputs, including voice, gestures, and gaze tracking. These inputs are synchronized and processed via the Robot Operating System (ROS2), enabling dynamic and flexible task execution. Additionally, the integration of deep learning algorithms ensures precise detection and validation of disassembly components, enhancing accuracy. Experimental evaluations demonstrate significant improvements, including reduced task completion times, enhanced operator experience, and compliance with strict adherence to safety standards. This scalable solution offers broad applicability for general-purpose disassembly tasks, making it well-suited for complex industrial scenarios. Full article
(This article belongs to the Special Issue Robot Teleoperation Integrating with Augmented Reality)
Show Figures

Figure 1

Back to TopTop