MDPI - Publisher of Open Access Journals

27 pages, 4631 KB

Open AccessArticle

Multimodal Minimal-Angular-Geometry Representation for Real-Time Dynamic Mexican Sign Language Recognition

by Gerardo Garcia-Gil, Gabriela del Carmen López-Armas and Yahir Emmanuel Ramirez-Pulido

Technologies 2026, 14(1), 48; https://doi.org/10.3390/technologies14010048 - 8 Jan 2026

Viewed by 434

Current approaches to dynamic sign language recognition commonly rely on dense landmark representations, which impose high computational cost and hinder real-time deployment on resource-constrained devices. To address this limitation, this work proposes a computationally efficient framework for real-time dynamic Mexican Sign Language (MSL) [...] Read more.

Current approaches to dynamic sign language recognition commonly rely on dense landmark representations, which impose high computational cost and hinder real-time deployment on resource-constrained devices. To address this limitation, this work proposes a computationally efficient framework for real-time dynamic Mexican Sign Language (MSL) recognition based on a multimodal minimal angular-geometry representation. Instead of processing complete landmark sets (e.g., MediaPipe Holistic with up to 468 keypoints), the proposed method encodes the relational geometry of the hands, face, and upper body into a compact set of 28 invariant internal angular descriptors. This representation substantially reduces feature dimensionality and computational complexity while preserving linguistically relevant manual and non-manual information required for grammatical and semantic discrimination in MSL. A real-time end-to-end pipeline is developed, comprising multimodal landmark extraction, angular feature computation, and temporal modeling using a Bidirectional Long Short-Term Memory (BiLSTM) network. The system is evaluated on a custom dataset of dynamic MSL gestures acquired under controlled real-time conditions. Experimental results demonstrate that the proposed approach achieves 99% accuracy and 99% macro F1-score, matching state-of-the-art performance while using fewer features dramatically. The compactness, interpretability, and efficiency of the minimal angular descriptor make the proposed system suitable for real-time deployment on low-cost devices, contributing toward more accessible and inclusive sign language recognition technologies. Full article

(This article belongs to the Special Issue Image Analysis and Processing)

► Show Figures

Figure 1

16 pages, 1701 KB

Open AccessArticle

Research on YOLOv5s-Based Multimodal Assistive Gesture and Micro-Expression Recognition with Speech Synthesis

by Xiaohua Li and Chaiyan Jettanasen

Computation 2025, 13(12), 277; https://doi.org/10.3390/computation13120277 - 1 Dec 2025

Viewed by 517

Abstract

Effective communication between deaf–mute and visually impaired individuals remains a challenge in the fields of human–computer interaction and accessibility technology. Current solutions mostly rely on single-modal recognition, which often leads to issues such as semantic ambiguity and loss of emotional information. To address [...] Read more.

Effective communication between deaf–mute and visually impaired individuals remains a challenge in the fields of human–computer interaction and accessibility technology. Current solutions mostly rely on single-modal recognition, which often leads to issues such as semantic ambiguity and loss of emotional information. To address these challenges, this study proposes a lightweight multimodal fusion framework that combines gestures and micro-expressions, which are then processed through a recognition network and a speech synthesis module. The core innovations of this research are as follows: (1) a lightweight YOLOv5s improvement structure that integrates residual modules and efficient downsampling modules, which reduces the model complexity and computational overhead while maintaining high accuracy; (2) a multimodal fusion method based on an attention mechanism, which adaptively and efficiently integrates complementary information from gestures and micro-expressions, significantly improving the semantic richness and accuracy of joint recognition; (3) an end-to-end real-time system that outputs the visual recognition results through a high-quality text-to-speech module, completing the closed-loop from “visual signal” to “speech feedback”. We conducted evaluations on the publicly available hand gesture dataset HaGRID and a curated micro-expression image dataset. The results show that, for the joint gesture and micro-expression tasks, our proposed multimodal recognition system achieves a multimodal joint recognition accuracy of 95.3%, representing a 4.5% improvement over the baseline model. The system was evaluated in a locally deployed environment, achieving a real-time processing speed of 22 FPS, with a speech output latency below 0.8 s. The mean opinion score (MOS) reached 4.5, demonstrating the effectiveness of the proposed approach in breaking communication barriers between the hearing-impaired and visually impaired populations. Full article

► Show Figures

Figure 1

28 pages, 4508 KB

Open AccessArticle

Mixed Reality-Based Multi-Scenario Visualization and Control in Automated Terminals: A Middleware and Digital Twin Driven Approach

by Yubo Wang, Enyu Zhang, Ang Yang, Keshuang Du and Jing Gao

Buildings 2025, 15(21), 3879; https://doi.org/10.3390/buildings15213879 - 27 Oct 2025

Viewed by 1116

Abstract

This study presents a Digital Twin–Mixed Reality (DT–MR) framework for the immersive and interactive supervision of automated container terminals (ACTs), addressing the fragmented data and limited situational awareness of conventional 2D monitoring systems. The framework employs a middleware-centric architecture that integrates heterogeneous [...] Read more.

This study presents a Digital Twin–Mixed Reality (DT–MR) framework for the immersive and interactive supervision of automated container terminals (ACTs), addressing the fragmented data and limited situational awareness of conventional 2D monitoring systems. The framework employs a middleware-centric architecture that integrates heterogeneous subsystems—covering terminal operation, equipment control, and information management—through standardized industrial communication protocols. It ensures synchronized timestamps and delivers semantically aligned, low-latency data streams to a multi-scale Digital Twin developed in Unity. The twin applies level-of-detail modeling, spatial anchoring, and coordinate alignment (from Industry Foundation Classes (IFCs) to east–north–up (ENU) coordinates and Unity space) for accurate registration with physical assets, while a Microsoft HoloLens 2 device provides an intuitive Mixed Reality interface that combines gaze, gesture, and voice commands with built-in safety interlocks for secure human–machine interaction. Quantitative performance benchmarks—latency ≤100 ms, status refresh ≤1 s, and throughput ≥10,000 events/s—were met through targeted engineering and validated using representative scenarios of quay crane alignment and automated guided vehicle (AGV) rerouting, demonstrating improved anomaly detection, reduced decision latency, and enhanced operational resilience. The proposed DT–MR pipeline establishes a reproducible and extensible foundation for real-time, human-in-the-loop supervision across ports, airports, and other large-scale smart infrastructures. Full article

(This article belongs to the Special Issue Digital Technologies, AI and BIM in Construction)

► Show Figures

Figure 1

24 pages, 2879 KB

Open AccessArticle

Skeleton-Based Real-Time Hand Gesture Recognition Using Data Fusion and Ensemble Multi-Stream CNN Architecture

by Maki K. Habib, Oluwaleke Yusuf and Mohamed Moustafa

Technologies 2025, 13(11), 484; https://doi.org/10.3390/technologies13110484 - 26 Oct 2025

Viewed by 1674

Abstract

Hand Gesture Recognition (HGR) is a vital technology that enables intuitive human–computer interaction in various domains, including augmented reality, smart environments, and assistive systems. Achieving both high accuracy and real-time performance remains challenging due to the complexity of hand dynamics, individual morphological variations, [...] Read more.

Hand Gesture Recognition (HGR) is a vital technology that enables intuitive human–computer interaction in various domains, including augmented reality, smart environments, and assistive systems. Achieving both high accuracy and real-time performance remains challenging due to the complexity of hand dynamics, individual morphological variations, and computational limitations. This paper presents a lightweight and efficient skeleton-based HGR framework that addresses these challenges through an optimized multi-stream Convolutional Neural Network (CNN) architecture and a trainable ensemble tuner. Dynamic 3D gestures are transformed into structured, noise-minimized 2D spatiotemporal representations via enhanced data-level fusion, supporting robust classification across diverse spatial perspectives. The ensemble tuner strengthens semantic relationships between streams and improves recognition accuracy. Unlike existing solutions that rely on high-end hardware, the proposed framework achieves real-time inference on consumer-grade devices without compromising accuracy. Experimental validation across five benchmark datasets (SHREC2017, DHG1428, FPHA, LMDHG, and CNR) confirms consistent or superior performance with reduced computational overhead. Additional validation on the SBU Kinect Interaction Dataset highlights generalization potential for broader Human Action Recognition (HAR) tasks. This advancement bridges the gap between efficiency and accuracy, supporting scalable deployment in AR/VR, mobile computing, interactive gaming, and resource-constrained environments. Full article

(This article belongs to the Special Issue Advancements in Medical and Assistive Technologies Using Artificial Intelligence and Deep Learning Techniques)

► Show Figures

Figure 1

21 pages, 3148 KB

Open AccessArticle

A Novel Multimodal Hand Gesture Recognition Model Using Combined Approach of Inter-Frame Motion and Shared Attention Weights

by Xiaorui Zhang, Shuaitong Li, Xianglong Zeng, Peisen Lu and Wei Sun

Computers 2025, 14(10), 432; https://doi.org/10.3390/computers14100432 - 13 Oct 2025

Cited by 2 | Viewed by 1178

Abstract

Dynamic hand gesture recognition based on computer vision aims at enabling computers to understand the semantic meaning conveyed by hand gestures in videos. Existing methods predominately rely on spatiotemporal attention mechanisms to extract hand motion features in a large spatiotemporal scope. However, they [...] Read more.

Dynamic hand gesture recognition based on computer vision aims at enabling computers to understand the semantic meaning conveyed by hand gestures in videos. Existing methods predominately rely on spatiotemporal attention mechanisms to extract hand motion features in a large spatiotemporal scope. However, they cannot accurately focus on the moving hand region for hand feature extraction because frame sequences contain a substantial amount of redundant information. Although multimodal techniques can extract a wider variety of hand features, they are less successful at utilizing information interactions between various modalities for accurate feature extraction. To address these challenges, this study proposes a multimodal hand gesture recognition model combining inter-frame motion and shared attention weights. By jointly using an inter-frame motion attention (IFMA) mechanism and adaptive down-sampling (ADS), the spatiotemporal search scope can be effectively narrowed down to the hand-related regions based on the characteristic of hands exhibiting obvious movements. The proposed inter-modal attention weight (IMAW) loss enables RGB and Depth modalities to share attention, allowing each to adjust its distribution based on the other. Experimental results on the EgoGesture, NVGesture, and Jester datasets demonstrate the superiority of our proposed model over existing state-of-the-art methods in terms of hand motion feature extraction and hand gesture recognition accuracy. Full article

(This article belongs to the Special Issue Multimodal Pattern Recognition of Social Signals in HCI (2nd Edition))

► Show Figures

Figure 1

36 pages, 6788 KB

Open AccessArticle

A Performing Arts ICH-Driven Interaction Design Framework for Rehabilitation Games

by Jing Zhao, Xinran Zhang, Yiming Ma, Yi Liu, Siyu Huo, Xiaotong Mu, Qian Xiao and Yuhong Han

Electronics 2025, 14(18), 3739; https://doi.org/10.3390/electronics14183739 - 22 Sep 2025

Cited by 2 | Viewed by 1455 | Correction

Abstract

The lack of deep engagement strategies that include cultural contextualization in the current rehabilitation game design can result in limited user motivation and low adherence in long-term rehabilitation. Integrating cultural semantics into interactive rehabilitation design offers new opportunities to enhance user engagement and [...] Read more.

The lack of deep engagement strategies that include cultural contextualization in the current rehabilitation game design can result in limited user motivation and low adherence in long-term rehabilitation. Integrating cultural semantics into interactive rehabilitation design offers new opportunities to enhance user engagement and emotional resonance in digital rehabilitation therapy, especially in a deeper way rather than visually. This study introduces a framework comprising a “Rehabilitation Mechanism–Interaction Design–Cultural Feature” triadic mapping model and a structured procedure. Following the framework, a hand function rehabilitation game is designed based on Chinese string puppetry, as well body rehabilitation games based on shadow puppetry and Tai Chi. The hand rehabilitation game utilizes Leap Motion for its gesture-based input and Unity3D for real-time visual feedback and task execution. Functional training gestures such as grasping, wrist rotation, and pinching are mapped to culturally meaningful puppet actions within the game. Through task-oriented engagement and narrative immersion, the design improves cognitive accessibility, emotional motivation, and sustained participation. Evaluations are conducted from rehabilitation professionals and target users. The results demonstrate that the system is promising in integrating motor function training with emotional engagement, validating the feasibility of the proposed triadic mapping framework in rehabilitation game design. This study provides a replicable design strategy for human–computer interaction (HCI) researchers working at the intersection of healthcare, cultural heritage, and interactive media. Full article

(This article belongs to the Special Issue Innovative Designs in Human–Computer Interaction)

► Show Figures

Figure 1

22 pages, 3399 KB

Open AccessArticle

Integrating Cross-Modal Semantic Learning with Generative Models for Gesture Recognition

by Shuangjiao Zhai, Zixin Dai, Zanxia Jin, Pinle Qin and Jianchao Zeng

Sensors 2025, 25(18), 5783; https://doi.org/10.3390/s25185783 - 17 Sep 2025

Viewed by 964

Abstract

Radio frequency (RF)-based human activity sensing is an essential component of ubiquitous computing, with WiFi sensing providing a practical and low-cost solution for gesture and activity recognition. However, challenges such as manual data collection, multipath interference, and poor cross-domain generalization hinder real-world deployment. [...] Read more.

Radio frequency (RF)-based human activity sensing is an essential component of ubiquitous computing, with WiFi sensing providing a practical and low-cost solution for gesture and activity recognition. However, challenges such as manual data collection, multipath interference, and poor cross-domain generalization hinder real-world deployment. Existing data augmentation approaches often neglect the biomechanical structure underlying RF signals. To address these limitations, we present CM-GR, a cross-modal gesture recognition framework that integrates semantic learning with generative modeling. CM-GR leverages 3D skeletal points extracted from vision data as semantic priors to guide the synthesis of realistic WiFi signals, thereby incorporating biomechanical constraints without requiring extensive manual labeling. In addition, dynamic conditional vectors are constructed from inter-subject skeletal differences, enabling user-specific WiFi data generation without the need for dedicated data collection and annotation for each new user. Extensive experiments on the public MM-Fi dataset and our SelfSet dataset demonstrate that CM-GR substantially improves the cross-subject gesture recognition accuracy, achieving gains of up to 10.26% and 9.5%, respectively. These results confirm the effectiveness of CM-GR in synthesizing personalized WiFi data and highlight its potential for robust and scalable gesture recognition in practical settings. Full article

(This article belongs to the Section Biomedical Sensors)

► Show Figures

Figure 1

18 pages, 392 KB

Open AccessArticle

Semantic Restoration of Snake-Slaying in Chan Buddhist Koan

by Yun Wang and Yulu Lv

Religions 2025, 16(8), 973; https://doi.org/10.3390/rel16080973 - 27 Jul 2025

Viewed by 1380

Abstract

In the Chan Buddhism koan (gong’an 公案) tradition, the act of “slaying the snake” functions as a signature gesture imbued with complex, historically layered cultural meanings. Rather than merely examining its motivations, this paper emphasizes tracing the semantic transformations that this motif has [...] Read more.

In the Chan Buddhism koan (gong’an 公案) tradition, the act of “slaying the snake” functions as a signature gesture imbued with complex, historically layered cultural meanings. Rather than merely examining its motivations, this paper emphasizes tracing the semantic transformations that this motif has undergone across different historical contexts. It argues that “snake-slaying” operated variously as an imperial narrative strategy reinforcing ruling class ideology; as a form of popular resistance by commoners against flood-related disasters; as a dietary practice among aristocrats and literati seeking danyao (elixirs) 丹藥 for reclusion and transcendence; and ultimately, within the Chan tradition, as a method of spiritual cultivation whereby masters sever desires rooted in attachment to both selfhood and the Dharma. More specifically, first, as an imperial narrative logic, snake-slaying embodied exemplary power: both Liu Bang 劉邦 and Guizong 歸宗 enacted this discursive strategy, with Guizong’s legitimacy in slaying the snake deriving from the precedent set by Liu Bang. Second, as a folk strategy of demystification, snake-slaying acquired a moral aura—since the snake was perceived as malevolent force, their slaying appeared righteous and heroic. Finally, as a mode of self-cultivation among the aristocracy, snake-slaying laid the groundwork for its later internalization. In Daoism, slaying the snake was a means of cultivating the body; in Chan Buddhism, the act is elevated to a higher plane—becoming a way of cultivating the mind. This transformation unfolded naturally, as if predestined. In all cases, the internalization of the snake-slaying motif was not an overnight development: the cultural genes that preceded its appearance in the Chan tradition provided the fertile ground for its karmic maturation and discursive proliferation. Full article

(This article belongs to the Special Issue Chinese Religious Cultures: Historical Traditions and Modern Interpretations)

20 pages, 1405 KB

Open AccessArticle

Multimodal Pragmatic Markers of Feedback in Dialogue

by Ludivine Crible and Loulou Kosmala

Languages 2025, 10(6), 117; https://doi.org/10.3390/languages10060117 - 22 May 2025

Cited by 1 | Viewed by 2265

Abstract

Historically, the field of discourse marker research has moved from relying on intuition to more and more ecological data, with written, spoken, and now multimodal corpora available to study these pervasive pragmatic devices. For some topics, video is necessary to capture the complexity [...] Read more.

Historically, the field of discourse marker research has moved from relying on intuition to more and more ecological data, with written, spoken, and now multimodal corpora available to study these pervasive pragmatic devices. For some topics, video is necessary to capture the complexity of interactive phenomena, such as feedback in dialogue. Feedback is the process of communicating engagement, alignment, and affiliation (or lack thereof) to the other speaker, and has attracted a lot of attention recently, from fields such as psycholinguistics, conversation analysis, or second language acquisition. Feedback can be expressed by a variety of verbal/vocal and visual/gestural devices, from questions to head nods and, crucially, discourse or pragmatic markers such as “okay, alright, yeah”. Verbal-vocal and visual-gestural forms often co-occur, which calls for more investigation of their combinations. In this study, we analyze multimodal pragmatic markers of feedback in a corpus of French dialogues, where all feedback devices have previously been categorized into either “alignment” (expression of mutual understanding) or “affiliation” (expression of shared stance). After describing the distribution and forms within each modality taken separately, we will focus on interesting multimodal combinations, such as [negative oui ‘yes’ + head tilt] or [mais oui ‘but yes’ + forward head move], thus showing how the visual modality can affect the semantics of verbal markers. In doing so, we will contribute to defining multimodal pragmatic markers, a status which has so far been restricted to verbal markers and manual gestures, at the expense of other devices in the visual modality. Full article

(This article belongs to the Special Issue Current Trends in Discourse Marker Research)

► Show Figures

Figure 1

51 pages, 41402 KB

Open AccessArticle

A Digitally Enhanced Ethnography for Craft Action and Process Understanding

by Xenophon Zabulis, Partarakis Nikolaos, Vasiliki Manikaki, Ioanna Demeridou, Arnaud Dubois, Inés Moreno, Valentina Bartalesi, Nicolò Pratelli, Carlo Meghini, Sotiris Manitsaris and Gavriela Senteri

Appl. Sci. 2025, 15(10), 5408; https://doi.org/10.3390/app15105408 - 12 May 2025

Cited by 2 | Viewed by 3513

Abstract

Traditional ethnographic methods have long been employed to study craft practices, yet they often fall short of capturing the full depth of embodied knowledge, material interactions, and procedural workflows inherent in craftsmanship. This paper introduces a digitally enhanced ethnographic framework that integrates Motion [...] Read more.

Traditional ethnographic methods have long been employed to study craft practices, yet they often fall short of capturing the full depth of embodied knowledge, material interactions, and procedural workflows inherent in craftsmanship. This paper introduces a digitally enhanced ethnographic framework that integrates Motion Capture, 3D scanning, audiovisual documentation, and semantic knowledge representation to document both the tangible and dynamic aspects of craft processes. By distinguishing between endurant (tools, materials, objects) and perdurant (actions, events, transformations) entities, we propose a structured methodology for analyzing craft gestures, material behaviors, and production workflows. The study applies this proposed framework to eight European craft traditions—including glassblowing, tapestry weaving, woodcarving, porcelain pottery, marble carving, silversmithing, clay pottery, and textile weaving—demonstrating the adaptability of digital ethnographic tools across disciplines. Through a combination of multimodal data acquisition and expert-driven annotation, we present a comprehensive model for craft documentation that enhances the preservation, education, and analysis of artisanal knowledge. This research contributes to the ongoing evolution of ethnographic methods by bridging digital technology with Cultural Heritage studies, offering a robust framework for understanding the mechanics and meanings of craft practices. Full article

(This article belongs to the Special Issue Challenges and Current Applications of 3D Information Technologies for Cultural Heritage)

► Show Figures

Figure 1

16 pages, 1756 KB

Open AccessArticle

Multi-Scale Parallel Enhancement Module with Cross-Hierarchy Interaction for Video Emotion Recognition

by Lianqi Zhang, Yuan Sun, Jiansheng Guan, Shaobo Kang, Jiangyin Huang and Xungao Zhong

Electronics 2025, 14(9), 1886; https://doi.org/10.3390/electronics14091886 - 6 May 2025

Cited by 1 | Viewed by 805

Abstract

Video emotion recognition faces significant challenges due to the strong spatiotemporal coupling of dynamic expressions and the substantial variations in cross-scale motion patterns (e.g., subtle facial micro-expressions versus large-scale body gestures). Traditional methods, constrained by limited receptive fields, often fail to effectively balance [...] Read more.

Video emotion recognition faces significant challenges due to the strong spatiotemporal coupling of dynamic expressions and the substantial variations in cross-scale motion patterns (e.g., subtle facial micro-expressions versus large-scale body gestures). Traditional methods, constrained by limited receptive fields, often fail to effectively balance multi-scale correlations between local cues (e.g., transient facial muscle movements) and global semantic patterns (e.g., full-body gestures). To address this, we propose an enhanced attention module integrating multi-dilated convolution and dynamic feature weighting, aimed at improving spatiotemporal emotion feature extraction. Building upon conventional attention mechanisms, the module introduces a multi-branch parallel architecture. Convolutional kernels with varying dilation rates (1, 3, 5) are designed to hierarchically capture cross-scale the spatiotemporal features of low-scale facial micro-motion units (e.g., brief lip tightening), mid-scale composite expression patterns (e.g., furrowed brows combined with cheek raising), and high-scale limb motion trajectories (e.g., sustained arm-crossing). A dynamic feature adapter is further incorporated to enable context-aware adaptive fusion of multi-source heterogeneous features. We conducted extensive ablation studies and experiments on popular benchmark datasets such as the VideoEmotion-8 and Ekman-6 datasets. Experiments demonstrate that the proposed method enhances joint modeling of low-scale cues (e.g., fragmented facial muscle dynamics) and high-scale semantic patterns (e.g., emotion-coherent body language), achieving stronger cross-database generalization. Full article

► Show Figures

Figure 1

29 pages, 4936 KB

Open AccessArticle

Continuous Arabic Sign Language Recognition Models

by Nahlah Algethami, Raghad Farhud, Manal Alghamdi, Huda Almutairi, Maha Sorani and Noura Aleisa

Sensors 2025, 25(9), 2916; https://doi.org/10.3390/s25092916 - 5 May 2025

Cited by 7 | Viewed by 3525

Abstract

A significant communication gap persists between the deaf and hearing communities, often leaving deaf individuals isolated and marginalised. This challenge is especially pronounced for Arabic-speaking individuals, given the lack of publicly available Arabic Sign Language datasets and dedicated recognition systems. This study is [...] Read more.

A significant communication gap persists between the deaf and hearing communities, often leaving deaf individuals isolated and marginalised. This challenge is especially pronounced for Arabic-speaking individuals, given the lack of publicly available Arabic Sign Language datasets and dedicated recognition systems. This study is the first to use the Temporal Convolutional Network (TCN) model for Arabic Sign Language (ArSL) recognition. We created a custom dataset of the 30 most common sentences in ArSL. We improved recognition performance by enhancing a Recurrent Neural Network (RNN) incorporating a Bidirectional Long Short-Term Memory (BiLSTM) model. Our approach achieved outstanding accuracy results compared to baseline RNN-BiLSTM models. This study contributes to developing recognition systems that could bridge communication barriers for the hearing-impaired community. Through a comparative analysis, we assessed the performance of the TCN and the enhanced RNN architecture in capturing the temporal dependencies and semantic nuances unique to Arabic Sign Language. The models are trained and evaluated using the created dataset of Arabic sign gestures based on recognition accuracy, processing speed, and robustness to variations in signing styles. This research provides insights into the strengths and limitations of TCNs and the enhanced RNN-BiLSTM by investigating their applicability in sign language recognition scenarios. The results indicate that the TCN model achieved an accuracy of 99.5%, while the original RNN-BiLSTM model initially achieved a 96% accuracy but improved to 99% after enhancement. While the accuracy gap between the two models was small, the TCN model demonstrated significant advantages in terms of computational efficiency, requiring fewer resources and achieving faster inference times. These factors make TCNs more practical for real-time sign language recognition applications. Full article

(This article belongs to the Special Issue Sensor-Based Behavioral Biometrics)

► Show Figures

Figure 1

25 pages, 2508 KB

Open AccessArticle

OVSLT: Advancing Sign Language Translation with Open Vocabulary

by Ai Wang, Junhui Li, Wuyang Luan and Lei Pan

Electronics 2025, 14(5), 1044; https://doi.org/10.3390/electronics14051044 - 6 Mar 2025

Cited by 1 | Viewed by 3631

Abstract

Hearing impairments affect approximately 1.5 billion individuals worldwide, highlighting the critical need for effective communication tools between deaf and hearing populations. Traditional sign language translation (SLT) models predominantly rely on gloss-based methods, which convert visual sign language inputs into intermediate gloss sequences before [...] Read more.

Hearing impairments affect approximately 1.5 billion individuals worldwide, highlighting the critical need for effective communication tools between deaf and hearing populations. Traditional sign language translation (SLT) models predominantly rely on gloss-based methods, which convert visual sign language inputs into intermediate gloss sequences before generating textual translations. However, these methods are constrained by their reliance on extensive annotated data, susceptibility to error propagation, and inadequate handling of low-frequency or unseen sign language vocabulary, thus limiting their scalability and practical application. Drawing upon multimodal translation theory, this study proposes the open-vocabulary sign language translation (OVSLT) method, designed to overcome these challenges by integrating open-vocabulary principles. OVSLT introduces two pivotal modules: Enhanced Caption Generation and Description (CGD), and Grid Feature Grouping with Advanced Alignment Techniques. The Enhanced CGD module employs a GPT model enhanced with a Negative Retriever and Semantic Retrieval-Augmented Features (SRAF) to produce semantically rich textual descriptions of sign gestures. In parallel, the Grid Feature Grouping module applies Grid Feature Grouping, contrastive learning, feature-discriminative contrastive loss, and balanced region loss scaling to refine visual feature representations, ensuring robust alignment with textual descriptions. We evaluated OVSLT on the PHOENIX-14T and CSLDaily datasets. The results demonstrated a ROUGE score of 29.6% on the PHOENIX-14T dataset and 30.72% on the CSLDaily dataset, significantly outperforming existing models. These findings underscore the versatility and effectiveness of OVSLT, showcasing the potential of open-vocabulary approaches to surpass the limitations of traditional SLT systems and contribute to the evolving field of multimodal translation. Full article

► Show Figures

Figure 1

19 pages, 8196 KB

Open AccessEditor’s ChoiceArticle

Human–Robot Interaction Using Dynamic Hand Gesture for Teleoperation of Quadruped Robots with a Robotic Arm

by Jianan Xie, Zhen Xu, Jiayu Zeng, Yuyang Gao and Kenji Hashimoto

Electronics 2025, 14(5), 860; https://doi.org/10.3390/electronics14050860 - 21 Feb 2025

Cited by 13 | Viewed by 6440

Abstract

Human–Robot Interaction (HRI) using hand gesture recognition offers an effective and non-contact approach to enhancing operational intuitiveness and user convenience. However, most existing studies primarily focus on either static sign language recognition or the tracking of hand position and orientation in space. These [...] Read more.

Human–Robot Interaction (HRI) using hand gesture recognition offers an effective and non-contact approach to enhancing operational intuitiveness and user convenience. However, most existing studies primarily focus on either static sign language recognition or the tracking of hand position and orientation in space. These approaches often prove inadequate for controlling complex robotic systems. This paper proposes an advanced HRI system leveraging dynamic hand gestures for controlling quadruped robots equipped with a robotic arm. The proposed system integrates both semantic and pose information from dynamic gestures to enable comprehensive control over the robot’s diverse functionalities. First, a Depth–MediaPipe framework is introduced to facilitate the precise three-dimensional (3D) coordinate extraction of 21 hand bone keypoints. Subsequently, a Semantic-Pose to Motion (SPM) model is developed to analyze and interpret both the pose and semantic aspects of hand gestures. This model translates the extracted 3D coordinate data into corresponding mechanical actions in real-time, encompassing quadruped robot locomotion, robotic arm end-effector tracking, and semantic-based command switching. Extensive real-world experiments demonstrate the proposed system’s effectiveness in achieving real-time interaction and precise control, underscoring its potential for enhancing the usability of complex robotic platforms. Full article

(This article belongs to the Special Issue Communication Systems and Manipulators for Robots and Unmanned Systems)

► Show Figures

Figure 1

17 pages, 8323 KB

Open AccessArticle

A Symmetrical Leech-Inspired Soft Crawling Robot Based on Gesture Control

by Jiabiao Li, Ruiheng Liu, Tianyu Zhang and Jianbin Liu

Biomimetics 2025, 10(1), 35; https://doi.org/10.3390/biomimetics10010035 - 8 Jan 2025

Cited by 2 | Viewed by 1678

Abstract

This paper presents a novel soft crawling robot controlled by gesture recognition, aimed at enhancing the operability and adaptability of soft robots through natural human–computer interactions. The Leap Motion sensor is employed to capture hand gesture data, and Unreal Engine is used for [...] Read more.

This paper presents a novel soft crawling robot controlled by gesture recognition, aimed at enhancing the operability and adaptability of soft robots through natural human–computer interactions. The Leap Motion sensor is employed to capture hand gesture data, and Unreal Engine is used for gesture recognition. Using the UE4Duino, gesture semantics are transmitted to an Arduino control system, enabling direct control over the robot’s movements. For accurate and real-time gesture recognition, we propose a threshold-based method for static gestures and a backpropagation (BP) neural network model for dynamic gestures. In terms of design, the robot utilizes cost-effective thermoplastic polyurethane (TPU) film as the primary pneumatic actuator material. Through a positive and negative pressure switching circuit, the robot’s actuators achieve controllable extension and contraction, allowing for basic movements such as linear motion and directional changes. Experimental results demonstrate that the robot can successfully perform diverse motions under gesture control, highlighting the potential of gesture-based interaction in soft robotics. Full article

(This article belongs to the Special Issue Design, Actuation, and Fabrication of Bio-Inspired Soft Robotics)

► Show Figures

Figure 1

Search Results (47)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (47)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI