MDPI - Publisher of Open Access Journals

27 pages, 3490 KB

Open AccessArticle

Multimodal Minimal-Angular-Geometry Representation for Real-Time Dynamic Mexican Sign Language Recognition

by Gerardo Garcia-Gil, Gabriela del Carmen López-Armas and Yahir Emmanuel Ramirez-Pulido

Technologies 2026, 14(1), 48; https://doi.org/10.3390/technologies14010048 - 8 Jan 2026

Current approaches to dynamic sign language recognition commonly rely on dense landmark representations, which impose high computational cost and hinder real-time deployment on resource-constrained devices. To address this limitation, this work proposes a computationally efficient framework for real-time dynamic Mexican Sign Language (MSL) [...] Read more.

Current approaches to dynamic sign language recognition commonly rely on dense landmark representations, which impose high computational cost and hinder real-time deployment on resource-constrained devices. To address this limitation, this work proposes a computationally efficient framework for real-time dynamic Mexican Sign Language (MSL) recognition based on a multimodal minimal angular-geometry representation. Instead of processing complete landmark sets (e.g., MediaPipe Holistic with up to 468 keypoints), the proposed method encodes the relational geometry of the hands, face, and upper body into a compact set of 28 invariant internal angular descriptors. This representation substantially reduces feature dimensionality and computational complexity while preserving linguistically relevant manual and non-manual information required for grammatical and semantic discrimination in MSL. A real-time end-to-end pipeline is developed, comprising multimodal landmark extraction, angular feature computation, and temporal modeling using a Bidirectional Long Short-Term Memory (BiLSTM) network. The system is evaluated on a custom dataset of dynamic MSL gestures acquired under controlled real-time conditions. Experimental results demonstrate that the proposed approach achieves 99% accuracy and 99% macro F1-score, matching state-of-the-art performance while using fewer features dramatically. The compactness, interpretability, and efficiency of the minimal angular descriptor make the proposed system suitable for real-time deployment on low-cost devices, contributing toward more accessible and inclusive sign language recognition technologies. Full article

(This article belongs to the Special Issue Image Analysis and Processing)

24 pages, 15172 KB

Open AccessArticle

Real-Time Hand Gesture Recognition for IoT Devices Using FMCW mmWave Radar and Continuous Wavelet Transform

by Anna Ślesicka and Adam Kawalec

Electronics 2026, 15(2), 250; https://doi.org/10.3390/electronics15020250 - 6 Jan 2026

Viewed by 29

Abstract

This paper presents an intelligent framework for real-time hand gesture recognition using Frequency-Modulated Continuous-Wave (FMCW) mmWave radar and deep learning. Unlike traditional radar-based recognition methods that rely on Discrete Fourier Transform (DFT) signal representations and focus primarily on classifier optimization, the proposed system [...] Read more.

This paper presents an intelligent framework for real-time hand gesture recognition using Frequency-Modulated Continuous-Wave (FMCW) mmWave radar and deep learning. Unlike traditional radar-based recognition methods that rely on Discrete Fourier Transform (DFT) signal representations and focus primarily on classifier optimization, the proposed system introduces a novel pre-processing stage based on the Continuous Wavelet Transform (CWT). The CWT enables the extraction of discriminative time–frequency features directly from raw radar signals, improving the interpretability and robustness of the learned representations. A lightweight convolutional neural network architecture is then designed to process the CWT maps for efficient classification on edge IoT devices. Experimental validation with data collected from 20 participants performing five standardized gestures demonstrates that the proposed framework achieves an accuracy of up to 99.87% using the Morlet wavelet, with strong generalization to unseen users (82–84% accuracy). The results confirm that the integration of CWT-based radar signal processing with deep learning forms a computationally efficient and accurate intelligent system for human–computer interaction in real-time IoT environments. Full article

(This article belongs to the Special Issue Convolutional Neural Networks and Vision Applications, 4th Edition)

► Show Figures

Figure 1

17 pages, 1312 KB

Open AccessArticle

RGB Fusion of Multiple Radar Sensors for Deep Learning-Based Traffic Hand Gesture Recognition

by Hüseyin Üzen

Electronics 2026, 15(1), 140; https://doi.org/10.3390/electronics15010140 - 28 Dec 2025

Viewed by 265

Abstract

Hand gesture recognition (HGR) systems play a critical role in modern intelligent transportation frameworks by enabling reliable communication between pedestrians, traffic operators, and autonomous vehicles. This work presents a novel traffic hand gesture recognition method that combines nine grayscale radar images captured from [...] Read more.

Hand gesture recognition (HGR) systems play a critical role in modern intelligent transportation frameworks by enabling reliable communication between pedestrians, traffic operators, and autonomous vehicles. This work presents a novel traffic hand gesture recognition method that combines nine grayscale radar images captured from multiple millimeter-wave radar nodes into a single RGB representation through an optimized rotation–shift fusion strategy. This transformation preserves complementary spatial information while minimizing inter-image interference, enabling deep learning models to more effectively utilize the distinctive micro-Doppler and spatial patterns embedded in radar measurements. Extensive experimental studies were conducted to verify the model’s performance, demonstrating that the proposed RGB fusion approach provides higher classification accuracy than single-sensor or unfused representations. In addition, the proposed model outperformed state-of-the-art methods in the literature with an accuracy of 92.55%. These results highlight its potential as a lightweight yet powerful solution for reliable gesture interpretation in future intelligent transportation and human–vehicle interaction systems. Full article

(This article belongs to the Special Issue Advanced Techniques for Multi-Agent Systems)

► Show Figures

Figure 1

24 pages, 2879 KB

Open AccessArticle

Skeleton-Based Real-Time Hand Gesture Recognition Using Data Fusion and Ensemble Multi-Stream CNN Architecture

by Maki K. Habib, Oluwaleke Yusuf and Mohamed Moustafa

Technologies 2025, 13(11), 484; https://doi.org/10.3390/technologies13110484 - 26 Oct 2025

Viewed by 1378

Abstract

Hand Gesture Recognition (HGR) is a vital technology that enables intuitive human–computer interaction in various domains, including augmented reality, smart environments, and assistive systems. Achieving both high accuracy and real-time performance remains challenging due to the complexity of hand dynamics, individual morphological variations, [...] Read more.

Hand Gesture Recognition (HGR) is a vital technology that enables intuitive human–computer interaction in various domains, including augmented reality, smart environments, and assistive systems. Achieving both high accuracy and real-time performance remains challenging due to the complexity of hand dynamics, individual morphological variations, and computational limitations. This paper presents a lightweight and efficient skeleton-based HGR framework that addresses these challenges through an optimized multi-stream Convolutional Neural Network (CNN) architecture and a trainable ensemble tuner. Dynamic 3D gestures are transformed into structured, noise-minimized 2D spatiotemporal representations via enhanced data-level fusion, supporting robust classification across diverse spatial perspectives. The ensemble tuner strengthens semantic relationships between streams and improves recognition accuracy. Unlike existing solutions that rely on high-end hardware, the proposed framework achieves real-time inference on consumer-grade devices without compromising accuracy. Experimental validation across five benchmark datasets (SHREC2017, DHG1428, FPHA, LMDHG, and CNR) confirms consistent or superior performance with reduced computational overhead. Additional validation on the SBU Kinect Interaction Dataset highlights generalization potential for broader Human Action Recognition (HAR) tasks. This advancement bridges the gap between efficiency and accuracy, supporting scalable deployment in AR/VR, mobile computing, interactive gaming, and resource-constrained environments. Full article

(This article belongs to the Special Issue Advancements in Medical and Assistive Technologies Using Artificial Intelligence and Deep Learning Techniques)

► Show Figures

Figure 1

36 pages, 14744 KB

Open AccessArticle

Saltatory Spectacles: (Pre)Colonialism, Travel, and Ancestral Lyric in the Middle Ages and Raymonda

by Kathryn Emily Dickason

Arts 2025, 14(5), 101; https://doi.org/10.3390/arts14050101 - 28 Aug 2025

Viewed by 3323

Abstract

This article examines tropes of (proto)colonialism in medieval European culture and Raymonda (Раймoнда), a ballet that premiered in St. Petersburg in 1898 and is set during the Fifth Crusade (1217–1221). Juxtaposing premodern travel accounts with a postmedieval dance creation, this study illuminates how [...] Read more.

This article examines tropes of (proto)colonialism in medieval European culture and Raymonda (Раймoнда), a ballet that premiered in St. Petersburg in 1898 and is set during the Fifth Crusade (1217–1221). Juxtaposing premodern travel accounts with a postmedieval dance creation, this study illuminates how religious otherness, imperial ambitions, and feminine resistance frame representations of dance spectacle and spectatorship. Following a synopsis of the ballet, the subsequent section considers Raymonda’s Muslim characters vis-à-vis medieval texts and images. Here, I incorporate Crusades-era sources, travel literature, and their accompanying iconography alongside the characterizations and aesthetics that pervade Raymonda. These comparisons apprehend the racializing and (proto)colonial thrust of crusader ideology and Russian imperialism. The final section historicizes Raymonda through medieval lyric and gestures toward an Afro-Islamicate ancestry of lyricism and ballet medievalism. Therefore, while traditional versions of Raymonda project Islamophobia, I posit that a rigorous examination of the Middle Ages imbues this ballet with profundity and intercultural nuance. Ultimately, this article demonstrates how a combined study of premodern travel and postmedieval dance may help scholars challenge the Eurocentrism, colonialism, and Whiteness that pervade medieval studies and the art of ballet. Full article

(This article belongs to the Special Issue Artistic Imagination and Social Imaginaries–Polysemous Readings of Historical Travelling Accounts)

► Show Figures

Figure 1

14 pages, 709 KB

Open AccessArticle

Operative Creativity: Art at the Intersection of Simulation and Realization

by Maayan Amir

Arts 2025, 14(5), 99; https://doi.org/10.3390/arts14050099 - 27 Aug 2025

Viewed by 804

Abstract

This essay proposes operative creativity as a conceptual and artistic response to the shifting roles of images in the age of algorithmic perception. Departing from Harun Farocki’s seminal artwork Eye/Machine, which first introduced the operative image as functioning not to represent but [...] Read more.

This essay proposes operative creativity as a conceptual and artistic response to the shifting roles of images in the age of algorithmic perception. Departing from Harun Farocki’s seminal artwork Eye/Machine, which first introduced the operative image as functioning not to represent but to activate within machinic processes, it traces the transformation of images from representational devices to machinic agents embedded in systems of simulation and realization. Although operative images were initially engineered for strictly technological functions, they have, from their inception, been subject to repurposing for human perception and interpretation. Drawing on literature theorizing the redirection of operative images within military, computational, and epistemic domains, the essay does not attempt a comprehensive survey. Instead, it opens a conceptual aperture within the framework, expanding it to illuminate the secondary redeployment of operative images in contemporary visual culture. Concluding with the artwork Terms and Conditions, co-created by Ruti Sela and the author, it examines how artistic gestures might neutralize the weaponized gaze, offering a mode of operative creativity that troubles machinic vision and reclaims a space for human opacity. Full article

(This article belongs to the Special Issue Contemporary Visual Culture in Conflict Zones and Contested Territories)

► Show Figures

Figure 1

26 pages, 1506 KB

Open AccessArticle

The Role of Non-Representational Hand Gestures in Creative Thinking

by Gyulten Hyusein and Tilbe Göksun

Languages 2025, 10(9), 206; https://doi.org/10.3390/languages10090206 - 26 Aug 2025

Viewed by 1741

Abstract

Previous studies suggest that representational gestures support divergent thinking and that mental imagery is necessary for gestures to aid convergent thinking. However, less is known about non-representational gesture use (i.e., beat and palm-revealing) during creative thinking. Across two experiments, we examined whether these [...] Read more.

Previous studies suggest that representational gestures support divergent thinking and that mental imagery is necessary for gestures to aid convergent thinking. However, less is known about non-representational gesture use (i.e., beat and palm-revealing) during creative thinking. Across two experiments, we examined whether these gestures supported or hindered creativity and the effects of mental imagery on creative thinking. In Experiment 1, we tested both gesture-spontaneous and gesture-encouraged conditions during divergent thinking. Beat gestures, irrespective of condition, were negatively associated with originality in divergent thinking for individuals with high mental imagery. Encouraged palm-revealing gestures were negatively associated with fluency, flexibility, and elaboration in divergent thinking, regardless of mental imagery. In Experiment 2, we examined spontaneous gestures during both divergent and convergent thinking and assessed mental imagery vividness and skills. Beat gestures were negatively associated with convergent thinking for individuals with low or average imagery vividness. Similarly, palm-revealing gestures were negatively associated with convergent thinking for individuals with low mental imagery skills. Vividness of imagery was the only consistent positive predictor of divergent thinking. Spontaneous gestures were not associated with divergent thinking. These findings show that, unlike representational, non-representational gesture use does not facilitate and might even hurt creativity, depending on individual differences in mental imagery. Full article

(This article belongs to the Special Issue Non-representational Gestures: Types, Use, and Functions)

► Show Figures

Figure 1

12 pages, 765 KB

Open AccessArticle

Monkey Do, Monkey See? The Effect of Imitation Strategies on Visuospatial Perspective-Taking and Self-Reported Social Cognitive Skills

by Marion Ducret, Eric Chabanat, Ayumi Kambara, Yves Rossetti and Francois Quesque

Behav. Sci. 2025, 15(8), 1112; https://doi.org/10.3390/bs15081112 - 17 Aug 2025

Viewed by 802

Abstract

Classical social cognitive conceptions suppose that the existence of common representations between agents constitutes the basis that represents the world from others’ perspectives. Alternatively, recent contributions support that the ability to distinguish self- from other’s representation would rather be at the origins of [...] Read more.

Classical social cognitive conceptions suppose that the existence of common representations between agents constitutes the basis that represents the world from others’ perspectives. Alternatively, recent contributions support that the ability to distinguish self- from other’s representation would rather be at the origins of social inferences abilities. In the present study we compared the effects of two types of imitation training: mirror imitation (for which gesture could be represented in common referential) and anatomically congruent imitation (which requires not only a representation of the gesture of the model but also distinguishing between one’s own and others’ representations). We observed that a 4 min training of anatomically congruent imitation, but not of mirror imitation, improved performance on a visual perspective-taking test. This short training did not significantly impact self-reported measures of social cognitive skills. These results suggest that a unique transversal cognitive mechanism of co-representing and switching between self-related and other-related representations could be involved at both the motor and the mental-state levels. Opportunities for innovative social cognitive interventions at the motor level are discussed. Full article

(This article belongs to the Special Issue Understanding Other Intentions: Merging Evidence on Theory of Mind across Various Research Areas)

► Show Figures

Figure 1

23 pages, 4949 KB

Open AccessArticle

Hybrid LDA-CNN Framework for Robust End-to-End Myoelectric Hand Gesture Recognition Under Dynamic Conditions

by Hongquan Le, Marc in het Panhuis, Geoffrey M. Spinks and Gursel Alici

Robotics 2025, 14(6), 83; https://doi.org/10.3390/robotics14060083 - 17 Jun 2025

Viewed by 1773

Abstract

Gesture recognition based on conventional machine learning is the main control approach for advanced prosthetic hand systems. Its primary limitation is the need for feature extraction, which must meet real-time control requirements. On the other hand, deep learning models could potentially overfit when [...] Read more.

Gesture recognition based on conventional machine learning is the main control approach for advanced prosthetic hand systems. Its primary limitation is the need for feature extraction, which must meet real-time control requirements. On the other hand, deep learning models could potentially overfit when trained on small datasets. For these reasons, we propose a hybrid Linear Discriminant Analysis–convolutional neural network (LDA-CNN) framework to improve the gesture recognition performance of sEMG-based prosthetic hand control systems. Within this framework, 1D-CNN filters are trained to generate latent representation that closely approximates Fisher’s (LDA’s) discriminant subspace, constructed from handcrafted features. Under the train-one-test-all evaluation scheme, our proposed hybrid framework consistently outperformed the 1D-CNN trained with cross-entropy loss only, showing improvements from 4% to 11% across two public datasets featuring hand gestures recorded under various limb positions and arm muscle contraction levels. Furthermore, our framework exhibited advantages in terms of induced spectral regularization, which led to a state-of-the-art recognition error of 22.79% with the extended 23 feature set when tested on the multi-limb position dataset. The main novelty of our hybrid framework is that it decouples feature extraction in regard to the inference time, enabling the future incorporation of a more extensive set of features, while keeping the inference computation time minimal. Full article

(This article belongs to the Special Issue AI for Robotic Exoskeletons and Prostheses)

► Show Figures

Figure 1

18 pages, 4982 KB

Open AccessArticle

Unsupervised Clustering and Ensemble Learning for Classifying Lip Articulation in Fingerspelling

by Nurzada Amangeldy, Nazerke Gazizova, Marek Milosz, Bekbolat Kurmetbek, Aizhan Nazyrova and Akmaral Kassymova

Sensors 2025, 25(12), 3703; https://doi.org/10.3390/s25123703 - 13 Jun 2025

Viewed by 974

Abstract

This paper presents a new methodology for analyzing lip articulation during fingerspelling aimed at extracting robust visual patterns that can overcome the inherent ambiguity and variability of lip shape. The proposed approach is based on unsupervised clustering of lip movement trajectories to identify [...] Read more.

This paper presents a new methodology for analyzing lip articulation during fingerspelling aimed at extracting robust visual patterns that can overcome the inherent ambiguity and variability of lip shape. The proposed approach is based on unsupervised clustering of lip movement trajectories to identify consistent articulatory patterns across different time profiles. The methodology is not limited to using a single model. Still, it includes the exploration of varying cluster configurations and an assessment of their robustness, as well as a detailed analysis of the correspondence between individual alphabet letters and specific clusters. In contrast to direct classification based on raw visual features, this approach pre-tests clustered representations using a model-based assessment of their discriminative potential. This structured approach enhances the interpretability and robustness of the extracted features, highlighting the importance of lip dynamics as an auxiliary modality in multimodal sign language recognition. The obtained results demonstrate that trajectory clustering can serve as a practical method for generating features, providing more accurate and context-sensitive gesture interpretation. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

25 pages, 9742 KB

Open AccessArticle

Autism Spectrum Disorder Detection Using Skeleton-Based Body Movement Analysis via Dual-Stream Deep Learning

by Jungpil Shin, Abu Saleh Musa Miah, Manato Kakizaki, Najmul Hassan and Yoichi Tomioka

Electronics 2025, 14(11), 2231; https://doi.org/10.3390/electronics14112231 - 30 May 2025

Cited by 1 | Viewed by 2304

Abstract

Autism Spectrum Disorder (ASD) poses significant challenges in diagnosis due to its diverse symptomatology and the complexity of early detection. Atypical gait and gesture patterns, prominent behavioural markers of ASD, hold immense potential for facilitating early intervention and optimising treatment outcomes. These patterns [...] Read more.

Autism Spectrum Disorder (ASD) poses significant challenges in diagnosis due to its diverse symptomatology and the complexity of early detection. Atypical gait and gesture patterns, prominent behavioural markers of ASD, hold immense potential for facilitating early intervention and optimising treatment outcomes. These patterns can be efficiently and non-intrusively captured using modern computational techniques, making them valuable for ASD recognition. Various types of research have been conducted to detect ASD through deep learning, including facial feature analysis, eye gaze analysis, and movement and gesture analysis. In this study, we optimise a dual-stream architecture that combines image classification and skeleton recognition models to analyse video data for body motion analysis. The first stream processes Skepxels—spatial representations derived from skeleton data—using ConvNeXt-Base, a robust image recognition model that efficiently captures aggregated spatial embeddings. The second stream encodes angular features, embedding relative joint angles into the skeleton sequence and extracting spatiotemporal dynamics using Multi-Scale Graph 3D Convolutional Network(MSG3D), a combination of Graph Convolutional Networks (GCNs) and Temporal Convolutional Networks (TCNs). We replace the ViT model from the original architecture with ConvNeXt-Base to evaluate the efficacy of CNN-based models in capturing gesture-related features for ASD detection. Additionally, we experimented with a Stack Transformer in the second stream instead of MSG3D but found it to result in lower performance accuracy, thus highlighting the importance of GCN-based models for motion analysis. The integration of these two streams ensures comprehensive feature extraction, capturing both global and detailed motion patterns. A pairwise Euclidean distance loss is employed during training to enhance the consistency and robustness of feature representations. The results from our experiments demonstrate that the two-stream approach, combining ConvNeXt-Base and MSG3D, offers a promising method for effective autism detection. This approach not only enhances accuracy but also contributes valuable insights into optimising deep learning models for gesture-based recognition. By integrating image classification and skeleton recognition, we can better capture both global and detailed motion patterns, which are crucial for improving early ASD diagnosis and intervention strategies. Full article

(This article belongs to the Special Issue Convolutional Neural Networks and Vision Applications, 4th Edition)

► Show Figures

Figure 1

15 pages, 4192 KB

Open AccessArticle

Enhancing Kazakh Sign Language Recognition with BiLSTM Using YOLO Keypoints and Optical Flow

by Zholdas Buribayev, Maria Aouani, Zhansaya Zhangabay, Ainur Yerkos, Zemfira Abdirazak and Mukhtar Zhassuzak

Appl. Sci. 2025, 15(10), 5685; https://doi.org/10.3390/app15105685 - 20 May 2025

Cited by 1 | Viewed by 1767

Abstract

Sign languages are characterized by complex and subtle hand movements, which are challenging for computer vision systems to accurately recognize. This study suggests an innovative deep learning pipeline specifically designed for reliable gesture recognition of Kazakh Sign Language. This approach combines key point [...] Read more.

Sign languages are characterized by complex and subtle hand movements, which are challenging for computer vision systems to accurately recognize. This study suggests an innovative deep learning pipeline specifically designed for reliable gesture recognition of Kazakh Sign Language. This approach combines key point detection using the YOLO model, optical flow estimation, and a bidirectional long short-term memory (BiLSTM) network. At the initial stage, a dataset is generated using MediaPipe, which is then used to train the YOLO model in order to accurately identify key hand points. After training, the YOLO model extracts key points and bounding boxes from video recordings of gestures, creating consistent representations of movements. To improve the recognition of dynamic gestures, the optical flow is calculated in an area covering 10% of the area around key points, which allows the dynamics of movements to be captured and provides additional time characteristics. The BiLSTM network is trained on multimodal input that combines data on keypoints, bounding boxes, and optical flow, resulting in improved gesture classification accuracy. The experimental results demonstrate that the proposed approach is superior to traditional methods based solely on key points, especially in recognizing fast and complex gestures. The proposed structure promotes the development of sign language recognition technologies, especially for poorly studied languages such as Kazakh, paving the way to more inclusive and effective communication tools. Full article

► Show Figures

Figure 1

51 pages, 41402 KB

Open AccessArticle

A Digitally Enhanced Ethnography for Craft Action and Process Understanding

by Xenophon Zabulis, Partarakis Nikolaos, Vasiliki Manikaki, Ioanna Demeridou, Arnaud Dubois, Inés Moreno, Valentina Bartalesi, Nicolò Pratelli, Carlo Meghini, Sotiris Manitsaris and Gavriela Senteri

Appl. Sci. 2025, 15(10), 5408; https://doi.org/10.3390/app15105408 - 12 May 2025

Cited by 1 | Viewed by 2979

Abstract

Traditional ethnographic methods have long been employed to study craft practices, yet they often fall short of capturing the full depth of embodied knowledge, material interactions, and procedural workflows inherent in craftsmanship. This paper introduces a digitally enhanced ethnographic framework that integrates Motion [...] Read more.

Traditional ethnographic methods have long been employed to study craft practices, yet they often fall short of capturing the full depth of embodied knowledge, material interactions, and procedural workflows inherent in craftsmanship. This paper introduces a digitally enhanced ethnographic framework that integrates Motion Capture, 3D scanning, audiovisual documentation, and semantic knowledge representation to document both the tangible and dynamic aspects of craft processes. By distinguishing between endurant (tools, materials, objects) and perdurant (actions, events, transformations) entities, we propose a structured methodology for analyzing craft gestures, material behaviors, and production workflows. The study applies this proposed framework to eight European craft traditions—including glassblowing, tapestry weaving, woodcarving, porcelain pottery, marble carving, silversmithing, clay pottery, and textile weaving—demonstrating the adaptability of digital ethnographic tools across disciplines. Through a combination of multimodal data acquisition and expert-driven annotation, we present a comprehensive model for craft documentation that enhances the preservation, education, and analysis of artisanal knowledge. This research contributes to the ongoing evolution of ethnographic methods by bridging digital technology with Cultural Heritage studies, offering a robust framework for understanding the mechanics and meanings of craft practices. Full article

(This article belongs to the Special Issue Challenges and Current Applications of 3D Information Technologies for Cultural Heritage)

► Show Figures

Figure 1

25 pages, 2508 KB

Open AccessArticle

OVSLT: Advancing Sign Language Translation with Open Vocabulary

by Ai Wang, Junhui Li, Wuyang Luan and Lei Pan

Electronics 2025, 14(5), 1044; https://doi.org/10.3390/electronics14051044 - 6 Mar 2025

Viewed by 3426

Abstract

Hearing impairments affect approximately 1.5 billion individuals worldwide, highlighting the critical need for effective communication tools between deaf and hearing populations. Traditional sign language translation (SLT) models predominantly rely on gloss-based methods, which convert visual sign language inputs into intermediate gloss sequences before [...] Read more.

Hearing impairments affect approximately 1.5 billion individuals worldwide, highlighting the critical need for effective communication tools between deaf and hearing populations. Traditional sign language translation (SLT) models predominantly rely on gloss-based methods, which convert visual sign language inputs into intermediate gloss sequences before generating textual translations. However, these methods are constrained by their reliance on extensive annotated data, susceptibility to error propagation, and inadequate handling of low-frequency or unseen sign language vocabulary, thus limiting their scalability and practical application. Drawing upon multimodal translation theory, this study proposes the open-vocabulary sign language translation (OVSLT) method, designed to overcome these challenges by integrating open-vocabulary principles. OVSLT introduces two pivotal modules: Enhanced Caption Generation and Description (CGD), and Grid Feature Grouping with Advanced Alignment Techniques. The Enhanced CGD module employs a GPT model enhanced with a Negative Retriever and Semantic Retrieval-Augmented Features (SRAF) to produce semantically rich textual descriptions of sign gestures. In parallel, the Grid Feature Grouping module applies Grid Feature Grouping, contrastive learning, feature-discriminative contrastive loss, and balanced region loss scaling to refine visual feature representations, ensuring robust alignment with textual descriptions. We evaluated OVSLT on the PHOENIX-14T and CSLDaily datasets. The results demonstrated a ROUGE score of 29.6% on the PHOENIX-14T dataset and 30.72% on the CSLDaily dataset, significantly outperforming existing models. These findings underscore the versatility and effectiveness of OVSLT, showcasing the potential of open-vocabulary approaches to surpass the limitations of traditional SLT systems and contribute to the evolving field of multimodal translation. Full article

► Show Figures

Figure 1

23 pages, 8209 KB

Open AccessArticle

Spatio-Temporal Transformer with Kolmogorov–Arnold Network for Skeleton-Based Hand Gesture Recognition

by Pengcheng Han, Xin He, Takafumi Matsumaru and Vibekananda Dutta

Sensors 2025, 25(3), 702; https://doi.org/10.3390/s25030702 - 24 Jan 2025

Cited by 2 | Viewed by 3113

Abstract

Manually crafted features often suffer from being subjective, having an inadequate accuracy, or lacking in robustness in recognition. Meanwhile, existing deep learning methods often overlook the structural and dynamic characteristics of the human hand, failing to fully explore the contextual information of joints [...] Read more.

Manually crafted features often suffer from being subjective, having an inadequate accuracy, or lacking in robustness in recognition. Meanwhile, existing deep learning methods often overlook the structural and dynamic characteristics of the human hand, failing to fully explore the contextual information of joints in both the spatial and temporal domains. To effectively capture dependencies between the hand joints that are not adjacent but may have potential connections, it is essential to learn long-term relationships. This study proposes a skeleton-based hand gesture recognition framework, the ST-KT, a spatio-temporal graph convolution network, and a transformer with the Kolmogorov–Arnold Network (KAN) model. It incorporates spatio-temporal graph convolution network (ST-GCN) modules and a spatio-temporal transformer module with KAN (KAN–Transformer). ST-GCN modules, which include a spatial graph convolution network (SGCN) and a temporal convolution network (TCN), extract primary features from skeleton sequences by leveraging the strength of graph convolutional networks in the spatio-temporal domain. A spatio-temporal position embedding method integrates node features, enriching representations by including node identities and temporal information. The transformer layer includes a spatial KAN–Transformer (S-KT) and a temporal KAN–Transformer (T-KT), which further extract joint features by learning edge weights and node embeddings, providing richer feature representations and the capability for nonlinear modeling. We evaluated the performance of our method on two challenging skeleton-based dynamic gesture datasets: our method achieved an accuracy of 97.5% on the SHREC’17 track dataset and 94.3% on the DHG-14/28 dataset. These results demonstrate that our proposed method, ST-KT, effectively captures dynamic skeleton changes and complex joint relationships. Full article

(This article belongs to the Topic Applied Computer Vision and Pattern Recognition: 2nd Edition)

► Show Figures

Figure 1

Search Results (95)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (95)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI