Due to scheduled maintenance work on our servers, there may be short service disruptions on this website between 11:00 and 12:00 CEST on March 28th.
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (106)

Search Parameters:
Keywords = gesture representation

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
17 pages, 561 KB  
Article
Multimodal Shared Autonomy for Heavy-Load UAV Operations with Physics-Aware Cooperative Control
by Xu Gao, Jingfeng Wu, Yuchen Wang, Can Cao, Lihui Wang, Bowen Wang and Yimeng Zhang
Sensors 2026, 26(6), 1997; https://doi.org/10.3390/s26061997 - 23 Mar 2026
Viewed by 119
Abstract
Heavy-load unmanned aerial vehicles (UAVs) are increasingly being applied in logistics, infrastructure installation, and emergency response missions, where complex payload dynamics and unstructured environments pose significant challenges to safe and efficient operation. Conventional manual teleoperation interfaces, such as dual-joystick control, impose a high [...] Read more.
Heavy-load unmanned aerial vehicles (UAVs) are increasingly being applied in logistics, infrastructure installation, and emergency response missions, where complex payload dynamics and unstructured environments pose significant challenges to safe and efficient operation. Conventional manual teleoperation interfaces, such as dual-joystick control, impose a high cognitive workload and provide limited support for expressing high-level operator intent, while fully autonomous solutions remain difficult to deploy reliably under real-world uncertainty. To address these limitations, this paper proposes the Multimodal Fusion Cooperation Network (MFCN), an end-to-end shared autonomy framework that integrates speech commands, visual gestures, and haptic cues through cross-modal feature fusion to infer operator intent in real time. The fused intent representation is translated into dynamically feasible control commands by a cooperative control policy with embedded physics-aware constraints to suppress payload oscillations and ensure flight stability. Extensive semi-physical simulations and real-world experiments demonstrate that the MFCN significantly improves the task success rate, positioning accuracy, and payload stability while reducing the task completion time and operator cognitive workload compared with manual, unimodal, and heuristic multimodal baselines. Full article
(This article belongs to the Special Issue Advanced Sensors and AI Integration for Human–Robot Teaming)
Show Figures

Figure 1

19 pages, 759 KB  
Article
Dual-Stream BiLSTM–Transformer Architecture for Real-Time Two-Handed Dynamic Sign Language Gesture Recognition
by Enachi Andrei, Turcu Corneliu-Octavian, Culea George, Andrioaia Dragos-Alexandru, Ungureanu Andrei-Gabriel and Sghera Bogdan-Constantin
Appl. Sci. 2026, 16(6), 2912; https://doi.org/10.3390/app16062912 - 18 Mar 2026
Viewed by 105
Abstract
Two-handed dynamic gesture recognition represents a fundamental component of sign language interpretation involving the modeling of temporal dependencies and inter-hand coordination. In this task, a major challenge is modeling asymmetric motion patterns, as well as bidirectional and long-range temporal dependencies. Most existing frameworks [...] Read more.
Two-handed dynamic gesture recognition represents a fundamental component of sign language interpretation involving the modeling of temporal dependencies and inter-hand coordination. In this task, a major challenge is modeling asymmetric motion patterns, as well as bidirectional and long-range temporal dependencies. Most existing frameworks rely on early fusion strategies that merge joints, keypoints or landmarks from both hands in early processing stages, primarily to reduce model complexity and enforce a unified representation. In this work, a novel dual-stream BiLSTM–Transformer model architecture is proposed for two-handed dynamic sign language recognition, where parallel encoders process the trajectories of each hand independently. To capture spatial and temporal dependencies for each hand, an attention-based cross-hand fusion mechanism is employed, with hand landmarks extracted by the MediaPipe Hands framework as a preprocessing step to enable real-time CPU-based inference. Experimental evaluation conducted on custom Romanian Sign Language dynamic gesture datasets indicates that the proposed dual-stream-based system outperforms single-handed baselines, achieving improvements in high recognition accuracy for asymmetric gestures and consistent performance gains for synchronized two-handed gestures. The proposed architecture represents an efficient and lightweight solution suitable for real-time sign language recognition and interpretation. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

19 pages, 2968 KB  
Article
CBAM-Enhanced CNN-LSTM with Improved DBSCAN for High-Precision Radar-Based Gesture Recognition
by Shiwei Yi, Zhenyu Zhao and Tongning Wu
Sensors 2026, 26(6), 1835; https://doi.org/10.3390/s26061835 - 14 Mar 2026
Viewed by 222
Abstract
In recent years, radar-based gesture recognition technology has been widely applied in industrial and daily life scenarios. However, increasingly complex application scenarios have imposed higher demands on the accuracy and robustness of gesture recognition algorithms, and challenges such as clutter interference, inter-gesture similarity, [...] Read more.
In recent years, radar-based gesture recognition technology has been widely applied in industrial and daily life scenarios. However, increasingly complex application scenarios have imposed higher demands on the accuracy and robustness of gesture recognition algorithms, and challenges such as clutter interference, inter-gesture similarity, and spatial–temporal feature ambiguity limit recognition performance. To address these challenges, a novel framework named CECL, which incorporates the Convolutional Block Attention Module (CBAM) into a Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) architecture, is proposed for high-accuracy radar-based gesture recognition. The CBAM adaptively highlights discriminative spatial regions and suppresses irrelevant background, and the CNN-LSTM network captures temporal dynamics across gesture sequences. During gesture signal processing, the Blackman window is applied to suppress spectral leakage. Additionally, a combination of wavelet thresholding and dynamic energy nulling is employed to effectively suppress clutter and enhance feature representation. Furthermore, an improved Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm further eliminates isolated sparse noise while preserving dense and valid target signal regions. Experimental results demonstrate that the proposed algorithm achieves 98.33% average accuracy in gesture classification, outperforming other baseline models. It exhibits excellent recognition performance across various distances and angles, demonstrating significantly enhanced robustness. Full article
Show Figures

Figure 1

21 pages, 891 KB  
Article
Unified Visual Synchrony: A Framework for Face–Gesture Coherence in Multimodal Human–AI Interaction
by Saule Kudubayeva, Yernar Seksenbayev, Aigerim Yerimbetova, Elmira Daiyrbayeva, Bakzhan Sakenov, Duman Telman and Mussa Turdalyuly
Big Data Cogn. Comput. 2026, 10(3), 88; https://doi.org/10.3390/bdcc10030088 - 12 Mar 2026
Viewed by 393
Abstract
Multimodal human–AI systems generally consider facial expressions and body motions as separate input streams, leading to disjointed interpretations and diminished emotional coherence. To overcome this issue, we offer the Engagement-Safe Expressive Alignment (ESEA) paradigm and the Unified Visual Synchrony (UVS) framework as its [...] Read more.
Multimodal human–AI systems generally consider facial expressions and body motions as separate input streams, leading to disjointed interpretations and diminished emotional coherence. To overcome this issue, we offer the Engagement-Safe Expressive Alignment (ESEA) paradigm and the Unified Visual Synchrony (UVS) framework as its computational implementation. UVS models the coherence between facial expressions and gestures, offering an interpretable visual synchrony signal that can function as adaptive feedback in human–AI interactions. The framework’s key component is the Consistency Index for Affective Synchrony (CIAS), which correlates brief visual segments with scalar synchrony scores through a common latent representation. Facial and gestural signals are processed by modality-specific projection networks into a unified latent space, and CIAS is derived from the similarity and short-term temporal consistency of these latent trajectories. The synchrony index is regarded as an estimation of affective visual coherence within the ESEA paradigm. We formalize the UVS/CIAS framework and conduct a comparative experimental evaluation utilizing matched and mismatched face–gesture segments derived from rendered dialog footage. Utilizing ROC analysis, score distribution comparisons, temporal visualizations, and negative control tests, we illustrate that CIAS effectively captures structured face–gesture alignment that surpasses similarity-based baselines, while also delivering a persistent, time-resolved synchronization signal. These findings establish CIAS as a principled and interpretable feedback signal for future affect-aware, engagement-focused multimodal agents. Full article
Show Figures

Figure 1

17 pages, 1701 KB  
Article
CLIP-ArASL: A Lightweight Multimodal Model for Arabic Sign Language Recognition
by Naif Alasmari
Appl. Sci. 2026, 16(5), 2573; https://doi.org/10.3390/app16052573 - 7 Mar 2026
Viewed by 216
Abstract
Arabic sign language (ArASL) is the primary communication medium for Deaf and hard-of-hearing people across Arabic-speaking communities. Most current ArASL recognition systems are based solely on visual features and do not incorporate linguistic or semantic information that could improve generalization and semantic grounding. [...] Read more.
Arabic sign language (ArASL) is the primary communication medium for Deaf and hard-of-hearing people across Arabic-speaking communities. Most current ArASL recognition systems are based solely on visual features and do not incorporate linguistic or semantic information that could improve generalization and semantic grounding. This paper introduces CLIP-ArASL, a lightweight CLIP-style multimodal approach for static ArASL letter recognition that aligns visual hand gestures with bilingual textual descriptions. The approach integrates an EfficientNet-B0 image encoder with a MiniLM text encoder to learn a shared embedding space using a hybrid objective that combines contrastive and cross-entropy losses. This design supports supervised classification on seen classes and zero-shot prediction on unseen classes using textual class representations. The proposed approach is evaluated on two public datasets, ArASL2018 and ArASL21L. Under supervised evaluation, recognition accuracies of 99.25±0.14% and 91.51±1.29% are achieved, respectively. Zero-shot performance is assessed by withholding 20% of gesture classes during training and predicting them using only their textual descriptions. In this setting, accuracies of 55.2±12.15% on ArASL2018 and 37.6±9.07% on ArASL21L are obtained. These results show that multimodal vision–language alignment supports semantic transfer and enables recognition of unseen classes. Full article
(This article belongs to the Special Issue Machine Learning in Computer Vision and Image Processing)
Show Figures

Figure 1

18 pages, 1956 KB  
Article
Dynamic Occlusion-Aware Facial Expression Recognition Guided by AA-ViT
by Xiangwei Mou, Xiuping Xie, Yongfu Song and Rijun Wang
Electronics 2026, 15(4), 764; https://doi.org/10.3390/electronics15040764 - 11 Feb 2026
Viewed by 299
Abstract
In complex natural scenarios, facial expression recognition often encounters partial occlusions caused by glasses, hand gestures, and hairstyles, making it difficult for models to extract effective features and thereby reducing recognition accuracy. Existing methods often employ attention mechanisms to enhance expression-related features, but [...] Read more.
In complex natural scenarios, facial expression recognition often encounters partial occlusions caused by glasses, hand gestures, and hairstyles, making it difficult for models to extract effective features and thereby reducing recognition accuracy. Existing methods often employ attention mechanisms to enhance expression-related features, but they fail to adequately address the issue where high-frequency responses in occluded regions can disperse attention weights (e.g., incorrectly focus on occluded areas), making it challenging to effectively utilize local cues around the occlusions and limiting performance improvement. To address this, this paper proposes a network based on an adaptive attention mechanism (Adaptive Attention Vision Transformer, AA-ViT). First, an Adaptive Attention module (ADA) is designed to dynamically adjust attention scores in occluded regions, enhancing the effective information in features. Next, a Dual-Branch Multi-Layer Perceptron (DB-MLP) replaces the single linear layer to improve feature representation and model classification capability. Additionally, a Random Erasure (RE) strategy is introduced to enhance model robustness. Finally, to address the issue of model training instability caused by class imbalance in the training dataset, a hybrid loss function combining Focal Loss and Cross-Entropy Loss is adopted to ensure training stability. Experimental results show that AA-ViT achieves expression recognition accuracies of 90.66% and 90.01% on the RAF-DB and FERPlus datasets, respectively, representing improvements of 4.58 and 18.9 percentage points over the baseline ViT model, with only a 24.3% increase in parameter count. Compared to existing methods, the proposed approach demonstrates superior performance in occluded facial expression recognition tasks. Full article
Show Figures

Figure 1

17 pages, 7804 KB  
Article
A 3D Camera-Based Approach for Real-Time Hand Configuration Recognition in Italian Sign Language
by Luca Ulrich, Asia De Luca, Riccardo Miraglia, Emma Mulassano, Simone Quattrocchio, Giorgia Marullo, Chiara Innocente, Federico Salerno and Enrico Vezzetti
Sensors 2026, 26(3), 1059; https://doi.org/10.3390/s26031059 - 6 Feb 2026
Viewed by 346
Abstract
Deafness poses significant challenges to effective communication, particularly in contexts where access to sign language interpreters is limited. Hand configuration recognition represents a fundamental component of sign language understanding, as configurations constitute a core cheremic element in many sign languages, including Italian Sign [...] Read more.
Deafness poses significant challenges to effective communication, particularly in contexts where access to sign language interpreters is limited. Hand configuration recognition represents a fundamental component of sign language understanding, as configurations constitute a core cheremic element in many sign languages, including Italian Sign Language (LIS). In this work, we address configuration-level recognition as an independent classification task and propose a machine vision framework based on RGB-D sensing. The proposed approach combines MediaPipe-based hand landmark extraction with normalized three-dimensional geometric features and a Support Vector Machine classifier. The first contribution of this study is the formulation of LIS hand configuration recognition as a standalone, configuration-level problem, decoupled from temporal gesture modeling. The second contribution is the integration of sensor-acquired RGB-D depth measurements into the landmark-based feature representation, enabling a direct comparison with estimated depth obtained from monocular data. The third contribution consists of a systematic experimental evaluation on two LIS configuration sets (6 and 16 classes), demonstrating that the use of real depth significantly improves classification performance and class separability, particularly for geometrically similar configurations. The results highlight the critical role of depth quality in configuration-level recognition and provide insights into the design of robust vision-based systems for LIS analysis. Full article
(This article belongs to the Special Issue Sensing and Machine Learning Control: Progress and Applications)
Show Figures

Figure 1

24 pages, 29852 KB  
Article
Dual-Axis Transformer-GNN Framework for Touchless Finger Location Sensing by Using Wi-Fi Channel State Information
by Minseok Koo and Jaesung Park
Electronics 2026, 15(3), 565; https://doi.org/10.3390/electronics15030565 - 28 Jan 2026
Viewed by 315
Abstract
Camera, lidar, and wearable-based gesture recognition technologies face practical limitations such as lighting sensitivity, occlusion, hardware cost, and user inconvenience. Wi-Fi channel state information (CSI) can be used as a contactless alternative to capture subtle signal variations caused by human motion. However, existing [...] Read more.
Camera, lidar, and wearable-based gesture recognition technologies face practical limitations such as lighting sensitivity, occlusion, hardware cost, and user inconvenience. Wi-Fi channel state information (CSI) can be used as a contactless alternative to capture subtle signal variations caused by human motion. However, existing CSI-based methods are highly sensitive to domain shifts and often suffer notable performance degradation when applied to environments different from the training conditions. To address this issue, we propose a domain-robust touchless finger location sensing framework that operates reliably even in a single-link environment composed of commercial Wi-Fi devices. The proposed system applies preprocessing procedures to reduce noise and variability introduced by environmental factors and introduces a multi-domain segment combination strategy to increase the domain diversity during training. In addition, the dual-axis transformer learns temporal and spatial features independently, and the GNN-based integration module incorporates relationships among segments originating from different domains to produce more generalized representations. The proposed model is evaluated using CSI data collected from various users and days; experimental results show that the proposed method achieves an in-domain accuracy of 99.31% and outperforms the best baseline by approximately 4% and 3% in cross-user and cross-day evaluation settings, respectively, even in a single-link setting. Our work demonstrates a viable path for robust, calibration-free finger-level interaction using ubiquitous single-link Wi-Fi in real-world and constrained environments, providing a foundation for more reliable contactless interaction systems. Full article
Show Figures

Figure 1

24 pages, 6118 KB  
Article
Effective Approach for Classifying EMG Signals Through Reconstruction Using Autoencoders
by Natalia Rendón Caballero, Michelle Rojo González, Marcos Aviles, José Manuel Alvarez Alvarado, José Billerman Robles-Ocampo, Perla Yazmin Sevilla-Camacho and Juvenal Rodríguez-Reséndiz
AI 2026, 7(1), 36; https://doi.org/10.3390/ai7010036 - 22 Jan 2026
Viewed by 505
Abstract
The study of muscle signal classification has been widely explored for the control of myoelectric prostheses. Traditional approaches rely on manually designed features extracted from time- or frequency-domain representations, which may limit the generalization and adaptability of EMG-based systems. In this work, an [...] Read more.
The study of muscle signal classification has been widely explored for the control of myoelectric prostheses. Traditional approaches rely on manually designed features extracted from time- or frequency-domain representations, which may limit the generalization and adaptability of EMG-based systems. In this work, an autoencoder-based framework is proposed for automatic feature extraction, enabling the learning of compact latent representations directly from raw EMG signals and reducing dependence on handcrafted features. A custom instrumentation system with three surface EMG sensors was developed and placed on selected forearm muscles to acquire signals associated with five hand movements from 20 healthy participants aged 18 to 40 years. The signals were segmented into 200 ms windows with 75% overlap. The proposed method employs a recurrent autoencoder with a symmetric encoder–decoder architecture, trained independently for each sensor to achieve accurate signal reconstruction, with a minimum reconstruction loss of 3.3×104V2. The encoder’s latent representations were then used to train a dense neural network for gesture classification. An overall efficiency of 93.84% was achieved, demonstrating that the proposed reconstruction-based approach provides high classification performance and represents a promising solution for future EMG-based assistive and control applications. Full article
(This article belongs to the Special Issue Transforming Biomedical Innovation with Artificial Intelligence)
Show Figures

Figure 1

27 pages, 4802 KB  
Article
Fine-Grained Radar Hand Gesture Recognition Method Based on Variable-Channel DRSN
by Penghui Chen, Siben Li, Chenchen Yuan, Yujing Bai and Jun Wang
Electronics 2026, 15(2), 437; https://doi.org/10.3390/electronics15020437 - 19 Jan 2026
Viewed by 403
Abstract
With the ongoing miniaturization of smart devices, fine-grained hand gesture recognition using millimeter-wave radar has attracted increasing attention, yet practical deployment remains challenging in continuous-gesture segmentation, robust feature extraction, and reliable classification. This paper presents an end-to-end fine-grained gesture recognition framework based on [...] Read more.
With the ongoing miniaturization of smart devices, fine-grained hand gesture recognition using millimeter-wave radar has attracted increasing attention, yet practical deployment remains challenging in continuous-gesture segmentation, robust feature extraction, and reliable classification. This paper presents an end-to-end fine-grained gesture recognition framework based on frequency modulated continuous wave(FMCW) millimeter-wave radar, including gesture design, data acquisition, feature construction, and neural network-based classification. Ten gesture types are recorded (eight valid gestures and two return-to-neutral gestures); for classification, the two return-to-neutral gesture types are merged into a single invalid class, yielding a nine-class task. A sliding-window segmentation method is developed using short-time Fourier transformation(STFT)-based Doppler-time representations, and a dataset of 4050 labeled samples is collected. Multiple signal classification(MUSIC)-based super-resolution estimation is adopted to construct range–time and angle–time representations, and instance-wise normalization is applied to Doppler and range features to mitigate inter-individual variability without test leakage. For recognition, a variable-channel deep residual shrinkage network (DRSN) is employed to improve robustness to noise, supporting single-, dual-, and triple-channel feature inputs. Results under both subject-dependent evaluation with repeated random splits and subject-independent leave one subject out(LOSO) cross-validation show that DRSN architecture consistently outperforms the RefineNet-based baseline, and the triple-channel configuration achieves the best performance (98.88% accuracy). Overall, the variable-channel design enables flexible feature selection to meet diverse application requirements. Full article
Show Figures

Figure 1

26 pages, 3626 KB  
Article
A Lightweight Frozen Multi-Convolution Dual-Branch Network for Efficient sEMG-Based Gesture Recognition
by Shengbiao Wu, Zhezhe Lv, Yuehong Li, Chengmin Fang, Tao You and Jiazheng Gui
Sensors 2026, 26(2), 580; https://doi.org/10.3390/s26020580 - 15 Jan 2026
Viewed by 360
Abstract
Gesture recognition is important for rehabilitation assistance and intelligent prosthetic control. However, surface electromyography (sEMG) signals exhibit strong non-stationarity, and conventional deep-learning models require long training time and high computational cost, limiting their use on resource-constrained devices. This study proposes a Frozen Multi-Convolution [...] Read more.
Gesture recognition is important for rehabilitation assistance and intelligent prosthetic control. However, surface electromyography (sEMG) signals exhibit strong non-stationarity, and conventional deep-learning models require long training time and high computational cost, limiting their use on resource-constrained devices. This study proposes a Frozen Multi-Convolution Dual-Branch Network (FMC-DBNet) to address these challenges. The model employs randomly initialized and fixed convolutional kernels for training-free multi-scale feature extraction, substantially reducing computational overhead. A dual-branch architecture is adopted to capture complementary temporal and physiological patterns from raw sEMG signals and intrinsic mode functions (IMFs) obtained through variational mode decomposition (VMD). In addition, positive-proportion (PPV) and global-average-pooling (GAP) statistics enhance lightweight multi-resolution representation. Experiments on the Ninapro DB1 dataset show that FMC-DBNet achieves an average accuracy of 96.4% ± 1.9% across 27 subjects and reduces training time by approximately 90% compared with a conventional trainable CNN baseline. These results demonstrate that frozen random-convolution structures provide an efficient and robust alternative to fully trained deep networks, offering a promising solution for low-power and computationally efficient sEMG gesture recognition. Full article
(This article belongs to the Section Electronic Sensors)
Show Figures

Figure 1

27 pages, 4631 KB  
Article
Multimodal Minimal-Angular-Geometry Representation for Real-Time Dynamic Mexican Sign Language Recognition
by Gerardo Garcia-Gil, Gabriela del Carmen López-Armas and Yahir Emmanuel Ramirez-Pulido
Technologies 2026, 14(1), 48; https://doi.org/10.3390/technologies14010048 - 8 Jan 2026
Viewed by 506
Abstract
Current approaches to dynamic sign language recognition commonly rely on dense landmark representations, which impose high computational cost and hinder real-time deployment on resource-constrained devices. To address this limitation, this work proposes a computationally efficient framework for real-time dynamic Mexican Sign Language (MSL) [...] Read more.
Current approaches to dynamic sign language recognition commonly rely on dense landmark representations, which impose high computational cost and hinder real-time deployment on resource-constrained devices. To address this limitation, this work proposes a computationally efficient framework for real-time dynamic Mexican Sign Language (MSL) recognition based on a multimodal minimal angular-geometry representation. Instead of processing complete landmark sets (e.g., MediaPipe Holistic with up to 468 keypoints), the proposed method encodes the relational geometry of the hands, face, and upper body into a compact set of 28 invariant internal angular descriptors. This representation substantially reduces feature dimensionality and computational complexity while preserving linguistically relevant manual and non-manual information required for grammatical and semantic discrimination in MSL. A real-time end-to-end pipeline is developed, comprising multimodal landmark extraction, angular feature computation, and temporal modeling using a Bidirectional Long Short-Term Memory (BiLSTM) network. The system is evaluated on a custom dataset of dynamic MSL gestures acquired under controlled real-time conditions. Experimental results demonstrate that the proposed approach achieves 99% accuracy and 99% macro F1-score, matching state-of-the-art performance while using fewer features dramatically. The compactness, interpretability, and efficiency of the minimal angular descriptor make the proposed system suitable for real-time deployment on low-cost devices, contributing toward more accessible and inclusive sign language recognition technologies. Full article
(This article belongs to the Special Issue Image Analysis and Processing)
Show Figures

Figure 1

24 pages, 15172 KB  
Article
Real-Time Hand Gesture Recognition for IoT Devices Using FMCW mmWave Radar and Continuous Wavelet Transform
by Anna Ślesicka and Adam Kawalec
Electronics 2026, 15(2), 250; https://doi.org/10.3390/electronics15020250 - 6 Jan 2026
Viewed by 642
Abstract
This paper presents an intelligent framework for real-time hand gesture recognition using Frequency-Modulated Continuous-Wave (FMCW) mmWave radar and deep learning. Unlike traditional radar-based recognition methods that rely on Discrete Fourier Transform (DFT) signal representations and focus primarily on classifier optimization, the proposed system [...] Read more.
This paper presents an intelligent framework for real-time hand gesture recognition using Frequency-Modulated Continuous-Wave (FMCW) mmWave radar and deep learning. Unlike traditional radar-based recognition methods that rely on Discrete Fourier Transform (DFT) signal representations and focus primarily on classifier optimization, the proposed system introduces a novel pre-processing stage based on the Continuous Wavelet Transform (CWT). The CWT enables the extraction of discriminative time–frequency features directly from raw radar signals, improving the interpretability and robustness of the learned representations. A lightweight convolutional neural network architecture is then designed to process the CWT maps for efficient classification on edge IoT devices. Experimental validation with data collected from 20 participants performing five standardized gestures demonstrates that the proposed framework achieves an accuracy of up to 99.87% using the Morlet wavelet, with strong generalization to unseen users (82–84% accuracy). The results confirm that the integration of CWT-based radar signal processing with deep learning forms a computationally efficient and accurate intelligent system for human–computer interaction in real-time IoT environments. Full article
(This article belongs to the Special Issue Convolutional Neural Networks and Vision Applications, 4th Edition)
Show Figures

Figure 1

17 pages, 1312 KB  
Article
RGB Fusion of Multiple Radar Sensors for Deep Learning-Based Traffic Hand Gesture Recognition
by Hüseyin Üzen
Electronics 2026, 15(1), 140; https://doi.org/10.3390/electronics15010140 - 28 Dec 2025
Viewed by 455
Abstract
Hand gesture recognition (HGR) systems play a critical role in modern intelligent transportation frameworks by enabling reliable communication between pedestrians, traffic operators, and autonomous vehicles. This work presents a novel traffic hand gesture recognition method that combines nine grayscale radar images captured from [...] Read more.
Hand gesture recognition (HGR) systems play a critical role in modern intelligent transportation frameworks by enabling reliable communication between pedestrians, traffic operators, and autonomous vehicles. This work presents a novel traffic hand gesture recognition method that combines nine grayscale radar images captured from multiple millimeter-wave radar nodes into a single RGB representation through an optimized rotation–shift fusion strategy. This transformation preserves complementary spatial information while minimizing inter-image interference, enabling deep learning models to more effectively utilize the distinctive micro-Doppler and spatial patterns embedded in radar measurements. Extensive experimental studies were conducted to verify the model’s performance, demonstrating that the proposed RGB fusion approach provides higher classification accuracy than single-sensor or unfused representations. In addition, the proposed model outperformed state-of-the-art methods in the literature with an accuracy of 92.55%. These results highlight its potential as a lightweight yet powerful solution for reliable gesture interpretation in future intelligent transportation and human–vehicle interaction systems. Full article
(This article belongs to the Special Issue Advanced Techniques for Multi-Agent Systems)
Show Figures

Figure 1

24 pages, 2879 KB  
Article
Skeleton-Based Real-Time Hand Gesture Recognition Using Data Fusion and Ensemble Multi-Stream CNN Architecture
by Maki K. Habib, Oluwaleke Yusuf and Mohamed Moustafa
Technologies 2025, 13(11), 484; https://doi.org/10.3390/technologies13110484 - 26 Oct 2025
Viewed by 1819
Abstract
Hand Gesture Recognition (HGR) is a vital technology that enables intuitive human–computer interaction in various domains, including augmented reality, smart environments, and assistive systems. Achieving both high accuracy and real-time performance remains challenging due to the complexity of hand dynamics, individual morphological variations, [...] Read more.
Hand Gesture Recognition (HGR) is a vital technology that enables intuitive human–computer interaction in various domains, including augmented reality, smart environments, and assistive systems. Achieving both high accuracy and real-time performance remains challenging due to the complexity of hand dynamics, individual morphological variations, and computational limitations. This paper presents a lightweight and efficient skeleton-based HGR framework that addresses these challenges through an optimized multi-stream Convolutional Neural Network (CNN) architecture and a trainable ensemble tuner. Dynamic 3D gestures are transformed into structured, noise-minimized 2D spatiotemporal representations via enhanced data-level fusion, supporting robust classification across diverse spatial perspectives. The ensemble tuner strengthens semantic relationships between streams and improves recognition accuracy. Unlike existing solutions that rely on high-end hardware, the proposed framework achieves real-time inference on consumer-grade devices without compromising accuracy. Experimental validation across five benchmark datasets (SHREC2017, DHG1428, FPHA, LMDHG, and CNR) confirms consistent or superior performance with reduced computational overhead. Additional validation on the SBU Kinect Interaction Dataset highlights generalization potential for broader Human Action Recognition (HAR) tasks. This advancement bridges the gap between efficiency and accuracy, supporting scalable deployment in AR/VR, mobile computing, interactive gaming, and resource-constrained environments. Full article
Show Figures

Figure 1

Back to TopTop