MDPI - Publisher of Open Access Journals

28 pages, 3548 KB

Open AccessArticle

Edge Computing Approach to AI-Based Gesture for Human–Robot Interaction and Control

by Nikola Ivačko, Ivan Ćirić and Miloš Simonović

Computers 2026, 15(4), 241; https://doi.org/10.3390/computers15040241 - 14 Apr 2026

Viewed by 253

This paper presents an edge-deployable vision-based framework for human–robot interaction using a xArm collaborative robot and a single RGB camera mounted on the robot wrist, and lightweight AI-based perception modules. The system enables intuitive, contact-free control by combining hand understanding and object detection [...] Read more.

This paper presents an edge-deployable vision-based framework for human–robot interaction using a xArm collaborative robot and a single RGB camera mounted on the robot wrist, and lightweight AI-based perception modules. The system enables intuitive, contact-free control by combining hand understanding and object detection within a unified perception–decision–control pipeline. Hand landmarks are extracted using MediaPipe Hands, from which continuous hand trajectories, static gestures, and dynamic gestures are derived. Task objects are detected using a YOLO-based model, and both hand and object observations are mapped into the robot workspace using ArUco-based planar calibration. To ensure stable robot motion, the hand control signal is smoothed using low-pass and Kalman filtering, while dynamic gestures such as waving are recognized using a lightweight LSTM classifier. The complete pipeline runs locally on edge hardware, specifically NVIDIA Jetson Orin Nano and Raspberry Pi 5 with a Hailo AI accelerator. Experimental evaluation includes trajectory stability, gesture recognition reliability, and runtime performance on both platforms. Results show that filtering significantly reduces hand-tracking jitter, gesture recognition provides stable command states for control, and both edge devices support real-time operation, with Jetson achieving consistently lower runtime than Raspberry Pi. The proposed system demonstrates the feasibility of low-cost edge AI solutions for responsive and practical human–robot interaction in collaborative industrial environments. Full article

(This article belongs to the Special Issue Intelligent Edge: When AI Meets Edge Computing)

► Show Figures

Figure 1

17 pages, 385 KB

Open AccessArticle

Assessing the Resilience of sEMG Classifiers to Sensor Malfunction and Signal Saturation

by Congyi Zhang, Dalin Zhou, Yinfeng Fang, Dongxu Gao and Zhaojie Ju

Sensors 2026, 26(8), 2386; https://doi.org/10.3390/s26082386 - 13 Apr 2026

Viewed by 348

Abstract

Surface electromyography (sEMG) is widely used for gesture recognition, yet the way classic feature–classifier pipelines fail under realistic signal degradations is still poorly quantified. Existing studies typically report accuracy on clean laboratory data, leaving open how amplitude saturation and channel dropout jointly affect [...] Read more.

Surface electromyography (sEMG) is widely used for gesture recognition, yet the way classic feature–classifier pipelines fail under realistic signal degradations is still poorly quantified. Existing studies typically report accuracy on clean laboratory data, leaving open how amplitude saturation and channel dropout jointly affect different feature combinations, classifiers, and subjects. In this work, we provide, to our knowledge, the first systematic robustness map of a conventional sEMG pipeline under controlledclipping and single-sensor failure. sEMG from nine subjects performing a multi-session, multi-gesture protocol is windowed (250 ms, 50 ms hop) and represented using four common time-domain features (Root Mean Square, Variance, Zero Crossing, and Waveform Length). We exhaustively evaluated single features and all pairwise fusions with three standard classifiers (Support Vector Machine (RBF kernel), Linear Discriminant Analysis, and Random Forest) over (i) a sweep of symmetric saturation thresholds (

10^{- 6}

–

10^{- 1}

) and (ii) five single-channel dropout scenarios, reporting subject-wise dispersion rather than aggregate scores alone. This design enables explicit characterization of the following: (1) accuracy recovery as clipping weakens for each feature pair; (2) dependency of robustness on which channel fails; and (3) differences among Support Vector Machine, Linear Discriminant Analysis, and Random Forest under identical degradations. The results show that lightweight feature pairs (Root Mean Square + Waveform Length, Variance + Zero Crossing, and Waveform Length + Zero Crossing) coupled with Random Forest form a consistently robust operating point, with performance recovering as clipping weakens and remaining resilient under single-channel dropout. Beyond robustness, the conventional pipeline trains substantially faster than representative deep learning baselines under a unified end-to-end timing definition, supporting real-time recalibration and repeated robustness sweeps in wearable deployments. Full article

(This article belongs to the Special Issue From Brain Signals to Recovery: Neural Sensing for Functional Restoration)

► Show Figures

Figure 1

27 pages, 6782 KB

Open AccessArticle

Development and Evaluation of a Data Glove-Based System for Assisting Puzzle Solving

by Shashank Srikanth Bharadwaj, Kazuma Sato and Lei Jing

Sensors 2026, 26(8), 2341; https://doi.org/10.3390/s26082341 - 10 Apr 2026

Viewed by 347

Abstract

Many hands-on tasks remain difficult to fully automate because they require human dexterity and flexible object handling. Data gloves offer a promising interface for sensing hand–object interactions, but most prior systems focus on gesture recognition or object classification rather than closed-loop, step-by-step task [...] Read more.

Many hands-on tasks remain difficult to fully automate because they require human dexterity and flexible object handling. Data gloves offer a promising interface for sensing hand–object interactions, but most prior systems focus on gesture recognition or object classification rather than closed-loop, step-by-step task guidance. In this work, we develop and evaluate a tactile-sensing operation support system using an e-textile data glove with 88 pressure sensors, a tactile pressure sheet for placement verification, and a GUI that provides step-by-step instructions. As a core component, a CNN classifies the grasped state as bare hand or one of four discs with 93.3% accuracy using 16,175 training samples collected from five participants. In a user study on the Tower of Hanoi task as a controlled proxy for multi-step manipulation, the system reduced mean solving time by 51.5% (from 242.6 s to 117.8 s), reduced the number of disc movements (35.4 to 15, about 20 fewer moves on average), and lowered perceived workload (NASA-TLX) by 53.1% (from 68.5 to 32.1), while achieving a SUS score of 75. These results demonstrate the feasibility of tactile-based step verification and guidance in a controlled multi-step task; broader generalization requires evaluation with larger and more diverse participant groups and tasks. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

7 pages, 1242 KB

Open AccessProceeding Paper

Real-Time Recognition of Dual-Arm Motion Using Joint Direction Vectors and Temporal Deep Learning

by Yi-Hsiang Tseng, Che-Wei Hsu and Yih-Guang Leu

Eng. Proc. 2025, 120(1), 75; https://doi.org/10.3390/engproc2025120075 - 9 Apr 2026

Viewed by 191

Abstract

We developed a dual-arm motion recognition system designed for real-time upper-limb movement analysis using video input. The system integrates MediaPipe Hands for skeletal critical point detection, a feature extraction pipeline that encodes spatial and temporal characteristics from upper-limb joints, and a three-layer long [...] Read more.

We developed a dual-arm motion recognition system designed for real-time upper-limb movement analysis using video input. The system integrates MediaPipe Hands for skeletal critical point detection, a feature extraction pipeline that encodes spatial and temporal characteristics from upper-limb joints, and a three-layer long short-term memory network for temporal modeling and classification. By computing directional vectors from the shoulder to the elbow and wrist, a 168-dimensional feature vector is generated per frame. Sequences of 90 frames are used to capture full motion patterns. The system effectively supports multi-class recognition of coordinated dual-arm gestures, offering applications in rehabilitation, gesture control, and human–computer interaction. Full article

(This article belongs to the Proceedings of 8th International Conference on Knowledge Innovation and Invention)

► Show Figures

Figure 1

16 pages, 1624 KB

Open AccessArticle

Surface EMG-Based Hand Gesture Recognition Using a Hybrid Multistream Deep Learning Architecture

by Yusuf Çelik and Umit Can

Sensors 2026, 26(7), 2281; https://doi.org/10.3390/s26072281 - 7 Apr 2026

Viewed by 317

Abstract

Surface electromyography (sEMG) enables non-invasive measurement of muscle activity for applications such as human–machine interaction, rehabilitation, and prosthesis control. However, high noise levels, inter-subject variability, and the complex nature of muscle activation hinder robust gesture classification. This study proposes a multistream hybrid deep-learning [...] Read more.

Surface electromyography (sEMG) enables non-invasive measurement of muscle activity for applications such as human–machine interaction, rehabilitation, and prosthesis control. However, high noise levels, inter-subject variability, and the complex nature of muscle activation hinder robust gesture classification. This study proposes a multistream hybrid deep-learning architecture for the FORS-EMG dataset to address these challenges. The model integrates Temporal Convolutional Networks (TCN), depthwise separable convolutions, bidirectional Long Short-Term Memory (LSTM)–Gated Recurrent Unit (GRU) layers, and a Transformer encoder to capture complementary temporal and spectral patterns, and an ArcFace-based classifier to enhance class separability. We evaluate the approach under three protocols: subject-wise, random split without augmentation, and random split with augmentation. In the augmented random-split setting, the model attains 96.4% accuracy, surpassing previously reported values. In the subject-wise setting, accuracy is 74%, revealing limited cross-user generalization. The results demonstrate the method’s high performance and highlight the impact of data-partition strategies for real-world sEMG-based gesture recognition. Full article

(This article belongs to the Special Issue Machine Learning in Biomedical Signal Processing)

► Show Figures

Figure 1

16 pages, 5700 KB

Open AccessArticle

A Deep Learning-Based EIT System for Robust Gesture Recognition Under Confounding Factors

by Hancong Wu, Guanghong Huang, Wentao Wang and Yuan Wen

Biosensors 2026, 16(4), 200; https://doi.org/10.3390/bios16040200 - 1 Apr 2026

Viewed by 348

Abstract

Gesture recognition with electrical impedance tomography (EIT) is an enormous potential tool for human–machine interaction because of its low cost, low complexity and high temporal resolution. Although high-precision EIT-based gesture recognition has been achieved in ideal scenarios, ensuring its consistent performance under interference [...] Read more.

Gesture recognition with electrical impedance tomography (EIT) is an enormous potential tool for human–machine interaction because of its low cost, low complexity and high temporal resolution. Although high-precision EIT-based gesture recognition has been achieved in ideal scenarios, ensuring its consistent performance under interference remains challenging. This article presents a novel method to alleviate the effect of confounding factors on EIT gesture recognition. An EIT armband was designed to mitigate the effect of contact impedance variation based on equivalent circuit analysis, and a spatial–temporal fusion network, named the Fold Atrous Spatial Pyramid Pooling-Gated Recurrent Unit (FASPP-GRU), was developed for robust gesture classification. The results showed that the proposed two-layer electrode maintained a stable contact impedance when its contact force with the skin was changed. Although confounding factors caused significant changes in baseline forearm impedance, FASPP-GRU achieved 80% accuracy under the effect of limb position changes and dynamic changes in muscle state over time, which outperforms conventional classifiers. With an 87 μs inference time, the proposed system shows enormous potential in real-time applications. Full article

(This article belongs to the Special Issue Artificial Intelligence (AI) and Machine Learning (ML) in Biosensors: Innovation, Application, and Challenge)

► Show Figures

Figure 1

15 pages, 287 KB

Open AccessProceeding Paper

Computer Vision for Collaborative Robots in Industry 5.0: A Survey of Techniques, Gaps, and Future Directions

by Himani Varolia, César M. A. Vasques and Adélio M. S. Cavadas

Eng. Proc. 2026, 124(1), 99; https://doi.org/10.3390/engproc2026124099 - 24 Mar 2026

Viewed by 308

Abstract

Collaborative robots are increasingly deployed in human-shared industrial workspaces, where perception is a key enabler for safe interaction, flexible manipulation, and human-aware task execution. In the context of Industry 5.0, computer vision for cobots must meet not only accuracy requirements but also human-centered [...] Read more.

Collaborative robots are increasingly deployed in human-shared industrial workspaces, where perception is a key enabler for safe interaction, flexible manipulation, and human-aware task execution. In the context of Industry 5.0, computer vision for cobots must meet not only accuracy requirements but also human-centered constraints such as safety, transparency, robustness, and practical deployability. This paper surveys computer-vision approaches used in collaborative robotics and organizes them through a task-driven taxonomy covering detection, segmentation, tracking, pose estimation, action/gesture recognition, and safety monitoring. Beyond a descriptive literature review, the paper provides a task-driven qualitative analytical perspective that relates families of computer vision methods to key industrial constraints, including occlusion, lighting variability, clutter, domain shift, real-time latency, and annotation cost, and summarizes comparative strengths and failure modes using unified criteria. We further discuss challenges related to data availability and evaluation practices, highlighting gaps in reproducibility, standardized metrics, and real-world validation in shared human–robot environments. Finally, we outline implementation and deployment considerations across common software stacks (e.g., Python-based pipelines and MATLAB-based prototyping), emphasizing ROS2 integration, edge inference, and lifecycle maintenance. The survey concludes with research directions toward robust multimodal perception, explainable human-aware vision, and benchmarkable safety-critical perception for next-generation collaborative robotic systems. Full article

(This article belongs to the Proceedings of The 6th International Electronic Conference on Applied Sciences)

23 pages, 5784 KB

Open AccessArticle

Learning Italian Hand Gesture Culture Through an Automatic Gesture Recognition Approach

by Chiara Innocente, Giorgio Di Pisa, Irene Lionetti, Andrea Mamoli, Manuela Vitulano, Giorgia Marullo, Simone Maffei, Enrico Vezzetti and Luca Ulrich

Future Internet 2026, 18(4), 177; https://doi.org/10.3390/fi18040177 - 24 Mar 2026

Viewed by 303

Abstract

Italian hand gestures constitute a distinctive and widely recognized form of nonverbal communication, deeply embedded in everyday interaction and cultural identity. Despite their prominence, these gestures are rarely formalized or systematically taught, posing challenges for foreign speakers and visitors seeking to interpret their [...] Read more.

Italian hand gestures constitute a distinctive and widely recognized form of nonverbal communication, deeply embedded in everyday interaction and cultural identity. Despite their prominence, these gestures are rarely formalized or systematically taught, posing challenges for foreign speakers and visitors seeking to interpret their meaning and pragmatic use. Moreover, their ephemeral and embodied nature complicates traditional preservation and transmission approaches, positioning them within the broader domain of intangible cultural heritage. This paper introduces a machine learning–based framework for recognizing iconic Italian hand gestures, designed to support cultural learning and engagement among foreign speakers and visitors. The approach combines RGB–D sensing with depth-enhanced geometric feature extraction, employing interpretable classification models trained on a purpose-built dataset. The recognition system is integrated into a non-immersive virtual reality application simulating an interactive digital totem conceived for public arrival spaces, providing tutorial content, real-time gesture recognition, and immediate feedback within a playful and accessible learning environment. Three supervised machine learning pipelines were evaluated, and Random Forest achieved the best overall performance. Its integration with an Isolation Forest module was further considered for deployment, achieving a macro-averaged accuracy and F1-score of 0.82 under a 5-fold cross-validation protocol. An experimental user study was conducted with 25 subjects to evaluate the proposed interactive system in terms of usability, user engagement, and learning effectiveness, obtaining favorable results and demonstrating its potential as a practical tool for cultural education and intercultural communication. Full article

(This article belongs to the Special Issue Virtual Reality and Metaverse: Impact on the Digital Transformation of Society—3rd Edition)

► Show Figures

Figure 1

19 pages, 759 KB

Open AccessArticle

Dual-Stream BiLSTM–Transformer Architecture for Real-Time Two-Handed Dynamic Sign Language Gesture Recognition

by Enachi Andrei, Turcu Corneliu-Octavian, Culea George, Andrioaia Dragos-Alexandru, Ungureanu Andrei-Gabriel and Sghera Bogdan-Constantin

Appl. Sci. 2026, 16(6), 2912; https://doi.org/10.3390/app16062912 - 18 Mar 2026

Viewed by 257

Abstract

Two-handed dynamic gesture recognition represents a fundamental component of sign language interpretation involving the modeling of temporal dependencies and inter-hand coordination. In this task, a major challenge is modeling asymmetric motion patterns, as well as bidirectional and long-range temporal dependencies. Most existing frameworks [...] Read more.

Two-handed dynamic gesture recognition represents a fundamental component of sign language interpretation involving the modeling of temporal dependencies and inter-hand coordination. In this task, a major challenge is modeling asymmetric motion patterns, as well as bidirectional and long-range temporal dependencies. Most existing frameworks rely on early fusion strategies that merge joints, keypoints or landmarks from both hands in early processing stages, primarily to reduce model complexity and enforce a unified representation. In this work, a novel dual-stream BiLSTM–Transformer model architecture is proposed for two-handed dynamic sign language recognition, where parallel encoders process the trajectories of each hand independently. To capture spatial and temporal dependencies for each hand, an attention-based cross-hand fusion mechanism is employed, with hand landmarks extracted by the MediaPipe Hands framework as a preprocessing step to enable real-time CPU-based inference. Experimental evaluation conducted on custom Romanian Sign Language dynamic gesture datasets indicates that the proposed dual-stream-based system outperforms single-handed baselines, achieving improvements in high recognition accuracy for asymmetric gestures and consistent performance gains for synchronized two-handed gestures. The proposed architecture represents an efficient and lightweight solution suitable for real-time sign language recognition and interpretation. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

15 pages, 667 KB

Open AccessArticle

Speech-to-Sign Gesture Translation for Kazakh: Dataset and Sign Gesture Translation System

by Akdaulet Mnuarbek, Akbayan Bekarystankyzy, Mussa Turdalyuly, Dina Oralbekova and Alibek Dyussemkhanov

Computers 2026, 15(3), 188; https://doi.org/10.3390/computers15030188 - 15 Mar 2026

Viewed by 451

Abstract

This paper presents the first prototype of a speech-to-sign language translation system for Kazakh Sign Language (KRSL). The proposed pipeline integrates the NVIDIA FastConformer model for automatic speech recognition (ASR) in the Kazakh language and addresses the challenges of sign language translation in [...] Read more.

This paper presents the first prototype of a speech-to-sign language translation system for Kazakh Sign Language (KRSL). The proposed pipeline integrates the NVIDIA FastConformer model for automatic speech recognition (ASR) in the Kazakh language and addresses the challenges of sign language translation in a low-resource setting. Unlike American or British Sign Languages, KRSL lacks publicly available datasets and established translation systems. The pipeline follows a multi-stage process: speech input is converted into text via ASR, segmented into phrases, matched with corresponding gestures, and visualized as sign language. System performance is evaluated using word error rate (WER) for ASR and accuracy metrics for speech-to-sign translation. This study also introduces the first KRSL dataset, consisting of 1200 manually recreated signs, including 95% static images and 5% dynamic gesture videos. To improve robustness under resource-constrained conditions, a Weighted Hybrid Similarity Score (WHSS)-based gesture matching method is proposed. Experimental results show that the FastConformer model achieves an average WER of 10.55%, with 7.8% for isolated words and 13.3% for full sentences. At the phrase level, the system achieves 92.1% accuracy for unigrams, 84.6% for bigrams, and 78.3% for trigrams. The complete pipeline reaches 85% accuracy for individual words and 70% for sentences, with an average latency of 310 ms. These results demonstrate the feasibility and effectiveness of the proposed system for supporting people with hearing and speech impairments in Kazakhstan. Full article

(This article belongs to the Special Issue Machine Learning: Innovation, Implementation, and Impact)

► Show Figures

Figure 1

19 pages, 2968 KB

Open AccessFeature PaperArticle

CBAM-Enhanced CNN-LSTM with Improved DBSCAN for High-Precision Radar-Based Gesture Recognition

by Shiwei Yi, Zhenyu Zhao and Tongning Wu

Sensors 2026, 26(6), 1835; https://doi.org/10.3390/s26061835 - 14 Mar 2026

Viewed by 451

Abstract

In recent years, radar-based gesture recognition technology has been widely applied in industrial and daily life scenarios. However, increasingly complex application scenarios have imposed higher demands on the accuracy and robustness of gesture recognition algorithms, and challenges such as clutter interference, inter-gesture similarity, [...] Read more.

In recent years, radar-based gesture recognition technology has been widely applied in industrial and daily life scenarios. However, increasingly complex application scenarios have imposed higher demands on the accuracy and robustness of gesture recognition algorithms, and challenges such as clutter interference, inter-gesture similarity, and spatial–temporal feature ambiguity limit recognition performance. To address these challenges, a novel framework named CECL, which incorporates the Convolutional Block Attention Module (CBAM) into a Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) architecture, is proposed for high-accuracy radar-based gesture recognition. The CBAM adaptively highlights discriminative spatial regions and suppresses irrelevant background, and the CNN-LSTM network captures temporal dynamics across gesture sequences. During gesture signal processing, the Blackman window is applied to suppress spectral leakage. Additionally, a combination of wavelet thresholding and dynamic energy nulling is employed to effectively suppress clutter and enhance feature representation. Furthermore, an improved Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm further eliminates isolated sparse noise while preserving dense and valid target signal regions. Experimental results demonstrate that the proposed algorithm achieves 98.33% average accuracy in gesture classification, outperforming other baseline models. It exhibits excellent recognition performance across various distances and angles, demonstrating significantly enhanced robustness. Full article

(This article belongs to the Special Issue Advanced Robust Processing Techniques and Sensor Technologies for Complex Radio Frequency Signals)

► Show Figures

Figure 1

36 pages, 14443 KB

Open AccessArticle

Personalized Wrist–Forearm Static Gesture Recognition Using the Vicara Kai Controller and Convolutional Neural Network

by Jacek Szedel

Sensors 2026, 26(5), 1700; https://doi.org/10.3390/s26051700 - 8 Mar 2026

Viewed by 297

Abstract

Predefined, user-independent gesture sets do not account for individual differences in movement patterns and physical limitations. This study presents a personalized wrist–forearm static gesture recognition system for human–computer interaction (HCI) using the Vicara Kai^TM wearable controller and a convolutional neural network (CNN). [...] Read more.

Predefined, user-independent gesture sets do not account for individual differences in movement patterns and physical limitations. This study presents a personalized wrist–forearm static gesture recognition system for human–computer interaction (HCI) using the Vicara Kai^TM wearable controller and a convolutional neural network (CNN). Unlike the system based on fixed, predefined gestures, the proposed approach enables users to define and train their own gesture sets. During gesture recording, users may either select a gesture pattern from a predefined prompt set or create their own natural, unprompted gestures. A dedicated software framework was developed for data acquisition, preprocessing, model training, and real-time recognition. The developed system was evaluated by optimizing the parameters of a lightweight CNN and examining the influence of sequentially applied changes to the input and network pipelines, including resizing the input layer, applying data augmentation, experimenting with different dropout ratios, and varying the number of learning samples. The performance of the resulting network setup was assessed using confusion matrices, accuracy, and precision metrics for both original gestures and gestures smoothed using the cubic Bézier function. The resulting validation accuracy ranged from 0.88 to 0.94, with an average test-set accuracy of 0.92 and macro precision of 0.92. The system’s resilience to rapid or casual gestures was also evaluated using the receiver operating characteristic (ROC) method, achieving an Area Under the Curve (AUC) of 0.97. The results demonstrate that the proposed approach achieves high recognition accuracy, indicating its potential for a range of practical applications. Full article

(This article belongs to the Special Issue Sensor Systems for Gesture Recognition (3rd Edition))

► Show Figures

Graphical abstract

17 pages, 1701 KB

Open AccessArticle

CLIP-ArASL: A Lightweight Multimodal Model for Arabic Sign Language Recognition

by Naif Alasmari

Appl. Sci. 2026, 16(5), 2573; https://doi.org/10.3390/app16052573 - 7 Mar 2026

Viewed by 303

Abstract

Arabic sign language (ArASL) is the primary communication medium for Deaf and hard-of-hearing people across Arabic-speaking communities. Most current ArASL recognition systems are based solely on visual features and do not incorporate linguistic or semantic information that could improve generalization and semantic grounding. [...] Read more.

Arabic sign language (ArASL) is the primary communication medium for Deaf and hard-of-hearing people across Arabic-speaking communities. Most current ArASL recognition systems are based solely on visual features and do not incorporate linguistic or semantic information that could improve generalization and semantic grounding. This paper introduces CLIP-ArASL, a lightweight CLIP-style multimodal approach for static ArASL letter recognition that aligns visual hand gestures with bilingual textual descriptions. The approach integrates an EfficientNet-B0 image encoder with a MiniLM text encoder to learn a shared embedding space using a hybrid objective that combines contrastive and cross-entropy losses. This design supports supervised classification on seen classes and zero-shot prediction on unseen classes using textual class representations. The proposed approach is evaluated on two public datasets, ArASL2018 and ArASL21L. Under supervised evaluation, recognition accuracies of

99.25 \pm 0.14 %

and

91.51 \pm 1.29 %

are achieved, respectively. Zero-shot performance is assessed by withholding

20 %

of gesture classes during training and predicting them using only their textual descriptions. In this setting, accuracies of

55.2 \pm 12.15 %

on ArASL2018 and

37.6 \pm 9.07 %

on ArASL21L are obtained. These results show that multimodal vision–language alignment supports semantic transfer and enables recognition of unseen classes. Full article

(This article belongs to the Special Issue Machine Learning in Computer Vision and Image Processing)

► Show Figures

Figure 1

17 pages, 1732 KB

Open AccessArticle

Lightweight Visual Dynamic Gesture Recognition System Based on CNN-LSTM-DSA

by Zhenxing Wang, Ziyan Wu, Ruidi Qi and Xuan Dou

Sensors 2026, 26(5), 1558; https://doi.org/10.3390/s26051558 - 2 Mar 2026

Viewed by 424

Abstract

Addressing the challenges of large-scale gesture recognition models, high computational complexity, and inefficient deployment on embedded devices, this study designs and implements a visual dynamic gesture recognition system based on a lightweight CNN-LSTM-DSA model. The system captures user hand images via a camera, [...] Read more.

Addressing the challenges of large-scale gesture recognition models, high computational complexity, and inefficient deployment on embedded devices, this study designs and implements a visual dynamic gesture recognition system based on a lightweight CNN-LSTM-DSA model. The system captures user hand images via a camera, extracts 21 keypoint 3D coordinates using MediaPipe, and employs a lightweight hybrid model to perform spatial and temporal feature modeling on keypoint sequences, achieving high-precision recognition of complex dynamic gestures. In static gesture recognition, the system determines the gesture state through joint angle calculation and a sliding window smoothing algorithm, ensuring smooth mapping of the servo motor angles and stability of the robotic hand’s movements. In dynamic gesture recognition, the system models the key point time series based on the CNN-LSTM-DSA hybrid model, enabling accurate classification and reproduction of gesture actions. Experimental results show that the proposed system demonstrates good robustness under various lighting and background conditions, with a static gesture recognition accuracy of up to 96%, dynamic gesture recognition accuracy of 90.19%, and an overall response delay of less than 300 ms. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

14 pages, 865 KB

Open AccessEssay

Utilizing the Walla Emotion Model to Standardize Terminological Clarity for AI-Driven “Emotion” Recognition

by Peter Walla

Brain Sci. 2026, 16(3), 260; https://doi.org/10.3390/brainsci16030260 - 26 Feb 2026

Viewed by 630

Abstract

The scientific study of affect has been historically characterized by a profound lack of terminological consensus, leading to a state of conceptual fragmentation that persists in psychology, neuroscience and many other fields. This ambiguity is not merely an academic concern; it has significant [...] Read more.

The scientific study of affect has been historically characterized by a profound lack of terminological consensus, leading to a state of conceptual fragmentation that persists in psychology, neuroscience and many other fields. This ambiguity is not merely an academic concern; it has significant consequences for the development of artificial intelligence (AI) systems designed to recognize and respond to human “emotions”. In fact, it has an influence on the entire field of affective computing. The problem is obvious. Without a distinct definition of “emotion” it is difficult to train an algorithm to recognize it. The Walla Emotion Model, also known as the ESCAPE (Emotions Convey Affective Processing Effects) model, provides a potentially helpful and neurobiologically grounded framework to resolve this impasse and to improve any discourse about it, for businesses and even lawmakers aiming at healthy societies. By establishing clear, non-overlapping definitions for affective processing, feelings, and emotions, this model offers a path toward more precise research and more ethically sound affective computing including AI-driven “emotion” recognition. It introduces a concept that allows for the detection of incongruences between internal states and external signals with a very clear terminology supporting understandable communication. This is critical for identifying feigned or socially masked inner affective states, a challenge that traditional “face-reading” AI models frequently fail to address. Even tone of voice and body postures as well as gestures can be and are often voluntarily modified. Through the separation of subcortical affective processing (evaluation of valence; neural activity) from subjective experience (feeling) and external communication (emotion), the Walla model provides a helpful framework for AI-designs meant to have the capacity to infer an internal affective state from collected signals in the wild bypassing verbal self-report. This paper is purely theoretical; it does not provide any algorithm models or other distinct suggestions to train a software package. Its main purpose is the introduction of a new emotion model, particularly a new terminology that is considered helpful in order to proceed with this endeavor. It is considered important to first enable the clearest-possible form of communication about anything related to the term emotion across all disciplines dealing with it. Only then can progress be made. Full article

(This article belongs to the Section Cognitive, Social and Affective Neuroscience)

► Show Figures

Figure 1

Search Results (859)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (859)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI