Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (150)

Search Parameters:
Keywords = skeleton-based action recognition

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
19 pages, 4367 KB  
Article
A Neuro-Symbolic Approach to Fall Detection via Monocular Depth Estimation
by Yinghai Xu, Bongjun Kim, In-Nea Wang and Junho Jeong
Appl. Sci. 2026, 16(4), 1895; https://doi.org/10.3390/app16041895 - 13 Feb 2026
Viewed by 341
Abstract
Falls remain a critical safety concern in surveillance settings, yet monocular RGB methods often degrade in multi-person scenes with occlusion and loss of three-dimensional cues. This study proposes a neuro-symbolic framework that restores physically interpretable depth proxies from monocular video and fuses them [...] Read more.
Falls remain a critical safety concern in surveillance settings, yet monocular RGB methods often degrade in multi-person scenes with occlusion and loss of three-dimensional cues. This study proposes a neuro-symbolic framework that restores physically interpretable depth proxies from monocular video and fuses them with skeleton-based spatio-temporal inference for robust fall detection. The pipeline estimates per-frame depth and 2D skeletons, recovers world coordinates for key joints, and derives absolute neck height and vertical descent rate for rule-based adjudication, while a neural method operates on joint trajectories; final decisions combine both streams with a logical policy and short-horizon temporal consistency. Experiments in a realistic indoor testbed with multi-person activity compare three configurations—neural, symbolic, and fused. The fused neuro-symbolic method achieved an accuracy of 0.88 and an F1 score of 0.76 on the real surveillance test set, outperforming the neural method alone (accuracy 0.81, F1 0.64) and the symbolic method alone (accuracy 0.77, F1 0.35). Gains arise from complementary error profiles: depth-derived, rule-based cues suppress spurious positives on non-fall frames, while the neural stream recovers true falls near rule boundaries. These findings indicate that integrating monocular depth proxies with interpretable rules improves reliability without additional sensors, supporting deployment in complex, multi-person surveillance environments. Full article
Show Figures

Figure 1

5 pages, 801 KB  
Proceeding Paper
From Pose to Pitch: Classifying Baseball Pitch Types with Projection-Gated ST-GCN
by Sergio Huesca-Flores, Gibran Benitez-Garcia, Oswaldo Juarez-Sandoval, Hiroki Takahashi, Hector Perez-Meana and Mariko Nakano-Miyatake
Eng. Proc. 2026, 123(1), 3; https://doi.org/10.3390/engproc2026123003 - 29 Jan 2026
Viewed by 308
Abstract
We present a skeleton-based approach to baseball pitch type classification from broadcast video. We leverage Human Pose Estimation and an ST-GCN architecture, improved with a projection-gated temporal downsampler, to learn kinematic signatures of the pitcher’s body, adaptively selecting the most informative frames, enabling [...] Read more.
We present a skeleton-based approach to baseball pitch type classification from broadcast video. We leverage Human Pose Estimation and an ST-GCN architecture, improved with a projection-gated temporal downsampler, to learn kinematic signatures of the pitcher’s body, adaptively selecting the most informative frames, enabling pitch type classification without the need for ball tracking. On the MLB-YouTube dataset, our proposed method reaches ~62% six-class accuracy, highlighting body mechanics as a practical biometric cue. Full article
(This article belongs to the Proceedings of First Summer School on Artificial Intelligence in Cybersecurity)
Show Figures

Figure 1

21 pages, 1491 KB  
Article
DPCA-GCN: Dual-Path Cross-Attention Graph Convolutional Networks for Skeleton-Based Action Recognition
by Khadija Lasri, Khalid El Fazazy, Adnane Mohamed Mahraz, Hamid Tairi and Jamal Riffi
Computation 2025, 13(12), 293; https://doi.org/10.3390/computation13120293 - 15 Dec 2025
Viewed by 745
Abstract
Skeleton-based action recognition has achieved remarkable advances with graph convolutional networks (GCNs). However, most existing models process spatial and temporal information within a single coupled stream, which often obscures the distinct patterns of joint configuration and motion dynamics. This paper introduces the Dual-Path [...] Read more.
Skeleton-based action recognition has achieved remarkable advances with graph convolutional networks (GCNs). However, most existing models process spatial and temporal information within a single coupled stream, which often obscures the distinct patterns of joint configuration and motion dynamics. This paper introduces the Dual-Path Cross-Attention Graph Convolutional Network (DPCA-GCN), an architecture that explicitly separates spatial and temporal modeling into two specialized pathways while maintaining rich bidirectional interaction between them. The spatial branch integrates graph convolution and spatial transformers to capture intra-frame joint relationships, whereas the temporal branch combines temporal convolution and temporal transformers to model inter-frame dependencies. A bidirectional cross-attention mechanism facilitates explicit information exchange between both paths, and an adaptive gating module balances their respective contributions according to the action context. Unlike traditional approaches that process spatial–temporal information sequentially, our dual-path design enables specialized processing while maintaining cross-modal coherence through memory-efficient chunked attention mechanisms. Extensive experiments on the NTU RGB+D 60 and NTU RGB+D 120 datasets demonstrate that DPCA-GCN achieves competitive joint-only accuracies of 88.72%/94.31% and 82.85%/83.65%, respectively, with exceptional top-5 scores of 96.97%/99.14% and 95.59%/95.96%, while maintaining significantly lower computational complexity compared to multi-modal approaches. Full article
(This article belongs to the Section Computational Engineering)
Show Figures

Figure 1

23 pages, 2403 KB  
Article
LI-AGCN: A Lightweight Initialization-Enhanced Adaptive Graph Convolutional Network for Effective Skeleton-Based Action Recognition
by Qingsheng Xie and Hongmin Deng
Sensors 2025, 25(23), 7282; https://doi.org/10.3390/s25237282 - 29 Nov 2025
Viewed by 823
Abstract
The graph convolutional network (GCN) has become a mainstream technology in skeleton-based action recognition since it was first applied to this field. However, previous studies often overlooked the pivotal role of heuristic model initialization in the extraction of spatial features, impeding the model [...] Read more.
The graph convolutional network (GCN) has become a mainstream technology in skeleton-based action recognition since it was first applied to this field. However, previous studies often overlooked the pivotal role of heuristic model initialization in the extraction of spatial features, impeding the model from achieving its optimal performance. To address this issue, a lightweight initialization-enhanced adaptive graph convolutional network (LI-AGCN) is proposed, which effectively captures spatiotemporal features while maintaining low computational complexity. LI-AGCN employs three coordinate-based input branches (CIB) to dynamically adjust graph structures, which facilitates the extraction of informative spatial features. In addition, the model incorporates a lightweight and multi-scale temporal module to extract temporal feature, and employs an attention module that considers the temporal, spatial, and channel dimensions simultaneously to enhance key features. Finally, the performance of our proposed model is evaluated on three large-scale public datasets: NTU RGB+D, NTU RGB+D 120, and UAV-Human. The experimental results demonstrate that the LI-AGCN achieves excellent comprehensive performances on these datasets, especially obtaining 90.03% accuracy on the cross-subject benchmark of the NTU RGB+D dataset with only 0.18 million parameters, showcasing the effectiveness of the model. Full article
(This article belongs to the Special Issue Computer Vision Sensing and Pattern Recognition)
Show Figures

Figure 1

14 pages, 2635 KB  
Article
Clustered Federated Spatio-Temporal Graph Attention Networks for Skeleton-Based Action Recognition
by Tao Yu, Sandro Pinto, Tiago Gomes, Adriano Tavares and Hao Xu
Sensors 2025, 25(23), 7277; https://doi.org/10.3390/s25237277 - 29 Nov 2025
Cited by 1 | Viewed by 932
Abstract
Federated learning (FL) for skeleton-based action recognition remains underexplored, particularly under strong client heterogeneity where regular FedAvg tends to cause client drift and unstable convergence. We introduce Clustered Federated Spatio-Temporal Graph Attention Networks (CF-STGAT), a clustered FL framework that leverages attention-derived spatio-temporal statistics [...] Read more.
Federated learning (FL) for skeleton-based action recognition remains underexplored, particularly under strong client heterogeneity where regular FedAvg tends to cause client drift and unstable convergence. We introduce Clustered Federated Spatio-Temporal Graph Attention Networks (CF-STGAT), a clustered FL framework that leverages attention-derived spatio-temporal statistics from local STGAT models to dynamically group clients and perform attention-weighted inter-cluster fusion that gently align cluster models. Concretely, the server periodically extracts multi-head parameter-based attention descriptors, normalizes and projects them via PCA, and applies K-means to form clusters; a global reference is then computed by attention–similarity weighting and used to regularize each cluster model with a lightweight fusion step. On NTU RGB+D 60/120(NTU 60/120), CF-STGAT consistently outperforms strong FL baselines with the STGAT backbone, yielding absolute top-1 gains of +0.84/+4.09 (NTU 60, X-Sub/X-Setup) and +7.98/+4.18 (NTU 120, X-Sub/X-Setup) over FedAvg, alongside smoother per-client trajectories and lower terminal test loss. Ablations indicate that attention-guided clustering and inter-cluster fusion are complementary: clustering reduces within-group variance whereas fusion limits cross-cluster divergence. The approach keeps local training unchanged and adds only server-side statistics and clustering. Full article
(This article belongs to the Special Issue Computer Vision-Based Human Activity Recognition)
Show Figures

Figure 1

47 pages, 150968 KB  
Article
Adaptive Refined Graph Convolutional Action Recognition Network with Enhanced Features for UAV Ground Crew Marshalling
by Qing Zhou, Liheng Dong, Zhaoxiang Zhang, Yuelei Xu, Feng Xiao and Yingxia Wang
Drones 2025, 9(12), 819; https://doi.org/10.3390/drones9120819 - 26 Nov 2025
Viewed by 566
Abstract
For unmanned aerial vehicle (UAV) ground crew marshalling tasks, the accuracy of skeleton-based action recognition is often limited by the high similarity of motion patterns across action categories as well as variations in individual performance. To address this issue, we propose an adaptive [...] Read more.
For unmanned aerial vehicle (UAV) ground crew marshalling tasks, the accuracy of skeleton-based action recognition is often limited by the high similarity of motion patterns across action categories as well as variations in individual performance. To address this issue, we propose an adaptive refined graph convolutional network with enhanced features for action recognition. First, a multi-order and motion feature modeling module is constructed, which integrates joint positions, skeletal structures, and angular encodings for multi-granularity representation. Static-domain and dynamic-domain features are then fused to enhance the diversity and expressiveness of the input representations. Second, a data-driven adaptive graph convolution module is designed, where inter-joint interactions are dynamically modeled through a learnable topology. Furthermore, an adaptive refinement feature activation mechanism is introduced to optimize information flow between nodes, enabling fine-grained modeling of skeletal spatial information. Finally, a frame-index semantic temporal modeling module is incorporated, where joint-type semantics and frame-index semantics are introduced in the spatial and temporal dimensions, respectively, to capture the temporal evolution of actions and comprehensively exploit spatio-temporal semantic correlations. On the NTU-RGB+D 60 and NTU-RGB+D 120 benchmark datasets, the proposed method achieves accuracies of 89.4% and 94.2% under X-Sub and X-View settings, respectively, as well as 81.7% and 83.3% on the respective benchmarks. On the self-constructed UAV Airfield Ground Crew Dataset, the proposed method attains accuracies of 90.71% and 96.09% under X-Sub and HO settings, respectively. Environmental robustness experiments demonstrate that under complex environmental conditions including illumination variations, haze, rain, shadows, and occlusions, the adoption of the Test + Train strategy reduces the maximum performance degradation from 3.1 percentage points to within 1 percentage point. Real-time performance testing shows that the system achieves an end-to-end inference latency of 24.5 ms (40.8 FPS) on the edge device NVIDIA Jetson Xavier NX, meeting real-time processing requirements and validating the efficiency and practicality of the proposed method on edge computing platforms. Full article
(This article belongs to the Section Artificial Intelligence in Drones (AID))
Show Figures

Graphical abstract

24 pages, 2879 KB  
Article
Skeleton-Based Real-Time Hand Gesture Recognition Using Data Fusion and Ensemble Multi-Stream CNN Architecture
by Maki K. Habib, Oluwaleke Yusuf and Mohamed Moustafa
Technologies 2025, 13(11), 484; https://doi.org/10.3390/technologies13110484 - 26 Oct 2025
Viewed by 1847
Abstract
Hand Gesture Recognition (HGR) is a vital technology that enables intuitive human–computer interaction in various domains, including augmented reality, smart environments, and assistive systems. Achieving both high accuracy and real-time performance remains challenging due to the complexity of hand dynamics, individual morphological variations, [...] Read more.
Hand Gesture Recognition (HGR) is a vital technology that enables intuitive human–computer interaction in various domains, including augmented reality, smart environments, and assistive systems. Achieving both high accuracy and real-time performance remains challenging due to the complexity of hand dynamics, individual morphological variations, and computational limitations. This paper presents a lightweight and efficient skeleton-based HGR framework that addresses these challenges through an optimized multi-stream Convolutional Neural Network (CNN) architecture and a trainable ensemble tuner. Dynamic 3D gestures are transformed into structured, noise-minimized 2D spatiotemporal representations via enhanced data-level fusion, supporting robust classification across diverse spatial perspectives. The ensemble tuner strengthens semantic relationships between streams and improves recognition accuracy. Unlike existing solutions that rely on high-end hardware, the proposed framework achieves real-time inference on consumer-grade devices without compromising accuracy. Experimental validation across five benchmark datasets (SHREC2017, DHG1428, FPHA, LMDHG, and CNR) confirms consistent or superior performance with reduced computational overhead. Additional validation on the SBU Kinect Interaction Dataset highlights generalization potential for broader Human Action Recognition (HAR) tasks. This advancement bridges the gap between efficiency and accuracy, supporting scalable deployment in AR/VR, mobile computing, interactive gaming, and resource-constrained environments. Full article
Show Figures

Figure 1

22 pages, 1087 KB  
Article
Modeling the Internal and Contextual Attention for Self-Supervised Skeleton-Based Action Recognition
by Wentian Xin, Yue Teng, Jikang Zhang, Yi Liu, Ruyi Liu, Yuzhi Hu and Qiguang Miao
Sensors 2025, 25(21), 6532; https://doi.org/10.3390/s25216532 - 23 Oct 2025
Viewed by 1245
Abstract
Multimodal contrastive learning has achieved significant performance advantages in self-supervised skeleton-based action recognition. Previous methods are limited by modality imbalance, which reduces alignment accuracy and makes it difficult to combine important spatial–temporal frequency patterns, leading to confusion between modalities and weaker feature representations. [...] Read more.
Multimodal contrastive learning has achieved significant performance advantages in self-supervised skeleton-based action recognition. Previous methods are limited by modality imbalance, which reduces alignment accuracy and makes it difficult to combine important spatial–temporal frequency patterns, leading to confusion between modalities and weaker feature representations. To overcome these problems, we explore intra-modality feature-wise self-similarity and inter-modality instance-wise cross-consistency, and discover two inherent correlations that benefit recognition: (i) Global Perspective expresses how action semantics carry a broad and high-level understanding, which supports the use of globally discriminative feature representations. (ii) Focus Adaptation refers to the role of the frequency spectrum in guiding attention toward key joints by emphasizing compact and salient signal patterns. Building upon these insights, we propose a novel language–skeleton contrastive learning framework comprising two key components: (a) Feature Modulation, which constructs a skeleton–language action conceptual domain to minimize the expected information gain between vision and language modalities. (b) Frequency Feature Learning, which introduces a Frequency-domain Spatial–Temporal block (FreST) that focuses on sparse key human joints in the frequency domain with compact signal energy. Extensive experiments demonstrate the effectiveness of our method achieves remarkable action recognition performance on widely used benchmark datasets, including NTU RGB+D 60 and NTU RGB+D 120. Especially on the challenging PKU-MMD dataset, MICA has achieved at least a 4.6% improvement over classical methods such as CrosSCLR and AimCLR, effectively demonstrating its ability to capture internal and contextual attention information. Full article
(This article belongs to the Special Issue Deep Learning for Perception and Recognition: Method and Applications)
Show Figures

Figure 1

23 pages, 27054 KB  
Article
ActionMamba: Action Spatial–Temporal Aggregation Network Based on Mamba and GCN for Skeleton-Based Action Recognition
by Jinglong Wen, Dan Liu and Bin Zheng
Electronics 2025, 14(18), 3610; https://doi.org/10.3390/electronics14183610 - 11 Sep 2025
Cited by 1 | Viewed by 2204
Abstract
Skeleton-based action recognition networks have widely adopted the approach of Graph Convolutional Networks (GCN) due to their superior capabilities in modeling data topology, but several key issues still require further investigation. Firstly, the graph convolutional network extracts action features by applying temporal convolution [...] Read more.
Skeleton-based action recognition networks have widely adopted the approach of Graph Convolutional Networks (GCN) due to their superior capabilities in modeling data topology, but several key issues still require further investigation. Firstly, the graph convolutional network extracts action features by applying temporal convolution to each key point, which causes the model to ignore the temporal connections between different important points. Secondly, the local receptive field of graph convolutional networks limits their ability to capture correlations between non-adjacent joints. Motivated by the State Space Model (SSM), we propose an Action Spatio-temporal Aggregation Network, named ActionMamba. Specifically, we introduce a novel embedding module called the Action Characteristic Encoder (ACE), which enhances the coupling of temporal and spatial information in skeletal features by combining intrinsic spatio-temporal encoding with extrinsic space encoding. Additionally, we design an Action Perception Model (APM) based on Mamba and GCN. By effectively combining the excellent feature processing capabilities of GCN with the outstanding global information modeling capabilities of Mamba, APM is able to comprehend the hidden features between different joints and selectively filter information from various joints. Extensive experimental results demonstrate that ActionMamba achieves highly competitive performance on three challenging benchmark datasets: NTU-RGB+D 60, NTU-RGB+D 120, and UAV–Human. Full article
Show Figures

Figure 1

18 pages, 1212 KB  
Article
Part-Wise Graph Fourier Learning for Skeleton-Based Continuous Sign Language Recognition
by Dong Wei, Hongxiang Hu and Gang-Feng Ma
J. Imaging 2025, 11(8), 286; https://doi.org/10.3390/jimaging11080286 - 21 Aug 2025
Cited by 1 | Viewed by 1565
Abstract
Sign language is a visual language articulated through body movements. Existing approaches predominantly leverage RGB inputs, incurring substantial computational overhead and remaining susceptible to interference from foreground and background noise. A second fundamental challenge lies in accurately modeling the nonlinear temporal dynamics and [...] Read more.
Sign language is a visual language articulated through body movements. Existing approaches predominantly leverage RGB inputs, incurring substantial computational overhead and remaining susceptible to interference from foreground and background noise. A second fundamental challenge lies in accurately modeling the nonlinear temporal dynamics and inherent asynchrony across body parts that characterize sign language sequences. To address these challenges, we propose a novel part-wise graph Fourier learning method for skeleton-based continuous sign language recognition (PGF-SLR), which uniformly models the spatiotemporal relations of multiple body parts in a globally ordered yet locally unordered manner. Specifically, different parts within different time steps are treated as nodes, while the frequency domain attention between parts is treated as edges to construct a part-level Fourier fully connected graph. This enables the graph Fourier learning module to jointly capture spatiotemporal dependencies in the frequency domain, while our adaptive frequency enhancement method further amplifies discriminative action features in a lightweight and robust fashion. Finally, a dual-branch action learning module featuring an auxiliary action prediction branch to assist the recognition branch is designed to enhance the understanding of sign language. Our experimental results show that the proposed PGF-SLR achieved relative improvements of 3.31%/3.70% and 2.81%/7.33% compared to SOTA methods on the dev/test sets of the PHOENIX14 and PHOENIX14-T datasets. It also demonstrated highly competitive recognition performance on the CSL-Daily dataset, showcasing strong generalization while reducing computational costs in both offline and online settings. Full article
(This article belongs to the Special Issue Advances in Machine Learning for Computer Vision Applications)
Show Figures

Figure 1

18 pages, 3407 KB  
Article
Graph Convolutional Network with Multi-View Topology for Lightweight Skeleton-Based Action Recognition
by Liangliang Wang, Xu Zhang and Chuang Zhang
Symmetry 2025, 17(8), 1235; https://doi.org/10.3390/sym17081235 - 4 Aug 2025
Cited by 1 | Viewed by 2648
Abstract
Skeleton-based action recognition is an important subject in deep learning. Graph Convolutional Networks (GCNs) have demonstrated strong performance by modeling the human skeleton as a natural topological graph, representing the connections between joints. However, most existing methods rely on non-adaptive topologies or insufficiently [...] Read more.
Skeleton-based action recognition is an important subject in deep learning. Graph Convolutional Networks (GCNs) have demonstrated strong performance by modeling the human skeleton as a natural topological graph, representing the connections between joints. However, most existing methods rely on non-adaptive topologies or insufficiently expressive representations. To address these limitations, we propose a Multi-view Topology Refinement Graph Convolutional Network (MTR-GCN), which is efficient, lightweight, and delivers high performance. Specifically: (1) We propose a new spatial topology modeling approach that incorporates two views. A dynamic view fuses joint information from dual streams in a pairwise manner, while a static view encodes the shortest static paths between joints, preserving the original connectivity relationships. (2) We propose a new MultiScale Temporal Convolutional Network (MSTC), which is efficient and lightweight. (3) Furthermore, we introduce a new temporal topology strategy by modeling temporal frames as a graph, which strengthens the extraction of temporal features. By modeling the human skeleton as both a spatial and a temporal graph, we reveal a topological symmetry between space and time within the unified spatio-temporal framework. The proposed model achieves state-of-the-art performance on several benchmark datasets, including NTU RGB + D (XSub: 92.8%, XView: 96.8%), NTU RGB + D 120 (XSub: 89.6%, XSet: 90.8%), and NW-UCLA (95.7%), demonstrating the effectiveness of our GCN module, TCN module, and overall architecture. Full article
(This article belongs to the Section Computer)
Show Figures

Figure 1

73 pages, 2833 KB  
Article
A Comprehensive Methodological Survey of Human Activity Recognition Across Diverse Data Modalities
by Jungpil Shin, Najmul Hassan, Abu Saleh Musa Miah and Satoshi Nishimura
Sensors 2025, 25(13), 4028; https://doi.org/10.3390/s25134028 - 27 Jun 2025
Cited by 12 | Viewed by 6882
Abstract
Human Activity Recognition (HAR) systems aim to understand human behavior and assign a label to each action, attracting significant attention in computer vision due to their wide range of applications. HAR can leverage various data modalities, such as RGB images and video, skeleton, [...] Read more.
Human Activity Recognition (HAR) systems aim to understand human behavior and assign a label to each action, attracting significant attention in computer vision due to their wide range of applications. HAR can leverage various data modalities, such as RGB images and video, skeleton, depth, infrared, point cloud, event stream, audio, acceleration, and radar signals. Each modality provides unique and complementary information suited to different application scenarios. Consequently, numerous studies have investigated diverse approaches for HAR using these modalities. This survey includes only peer-reviewed research papers published in English to ensure linguistic consistency and academic integrity. This paper presents a comprehensive survey of the latest advancements in HAR from 2014 to 2025, focusing on Machine Learning (ML) and Deep Learning (DL) approaches categorized by input data modalities. We review both single-modality and multi-modality techniques, highlighting fusion-based and co-learning frameworks. Additionally, we cover advancements in hand-crafted action features, methods for recognizing human–object interactions, and activity detection. Our survey includes a detailed dataset description for each modality, as well as a summary of the latest HAR systems, accompanied by a mathematical derivation for evaluating the deep learning model for each modality, and it also provides comparative results on benchmark datasets. Finally, we provide insightful observations and propose effective future research directions in HAR. Full article
(This article belongs to the Special Issue Computer Vision and Sensors-Based Application for Intelligent Systems)
Show Figures

Figure 1

15 pages, 1463 KB  
Article
Spatial–Temporal Heatmap Masked Autoencoder for Skeleton-Based Action Recognition
by Cunling Bian, Yang Yang, Tao Wang and Weigang Lu
Sensors 2025, 25(10), 3146; https://doi.org/10.3390/s25103146 - 16 May 2025
Cited by 3 | Viewed by 2789
Abstract
Skeleton representation learning offers substantial advantages for action recognition by encoding intricate motion details and spatial–temporal dependencies among joints. However, fully supervised approaches necessitate large amounts of annotated data, which are often labor-intensive and costly to acquire. In this work, we propose the [...] Read more.
Skeleton representation learning offers substantial advantages for action recognition by encoding intricate motion details and spatial–temporal dependencies among joints. However, fully supervised approaches necessitate large amounts of annotated data, which are often labor-intensive and costly to acquire. In this work, we propose the Spatial–Temporal Heatmap Masked Autoencoder (STH-MAE), a novel self-supervised framework tailored for skeleton-based action recognition. Unlike coordinate-based methods, STH-MAE adopts heatmap volumes as its primary representation, mitigating noise inherent in pose estimation while capitalizing on advances in Vision Transformers. The framework constructs a spatial–temporal heatmap (STH) by aggregating 2D joint heatmaps across both spatial and temporal axes. This STH is partitioned into non-overlapping patches to facilitate local feature learning, with a masking strategy applied to randomly conceal portions of the input. During pre-training, a Vision Transformer-based autoencoder equipped with a lightweight prediction head reconstructs the masked regions, fostering the extraction of robust and transferable skeletal representations. Comprehensive experiments on the NTU RGB+D 60 and NTU RGB+D 120 benchmarks demonstrate the superiority of STH-MAE, achieving state-of-the-art performance under multiple evaluation protocols. Full article
(This article belongs to the Section Intelligent Sensors)
Show Figures

Figure 1

24 pages, 8541 KB  
Article
Feature Fusion Graph Consecutive-Attention Network for Skeleton-Based Tennis Action Recognition
by Pawel Powroznik, Maria Skublewska-Paszkowska, Krzysztof Dziedzic and Marcin Barszcz
Appl. Sci. 2025, 15(10), 5320; https://doi.org/10.3390/app15105320 - 9 May 2025
Cited by 1 | Viewed by 2000
Abstract
Human action recognition has become a key direction in computer vision. Deep learning models, particularly when combined with sensor data fusion, can significantly enhance various applications by learning complex patterns and relationships from diverse data streams. Thus, this study proposes a new model, [...] Read more.
Human action recognition has become a key direction in computer vision. Deep learning models, particularly when combined with sensor data fusion, can significantly enhance various applications by learning complex patterns and relationships from diverse data streams. Thus, this study proposes a new model, the Feature Fusion Graph Consecutive-Attention Network (FFGCAN), in order to enhance performance in the classification of the main tennis strokes: forehand, backhand, volley forehand, and volley backhand. The proposed network incorporates seven basic blocks that are combined with two types of module: an Adaptive Consecutive Attention Module, and Graph Self-Attention module. They are employed to extract joint information at different scales from the motion capture data. Due to focusing on relevant components, the model enriches the network’s comprehension of tennis motion data representation and allows for a more invested representation. Moreover, the FFGCAN utilizes a fusion of motion capture data that generates a channel-specific topology map for each output channel, reflecting how joints are connected when the tennis player is moving. The proposed solution was verified utilizing three well-known motion capture datasets, THETIS, Tennis-Mocap, and 3DTennisDS, each containing tennis movements in various formats. A series of experiments were performed, including data division into training (70%), validating (15%), and testing (15%) subsets. The testing utilized five trials. The FFCGAN model obtained very high results for accuracy, precision, recall, and F1-score, outperforming the commonly applied networks for action recognition, such as the Spatial-Temporal Graph Convolutional Network or its modifications. The proposed model demonstrated excellent tennis movement prediction ability. Full article
Show Figures

Figure 1

16 pages, 747 KB  
Article
Dynamic Graph Attention Network for Skeleton-Based Action Recognition
by Zhenhua Li, Fanjia Li and Gang Hua
Appl. Sci. 2025, 15(9), 4929; https://doi.org/10.3390/app15094929 - 29 Apr 2025
Cited by 3 | Viewed by 3633
Abstract
Skeleton-based human action recognition has garnered significant attention for its robustness to background noise and illumination variations. However, existing methods relying on Graph Convolutional Networks (GCNs) and Transformers exhibit inherent limitations: GCNs struggle to model interactions between non-adjacent joints due to predefined skeletal [...] Read more.
Skeleton-based human action recognition has garnered significant attention for its robustness to background noise and illumination variations. However, existing methods relying on Graph Convolutional Networks (GCNs) and Transformers exhibit inherent limitations: GCNs struggle to model interactions between non-adjacent joints due to predefined skeletal topology, while Transformers accumulate noise through unrestricted global dependency modeling. To address these challenges, we propose a Dynamic Graph Attention Network (DGAN) that dynamically integrates local structural features and global spatiotemporal dependencies. DGAN employs a masked attention mechanism to adaptively adjust node connectivity, forming a dynamic adjacency matrix that extends beyond physical skeletal constraints by selectively incorporating highly correlated joints. Additionally, a node-partition bias strategy is introduced to prioritize attention on collaboratively moving body parts, thereby enhancing discriminative feature extraction. Extensive experiments on the NTU RGB+D 60 and NTU RGB+D 120 datasets validate the effectiveness of DGAN, which outperforms state-of-the-art methods by achieving a balance between local topology preservation and global interaction modeling. Our approach provides a robust framework for skeleton-driven action recognition, demonstrating superior generalization across diverse scenarios. Full article
Show Figures

Figure 1

Back to TopTop