sensors-logo

Journal Browser

Journal Browser

Deep Learning Applications for Pose Estimation and Human Action Recognition—2nd Edition

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Intelligent Sensors".

Deadline for manuscript submissions: 15 September 2025 | Viewed by 4296

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, 00185 Rome, Italy
Interests: deep learning; machine learning; computer vision; depth estimation; attitude and pose estimation
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Civil and Environmental Engineering, University of Florence, Via S. Marta 3, 50139 Florence, Italy
Interests: navigation and positioning; attitude and pose estimation; 3D modeling; geomatics; sensors; deep learning; computer vision; climate change; cultural heritage preservation; remote sensing
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

In the last decade, deep learning has drawn significant attention thanks to its robustness and potential in generalization and learning capabilities. Several applications have been tested and successfully deployed, exploring the majority of real-world tasks with the aim of improving their performances. Among others, pose estimation and human action recognition have benefitted from the exceptional results achieved in the deep learning field, although still showing wide margins of improvement.

This Special Issue aims to gather a significant collection of original contributions to these topics. Accurate vehicle and human pose estimation is crucial for several applications, e.g., animal behavior research, gaming and virtual reality, medicine and biotechnology, pedestrian, aerial and maritime navigation, robotics, and human motion tracking. Furthermore, effective human pose and action recognition offers an important contribution in many fields, such as physical therapists’ diagnoses and patient rehabilitation, as well as security and surveillance or employee-free store development.

The relevant topics of this issue include, but are not limited to, the following:

  • Single and multihuman pose estimation, action recognition, and tracking;
  • Terrestrial, maritime, aerial robot pose estimation, and tracking;
  • Literature reviews and surveys;
  • Datasets and sensors;
  • Interesting applications and ideas focusing on surveillance, autonomous navigation, human–robot interaction, healthcare, and sports, etc.

Dr. Paolo Russo
Dr. Fabiana Di Ciaccio
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • deep learning
  • action recognition
  • pose estimation
  • human activities
  • robotics and intelligent systems
  • navigation
  • positioning
  • control
  • datasets
  • sensors
  • embedded systems and devices

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Related Special Issue

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

37 pages, 2036 KiB  
Article
GCN-Transformer: Graph Convolutional Network and Transformer for Multi-Person Pose Forecasting Using Sensor-Based Motion Data
by Romeo Šajina, Goran Oreški and Marina Ivašić-Kos
Sensors 2025, 25(10), 3136; https://doi.org/10.3390/s25103136 - 15 May 2025
Viewed by 206
Abstract
Multi-person pose forecasting involves predicting the future body poses of multiple individuals over time, involving complex movement dynamics and interaction dependencies. Its relevance spans various fields, including computer vision, robotics, human–computer interaction, and surveillance. This task is particularly important in sensor-driven applications, where [...] Read more.
Multi-person pose forecasting involves predicting the future body poses of multiple individuals over time, involving complex movement dynamics and interaction dependencies. Its relevance spans various fields, including computer vision, robotics, human–computer interaction, and surveillance. This task is particularly important in sensor-driven applications, where motion capture systems, including vision-based sensors and IMUs, provide crucial data for analyzing human movement. This paper introduces GCN-Transformer, a novel model for multi-person pose forecasting that leverages the integration of Graph Convolutional Network and Transformer architectures. We integrated novel loss terms during the training phase to enable the model to learn both interaction dependencies and the trajectories of multiple joints simultaneously. Additionally, we propose a novel pose forecasting evaluation metric called Final Joint Position and Trajectory Error (FJPTE), which assesses both local movement dynamics and global movement errors by considering the final position and the trajectory leading up to it, providing a more comprehensive assessment of movement dynamics. Our model uniquely integrates scene-level graph-based encoding and personalized attention-based decoding, introducing a novel architecture for multi-person pose forecasting that achieves state-of-the-art results across four datasets. The model is trained and evaluated on the CMU-Mocap, MuPoTS-3D, SoMoF Benchmark, and ExPI datasets, which are collected using sensor-based motion capture systems, ensuring its applicability in real-world scenarios. Comprehensive evaluations on the CMU-Mocap, MuPoTS-3D, SoMoF Benchmark, and ExPI datasets demonstrate that the proposed GCN-Transformer model consistently outperforms existing state-of-the-art (SOTA) models according to the VIM and MPJPE metrics. Specifically, based on the MPJPE metric, GCN-Transformer shows a 4.7% improvement over the closest SOTA model on CMU-Mocap, 4.3% improvement over the closest SOTA model on MuPoTS-3D, 5% improvement over the closest SOTA model on the SoMoF Benchmark, and a 2.6% improvement over the closest SOTA model on the ExPI dataset. Unlike other models with performances that fluctuate across datasets, GCN-Transformer performs consistently, proving its robustness in multi-person pose forecasting and providing an excellent foundation for the application of GCN-Transformer in different domains. Full article
Show Figures

Figure 1

16 pages, 1659 KiB  
Article
DualPose: Dual-Block Transformer Decoder with Contrastive Denoising for Multi-Person Pose Estimation
by Matteo Fincato and Roberto Vezzani
Sensors 2025, 25(10), 2997; https://doi.org/10.3390/s25102997 - 9 May 2025
Viewed by 193
Abstract
Multi-person pose estimation is the task of detecting and regressing the keypoint coordinates of multiple people in a single image. Significant progress has been achieved in recent years, especially with the introduction of transformer-based end-to-end methods. In this paper, we present DualPose, a [...] Read more.
Multi-person pose estimation is the task of detecting and regressing the keypoint coordinates of multiple people in a single image. Significant progress has been achieved in recent years, especially with the introduction of transformer-based end-to-end methods. In this paper, we present DualPose, a novel framework that enhances multi-person pose estimation by leveraging a dual-block transformer decoding architecture. Class prediction and keypoint estimation are split into parallel blocks so each sub-task can be separately improved and the risk of interference is reduced. This architecture improves the precision of keypoint localization and the model’s capacity to accurately classify individuals. To improve model performance, the Keypoint-Block uses parallel processing of self-attentions, providing a novel strategy that improves keypoint localization accuracy and precision. Additionally, DualPose incorporates a contrastive denoising (CDN) mechanism, leveraging positive and negative samples to stabilize training and improve robustness. Thanks to CDN, a variety of training samples are created by introducing controlled noise into the ground truth, improving the model’s ability to discern between valid and incorrect keypoints. DualPose achieves state-of-the-art results outperforming recent end-to-end methods, as shown by extensive experiments on the MS COCO and CrowdPose datasets. The code and pretrained models are publicly available. Full article
Show Figures

Figure 1

19 pages, 30474 KiB  
Article
Multi-Head Attention-Based Framework with Residual Network for Human Action Recognition
by Basheer Al-Tawil, Magnus Jung, Thorsten Hempel and Ayoub Al-Hamadi
Sensors 2025, 25(9), 2930; https://doi.org/10.3390/s25092930 - 6 May 2025
Viewed by 233
Abstract
Human action recognition (HAR) is essential for understanding and classifying human movements. It is widely used in real-life applications such as human–computer interaction and assistive robotics. However, recognizing patterns across different temporal scales remains challenging. Traditional methods struggle with complex timing patterns, intra-class [...] Read more.
Human action recognition (HAR) is essential for understanding and classifying human movements. It is widely used in real-life applications such as human–computer interaction and assistive robotics. However, recognizing patterns across different temporal scales remains challenging. Traditional methods struggle with complex timing patterns, intra-class variability, and inter-class similarities, leading to misclassifications. In this paper, we propose a deep learning framework for efficient and robust HAR. It integrates residual networks (ResNet-18) for spatial feature extraction and Bi-LSTM for temporal feature extraction. A multi-head attention mechanism enhances the prioritization of crucial motion details. Additionally, we introduce a motion-based frame selection strategy utilizing optical flow to reduce redundancy and enhance efficiency. This ensures accurate, real-time recognition of both simple and complex actions. We evaluate the framework on the UCF-101 dataset, achieving a 96.60% accuracy, demonstrating competitive performance against state-of-the-art approaches. Moreover, the framework operates at 222 frames per second (FPS), achieving an optimal balance between recognition performance and computational efficiency. The proposed framework was also deployed and tested on a mobile service robot, TIAGo, validating its real-time applicability in real-world scenarios. It effectively models human actions while minimizing frame dependency, making it well-suited for real-time applications. Full article
Show Figures

Figure 1

29 pages, 9831 KiB  
Article
Quality of Experience (QoE) in Cloud Gaming: A Comparative Analysis of Deep Learning Techniques via Facial Emotions in a Virtual Reality Environment
by Awais Khan Jumani, Jinglun Shi, Asif Ali Laghari, Muhammad Ahmad Amin, Aftab ul Nabi, Kamlesh Narwani and Yi Zhang
Sensors 2025, 25(5), 1594; https://doi.org/10.3390/s25051594 - 5 Mar 2025
Viewed by 821
Abstract
Cloud gaming has rapidly transformed the gaming industry, allowing users to play games on demand from anywhere without the need for powerful hardware. Cloud service providers are striving to enhance user Quality of Experience (QoE) using traditional assessment methods. However, these traditional methods [...] Read more.
Cloud gaming has rapidly transformed the gaming industry, allowing users to play games on demand from anywhere without the need for powerful hardware. Cloud service providers are striving to enhance user Quality of Experience (QoE) using traditional assessment methods. However, these traditional methods often fail to capture the actual user QoE because some users are not serious about providing feedback regarding cloud services. Additionally, some players, even after receiving services as per the Service Level Agreement (SLA), claim that they are not receiving services as promised. This poses a significant challenge for cloud service providers in accurately identifying QoE and improving actual services. In this paper, we have compared our previous proposed novel technique that utilizes a deep learning (DL) model to assess QoE through players’ facial expressions during cloud gaming sessions in a virtual reality (VR) environment. The EmotionNET model technique is based on a convolutional neural network (CNN) architecture. Later, we have compared the EmotionNET technique with three other DL techniques, namely ConvoNEXT, EfficientNET, and Vision Transformer (ViT). We trained the EmotionNET, ConvoNEXT, EfficientNET, and ViT model techniques on our custom-developed dataset, achieving 98.9% training accuracy and 87.8% validation accuracy with the EmotionNET model technique. Based on the training and comparison results, it is evident that the EmotionNET model technique predicts and performs better than the other model techniques. At the end, we have compared the EmotionNET results on two network (WiFi and mobile data) datasets. Our findings indicate that facial expressions are strongly correlated with QoE. Full article
Show Figures

Figure 1

17 pages, 1198 KiB  
Article
Decision Fusion-Based Deep Learning for Channel State Information Channel-Aware Human Action Recognition
by Domonkos Varga
Sensors 2025, 25(4), 1061; https://doi.org/10.3390/s25041061 - 10 Feb 2025
Viewed by 977
Abstract
WiFi channel state information (CSI) has emerged as a promising modality for human action recognition due to its non-invasive nature and robustness in diverse environments. However, most existing methods process CSI channels collectively, potentially overlooking valuable channel-specific information. In this study, we propose [...] Read more.
WiFi channel state information (CSI) has emerged as a promising modality for human action recognition due to its non-invasive nature and robustness in diverse environments. However, most existing methods process CSI channels collectively, potentially overlooking valuable channel-specific information. In this study, we propose a novel architecture, DF-CNN, which treats CSI channels separately and integrates their outputs using a decision fusion (DF) strategy. Extensive experiments demonstrate that DF-CNN significantly outperforms traditional approaches, achieving state-of-the-art performance. We also provide a comprehensive analysis of individual and combined CSI channel evaluations, showcasing the effectiveness of our method. This work establishes the importance of separate channel processing in CSI-based human action recognition and sets a new benchmark for the field. Full article
Show Figures

Figure 1

16 pages, 2388 KiB  
Article
Mitigating Data Leakage in a WiFi CSI Benchmark for Human Action Recognition
by Domonkos Varga
Sensors 2024, 24(24), 8201; https://doi.org/10.3390/s24248201 - 22 Dec 2024
Cited by 2 | Viewed by 1434
Abstract
Human action recognition using WiFi channel state information (CSI) has gained attention due to its non-intrusive nature and potential applications in healthcare, smart environments, and security. However, the reliability of methods developed for CSI-based action recognition is often contingent on the quality of [...] Read more.
Human action recognition using WiFi channel state information (CSI) has gained attention due to its non-intrusive nature and potential applications in healthcare, smart environments, and security. However, the reliability of methods developed for CSI-based action recognition is often contingent on the quality of the datasets and evaluation protocols used. In this paper, we uncovered a critical data leakage issue, which arises from improper data partitioning, in a widely used WiFi CSI benchmark dataset. Specifically, the benchmark fails to separate individuals between the training and test sets, leading to inflated performance metrics as models inadvertently learn individual-specific features rather than generalizable action patterns. We analyzed this issue in depth, retrained several benchmarked models using corrected data partitioning methods, and demonstrated a significant drop in accuracy when individuals were properly separated across training and testing. Our findings highlight the importance of rigorous data partitioning in CSI-based action recognition and provide recommendations for mitigating data leakage in future research. This work contributes to the development of more robust and reliable human action recognition systems using WiFi CSI. Full article
Show Figures

Figure 1

Back to TopTop