Submit to Special Issue Submit Abstract to Special Issue Review for Sensors Propose a Special Issue

Journal Menu

Journal Browser

► Journal Browser

Deep Learning Applications for Pose Estimation and Human Action Recognition—2nd Edition

Print Special Issue Flyer
Special Issue Editors
Special Issue Information
Keywords
Benefits of Publishing in a Special Issue
Related Special Issue
Published Papers

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Intelligent Sensors".

Deadline for manuscript submissions: 15 September 2025 | Viewed by 10326

Share This Special Issue

Special Issue Editors

Dr. Paolo Russo

E-Mail Website
Guest Editor

Department of Computer, Control and Management Engineering, Sapienza University of Rome, Via Ariosto 25, 00185 Rome, Italy
Interests: deep learning; machine learning; computer vision; depth estimation; attitude and pose estimation
Special Issues, Collections and Topics in MDPI journals

Dr. Fabiana Di Ciaccio

E-Mail Website
Guest Editor

Department of Civil and Environmental Engineering, University of Florence, Via S. Marta 3, 50139 Florence, Italy
Interests: navigation and positioning; attitude and pose estimation; 3D modeling; geomatics; sensors; deep learning; computer vision; climate change; cultural heritage preservation; remote sensing
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

In the last decade, deep learning has drawn significant attention thanks to its robustness and potential in generalization and learning capabilities. Several applications have been tested and successfully deployed, exploring the majority of real-world tasks with the aim of improving their performances. Among others, pose estimation and human action recognition have benefitted from the exceptional results achieved in the deep learning field, although still showing wide margins of improvement.

This Special Issue aims to gather a significant collection of original contributions to these topics. Accurate vehicle and human pose estimation is crucial for several applications, e.g., animal behavior research, gaming and virtual reality, medicine and biotechnology, pedestrian, aerial and maritime navigation, robotics, and human motion tracking. Furthermore, effective human pose and action recognition offers an important contribution in many fields, such as physical therapists’ diagnoses and patient rehabilitation, as well as security and surveillance or employee-free store development.

The relevant topics of this issue include, but are not limited to, the following:

Single and multihuman pose estimation, action recognition, and tracking;
Terrestrial, maritime, aerial robot pose estimation, and tracking;
Literature reviews and surveys;
Datasets and sensors;
Interesting applications and ideas focusing on surveillance, autonomous navigation, human–robot interaction, healthcare, and sports, etc.

Dr. Paolo Russo
Dr. Fabiana Di Ciaccio
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

deep learning
action recognition
pose estimation
human activities
robotics and intelligent systems
navigation
positioning
control
datasets
sensors
embedded systems and devices

Benefits of Publishing in a Special Issue

Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Related Special Issue

Deep Learning Applications for Pose Estimation and Human Action Recognition in Sensors (18 articles)

Published Papers (9 papers)

Download All Papers

Order results

Result details

Show export options Show export options

Select all

Export citation of selected articles as:

Research

26 pages, 32088 KiB

Open AccessArticle

Fall Detection Algorithm Using Enhanced HRNet Combined with YOLO

by Huan Shi, Xiaopeng Wang and Jia Shi

Sensors 2025, 25(13), 4128; https://doi.org/10.3390/s25134128 - 2 Jul 2025

Viewed by 543

Abstract

To address the issues of insufficient feature extraction, single-fall judgment method, and poor real-time performance of traditional fall detection algorithms in occluded scenes, a top-down fall detection algorithm based on improved YOLOv8 combined with BAM-HRNet is proposed. First, the Shufflenetv2 network is used to make the backbone of YOLOv8 light weight, and a mixed attention mechanism network is connected stage-wise at the neck to enable the network to better obtain human body position information. Second, the HRNet network integrated with the channel attention mechanism can effectively extract the position information of key points. Then, by analyzing the position information of skeletal key points, the decline speed of the center of mass, the angular velocity between the trunk and the ground, and the human body height-to-width ratio are jointly used as the discriminant basis for identifying fall behaviors. In addition, when a suspected fall is detected, the system automatically activates a voice inquiry mechanism to improve the accuracy of fall judgment. The results show that the accuracy of the object detection module on the COCO and Pascal VOC datasets is 64.1% and 61.7%, respectively. The accuracy of the key point detection module on the COCO and OCHuman datasets reaches 73.49% and 70.11%, respectively. On the fall detection datasets, the accuracy of the proposed algorithm exceeds 95% and the frame rate reaches 18.1 fps. Compared with traditional algorithms, it demonstrates superior ability to distinguish between normal and fall behaviors. Full article

(This article belongs to the Special Issue Deep Learning Applications for Pose Estimation and Human Action Recognition—2nd Edition)

► Show Figures

Figure 1

16 pages, 850 KiB

Open AccessArticle

Revised Control Barrier Function with Sensing of Threats from Relative Velocity Between Humans and Mobile Robots

by Zihan Zeng, Silu Chen, Xiangjie Kong, Xiaojuan Li, Chi Zhang and Guilin Yang

Sensors 2025, 25(13), 4005; https://doi.org/10.3390/s25134005 - 27 Jun 2025

Viewed by 628

Abstract

The mobile robot, which comprises a mobile platform and a robotic arm, has been widely adopted in industrial automation. Existing safe control methods with real-time trajectory alternation face difficulties in efficiently identifying threats from fast relative motion between humans and robots, causing hazards in environments of dense human–robot coexistence. This work firstly builds a safe mobile robot control framework in the kinematic sense. Secondly, the proximity between parts of a human and a mobile robot is efficiently solved by convex programming with parametric description of skew line segments. It is also no longer required to perform case-by-case analysis of skew line segments’ relative pose in space. Thirdly, a novel threatening index is proposed to select the most threatened human parts based on mutual projection of human–robot relative velocity and their common normal vector. Eventually, this index is incorporated into the safety constraint, showing the improved safe control performance in the simulated human–mobile robot coexistence scenario. Full article

(This article belongs to the Special Issue Deep Learning Applications for Pose Estimation and Human Action Recognition—2nd Edition)

► Show Figures

Figure 1

36 pages, 2706 KiB

Open AccessArticle

Towards Intelligent Assessment in Personalized Physiotherapy with Computer Vision

by Victor García and Olga C. Santos

Sensors 2025, 25(11), 3436; https://doi.org/10.3390/s25113436 - 29 May 2025

Viewed by 855

Abstract

Effective physiotherapy requires accurate and personalized assessments of patient mobility, yet traditional methods can be time-consuming and subjective. This study explores the potential of open-source computer vision algorithms, specifically YOLO Pose, to support automated, vision-based analysis in physiotherapy settings using information collected from optical sensors such as cameras. By extracting skeletal data from video input, the system enables objective evaluation of patient movements and rehabilitation progress. The visual information is then analyzed to propose a semantic framework that facilitates a structured interpretation of clinical parameters. Preliminary results indicate that YOLO Pose provides reliable pose estimation, offering a solid foundation for future enhancements, such as the integration of natural language processing (NLP) to improve patient interaction through empathetic, AI-driven support. Full article

(This article belongs to the Special Issue Deep Learning Applications for Pose Estimation and Human Action Recognition—2nd Edition)

► Show Figures

Figure 1

37 pages, 2036 KiB

Open AccessArticle

GCN-Transformer: Graph Convolutional Network and Transformer for Multi-Person Pose Forecasting Using Sensor-Based Motion Data

by Romeo Šajina, Goran Oreški and Marina Ivašić-Kos

Sensors 2025, 25(10), 3136; https://doi.org/10.3390/s25103136 - 15 May 2025

Viewed by 1455

Abstract

Multi-person pose forecasting involves predicting the future body poses of multiple individuals over time, involving complex movement dynamics and interaction dependencies. Its relevance spans various fields, including computer vision, robotics, human–computer interaction, and surveillance. This task is particularly important in sensor-driven applications, where motion capture systems, including vision-based sensors and IMUs, provide crucial data for analyzing human movement. This paper introduces GCN-Transformer, a novel model for multi-person pose forecasting that leverages the integration of Graph Convolutional Network and Transformer architectures. We integrated novel loss terms during the training phase to enable the model to learn both interaction dependencies and the trajectories of multiple joints simultaneously. Additionally, we propose a novel pose forecasting evaluation metric called Final Joint Position and Trajectory Error (FJPTE), which assesses both local movement dynamics and global movement errors by considering the final position and the trajectory leading up to it, providing a more comprehensive assessment of movement dynamics. Our model uniquely integrates scene-level graph-based encoding and personalized attention-based decoding, introducing a novel architecture for multi-person pose forecasting that achieves state-of-the-art results across four datasets. The model is trained and evaluated on the CMU-Mocap, MuPoTS-3D, SoMoF Benchmark, and ExPI datasets, which are collected using sensor-based motion capture systems, ensuring its applicability in real-world scenarios. Comprehensive evaluations on the CMU-Mocap, MuPoTS-3D, SoMoF Benchmark, and ExPI datasets demonstrate that the proposed GCN-Transformer model consistently outperforms existing state-of-the-art (SOTA) models according to the VIM and MPJPE metrics. Specifically, based on the MPJPE metric, GCN-Transformer shows a 4.7% improvement over the closest SOTA model on CMU-Mocap, 4.3% improvement over the closest SOTA model on MuPoTS-3D, 5% improvement over the closest SOTA model on the SoMoF Benchmark, and a 2.6% improvement over the closest SOTA model on the ExPI dataset. Unlike other models with performances that fluctuate across datasets, GCN-Transformer performs consistently, proving its robustness in multi-person pose forecasting and providing an excellent foundation for the application of GCN-Transformer in different domains. Full article

(This article belongs to the Special Issue Deep Learning Applications for Pose Estimation and Human Action Recognition—2nd Edition)

► Show Figures

Figure 1

16 pages, 1659 KiB

Open AccessArticle

DualPose: Dual-Block Transformer Decoder with Contrastive Denoising for Multi-Person Pose Estimation

by Matteo Fincato and Roberto Vezzani

Sensors 2025, 25(10), 2997; https://doi.org/10.3390/s25102997 - 9 May 2025

Viewed by 565

Abstract

Multi-person pose estimation is the task of detecting and regressing the keypoint coordinates of multiple people in a single image. Significant progress has been achieved in recent years, especially with the introduction of transformer-based end-to-end methods. In this paper, we present DualPose, a novel framework that enhances multi-person pose estimation by leveraging a dual-block transformer decoding architecture. Class prediction and keypoint estimation are split into parallel blocks so each sub-task can be separately improved and the risk of interference is reduced. This architecture improves the precision of keypoint localization and the model’s capacity to accurately classify individuals. To improve model performance, the Keypoint-Block uses parallel processing of self-attentions, providing a novel strategy that improves keypoint localization accuracy and precision. Additionally, DualPose incorporates a contrastive denoising (CDN) mechanism, leveraging positive and negative samples to stabilize training and improve robustness. Thanks to CDN, a variety of training samples are created by introducing controlled noise into the ground truth, improving the model’s ability to discern between valid and incorrect keypoints. DualPose achieves state-of-the-art results outperforming recent end-to-end methods, as shown by extensive experiments on the MS COCO and CrowdPose datasets. The code and pretrained models are publicly available. Full article

(This article belongs to the Special Issue Deep Learning Applications for Pose Estimation and Human Action Recognition—2nd Edition)

► Show Figures

Figure 1

19 pages, 30474 KiB

Open AccessArticle

Multi-Head Attention-Based Framework with Residual Network for Human Action Recognition

by Basheer Al-Tawil, Magnus Jung, Thorsten Hempel and Ayoub Al-Hamadi

Sensors 2025, 25(9), 2930; https://doi.org/10.3390/s25092930 - 6 May 2025

Viewed by 914

Abstract

Human action recognition (HAR) is essential for understanding and classifying human movements. It is widely used in real-life applications such as human–computer interaction and assistive robotics. However, recognizing patterns across different temporal scales remains challenging. Traditional methods struggle with complex timing patterns, intra-class variability, and inter-class similarities, leading to misclassifications. In this paper, we propose a deep learning framework for efficient and robust HAR. It integrates residual networks (ResNet-18) for spatial feature extraction and Bi-LSTM for temporal feature extraction. A multi-head attention mechanism enhances the prioritization of crucial motion details. Additionally, we introduce a motion-based frame selection strategy utilizing optical flow to reduce redundancy and enhance efficiency. This ensures accurate, real-time recognition of both simple and complex actions. We evaluate the framework on the UCF-101 dataset, achieving a 96.60% accuracy, demonstrating competitive performance against state-of-the-art approaches. Moreover, the framework operates at 222 frames per second (FPS), achieving an optimal balance between recognition performance and computational efficiency. The proposed framework was also deployed and tested on a mobile service robot, TIAGo, validating its real-time applicability in real-world scenarios. It effectively models human actions while minimizing frame dependency, making it well-suited for real-time applications. Full article

(This article belongs to the Special Issue Deep Learning Applications for Pose Estimation and Human Action Recognition—2nd Edition)

► Show Figures

Figure 1

29 pages, 9831 KiB

Open AccessArticle

Quality of Experience (QoE) in Cloud Gaming: A Comparative Analysis of Deep Learning Techniques via Facial Emotions in a Virtual Reality Environment

by Awais Khan Jumani, Jinglun Shi, Asif Ali Laghari, Muhammad Ahmad Amin, Aftab ul Nabi, Kamlesh Narwani and Yi Zhang

Sensors 2025, 25(5), 1594; https://doi.org/10.3390/s25051594 - 5 Mar 2025

Cited by 1 | Viewed by 1231

Abstract

Cloud gaming has rapidly transformed the gaming industry, allowing users to play games on demand from anywhere without the need for powerful hardware. Cloud service providers are striving to enhance user Quality of Experience (QoE) using traditional assessment methods. However, these traditional methods often fail to capture the actual user QoE because some users are not serious about providing feedback regarding cloud services. Additionally, some players, even after receiving services as per the Service Level Agreement (SLA), claim that they are not receiving services as promised. This poses a significant challenge for cloud service providers in accurately identifying QoE and improving actual services. In this paper, we have compared our previous proposed novel technique that utilizes a deep learning (DL) model to assess QoE through players’ facial expressions during cloud gaming sessions in a virtual reality (VR) environment. The EmotionNET model technique is based on a convolutional neural network (CNN) architecture. Later, we have compared the EmotionNET technique with three other DL techniques, namely ConvoNEXT, EfficientNET, and Vision Transformer (ViT). We trained the EmotionNET, ConvoNEXT, EfficientNET, and ViT model techniques on our custom-developed dataset, achieving 98.9% training accuracy and 87.8% validation accuracy with the EmotionNET model technique. Based on the training and comparison results, it is evident that the EmotionNET model technique predicts and performs better than the other model techniques. At the end, we have compared the EmotionNET results on two network (WiFi and mobile data) datasets. Our findings indicate that facial expressions are strongly correlated with QoE. Full article

(This article belongs to the Special Issue Deep Learning Applications for Pose Estimation and Human Action Recognition—2nd Edition)

► Show Figures

Figure 1

17 pages, 1198 KiB

Open AccessArticle

Decision Fusion-Based Deep Learning for Channel State Information Channel-Aware Human Action Recognition

by Domonkos Varga

Sensors 2025, 25(4), 1061; https://doi.org/10.3390/s25041061 - 10 Feb 2025

Viewed by 1684

Abstract

WiFi channel state information (CSI) has emerged as a promising modality for human action recognition due to its non-invasive nature and robustness in diverse environments. However, most existing methods process CSI channels collectively, potentially overlooking valuable channel-specific information. In this study, we propose a novel architecture, DF-CNN, which treats CSI channels separately and integrates their outputs using a decision fusion (DF) strategy. Extensive experiments demonstrate that DF-CNN significantly outperforms traditional approaches, achieving state-of-the-art performance. We also provide a comprehensive analysis of individual and combined CSI channel evaluations, showcasing the effectiveness of our method. This work establishes the importance of separate channel processing in CSI-based human action recognition and sets a new benchmark for the field. Full article

(This article belongs to the Special Issue Deep Learning Applications for Pose Estimation and Human Action Recognition—2nd Edition)

► Show Figures

Figure 1

16 pages, 2388 KiB

Open AccessArticle

Mitigating Data Leakage in a WiFi CSI Benchmark for Human Action Recognition

by Domonkos Varga

Sensors 2024, 24(24), 8201; https://doi.org/10.3390/s24248201 - 22 Dec 2024

Cited by 2 | Viewed by 1834

Abstract

Human action recognition using WiFi channel state information (CSI) has gained attention due to its non-intrusive nature and potential applications in healthcare, smart environments, and security. However, the reliability of methods developed for CSI-based action recognition is often contingent on the quality of the datasets and evaluation protocols used. In this paper, we uncovered a critical data leakage issue, which arises from improper data partitioning, in a widely used WiFi CSI benchmark dataset. Specifically, the benchmark fails to separate individuals between the training and test sets, leading to inflated performance metrics as models inadvertently learn individual-specific features rather than generalizable action patterns. We analyzed this issue in depth, retrained several benchmarked models using corrected data partitioning methods, and demonstrated a significant drop in accuracy when individuals were properly separated across training and testing. Our findings highlight the importance of rigorous data partitioning in CSI-based action recognition and provide recommendations for mitigating data leakage in future research. This work contributes to the development of more robust and reliable human action recognition systems using WiFi CSI. Full article

(This article belongs to the Special Issue Deep Learning Applications for Pose Estimation and Human Action Recognition—2nd Edition)

► Show Figures

Journal Menu

Journal Browser

Deep Learning Applications for Pose Estimation and Human Action Recognition—2nd Edition

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Benefits of Publishing in a Special Issue

Related Special Issue

Published Papers (9 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI