Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (10)

Search Parameters:
Keywords = ViTPose

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
13 pages, 1474 KB  
Article
A Lightweight Fall Detection Framework for Smart-City CCTV Using Distilled Pose and Interpretable Features
by Doniyorjon Mukhtorov and Young Im Cho
Appl. Sci. 2026, 16(10), 4632; https://doi.org/10.3390/app16104632 - 8 May 2026
Viewed by 247
Abstract
Vision-based fall detection for smart-city CCTV must be fast, interpretable, and robust to nuisance alarms. In real surveillance scenes, false alarms are often caused by sitting, crouching, duplicate detections, short-lived pose noise, and brief posture changes that do not correspond to actual falls. [...] Read more.
Vision-based fall detection for smart-city CCTV must be fast, interpretable, and robust to nuisance alarms. In real surveillance scenes, false alarms are often caused by sitting, crouching, duplicate detections, short-lived pose noise, and brief posture changes that do not correspond to actual falls. This paper presents a lightweight CCTV fall-detection framework evaluated on URFD, Le2i, and UP-Fall. The proposed method combines teacher-guided distillation from ViTPose to a YOLO26s-pose student, custom person detection, full-body ROI extraction, nine interpretable posture features, Random Forest classification, tracking-based duplicate suppression, and post-event false-positive rejection. The distillation stage improves pose mAP50-95 from 66.8% to 70.9% and pose mAP50 from 88.9% to 91.2%. In the final stand/fall setting, Random Forest with false-positive rejection achieves 98.46% accuracy, 98.61% precision, 98.43% recall, and 98.52% F1-score. The main contribution of this work is a practical and interpretable surveillance framework that integrates distilled lightweight pose estimation, posture-based fall representation, and tracking-aware false-positive suppression for robust deployment-oriented fall detection. Full article
Show Figures

Figure 1

28 pages, 4247 KB  
Article
BiMS-Pose: Enhancing Human Pose Estimation in Orchard Spraying Scenarios via Bidirectional Multi-Scale Collaboration
by Yuhang Ren, Zichen Yang, Hanxin Chen, Zhuochao Chen and Daojin Yao
Agriculture 2026, 16(5), 606; https://doi.org/10.3390/agriculture16050606 - 6 Mar 2026
Viewed by 427
Abstract
Most 2D human pose estimation frameworks utilize static designs for multi-scale feature fusion, where information from various scales is integrated using fixed weights. A drawback of these approaches is that they often lead to localization biases in complex scenarios. This paper addresses the [...] Read more.
Most 2D human pose estimation frameworks utilize static designs for multi-scale feature fusion, where information from various scales is integrated using fixed weights. A drawback of these approaches is that they often lead to localization biases in complex scenarios. This paper addresses the issues of multi-scale feature mismatch and joint localization biases in pose estimation. From the perspective of feature processing, multi-scale weights must be adapted to the size and position of joints, while joint predictions should adhere to human anatomical constraints. Existing methods lack effective dynamic adaptation, structural constraints, and bidirectional complementarity between high-level semantics and low-level details. They often experience localization biases in occluded scenarios, and the peaks of their heatmaps demonstrate insufficient consistency with the actual positions of the joints. Through theoretical analysis, we identify the causes of performance gaps and propose directions for narrowing them. We propose Bidirectional Multi-Scale Collaborative Pose Estimation (BiMS-Pose), a framework that introduces dynamic weights to adjust feature proportions, establishes bidirectional topological constraints for joint relationships, and integrates a bidirectional attention flow. The framework filters key information from three dimensions, adjusts filtering strategies in real time, and is enhanced by heatmap optimization to improve localization accuracy. Extensive experiments conducted on COCO, MPII, and our self-built Orchard Spraying Pose Dataset (OSPD) demonstrate the effectiveness of BiMS-Pose. In general scenarios, it achieves a significant 1.2 percentage-point increase in average precision (AP) on the COCO val2017 dataset compared to ViTPose while utilizing the same backbone. In agricultural orchard spraying scenarios, it effectively addresses interference factors such as changes in illumination, occlusion, and varying shooting distances, achieving 75.4% average precision (AP) and 90.7% percent of correct keypoints (PCKh@0.5) on the OSPD dataset. Additionally, it maintains an average frame rate of 18.3 FPS on embedded devices, effectively meeting the requirements for real-time monitoring. This highlights the model’s potential for precise, stable, and practical human pose estimation in both general and agricultural application scenarios. Full article
(This article belongs to the Special Issue Application of Smart Technologies in Orchard Management)
Show Figures

Figure 1

17 pages, 1594 KB  
Article
TransMODAL: A Dual-Stream Transformer with Adaptive Co-Attention for Efficient Human Action Recognition
by Majid Joudaki, Mehdi Imani and Hamid R. Arabnia
Electronics 2025, 14(16), 3326; https://doi.org/10.3390/electronics14163326 - 21 Aug 2025
Viewed by 2994
Abstract
Human Action Recognition has seen significant advances through transformer-based architectures, yet achieving a nuanced understanding often requires fusing multiple data modalities. Standard models relying solely on RGB video can struggle with actions defined by subtle motion cues rather than appearance. This paper introduces [...] Read more.
Human Action Recognition has seen significant advances through transformer-based architectures, yet achieving a nuanced understanding often requires fusing multiple data modalities. Standard models relying solely on RGB video can struggle with actions defined by subtle motion cues rather than appearance. This paper introduces TransMODAL, a novel dual-stream transformer that synergistically fuses spatiotemporal appearance features from a pre-trained VideoMAE(Video Masked AutoEncoders) backbone with explicit skeletal kinematics from a state-of-the-art pose estimation pipeline (RT-DETR(Real-Time DEtection Transformer) + ViTPose++). We propose two key architectural innovations to enable effective and efficient fusion: a CoAttentionFusion module that facilitates deep, iterative cross-modal feature exchange between the RGB and pose streams, and an efficient AdaptiveSelector mechanism that dynamically prunes less informative spatiotemporal tokens to reduce computational overhead. Evaluated on three challenging benchmarks, TransMODAL demonstrates robust generalization, achieving accuracies of 98.5% on KTH, 96.9% on UCF101, and 84.2% on HMDB51. These results significantly outperform a strong VideoMAE-only baseline and are competitive with state-of-the-art methods, demonstrating the profound impact of explicit pose guidance. TransMODAL presents a powerful and efficient paradigm for composing pre-trained foundation models to tackle complex video understanding tasks by providing a fully reproducible implementation and strong benchmark results. Full article
Show Figures

Figure 1

11 pages, 1602 KB  
Article
Evaluating Assistive Technology Outcomes in Boccia Athletes with Disabilities Using AI-Based Kinematic Analysis
by Wann-Yun Shieh, Yan-Ying Ju, Shiu-Yuan Yang, I-Chun Chen and Hsin-Yi Kathy Cheng
Bioengineering 2025, 12(7), 684; https://doi.org/10.3390/bioengineering12070684 - 23 Jun 2025
Cited by 2 | Viewed by 1837
Abstract
This study explores how artificial intelligence (AI) can support the evaluation of assistive technology outcomes in adaptive sports, focusing on elite boccia athletes with disabilities. Using a multi-stage motion analysis framework, we integrated OpenPose, ViTPose, and Lifting to estimate seated joint kinematics with [...] Read more.
This study explores how artificial intelligence (AI) can support the evaluation of assistive technology outcomes in adaptive sports, focusing on elite boccia athletes with disabilities. Using a multi-stage motion analysis framework, we integrated OpenPose, ViTPose, and Lifting to estimate seated joint kinematics with greater precision. Match footage from 12 athletes at the 2018 Asia-Pacific Boccia Open was analyzed across five biomechanical phases: preparation, acceleration, peak, release, and follow-through. AI-enhanced 2D and 3D pose estimation methods were applied to assess throwing strategies and motor variability. ViTPose outperformed OpenPose in joint detection accuracy (F1-score: 85% vs. 79.5%), while Lifting improved 3D estimation by reducing joint position error by 16%. Principal Component Analysis revealed greater movement consistency in overhand throws compared to underhand techniques. The proposed pipeline provides an interpretable and scalable method for measuring performance, motor control, and strategy-specific movement outcomes in boccia, offering practical applications for evidence-based coaching, athlete classification, and the design of inclusive assistive sport technologies. Full article
Show Figures

Figure 1

9 pages, 1106 KB  
Article
Automatic Movement Recognition for Evaluating the Gross Motor Development of Infants
by Yin-Zhang Yang, Jia-An Tsai, Ya-Lan Yu, Mary Hsin-Ju Ko, Hung-Yi Chiou, Tun-Wen Pai and Hui-Ju Chen
Children 2025, 12(3), 310; https://doi.org/10.3390/children12030310 - 28 Feb 2025
Cited by 1 | Viewed by 2354
Abstract
Objective: The objective of this study was to early-detect gross motor abnormalities through video detection in Taiwanese infants aged 2–6 months. Background: The current diagnosis of infant developmental delays primarily relies on clinical examinations. However, during clinical visits, infants may show atypical behaviors [...] Read more.
Objective: The objective of this study was to early-detect gross motor abnormalities through video detection in Taiwanese infants aged 2–6 months. Background: The current diagnosis of infant developmental delays primarily relies on clinical examinations. However, during clinical visits, infants may show atypical behaviors due to unfamiliar environments, which might not truly reflect their true developmental status. Methods: This study utilized videos of infants recorded in their home environments. Two pediatric neurologists manually annotated these clips to identify whether an infant possessed the characteristics of gross motor delays through an assessment of his/her gross motor movements. Using transfer learning techniques, four pose recognition models, including ViTPose, HRNet, DARK, and UDP, were applied to the infant gross motor dataset. Four machine learning classification models, including random forest, support vector machine, logistic regression, and XGBoost, were used to predict the developmental status of infants. Results: The experimental results of pose estimation and tracking indicate that the ViTPose model provided the best performance for pose recognition. A total of 227 features related to kinematics, motions, and postures were extracted and calculated. A one-way ANOVA analysis revealed 106 significant features that were retained for constructing prediction models. The results show that a random forest model achieved the best performance with an average F1-score of 0.94, a weighted average AUC of 0.98, and an average accuracy of 94%. Full article
(This article belongs to the Section Pediatric Neurology & Neurodevelopmental Disorders)
Show Figures

Figure 1

21 pages, 9016 KB  
Article
TomatoPoseNet: An Efficient Keypoint-Based 6D Pose Estimation Model for Non-Destructive Tomato Harvesting
by Jipeng Ni, Licheng Zhu, Lizhong Dong, Ruixue Wang, Kaikang Chen, Jianbo Gao, Wenbei Wang, Liming Zhou, Bo Zhao, Jiacheng Rong, Zhenhao Han, Kunlei Lu and Xuguang Feng
Agronomy 2024, 14(12), 3027; https://doi.org/10.3390/agronomy14123027 - 19 Dec 2024
Cited by 6 | Viewed by 3357
Abstract
The non-destructive harvesting of fresh tomatoes with agricultural robots requires the robotic arm to approach the fruit with the correct posture to ensure successful harvesting. However, this process faces significant challenges due to the small size of fruit pedicels, cluttered environments, and varied [...] Read more.
The non-destructive harvesting of fresh tomatoes with agricultural robots requires the robotic arm to approach the fruit with the correct posture to ensure successful harvesting. However, this process faces significant challenges due to the small size of fruit pedicels, cluttered environments, and varied poses of the tomatoes and pedicels. Accurately identifying, localizing, and estimating the 6D spatial pose of the cutting points is critical for efficient and non-destructive harvesting. To address these challenges, we propose a keypoint-based pose estimation model, TomatoPoseNet, tailored to meet the agronomic requirements of tomato harvesting. The model integrates an efficient fusion block (EFBlock) based on the CSPLayer, referred to as the CSEFLayer, as the backbone network, designed to fuse multiscale features while maintaining efficient computational resource usage. Next, a parallel deep fusion network (PDFN) is utilized as the neck network to integrate features from multiple parallel branches. Furthermore, simple coordinate classification (SimCC) is employed as the head network for keypoint detection, and a StripPooling block is introduced to enhance the model’s ability to capture features of different scales and shapes by applying strip pooling in horizontal and vertical directions. Finally, a geometric model is constructed based on the information about the predicted 3D keypoints to estimate the 6D pose of the cutting points. The results show the following: (1) The average precision for keypoint detection (PCK@0.05) reached 82.51%, surpassing those of ViTPose, HRNet, Lite-HRNet, Hourglass, and RTMPose by 3.78%, 9.46%, 11%, 9.14%, and 10.07%, respectively. (2) The mean absolute errors (MAEs) of the yaw and pitch angles for 6D pose estimation of the cutting points were 2.98° and 3.54°, respectively, with maximum errors within 6.5°, meeting the requirements for harvesting. The experimental results demonstrate that the proposed method can accurately locate the 6D pose of cutting points in an unstructured tomato harvesting environment, enabling non-destructive harvesting. Full article
Show Figures

Figure 1

25 pages, 7113 KB  
Article
LidPose: Real-Time 3D Human Pose Estimation in Sparse Lidar Point Clouds with Non-Repetitive Circular Scanning Pattern
by Lóránt Kovács, Balázs M. Bódis and Csaba Benedek
Sensors 2024, 24(11), 3427; https://doi.org/10.3390/s24113427 - 26 May 2024
Cited by 13 | Viewed by 6276
Abstract
In this paper, we propose a novel, vision-transformer-based end-to-end pose estimation method, LidPose, for real-time human skeleton estimation in non-repetitive circular scanning (NRCS) lidar point clouds. Building on the ViTPose architecture, we introduce novel adaptations to address the unique properties of NRCS lidars, [...] Read more.
In this paper, we propose a novel, vision-transformer-based end-to-end pose estimation method, LidPose, for real-time human skeleton estimation in non-repetitive circular scanning (NRCS) lidar point clouds. Building on the ViTPose architecture, we introduce novel adaptations to address the unique properties of NRCS lidars, namely, the sparsity and unusual rosetta-like scanning pattern. The proposed method addresses a common issue of NRCS lidar-based perception, namely, the sparsity of the measurement, which needs balancing between the spatial and temporal resolution of the recorded data for efficient analysis of various phenomena. LidPose utilizes foreground and background segmentation techniques for the NRCS lidar sensor to select a region of interest (RoI), making LidPose a complete end-to-end approach to moving pedestrian detection and skeleton fitting from raw NRCS lidar measurement sequences captured by a static sensor for surveillance scenarios. To evaluate the method, we have created a novel, real-world, multi-modal dataset, containing camera images and lidar point clouds from a Livox Avia sensor, with annotated 2D and 3D human skeleton ground truth. Full article
(This article belongs to the Section Optical Sensors)
Show Figures

Figure 1

17 pages, 32154 KB  
Article
High-Resolution Multi-Scale Feature Fusion Network for Running Posture Estimation
by Xiaobing Xu and Yaping Zhang
Appl. Sci. 2024, 14(7), 3065; https://doi.org/10.3390/app14073065 - 5 Apr 2024
Cited by 6 | Viewed by 2641
Abstract
Running posture estimation is a specialized task in human pose estimation that has received relatively little research attention due to the lack of appropriate datasets. To address this issue, this paper presents the construction of a new benchmark dataset called “Running Human”, which [...] Read more.
Running posture estimation is a specialized task in human pose estimation that has received relatively little research attention due to the lack of appropriate datasets. To address this issue, this paper presents the construction of a new benchmark dataset called “Running Human”, which was specifically designed for running sports. This dataset contains over 1000 images along with comprehensive annotations for 1288 instances of running humans, including bounding boxes and keypoint annotations on the human body. Additionally, a Receptive Field Spatial Pooling (RFSP) module was developed to tackle the challenge of joint occlusion, which is common in running sports images. This module was incorporated into the High-Resolution Network (HRNet) model, resulting in a novel network model named the Running Human Posture Network (RHPNet). By expanding the receptive field and effectively utilizing multi-scale features extracted from the multi-branch network, the RHPNet model significantly enhances the accuracy of running posture estimation. On the Running Human dataset, the proposed method achieved state-of-the-art performance. Furthermore, experiments were conducted on two benchmark datasets. Compared to the state-of-the-art ViTPose-L method, when applied to the COCO dataset, RHPNet demonstrated comparable prediction accuracy while utilizing only one tenth of the parameters and one eighth of the floating-point operations (FLOPs). On the MPII dataset, RHPNet achieves a PCKh@0.5 score of 92.0, which is only 0.5 points lower than the state-of-the-art method, PCT. These experimental results provide strong validation for the effectiveness and excellent generalization ability of the proposed method. Full article
Show Figures

Figure 1

14 pages, 3267 KB  
Article
THANet: Transferring Human Pose Estimation to Animal Pose Estimation
by Jincheng Liao, Jianzhong Xu, Yunhang Shen and Shaohui Lin
Electronics 2023, 12(20), 4210; https://doi.org/10.3390/electronics12204210 - 11 Oct 2023
Cited by 9 | Viewed by 3845
Abstract
Animal pose estimation (APE) boosts the understanding of animal behaviors. Recent vision-based APE has attracted extensive attention due to the advantages of contactless and sensorless applications. One of the main challenges in APE is the lack of high-quality keypoint annotations for different animal [...] Read more.
Animal pose estimation (APE) boosts the understanding of animal behaviors. Recent vision-based APE has attracted extensive attention due to the advantages of contactless and sensorless applications. One of the main challenges in APE is the lack of high-quality keypoint annotations for different animal species since manually annotating the animal keypoints is very expensive and time-consuming. Existing works alleviate this problem by synthesizing APE data and generating pseudo-labels for unlabeled animal images. However, feature representations learned from synthetic images could not be directly transferred to real-world scenarios, and the generated pseudo-labels are usually noisy, which limits the model’s performance. To address the above challenge, we propose a novel cross-domain vision transformer for APE to Transfer Human pose estimation to Animal pose estimation, termed THANet, as humans share skeleton similarities with some animals. Inspired by the success of ViTPose in HPE, we design a unified vision transformer encoder to extract universal features for both animals and humans followed by two task-specific decoders. We further introduce a simple but effective cross-domain discriminator to bridge the domain gaps between the human pose and the animal pose. We evaluated the proposed THANet on the AP-10K and Animal-Pose benchmarks, and the extensive experiments show that our method achieves a promising performance. Specifically, the proposed vision transformer and cross-domain method significantly improve the model’s accuracy and generalization ability for APE. Full article
(This article belongs to the Special Issue Advances in Computer Vision and Multimedia Information Processing)
Show Figures

Figure 1

14 pages, 19358 KB  
Article
RefinePose: Towards More Refined Human Pose Estimation
by Hao Dong, Guodong Wang, Chenglizhao Chen and Xinyue Zhang
Electronics 2022, 11(23), 4060; https://doi.org/10.3390/electronics11234060 - 6 Dec 2022
Cited by 3 | Viewed by 4486
Abstract
Human pose estimation is a very important research topic in computer vision and attracts more and more researchers. Recently, ViTPose based on heatmap representation refreshed the state of the art for pose estimation methods. However, we find that ViTPose still has room for [...] Read more.
Human pose estimation is a very important research topic in computer vision and attracts more and more researchers. Recently, ViTPose based on heatmap representation refreshed the state of the art for pose estimation methods. However, we find that ViTPose still has room for improvement in our experiments. On the one hand, the PatchEmbedding module of ViTPose uses a convolutional layer with a stride of 14 × 14 to downsample the input image, resulting in the loss of a significant amount of feature information. On the other hand, the two decoding methods (Classical Decoder and Simple Decoder) used by ViTPose are not refined enough: transpose convolution in the Classical Decoder produces the inherent chessboard effect; the upsampling factor in the Simple Decoder is too large, resulting in the blurry heatmap. To this end, we propose a novel pose estimation method based on ViTPose, termed RefinePose. In RefinePose, we design the GradualEmbedding module and Fusion Decoder, respectively, to solve the above problems. More specifically, the GradualEmbedding module only downsamples the image to 1/2 of the original size in each downsampling stage, and it reduces the input image to a fixed size (16 × 112 in ViTPose) through multiple downsampling stages. At the same time, we fuse the outputs of max pooling layers and convolutional layers in each downsampling stage, which retains more meaningful feature information. In the decoding stage, the Fusion Decoder designed by us combines bilinear interpolation with max unpooling layers, and gradually upsamples the feature maps to restore the predicted heatmap. In addition, we also design the FeatureAggregation module to aggregate features after sampling (upsampling and downsampling). We validate the RefinePose on the COCO dataset. The experiments show that RefinePose has achieved better performance than ViTPose. Full article
Show Figures

Figure 1

Back to TopTop