Robust Human Pose Estimation Method for Body-to-Body Occlusion Using RGB-D Fusion Neural Network
Abstract
1. Introduction
- We propose a novel architecture that integrates depth image features into a bottom-up HPE model. In previous HPE models based on RGB-D images [26,27,28], depth images have been used primarily to augment the base features extracted by the feature extractor. In contrast, our approach progressively fuses depth features with those from color images, thereby substantially improving HPE performances in occlusion scenes. We demonstrate that progressively fusing depth features at each stage of the HPE process leads to improved performance.
- We specifically selected images containing occlusion scenarios and experimentally demonstrated that leveraging features extracted from depth images leads to clear improvements in pose estimation performance for these cases. Furthermore, we conducted an analysis of how RGB and depth images contribute to pose estimation at each stage of feature fusion.
2. Related Works
2.1. Human Pose Estimation Methods Based on RGB Images
2.1.1. OpenPose
2.1.2. Scale-Aware High-Resolution Network
2.1.3. Recent Human Pose Estimation Methods
2.1.4. Feature Extractors for Human Pose Estimation
2.2. Human Pose Estimation Methods Based on Fusion of RGB Images and Other Modalities
2.3. Pose Estimation Methods Based on RGB-D Images
3. Human Pose Estimation Method by Progressive Feature Fusion
3.1. Network Architecture
3.2. Loss Function
3.3. Ground Truth for Heatmap Representation
3.4. Generation of Body-to-Body Occlusion Samples
4. Experimental Results
4.1. Dataset
4.2. Performance Metrics in Human Pose Estimation
4.3. Performance Evaluation of Human Pose Estimation
4.4. Performance Evaluation for Body-to-Body Occlusion Subset
4.5. Qualitative Comparison of Human Pose Estimation
5. Discussion and Future Research
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Yue, R.; Tian, Z.; Du, S. Action recognition based on RGB and skeleton data sets: A survey. Neurocomputing 2022, 512, 287–306. [Google Scholar] [CrossRef]
- Bux, A.; Angelov, P.; Habib, Z. Vision based human activity recognition: A review. In Proceedings of the UK Workshop on Computational Intelligence, Lancaster, UK, 7–9 September 2016; pp. 341–371. [Google Scholar]
- Vrigkas, M.; Nikou, C.; Kakadiaris, I.A. A review of human activity recognition methods. Front. Robot. AI 2015, 2, 28. [Google Scholar] [CrossRef]
- Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 2969–2978. [Google Scholar]
- Karim, M.; Khalid, S.; Aleryani, A.; Khan, J.; Ullah, I.; Ali, Z. Human Action Recognition Systems: A Review of the Trends and State-of-the-Art. IEEE Access 2024, 12, 36372–36390. [Google Scholar] [CrossRef]
- Liu, Z.; Zhu, J.; Bu, J.; Chen, C. A survey of human pose estimation: The body parts parsing based methods. J. Vis. Commun. Image Represent. 2015, 32, 10–19. [Google Scholar] [CrossRef]
- Wang, P.; Li, W.; Ogunbona, P.; Wan, J.; Escalera, S. RGB-D-based human motion recognition with deep learning: A survey. Comput. Vis. Image Underst. 2018, 171, 118–139. [Google Scholar] [CrossRef]
- Presti, L.L.; La Cascia, M. 3D skeleton-based human action classification: A survey. Pattern Recognit. 2016, 53, 130–147. [Google Scholar] [CrossRef]
- Tome, D.; Russell, C.; Agapito, L. Lifting from the deep: Convolutional 3D pose estimation from a single image. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2500–2509. [Google Scholar]
- Huang, H.; Wang, Y.; Linghu, K.; Xia, Z. Multi-modal micro-gesture classification via multi-scale heterogeneous ensemble network. In Proceedings of the Workshop & Challenge on Micro-Gesture Analysis for Hidden Emotion Understanding, Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar]
- Wang, Y.; Rui, K.; Huang, H.; Xia, Z. Micro-gesture online recognition with dual-stream multi-scale transformer in long videos. In Proceedings of the Workshop & Challenge on Micro-Gesture Analysis for Hidden Emotion Understanding, Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar]
- Dang, Q.; Yin, J.; Wang, B.; Zheng, W. Deep learning based 2D human pose estimation: A survey. Tsinghua Sci. Technol. 2019, 24, 663–676. [Google Scholar] [CrossRef]
- Lan, G.; Wu, Y.; Hu, F.; Hao, Q. Vision-based human pose estimation via deep learning: A survey. IEEE Trans. Hum.-Mach. Syst. 2022, 53, 253–268. [Google Scholar] [CrossRef]
- Wang, C.; Zhang, F.; Ge, S.S. A comprehensive survey on 2D multi-person pose estimation methods. Eng. Appl. Artif. Intell. 2021, 102, 104260. [Google Scholar] [CrossRef]
- Gamra, M.B.; Akhloufi, M.A. A review of deep learning techniques for 2D and 3D human pose estimation. Image Vis. Comput. 2021, 114, 104282. [Google Scholar] [CrossRef]
- Chen, Y.; Tian, Y.; He, M. Monocular human pose estimation: A survey of deep learning-based methods. Comput. Vis. Image Underst. 2020, 192, 102897. [Google Scholar] [CrossRef]
- Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. ViTPose: Simple vision transformer baselines for human pose estimation. In Proceedings of the Conference on Neural Information Processing Systems, Virtual, 28 November–9 December 2022; pp. 38571–38584. [Google Scholar]
- Jiang, T.; Lu, P.; Zhang, L.; Ma, N.; Han, R.; Lyu, C.; Li, Y.; Chen, K. RTMPose: Real-time multi-person pose estimation based on MMpose. arXiv 2023, arXiv:2303.07399. [Google Scholar] [CrossRef]
- Yang, S.; Quan, Z.; Nie, M.; Yang, W. TransPose: Keypoint localization via transformer. In Proceedings of the International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 11802–11812. [Google Scholar]
- Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5386–5395. [Google Scholar]
- Zhao, M.; Li, T.; Abu Alsheikh, M.; Tian, Y.; Zhao, H.; Torralba, A.; Katabi, D. Through-wall human pose estimation using radio signals. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7356–7365. [Google Scholar]
- Ghafoor, M.; Mahmood, A. Quantification of occlusion handling capability of a 3D human pose estimation framework. IEEE Trans. Multimed. 2022, 25, 3311–3318. [Google Scholar] [CrossRef]
- Chen, B.; Chin, T.J.; Klimavicius, M. Occlusion-robust object pose estimation with holistic representation. In Proceedings of the Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2929–2939. [Google Scholar]
- Munea, T.L.; Jembre, Y.Z.; Weldegebriel, H.T.; Chen, L.; Huang, C.; Yang, C. The progress of human pose estimation: A survey and taxonomy of models applied in 2D human pose estimation. IEEE Access 2020, 8, 133330–133348. [Google Scholar] [CrossRef]
- Bragagnolo, L.; Terreran, M.; Allegro, D.; Ghidoni, S. Multi-view Pose Fusion for Occlusion-Aware 3D Human Pose Estimation. arXiv 2024, arXiv:2408.15810. [Google Scholar] [CrossRef]
- Zhou, G.; Yan, Y.; Wang, D.; Chen, Q. A novel depth and color feature fusion framework for 6D object pose estimation. IEEE Trans. Multimed. 2020, 23, 1630–1639. [Google Scholar] [CrossRef]
- Kazakos, E.; Nikou, C.; Kakadiaris, I.A. On the fusion of RGB and depth information for hand pose estimation. In Proceedings of the International Conference on Image Processing, Athens, Greece, 7–10 October 2018; pp. 868–872. [Google Scholar]
- Wang, Z.; Lu, Y.; Ni, W.; Song, L. An RGB-D based approach for human pose estimation. In Proceedings of the International Conference on Networking Systems of AI, Shanghai, China, 19–20 November 2021; pp. 166–170. [Google Scholar]
- Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional Pose Machines. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4724–4732. [Google Scholar]
- Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
- Newell, A.; Huang, Z.; Deng, J. Associative embedding: End-to-end learning for joint detection and grouping. In Proceedings of the Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Chen, X.; Yang, C.; Mo, J.; Sun, Y.; Karmouni, H.; Jiang, Y.; Zheng, Z. CSPNeXt: A new efficient token hybrid backbone. Eng. Appl. Artif. Intell. 2024, 132, 107886. [Google Scholar] [CrossRef]
- Li, Y.; Yang, S.; Liu, P.; Zhang, S.; Wang, Y.; Wang, Z.; Yang, W.; Xia, S. SimCC: A simple coordinate classification perspective for human pose estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 89–106. [Google Scholar]
- Yang, C.H.; Kong, K.B.; Min, S.J.; Wee, D.Y.; Jang, H.D.; Cha, G.H.; Kang, S.J. SEFD: Learning to distill complex pose and occlusion. In Proceedings of the International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 14895–14906. [Google Scholar]
- Purkrabek, M.; Matas, J. Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle. arXiv 2024, arXiv:2412.01562. [Google Scholar] [CrossRef]
- Artacho, B.; Savakis, A. Full-BAPose: Bottom Up Framework for Full Body Pose Estimation. Sensors 2023, 23, 3725. [Google Scholar] [CrossRef]
- Qu, H.; Cai, Y.; Foo, L.G.; Kumar, A.; Liu, J. A characteristic function-based method for bottom-up human pose estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
- Bai, X.; Wei, X.; Wang, Z.; Zhang, M. CONet: Crowd and occlusion-aware network for occluded human pose estimation. Neural Netw. 2024, 172, 106109. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
- Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Amin, A.; Tamajo, A.; Klugman, I.; Stoev, E.; Fisho, T.; Lim, H.; Kim, H. Real-time 3D multi-person pose estimation using an omnidirectional camera and mmWave radars. In Proceedings of the International Conference on Engineering and Emerging Technologies, Seoul, Republic of Korea, 20–22 October 2023; pp. 1–6. [Google Scholar]
- Knap, P.; Hardy, P.; Tamajo, A.; Lim, H.; Kim, H. Real-time omnidirectional 3D multi-person human pose estimation with occlusion handling. In Proceedings of the ACM SIGGRAPH European Conference on Visual Media Production, London, UK, 6–8 November 2023. [Google Scholar]
- Knap, P.; Hardy, P.; Tamajo, A.; Lim, H.; Kim, H. Improving real-time omnidirectional 3D multi-person human pose estimation with people matching and unsupervised 2D–3D lifting. In Proceedings of the International Conference on Electronics, Information, and Communication, Jeju Island, Republic of Korea, 10–13 January 2024; pp. 1–4. [Google Scholar]
- Sengupta, A.; Jin, F.; Cao, S. NLP based skeletal pose estimation using mmWave radar point-cloud: A simulation approach. In Proceedings of the IEEE Radar Conference, Atlantic City, NJ, USA, 21–24 September 2020; pp. 1–6. [Google Scholar]
- An, S.; Ogras, U.Y. Fast and scalable human pose estimation using mmWave point cloud. In Proceedings of the Design Automation Conference, San Francisco, CA, USA, 10–14 July 2022; pp. 889–894. [Google Scholar]
- Li, G.; Zhang, Z.; Yang, H.; Pan, J.; Chen, D.; Zhang, J. Capturing human pose using mmWave radar. In Proceedings of the International Conference on Pervasive Computing and Communications Workshops, Austin, TX, USA, 23–27 March 2020; pp. 1–6. [Google Scholar]
- Fürst, M.; Gupta, S.T.; Schuster, R.; Wasenmüller, O.; Stricker, D. HPERL: 3D human pose estimation from RGB and LiDAR. In Proceedings of the International Conference on Pattern Recognition, Milano, Italy, 10–15 January 2021; pp. 7321–7327. [Google Scholar]
- Ye, D.; Xie, Y.; Chen, W.; Zhou, Z.; Ge, L.; Foroosh, H. LPFormer: LiDAR pose estimation transformer with multi-task network. In Proceedings of the International Conference on Robotics and Automation, Yokohama, Japan, 18–22 May 2024; pp. 16432–16438. [Google Scholar]
- Knap, P. Human modelling and pose estimation overview. arXiv 2024, arXiv:2406.19290. [Google Scholar] [CrossRef]
- Park, S.; Ji, M.; Chun, J. 2D human pose estimation based on object detection using RGB-D information. KSII Trans. Internet Inf. Syst. 2018, 12, 800–816. [Google Scholar] [CrossRef]
- Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 328–341. [Google Scholar] [CrossRef] [PubMed]
- Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
- Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Study | Year | Backbone | Approach | Description |
---|---|---|---|---|
Wei et al. [29] | 2016 | CPM | Top-down | Intermediate supervision |
Cao et al. [30] | 2017 | VGG-19 | Bottom-up | Part Affinity Field |
Cheng et al. [20] | 2020 | HRNet | Bottom-up | Focus on scale change challenges in bottom-up pipelines |
Yang et al. [35] | 2023 | ResNet/ HRNet | Top-Down | 3D mesh estimation by edge information, Knowledge distillation strategy |
Artacho et al. [36] | 2023 | HRNet | Bottom-up | Waterfall architecture, Capture multi-scale features, |
Qu et al. [37] | 2023 | HRNet | Bottom-up | A distance-based loss function for joint heatmaps |
Purkrabek et al. [38] | 2024 | ViTPose | Top-Down | Integration of detection, segmentation, and pose estimation |
Index | Joint Type | Index | Joint Type |
---|---|---|---|
1 | Base of the spine | 14 | Left knee |
2 | Middle of the spine | 15 | Left ankle |
3 | Neck | 16 | Left foot |
4 | Head | 17 | Right hip |
5 | Left shoulder | 18 | Right knee |
6 | Left elbow | 19 | Right ankle |
7 | Left wrist | 20 | Right foot |
8 | Left hand | 21 | Spine |
9 | Right shoulder | 22 | Tip of the left hand |
10 | Right elbow | 23 | Left thumb |
11 | Right wrist | 24 | Tip of the right hand |
12 | Right hand | 25 | Right thumb |
13 | Left hip |
Index | Joint Type | Index | Joint Type |
---|---|---|---|
1 | Nose | 10 | Right knee |
2 | Neck | 11 | Right ankle |
3 | Right shoulder | 12 | Left hip |
4 | Right elbow | 13 | Left knee |
5 | Right wrist | 14 | Left ankle |
6 | Left shoulder | 15 | Right eye |
7 | Left elbow | 16 | Left eye |
8 | Left wrist | 17 | Right ear |
9 | Right hip | 18 | Left ear |
Joints Type | σi |
---|---|
Nose | 0.026 |
Neck | 0.026 |
Eyes | 0.025 |
Ears | 0.035 |
Shoulders | 0.079 |
Elbows | 0.072 |
Wrists | 0.062 |
Hips | 0.107 |
Knees | 0.087 |
Ankles | 0.089 |
R | AP | P50 | P75 | APM | APL | AR |
---|---|---|---|---|---|---|
1 | 0.49 | 0.84 | 0.49 | 0.31 | 0.51 | 0.50 |
2 | 0.50 | 0.85 | 0.49 | 0.32 | 0.51 | 0.51 |
3 | 0.51 | 0.86 | 0.51 | 0.33 | 0.52 | 0.52 |
4 | 0.45 | 0.79 | 0.46 | 0.18 | 0.47 | 0.43 |
5 | 0.24 | 0.46 | 0.23 | 0.09 | 0.26 | 0.25 |
Feature Fusion Method | AP | P50 | P75 | APM | APL | AR |
---|---|---|---|---|---|---|
RGB image only | 0.39 | 0.83 | 0.32 | 0.22 | 0.41 | 0.39 |
Depth image only | 0.28 | 0.69 | 0.28 | 0.11 | 0.28 | 0.29 |
Early fusion | 0.14 | 0.40 | 0.07 | 0.06 | 0.14 | 0.13 |
Late fusion | 0.40 | 0.84 | 0.31 | 0.23 | 0.40 | 0.40 |
Ours (R = 3) | 0.51 | 0.86 | 0.51 | 0.33 | 0.52 | 0.52 |
Model | #Params | GFLOPs | AP | P50 | P75 | APM | APL | AR |
---|---|---|---|---|---|---|---|---|
Subset of 8921 images as in NTU RGB+D 120 dataset | ||||||||
HigherHRNet [18] | 28.7M | 73.1 | 0.49 | 0.88 | 0.49 | 0.35 | 0.50 | 0.51 |
OpenPose [27] | 52.3M | 411.1 | 0.39 | 0.83 | 0.32 | 0.22 | 0.41 | 0.39 |
Ours (R = 3) | 68.9M | 593.7 | 0.51 | 0.86 | 0.51 | 0.33 | 0.52 | 0.52 |
Model | Neck | Shoulder | Elbow | Wrist | Hip | Knee | Ankle | mAP |
---|---|---|---|---|---|---|---|---|
Occlusion Subset of 383 images as in NTU RGB+D 120 dataset | ||||||||
HigherHRNet [18] | - | 0.47 | 0.47 | 0.43 | 0.63 | 0.68 | 0.55 | 0.54 |
OpenPose [27] | 0.62 | 0.49 | 0.44 | 0.33 | 0.65 | 0.69 | 0.63 | 0.54 |
Ours (R = 3) | 0.67 | 0.78 | 0.54 | 0.38 | 0.86 | 0.74 | 0.70 | 0.67 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yoon, J.-h.; Kwon, S.-k. Robust Human Pose Estimation Method for Body-to-Body Occlusion Using RGB-D Fusion Neural Network. Appl. Sci. 2025, 15, 8746. https://doi.org/10.3390/app15158746
Yoon J-h, Kwon S-k. Robust Human Pose Estimation Method for Body-to-Body Occlusion Using RGB-D Fusion Neural Network. Applied Sciences. 2025; 15(15):8746. https://doi.org/10.3390/app15158746
Chicago/Turabian StyleYoon, Jae-hyuk, and Soon-kak Kwon. 2025. "Robust Human Pose Estimation Method for Body-to-Body Occlusion Using RGB-D Fusion Neural Network" Applied Sciences 15, no. 15: 8746. https://doi.org/10.3390/app15158746
APA StyleYoon, J.-h., & Kwon, S.-k. (2025). Robust Human Pose Estimation Method for Body-to-Body Occlusion Using RGB-D Fusion Neural Network. Applied Sciences, 15(15), 8746. https://doi.org/10.3390/app15158746