Robust Human Pose Estimation Method for Body-to-Body Occlusion Using RGB-D Fusion Neural Network

Yoon, Jae-hyuk; Kwon, Soon-kak

doi:10.3390/app15158746

Open AccessArticle

Robust Human Pose Estimation Method for Body-to-Body Occlusion Using RGB-D Fusion Neural Network

by

Jae-hyuk Yoon

and

Soon-kak Kwon

^*

Department of Computer Software Engineering, Dong-eui University, Busan 47340, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8746; https://doi.org/10.3390/app15158746 (registering DOI)

Submission received: 6 June 2025 / Revised: 21 July 2025 / Accepted: 5 August 2025 / Published: 7 August 2025

(This article belongs to the Special Issue Advanced Pattern Recognition & Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

In this study, we propose a novel approach for human pose estimation (HPE) in occluded scenes by progressively fusing features extracted from RGB-D images, which contain RGB and depth images. Conventional bottom-up human pose estimation models that rely solely on RGB inputs often produce erroneous skeletons when parts of a person’s body are obscured by another individual, because they struggle to accurately infer body connectivity due to the lack of 3D topological information. To address this limitation, we modify the traditional OpenPose that is a bottom-up HPE model to take a depth image as an additional input, thereby providing explicit 3D spatial cues. Each input modality is processed by a dedicated feature extractor. Each input modality is processed by a dedicated feature extractor. In addition to the two existing modules for each stage—joint connectivity and joint confidence map estimations for the color image—we integrate a new module for estimating joint confidence maps for the depth image into the initial few stages. Subsequently, the confidence maps derived from both depth and RGB modalities are fused at each stage and forwarded to the next, ensuring that 3D topological information from the depth image is effectively utilized for both joint localization and body part association. Subsequently, the confidence maps derived from both depth and RGB modalities are fused at each stage and forwarded to the next to ensure that 3D topological information is effectively utilized for estimating both joint localization and their connectivity. The experimental results on the NTU 120+ RGB-D Dataset verify that our proposed approach achieves a 13.3% improvement in average recall compared to the original OpenPose model. The proposed method can enhance the performance of the bottom-up HPE models for the occlusion scenes.

Keywords:

deep learning; computer vision; human pose estimation; RGB-D image

1. Introduction

Human pose estimation (HPE) detects body joints from images and estimates the human skeleton. HPE is widely used in HAR (Human Action Recognition), which recognizes body posture or movements [1,2,3], because it is not affected by lighting changes or background complexity [4]. Vision-based HAR utilizing HPE has diverse applications in the field of computer vision, including surveillance video analysis and human–computer interaction (HCI) [5,6,7,8,9]. Recently, a fine-grained pose estimation method using multi-modal information has also been actively studied for HCI [10,11].

HPE methods can be categorized into top-down and bottom-up approaches based on the sequence of body detection and joint localization [12,13,14,15]. The top-down approach first detects human regions in an image and then applies a joint detector to each region. This method is less sensitive to variations in human scale but suffers from a linear increase in computational cost as the number of people in an image increases. In contrast, the bottom-up approach directly detects body joints in an image and estimates their associations based on spatial relationships to reconstruct human poses. This approach is advantageous for real-time processing since the joint detector operates only once [16].

Although recent HPE models [17,18,19,20,21] have achieved outstanding average accuracy and inference speed, an occlusion problem, which refers to a drop in accuracy when some body parts are obstructed by other parts or objects, still remains to be solved. The occlusion problem can be classified into occlusion by objects, body-to-body occlusion, and self-occlusion (e.g., when a person crosses his or her legs). Improvements in HPE for images with occlusion by objects have been studied, whereas other types of occlusion problems have not been adequately addressed [22]. These occlusion problems are mainly caused by the lack of topological information in RGB images [23,24]. In this study, we focus on the improvement in HPE for images with body-to-body occlusion. There are efforts to estimate multiple people’s poses from images captured from multiple views to reduce the influence of occlusion [25]. In this case, it can be mostly free from occlusion, and even 3D pose estimation can be considered. However, this approach is limited in that it can only be used in controlled environments where multiple cameras are precisely installed. RGB-D images that contain both color and depth images can capture topological relationships in three-dimensional space. A depth image stores distances from a camera device along the z-axis, which is the camera optical axis, as its pixels. By additionally utilizing the pixel values of the depth image, occluded body parts can be more effectively separated. Previous studies [26,27,28] have been conducted to estimate object poses by fusing features from color and depth images. Nevertheless, these studies for HPE have typically employed depth images only for auxiliary tasks, such as body region segmentation or post-processing corrections of detected skeletal data, rather than integrating depth information directly into the network itself. In particular, few studies have focused on leveraging depth images to address body-to-body occlusion in multi-person images. We aim to improve HPE for occluded bodies by fusing features from both color and depth images.

In this study, we propose an HPE method robust to body-to-body occlusion problems using a novel network architecture that integrates color and depth information. We modify the traditional OpenPose that is a bottom-up HPE model to take a depth image as an additional input, thereby providing explicit 3D spatial cues. Each input modality is processed by a dedicated feature extractor. The proposed network consists of a multi-stage cascade structure and acts as a joint detection method that effectively considers both color and depth information by iteratively fusing features extracted from two inputs. We demonstrate that 3D spatial feature information captured from depth images can be used as information on occlusion, enabling accurate prediction even in body-to-body scenarios. Additionally, the proposed HPE method adopts a bottom-up approach to maintain stable low computational complexity regardless of the number of people in the image. As a result, the proposed method mitigates the ambiguity caused by overlapping inter-personal body parts by exploiting the correlation between color and depth information and shows improved accuracy compared to previous bottom-up methods in body-to-body occlusion scenarios.

The main contributions of this study are as follows:

We propose a novel architecture that integrates depth image features into a bottom-up HPE model. In previous HPE models based on RGB-D images [26,27,28], depth images have been used primarily to augment the base features extracted by the feature extractor. In contrast, our approach progressively fuses depth features with those from color images, thereby substantially improving HPE performances in occlusion scenes. We demonstrate that progressively fusing depth features at each stage of the HPE process leads to improved performance.
We specifically selected images containing occlusion scenarios and experimentally demonstrated that leveraging features extracted from depth images leads to clear improvements in pose estimation performance for these cases. Furthermore, we conducted an analysis of how RGB and depth images contribute to pose estimation at each stage of feature fusion.

The remainder of this paper is organized as follows. Section 2 reviews related works. Section 3 presents the proposed HPE method using RGB-D images. Section 4 describes the experimental results for the proposed method. Finally, Section 5 provides a conclusion.

2. Related Works

2.1. Human Pose Estimation Methods Based on RGB Images

2.1.1. OpenPose

Carnegie Mellon University (CMU) introduced a 2D HPE model called Convolutional Pose Machine (CPM) [29], which employs a multi-stage cascaded architecture based on convolutional neural network (CNN) layers. CPM progressively expands the receptive field across stages, enabling joint detection that effectively captures both local and global features. To enhance the performance of 2D pose estimation, the CMU research team later modified the CPM network and introduced part affinity fields (PAFs), leading to the development of the OpenPose model [30]. The OpenPose network retains the multi-stage cascaded structure of CPM but is optimized to efficiently extract confidence maps—heatmaps representing body joint locations—and PAFs. PAFs are a two-dimensional vector field that encodes the unit vector direction between two connected joints along the limb. Non-limb pixels are assigned a value of zero. The final extracted confidence maps are used to detect body joints, and a bipartite graph matching algorithm identifies candidate joint pairs. The selected joint pairs are connected using straight lines, and their association scores are computed through line integrals of the PAF directional information. The skeleton is then formed by assembling the limbs selected by their association scores. The association score E for a candidate joint pair is computed as follows:

E = \int_{u = 0}^{u = 1} L_{c} (p (u)) \cdot \frac{d_{j_{2}} - d_{j_{1}}}{‖d_{j_{2}} - d_{j_{1}}‖} d u,

(1)

where c is a limb type that refers a specific connection between two body keypoints, L_c is the directional information of c in the PAF,

d_{j_{1}}

and

d_{j_{2}}

are the candidate joint positions for c, and p(u) is a parametric function that interpolates the position of the two body parts as follows:

p (u) = (1 - u) d_{j_{1}} + u d_{j_{2}},

(2)

where u is an interpolation parameter that ranges from 0 to 1.

As the predicted unit vector directions at each pixel align more closely with the directional information from

d_{j_{1}}

to

d_{j_{2}}

, the association score E increases. Consequently, the joint pair (

d_{j_{1}}

,

d_{j_{2}}

) with the highest E value for limb type c is grouped as belonging to a specific individual. This correlation-based grouping enables efficient pose estimation with real-time processing capability.

OpenPose adopts a bottom-up approach, maintaining stable computational complexity regardless of the number of people in the image. Its PAF-based grouping strategy achieves higher pose estimation accuracy compared to previous methods. However, in occluded scenarios, OpenPose still performs poorly due to joint misdetections and incorrect associations, indicating that further improvements are needed.

2.1.2. Scale-Aware High-Resolution Network

Conventional bottom-up HPE models exhibit degraded performance due to their sensitivity to variations in human scale. To address this problem, Scale-Aware HighResolution Network (HigherHRNet) [20] adopts HighResolution Network (HRNet) [31] as its backbone and introduces a scale-aware representation learning approach using a high-resolution feature pyramid.

HigherHRNet generates multi-resolution ground truth joint heatmaps by positioning joints at different scales and applying Gaussian kernels with a consistent standard deviation. During inference, it extracts multi-scale heatmaps, computes the mean squared error between the predicted and ground truth heatmaps at each scale, and sums these errors to determine the final loss value. Additionally, HigherHRNet employs Associative Embedding [32] to facilitate joint grouping, allowing the model to consider multiple scales for each individual. As a result, HigherHRNet demonstrates improved robustness to scale variations and achieves superior performance compared to previous bottom-up pose estimation methods.

2.1.3. Recent Human Pose Estimation Methods

HPE methods are typically classified as either top-down or bottom-up. Top-down HPE models process each individual independently, which generally results in superior keypoint separation and pose estimation accuracy compared to bottom-up approaches. However, their performance drops sharply in complex scenes due to high dependency on its detector. ViTPose [17] proposes a simple baseline that operates on a pure vision transformer for pose estimation tasks. The study follows a common top-down pipeline and focuses on simplicity, scalability, flexibility, and transferability. ViTPose is evaluated on public datasets and shows improved performance compared to existing CNN-based models. These studies did not analyze the results of the model’s handling of samples with occlusions and noted that there was no special design for handling occlusions. The TransPose [19] model adopts a top-down pipeline and obtains pose estimation results considering the positional relationship of joints by capturing long-distance spatial relationships based on the integration of transformer encoders and CNNs. These studies simply analyze how the proposed model handles self-occlusion cases of single-person images but do not provide results or analysis on body-to-body occlusion of multi-person images. RTMPose [18] adopted a top-down pose estimation method and introduced a method to balance speed and accuracy to meet the needs of industrial applications. RTMPose uses CSPNeXt [33] as a backbone and a SimCC-based algorithm [34] as a head. In addition, it analyzed the factors affecting performance and latency in existing 2D multi-person pose estimation and applied various techniques based on empirical analysis. Accordingly, the model is designed to preserve accuracy and have low computational complexity. The RTMPose focuses on the work to improve the speed of the pose estimation model but does not address the occlusion problem of pose estimation. SEFD [35] mitigates the challenges posed by occlusions by explicitly incorporating body edge information. The model training adopts a knowledge distillation strategy, wherein a teacher model exploits the predefined SMPL 3D human mesh, and a student model leverages edge cues extracted via classical edge detectors such as the Canny Edge Detector. SEFD substantially enhances both 2D and 3D pose estimation performance, particularly in the presence of complex poses and severe occlusions. BBox-Mask-Pose [36] addresses the issue of performance degradation in top-down human pose estimation caused by overlapping bounding boxes. By utilizing instance segmentation masks generated by object detectors, the model achieves a more precise delineation of boundaries between adjacent individuals. Additionally, it proposes an integrated architecture that jointly optimizes detection, segmentation, and pose estimation through iterative refinement. Although this approach substantially mitigates occlusion-related challenges, its reliance on repeated inference steps for high accuracy leads to increased computational complexity. Nevertheless, top-down models tend to have slow inference speeds because pose estimation must be performed repeatedly for each person in the image. Therefore, the top-down models are hard to apply in real-time systems.

In contrast to top-down methods, bottom-up HPE models first detect all keypoints in an image and then group them into individuals, which often leads to faster inference and better scalability in crowded scenes. Therefore, top-down HPE models are suitable for real-time applications. OpenPose and HigherHRNet are representative bottom-up models. Full-BAPose [37] introduced a Disentangled Waterfall Atrous Spatial Pooling (D-WASP) module to independently capture multi-scale features for each individual part. The D-WASP module refines multi-scale feature maps by employing a waterfall architecture in conjunction with atrous convolutions of diverse dilation rates. This enhanced joint localization accuracy, achieving robust performance in highly crowded environments. Qu et al. [38] introduced an HPE model based on HigherHRNet. Distinct from traditional approaches that employ their loss function as L2 losses of joint heatmaps, this model transforms heatmaps into characteristic functions and optimizes a loss that directly minimizes the distance between the predicted and ground truth characteristic functions. This method substantially decreases both missing and misidentified joints under occlusion. Nevertheless, bottom-up methods are susceptible to errors in crowded or overlapping scenarios. In such cases, joints belonging to different individuals may be incorrectly linked, resulting in inaccurate or implausible pose estimations.

Table 1 presents recent studies on multi-person HPE studies. Recently, various studies have been actively conducted to achieve accurate human pose estimation in the presence of occlusions and crowded scenes. However, HPE models following a top-down pipeline are limited to predicting a single person’s pose, even if the bounding box estimated by a person detector contains multiple people, so they are not suitable as HPE models to handle severe body-to-body occlusion [39]. Conversely, bottom-up methods demonstrate high accuracy in detecting individual joints even in body-to-body occlusion scenarios, while they are prone to mistakenly associating joints across different people.

2.1.4. Feature Extractors for Human Pose Estimation

CPM [29] extracts features using a shallow stack of convolutional layers at the beginning of the network, focusing on capturing essential image representations before passing them to the multi-stage refinement modules. This approach is effective for integrating both local and contextual information during the subsequent stages. However, the feature representations are less expressive due to the lightweight structure. OpenPose adopted VGG as its feature extractor. VGG [40] consists of a series of convolutional layers with small receptive fields (3 × 3 filters) and uses deep stacking of these layers, followed by fully connected layers. This simple and uniform architecture allows VGG to effectively capture hierarchical features from images. While VGG offers an intuitive architecture and demonstrates strong performance on large-scale datasets, its large number of parameters results in slow inference speed. To address this limitation, SimpleBaseline [41] employs ResNet [42] as the backbone. ResNet, with its introduction of residual blocks, enables more efficient computation in deep networks. However, as the network depth increases, the spatial resolution of feature maps decreases, which can lead to errors in joint detection. To overcome this issue, HRNet [31] was proposed. HRNet maintains high-resolution feature maps throughout the network and processes multi-resolution information in parallel, continuously fusing these representations. This design enables more precise keypoint localization, even for small joints or complex poses. Nonetheless, HRNet tends to encounter difficulties when applied to real-time applications due to its high computational and memory requirements. More recently, CSPNeXt [33] has been proposed, which combines the partial connection structure of CSPNet [43] with the NeXt block [44]. This allows for a substantial reduction in parameter count and computational cost while maintaining high performance. As a result, CSPNeXt inherits the advantages of high-resolution networks like HRNet, while also achieving the speed and compactness necessary for real-time and mobile applications.

2.2. Human Pose Estimation Methods Based on Fusion of RGB Images and Other Modalities

To address occlusion challenges in pose estimation, hybrid systems integrating RGB cameras with additional sensors, such as Radio Detection and Ranging (RADAR), Light Detection and Ranging (LIDAR), and Infrared Radiation (IR), have been introduced [45,46,47]. Pose estimation methods utilizing RADAR [48,49,50] and LIDAR [51,52] offer advantages in occlusion handling and privacy preservation. However, these approaches face challenges due to high system deployment costs and sparse data availability. Consequently, research involving RADAR and LIDAR has progressed more slowly than studies on wearable motion capture systems and multi-camera motion capture systems [53]. The MIT CSAIL research team focused on occlusion caused by objects obstructing individuals in 2D pose estimation. To alleviate this problem, they introduced RF-Pose [21], a method that utilizes wireless signals penetrating walls and reflecting off the human body with RGB images. RF-Pose aggregates vertically and horizontally recorded RF heatmaps captured simultaneously and utilizes a neural network to infer occluded body poses. This approach demonstrates robustness to occlusions caused by walls and achieves improved pose estimation performance in dark environments. However, the study identified limitations in addressing occlusions caused by overlapping human bodies. Specifically, challenges arise when correctly associating the same joints of the front and back individuals in multi-person occlusion scenarios. Additionally, in self-occlusion cases, pose estimation accuracy depends on the model’s ability to infer occluded joints, highlighting an area requiring further research and improvement.

2.3. Pose Estimation Methods Based on RGB-D Images

Pose estimation methods utilizing RGB-D information primarily leverage depth data for preprocessing and postprocessing while relying on RGB images for pose estimation. Park et al. [54] analyzed why occlusion occurs due to the loss of topological information when projecting the three-dimensional real world into a two-dimensional image. To address this, they introduced a method that incorporates depth-enhanced object detection results as input for pose estimation. In this approach, depth information is obtained by computing the disparity between images captured from two horizontally aligned RGB cameras [55]. The depth data are then used to segment the background, and within the segmented regions, RGB information is employed for object segmentation. The segmented object regions are subsequently refined and provided as input to the CPM network to generate confidence maps for each joint. This method adopts a top-down pose estimation approach, leveraging depth information for human region detection to enhance robustness against occlusions caused by overlapping individuals. However, similar to conventional top-down methods, the computational cost increases proportionally with the number of people in the image, limiting its scalability. Wang et al. [28] proposed an approach that enhances pose estimation performance by integrating features extracted from depth images into the OpenPose model. However, the use of depth images was limited to enhancing the backbone feature representations, while it was not separately utilized for estimating joint locations or their connectivity.

3. Human Pose Estimation Method by Progressive Feature Fusion

Figure 1 presents a flowchart of the proposed HPE method. The core process involves progressively fusing features from color and depth images to enhance joint estimation. This approach enhances joint detection by incorporating edge, feature, and texture information from color images together with three-dimensional characteristics from depth images. The proposed method utilizes RGB and depth images as input data. Feature maps for each modality are extracted through separate feature extractors. Subsequently, the network progressively refines the representations of joints by combining features from both depth and color images. The enhanced features from both modalities are fused and passed as inputs to the next stage. This iterative process continues for several early stages, leading to a gradual strengthening of the joint representations.

3.1. Network Architecture

The proposed method effectively considers the topological relationships between detected candidate joints using three-dimensional information. This allows for accurate joint detection even in occluded scenes. In this work, we adopt OpenPose [30] as our HPE baseline model due to its widespread use and proven effectiveness.

Figure 2 illustrates the structure of the proposed network. The network consists of three parallel branches, each containing independent inference layers. The first branch extracts PAFs, while the second and third branches generate confidence maps for body joints using RGB and depth images, respectively. The network employs a multi-stage cascaded architecture, where the outputs of each branch are fused and used as input for the subsequent stage. This iterative process progressively expands the receptive field and refines feature representations, ultimately improving the final output.

The proposed network is based on the OpenPose model and adopts the first ten layers of VGG-19 [40] as the backbone to extract feature maps F_rgb and F_d from RGB and depth images. In the first stage, three independent inference layers, ϕ¹, ρ¹, and γ¹, predict pose-related information. The inference layers ϕ¹ and ρ¹ take F_rgb as input, while γ¹ processes F_d. The inference layer ϕ¹ extracts a two-dimensional vector field L¹, encoding the directional information of joint connections at limb locations. The inference layer ρ¹ generates a confidence map S¹, predicting body joint locations from the color feature maps. The inference layer γ¹ produces a confidence map D¹, estimating body joint positions using depth feature maps. All three inference layers, ϕ¹, ρ¹, and γ¹, have an identical structure consisting of three convolutional layers with a 3 × 3 kernel and two convolutional layers with a 1 × 1 kernel. The detailed structure of these layers is shown in Figure 3. After extracting L¹, S¹, and D¹ from each branch in Stage 1, the confidence maps S¹ and D¹ are fused. A convolutional layer with a 1 × 1 kernel is then applied to reduce the number of channels by half, resulting in the fused confidence map M¹, which integrates both color and depth information. Similarly, the feature maps F_rgb and F_d are combined, and a 1 × 1 convolutional layer is used to reduce the number of channels by half, producing F_rgbd, which encapsulates both color and depth features. The fused outputs M¹, L¹, and F_rgbd are then combined and used as input for each branch in Stage 2. For Stage n (n ≥ 2), the inference layers in each branch consist of five sequential convolutional layers with a 7 × 7 kernel, followed by two convolutional layers with a 1 × 1 kernel. Figure 4 illustrates the structure of the inference layers ϕⁿ, ρⁿ, and γⁿ for each branch in Stage n.

The proposed RGB-D network fuses color and depth information only up to Stage R. In these stages, the fused confidence map M^r is obtained by combining S^r and D^r, while L^r and F_rgbd are also incorporated as inputs for the next stage. After Stage R, the network no longer includes Branch 3, which processes depth information. Instead, the subsequent stages use only L^t, S^t, and F_rgb as inputs. In the final Stage T, the network outputs L^T and S^T, marking the completion of the proposed RGB-D network’s operation. Based on empirical findings from the OpenPose model, T is set to 6, while R is optimized to 3, ensuring the highest performance when color and depth information is fused up to Stage 3. The final outputs L and S from the proposed RGB-D network are used in the PAF grouping strategy of OpenPose to construct the skeletal structure for each individual.

Figure 5 compares the confidence maps of the right shoulder at Stages 3 and 4 between the standard OpenPose and our proposed modification. The confidence maps extracted from Stage 3 and Stage 4 in both models are analyzed to determine the impact of incorporating depth information on joint detection accuracy. The traditional model struggles to distinguish the shoulder positions of two closely positioned individuals when significant occlusion occurs. In Stage 3, the output confidence map S³ exhibits a merged peak, making it difficult to differentiate between the right shoulders of the two individuals. This problem worsens in Stage 4, where the peaks remain indistinguishable, making it even more challenging to separate the shoulder locations compared to Stage 3. In contrast, the proposed RGB-D network effectively leverages depth information from Stage 1, allowing it to incorporate both topological information and color features throughout the process. As a result, in Stage 3, the confidence map S³ successfully separates the peaks for the leftmost person’s shoulder and the more distant individual’s shoulder, making them easily distinguishable. Furthermore, the output confidence map D³ from the third branch is combined with S³, refining the Stage 4 output S⁴. This results in a more distinct separation of peaks, ultimately improving the accuracy of the right shoulder localization for each individual compared to previous stages.

3.2. Loss Function

In the proposed RGB-D network, designed for robust pose estimation under occlusions, a loss function is defined for each branch at every stage during training. At Stage r, Branch 1 utilizes the loss function

f_{L}^{r}

, Branch 2 employs

f_{S}^{r}

, and Branch 3 applies

f_{D}^{r}

. The loss functions

f_{L}^{r}

,

f_{S}^{r}

, and

f_{D}^{r}

are formulated utilizing the L2 norm as follows:

\begin{array}{l} f_{L}^{r} = \sum_{c = 1}^{C} \sum_{P} W (p) \cdot {‖L_{c}^{r} (p) - L_{c}^{*} (p)‖}_{2}^{2} \\ f_{S}^{r} = \sum_{j = 1}^{J} \sum_{P} W (p) \cdot {‖S_{j}^{r} (p) - Q_{j}^{*} (p)‖}_{2}^{2} \\ f_{D}^{r} = \sum_{j = 1}^{J} \sum_{P} W (p) \cdot {‖D_{j}^{r} (p) - Q_{j}^{*} (p)‖}_{2}^{2}, \end{array}

(3)

where W(p) is a binary mask for a pixel

p \in R^{2}

, which prevents loss contributions from false-positive predictions at joint locations that are not annotated in the ground truth (GT);

L_{c}^{r}

is the vector fields predicted for limb c in Branch 1 at Stage r;

S_{j}^{r}

and

D_{j}^{r}

are the confidence maps predicted for joint j in Branch 2 and Branch 3 at Stage r, respectively; and

L_{c}^{*} (p)

and

Q_{c}^{*} (p)

are the ground truths of the confidence map and vector field, respectively.

The total loss function f for training the proposed RGB-D network is formulated as follows:

f = \sum_{r = 1}^{R} (f_{L}^{r} + f_{S}^{r} + f_{D}^{r}) + \sum_{r = R + 1}^{T} (f_{L}^{r} + f_{S}^{r}) .

(4)

The total loss function f sums the vector field loss (

f_{L}

) and the confidence map losses (

f_{S}

and

f_{D}

) over all stages. Since Branch 3 is not present after Stage T, the

f_{D}

term is excluded from the loss from that point onward.

3.3. Ground Truth for Heatmap Representation

In the proposed network, the GT for network training consists of confidence maps

Q^{*}

that represent joint locations as heatmaps and a set of 2D vectors

L^{*}

that encode the directional information between connected joints. The generation of GT from the annotated body joint coordinates follows the procedures described in OpenPose. When training body joint localization, the confidence maps have a width w and height h, and the ground truth

Q^{*} \in R^{j \times w \times h}

is defined as a set of heatmaps

Q_{j}^{*} \in R^{w \times h}

for each joint j. To generate

Q_{j}^{*}

, the function

Q_{j, k}^{*} \in R^{w \times h}

must first be computed. The function

Q_{j, k}^{*}

represents a heatmap modeled using a Gaussian distribution with variance σ², based on the distance between a pixel p and the actual joint coordinate x_j,k of person k in the dataset. The function

Q_{j, k}^{*}

is computed as follows:

Q_{j, k}^{*} (p) = \exp (- \frac{{‖p - x_{j, k}‖}_{2}^{2}}{σ^{2}}),

(5)

where σ controls the spread of the peak in the confidence map represented as a heatmap. In the proposed method, σ is set to 3.

The GT of confidence map

Q_{j}^{*}

is generated by applying a maximum operation over all

Q_{j, k}^{*}

at p as follows:

Q_{j}^{*} (p) = \max_{k} Q_{j, k}^{*} (p) .

(6)

Finally, the confidence maps corresponding to the joints are assigned the GT values.

The generation of the GT of

L^{*}

requires one to obtain

L_{c, k}^{*} \in R^{w \times h \times 2}

, which represents the connection direction of limb type c at pixels where the limb of person index k is present. The vector field

L_{c, k}^{*}

is computed as follows:

L_{c, k}^{*} (p) = \{\begin{array}{l} v i f p o n l i m b c, k \\ 0 o t h e r w i s e \end{array},

(7)

where v represents the direction of the limb. If pixel p corresponds to the location of the c-th limb of the k-th person, v is assigned as a unit vector indicating the limb’s direction. Otherwise, for non-limb pixels, v is set to zero. The vector v is computed as follows:

v = (x_{j_{2}, k} - x_{j_{1}, k}) / {‖x_{j_{2}, k} - x_{j_{1}, k}‖}_{1},

(8)

where

x_{j_{1}, k}, x_{j_{2}, k} \in R^{2}

represent the positions of the two joints forming the c-th limb of the k-th individual. A unit vector with a magnitude of 1 and a direction from

x_{j_{1}, k}

to

x_{j_{2}, k}

can be obtained through computation.

To assign v at limb locations, it is necessary to define the pixel region where the limb exists. Since limbs have thickness, the reference region should include both the line segment connecting the two joints and the area perpendicular to this segment. The threshold for defining the region considering limb thickness is determined based on

x_{j_{1}, k}

and

x_{j_{2}, k}

and is computed as follows:

0 \leq v \cdot (p - x_{j_{1}, k}) \leq l_{c, k} and |v_{⊥} \cdot (p - x_{j_{2}, k})| \leq σ_{l},

(9)

where

v_{⊥}

is a unit vector perpendicular to the direction of v,

σ_{l}

represents a limb width that is a distance in pixels, and l_c,k denotes the limb length of the c-th limb for k-th person. The limb length l_c,k is computed as follows:

l_{c, k} = {‖x_{j_{2}, k} - x_{j_{1}, k}‖}_{2} .

(10)

L^{*} \in R^{c \times w \times h \times 2}

is composed of the set of

L_{c}^{*} \in R^{w \times h \times 2}

for all limb types c across all individuals.

L_{c}^{*}

is computed as follows:

L_{c}^{*} = \frac{1}{n_{c} (p)} \sum_{k} L_{c, k}^{*} (p),

(11)

where n_c(p) denotes the number of instances in which the v corresponding to c-th limb is nonzero at p across all individuals. A nonzero v in

L_{c, k}^{*}

indicates that the corresponding pixel is the location of the c-th limb of the k-th person. Therefore, n_c(p) means the number of individuals whose c-th limbs overlaps at p. The final aggregated direction map

L_{c}^{*}

is calculated by summing all overlapping

L_{c, k}^{*}

and dividing by n_c(p), thereby averaging the directional vectors for limb connections.

3.4. Generation of Body-to-Body Occlusion Samples

To evaluate the pose estimation performance of the proposed network, we generate a body-to-body occlusion test set by selecting images with inter-body occlusion from the existing multi-person pose estimation test set. RGB-D images in the dataset are identified as body-to-body occlusion if two or more people exist and their bounding boxes are overlapped. This occlusion test set enables the evaluation of pose estimation performance under controlled body-to-body occlusion scenes.

The presence of body-to-body occlusion in a sample is determined by the intersection area

β_{z}

of the bounding box pair z, where occlusion is identified if

0 < \sum_{z} β_{z}

. The intersection area

β_{z}

is calculated as

β_{z} = w_{i n t e r} \cdot h_{i n t e r}

, where w_inter and h_inter represent the width and height of the intersection area, respectively. These values are derived from the bounding box coordinates of individuals i and j, where each bounding box is defined by its top-left coordinates (x_min, y_min) and bottom-right coordinates (x_max, y_max). Figure 6 provides an example of bounding boxes, intersection areas, and the corresponding w_inter and h_inter values for multi-person samples.

The values w_inter and h_inter, which are used to compute the intersection area, may be negative when the bounding boxes do not overlap. To prevent negative values from contributing to the occlusion determination, w_inter and h_inter are computed only when they are positive; otherwise, they are assigned a value of zero to exclude them from the calculation. Therefore, w_inter and h_inter are computed as follows:

w_{i n t e r} = \{\begin{matrix} w_{i, j} {(w}_{i, j} > 0) \\ 0 {(w}_{i, j} \leq 0) \end{matrix}, h_{i n t e r} = \{\begin{matrix} h_{i, j} (h_{i, j} > 0) \\ 0 (h_{i, j} \leq 0) \end{matrix},

(12)

where w_i,j and h_i,j represent the width and height of the intersection area between the bounding boxes of individuals i and j, respectively. They are computed as follows:

\begin{matrix} w_{i, j} = \min (x_{m a x, i}, x_{m a x, j}) - \max (x_{m i n, i}, x_{m i n, j}) \\ h_{i, j} = \min (y_{m a x, i}, y_{m a x, i}) - \max (y_{m i n, i}, y_{m i n, j}) . \end{matrix}

(13)

If there is no overlap between the two bounding boxes, w_inter or h_inter is computed as a negative value. In such cases, (12) assigns a value of zero to exclude them from the final area calculation. In some cases, bounding boxes overlap even though actual occlusion does not occur. To address this, we manually verified these cases and selected the occluded samples accordingly.

4. Experimental Results

We evaluate the pose estimation performance of the proposed network. The experiments are conducted in an environment running Ubuntu 20.04, equipped with a NVIDIA GeForce RTX 4070 GPU, 16 GB of memory, and 1 TB of storage.

A self-comparison is performed by adjusting the fusion stage parameter R in the proposed RGB-D network using a test set consisting of approximately 9000 randomly selected samples from the NTU RGB+D 120 dataset [56,57]. The proposed method is compared with existing high-performance bottom-up pose estimation models. The proposed RGB-D network is compared with OpenPose, which uses only RGB images as input. Since the proposed network enhances OpenPose by incorporating depth images and fusing topological information into the confidence maps and PAFs, the stage-wise confidence map prediction performance is analyzed to assess the effectiveness of fusing color and depth information. In order to evaluate performance under inter-person occlusion, a dedicated occlusion test set is constructed by selecting 383 occluded samples from the dataset.

4.1. Dataset

The NTU RGB+D 120 dataset was adopted to evaluate the performance of the proposed method. Although the NTU RGB+D 120 dataset was primarily developed for benchmarking action recognition tasks, it contains a sufficient number of scenes with body-to-body occlusions with multiple individuals and provides body skeleton data. Therefore, the NTU RGB+D 120 dataset is also suitable for evaluating pose estimation based on RGB-D images. The NTU RGB+D 120 dataset contains RGB-D images, that is, pairs of RGB and depth images, along with corresponding skeleton annotations. This dataset contains approximately 110,000 RGB-D images, along with joint position annotations and 120 action class labels. The RGB-D images were captured using Microsoft Kinect v2. The resolutions of RGB and depth images are 1920 × 1080 and 512 × 424, respectively. The skeleton annotations include the index corresponding to each type of joint, as defined in Table 2, and the coordinates of each joint in both the RGB and depth images. If a particular joint is not observed due to occlusion or other reasons, the corresponding joint coordinates are filled with zeros. However, these joints that are not visible and, thus, have coordinates filled with zeros are excluded from the evaluation. Although the dataset does not directly provide information such as camera parameters that can be used to align the two images, it is possible to compute the alignment matrix using the pairs of joint coordinates provided in both images. We computed a transformation matrix for each scene based on this alignment method to align the RGB image with the depth image. The weights of the proposed network are initialized using the pretrained weights of the OpenPose model [58]. The OpenPose model was trained using the joint index numbering and types defined by the MS COCO format, as shown in Table 3, which differ from the index numbering and joint types defined in NTU RGB+D 120. To address this issue, we evaluated only the joints that are commonly defined in both formats, which correspond to joints 1 to 14 in the MS COCO format.

4.2. Performance Metrics in Human Pose Estimation

Object Keypoint Similarity (OKS) [58] is adopted to evaluate the pose estimations for scenes with multiple people. OKS is a concept similar to IoU in object detection and is used to compute Average Precision (AP) and Average Recall (AR). OKS generalizes body joint scale variations across different human sizes as follows:

O K S = \frac{\sum_{i \in [0, N - 1]} e x p (\frac{- d_{i}^{2}}{2 s^{2} k_{i}^{2}}) δ (υ_{i})}{\sum_{i \in [0, N - 1]} δ (υ_{i})},

(14)

where i represents the index of each joint; N denotes the total number of joints; d_i is the Euclidean distance between the ground truth and predicted locations for joint i; s is the square root of the object area; k_i is a constant assigned based on the standard deviation of joint localization for joint i; and δ(υ_i) is an indicator function based on a visibility flag υ_i in the ground truth annotation. The δ(υ_i) is 1 if υ_i > 0 and 0 otherwise. The 17-COCO Pose dataset provides joint-wise standard deviations computed across the entire benchmark dataset, and these values follow the relationship k_i = 2σ_i for joint i. The standard deviations for joint types are presented in Table 4.

The standard deviation σ_i tends to be larger for the shoulders, knees, and hips compared to facial joints such as the nose, eyes, and ears. Since s in OKS is defined as the object area, segmentation data are required. However, the NTU RGB+D dataset does not provide segmentation information for objects. In the COCO dataset, an empirical approximation method is used when segmentation data are unavailable, where s is defined by multiplying the bounding box area by 0.53. Therefore, s is determined using this empirical approximation method.

To evaluate joint detection performance, a prediction is considered a true positive (TP) for a given threshold (set within the range of 0 to 1) if its OKS score exceeds the threshold. In cases where multiple predictions are associated with a single GT, only the prediction with the highest OKS score is regarded as a TP, while the others are counted as false positives (FPs). Furthermore, if there is no prediction corresponding to a particular GT, it is counted as a false negative (FN). The precision (P) and recall (R) are calculated as follows:

\begin{matrix} P = T P / (T P + F P) \\ R = T P / (T P + F N) \end{matrix} .

(15)

The OKS scores were assessed using thresholds of 0.5 (P⁵⁰) and 0.75 (P⁷⁵), respectively. The overall AP is computed by averaging multiple P values over a range of OKS thresholds (0.50:0.05:0.95). To evaluate the impact of object scale variations, performance is also measured for medium-scale objects (AP^M) with 32² < s² < 96² and large-scale objects (AP^L) with s² > 96².

4.3. Performance Evaluation of Human Pose Estimation

To determine the optimal network configuration, self-comparison experiments are conducted by varying the R parameter, which specifies how many early stages fuse features from the depth image. The experiments use a test set of 8921 randomly selected samples from the NTU RGB+D 120 dataset. The experimental results are presented in Table 5. The self-comparison experiments indicate that the RGB-D network with R = 3 achieves the highest performance across all evaluation metrics, including AP at OKS thresholds of 50 and 75, as well as AP and AR across different person sizes. As R increases from 1 to 3, the accuracy continuously improves. This indicates that, in the initial stages for extracting local features, the additional use of depth information enables more effective separation of overlapping joints and facilitates the extraction of more precise features. However, when R is 4 or above, a marked decrease in accuracy is observed. This suggests that, during the later stages for global feature extraction, fusing depth features can adversely affect the representational capabilities of features pertaining to both individual joints and their connectivity. In summary, the 3D topological cues derived from depth images primarily facilitate the representation of local structural features. In contrast, the 2D appearance information contained in color images is generally more influential in modeling global contextual characteristics.

We compared the modality fusion in our proposed method with the commonly used fusion strategies of early fusion and late fusion, as shown in Table 6. In early fusion and late fusion, the inputs from RGB and depth images are merged either before or after the initial feature extraction stage, respectively. Early fusion showed the lowest AP and AR scores, likely due to the fundamental differences in the data domains of depth and color images. While depth images primarily provide structural information such as distance and shape, color images offer visual details such as color and texture. When these heterogeneous features are simply combined in the early stages, the network may fail to adequately learn the distinct characteristics of each modality, resulting in degraded performance. On the other hand, late fusion yielded slightly better performance compared to using only RGB images but still demonstrated lower overall accuracy than the proposed method. This suggests that reinforcing feature representations solely through depth images is insufficient for significant improvements. In contrast, our method substantially enhances the feature representations of the joints by progressively fusing features from both modalities at each stage.

Table 7 presents the experimental results obtained from the test set. These experimental results show that the pose estimation network with the fusion stage parameter R = 3 achieves an AP improvement of 11.7 and an AR improvement of 13.3 compared to the original OpenPose model. Additionally, as the OKS threshold increases, OpenPose exhibits a significant decline in accuracy and recall, whereas the proposed method demonstrates a relatively lower performance degradation rate. Despite enhancing pose estimation accuracy, the proposed method only increases GPU Floating Point Operations Per Second (GFLOPS) by approximately 44% compared to OpenPose, which is capable of processing approximately 200 images per second in real time. Compared to HigherHRNet, which currently achieves SOTA performance in multi-person pose estimation and adopts an aggressive detection approach, the proposed network improves AP by 1.6 and AR by 1.6, demonstrating a performance enhancement.

4.4. Performance Evaluation for Body-to-Body Occlusion Subset

To evaluate the improvement in pose estimation achieved by the proposed method in body-to-body occlusion scenes, we selected a total of 383 samples from RGB-D images in the NTU RGB+D 120 dataset. For performance comparison and analysis, joint-wise AP is measured during the occlusion experiments. However, in the COCO dataset skeleton format, the nose joint exists in the NTU RGB+D 120 dataset, but its annotated location varies significantly when capturing side profiles of individuals. Therefore, the AP measurement for the nose joint is excluded from this experiment to ensure fair evaluation.

The results in Table 8 demonstrate that the proposed method improves detection accuracy by achieving higher AP scores for most joints compared to existing SOTA multi-person pose estimation benchmarks. Additionally, in terms of mean Average Precision (mAP)—computed as the average AP across all joints—the proposed method outperforms OpenPose and HigherHRNet by 22.2 percentage points and 23.4 percentage points, respectively. However, for wrist joints, which are particularly challenging to detect due to their small region size, the RGB-D network, which integrates depth information, achieves better performance than OpenPose but lower AP scores than HigherHRNet, which benefits from high-resolution feature maps that enhance precise joint localization. These results suggest that incorporating depth information into the pose estimation network does not necessarily provide fine-grained feature details for joint localization. However, in inter-person occlusion scenarios, the proposed method effectively leverages depth variations across multiple stages to reinforce the heatmap representation of occluded joints, enabling more accurate joint predictions from the camera’s perspective.

4.5. Qualitative Comparison of Human Pose Estimation

Figure 7 presents a qualitative evaluation of pose estimation, comparing skeleton extraction results obtained using ground truth, HigherHRNet with HRNet-W32 backbone, and the proposed method. In the skeleton extraction analysis of the proposed method, GT annotations for the nose position were excluded from evaluation, as previously explained in the occlusion experiments. This exclusion was necessary because the GT nose annotations were centered on the head rather than precisely located at the nose. Unlike HigherHRNet, which incorrectly produced extra skeletons, the proposed method successfully extracted exactly two skeletons when two individuals were present in the image. However, self-occlusion led to the left arm of the left person being undetected, as most of its joints were not visible. For the visible joints, detection accuracy was high, and no joint-switching problem occurred.

Figure 8 visualizes the confidence maps for joints in occluded scenes. In Figure 8, The red and green circles represent the same joint on different individuals. The traditional OpenPose often fails to accurately localize body parts within occluded areas. In contrast, our proposed method can more precisely detect the corresponding joints by leveraging additional 3D topological information from depth images.

5. Discussion and Future Research

In this study, we modify the structure of OpenPose, which is a widely used bottom-up HPE model, to additionally extract features from depth images and progressively fuse them with features for color images. The additional use of depth images significantly reduces instances where joints of different individuals are mistakenly assigned to the same skeleton in occluded scenes. Furthermore, compared to conventional early fusion or late fusion approaches, in which features are integrated either at the feature extraction stage or at a later stage, our proposed progressive fusion further improves the accuracy of individual skeleton detection. Within the proposed HPE model, global features are derived from color images, while depth images serve as a valuable source of local 3D topological information. Despite the exclusion of occluded joints from the OKS calculation, we observed that the influence of other joints on the remaining joints is reduced. This indicates that the proposed method can maintain pose estimation consistency by suppressing body-to-body interference under the occlusion scenes. While this study presents performance results on a collective set of occlusion samples, future work examining the impact of different occlusion severities may provide further insight into the robustness of the proposed model.

Although NTU RGB+D 120 provides enough body-to-body occlusion scenes, it was optimized for action recognition. Consequently, the skeleton annotations may have lower precision. This can limit the accurate evaluation of the proposed method.

The proposed feature fusion method was applied to OpenPose, but it can also be extended to the latest SOTA bottom-up HPE models. Furthermore, the HPE performance can be improved by employing advanced feature extractors such as HRNet or CSPNeXt instead of VGG. While the proposed method uses 1 × 1 convolution operations for feature fusion, introducing more sophisticated fusion techniques such as attention-based fusion or cross-modal gating could further enhance the representation capacity.

6. Conclusions

We proposed an improvement method of human pose estimation for scenes with occluded bodies by using RGB-D images. The proposed method iteratively fused body joint information extracted from both color and depth images, thereby enhancing the final joint representation. To evaluate the performance in occlusion-prone environments, the proposed method introduced a method for extracting occluded samples from conventional pose estimation test datasets. The performance of the proposed method was quantitatively assessed using the constructed occlusion test set. The experimental results demonstrated that the proposed approach improved pose estimation accuracy in occluded environments using both color information and topological cues to detect human joints effectively. Specifically, the proposed method improved most per-joint AP values in the occlusion test set. Furthermore, the mAP increased by 22.2% and 23.4% compared to OpenPose and HigherHRNet, respectively. These results demonstrate the effectiveness of the proposed method in pose estimation for scenes with body-to-body occlusions. The proposed pose estimation method is expected to be particularly beneficial in real-time applications where occlusions frequently occur, such as surveillance video analysis, action recognition, and autonomous driving. Moreover, beyond human pose estimation, the method can be extended to structurally defined objects, offering a highly efficient and generalizable approach to object detection in occlusion-prone environments.

Author Contributions

Conceptualization, J.-h.Y. and S.-k.K.; software, J.-h.Y.; writing—original draft preparation, J.-h.Y. and S.-k.K.; supervision, S.-k.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly supported by the Innovative Human Resource Development for Local Intellectualization program through the Institute of Information & Communications Technology Planning & Evaluation (IITP), grant funded by the Korean government (MSIT) (IITP-2025-RS-2020-II201791, 100%), by the BB21plus funded by Busan Metropolitan City and Busan Techno Park, and by a Dong-eui University Grant (202501170001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yue, R.; Tian, Z.; Du, S. Action recognition based on RGB and skeleton data sets: A survey. Neurocomputing 2022, 512, 287–306. [Google Scholar] [CrossRef]
Bux, A.; Angelov, P.; Habib, Z. Vision based human activity recognition: A review. In Proceedings of the UK Workshop on Computational Intelligence, Lancaster, UK, 7–9 September 2016; pp. 341–371. [Google Scholar]
Vrigkas, M.; Nikou, C.; Kakadiaris, I.A. A review of human activity recognition methods. Front. Robot. AI 2015, 2, 28. [Google Scholar] [CrossRef]
Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 2969–2978. [Google Scholar]
Karim, M.; Khalid, S.; Aleryani, A.; Khan, J.; Ullah, I.; Ali, Z. Human Action Recognition Systems: A Review of the Trends and State-of-the-Art. IEEE Access 2024, 12, 36372–36390. [Google Scholar] [CrossRef]
Liu, Z.; Zhu, J.; Bu, J.; Chen, C. A survey of human pose estimation: The body parts parsing based methods. J. Vis. Commun. Image Represent. 2015, 32, 10–19. [Google Scholar] [CrossRef]
Wang, P.; Li, W.; Ogunbona, P.; Wan, J.; Escalera, S. RGB-D-based human motion recognition with deep learning: A survey. Comput. Vis. Image Underst. 2018, 171, 118–139. [Google Scholar] [CrossRef]
Presti, L.L.; La Cascia, M. 3D skeleton-based human action classification: A survey. Pattern Recognit. 2016, 53, 130–147. [Google Scholar] [CrossRef]
Tome, D.; Russell, C.; Agapito, L. Lifting from the deep: Convolutional 3D pose estimation from a single image. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2500–2509. [Google Scholar]
Huang, H.; Wang, Y.; Linghu, K.; Xia, Z. Multi-modal micro-gesture classification via multi-scale heterogeneous ensemble network. In Proceedings of the Workshop & Challenge on Micro-Gesture Analysis for Hidden Emotion Understanding, Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar]
Wang, Y.; Rui, K.; Huang, H.; Xia, Z. Micro-gesture online recognition with dual-stream multi-scale transformer in long videos. In Proceedings of the Workshop & Challenge on Micro-Gesture Analysis for Hidden Emotion Understanding, Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar]
Dang, Q.; Yin, J.; Wang, B.; Zheng, W. Deep learning based 2D human pose estimation: A survey. Tsinghua Sci. Technol. 2019, 24, 663–676. [Google Scholar] [CrossRef]
Lan, G.; Wu, Y.; Hu, F.; Hao, Q. Vision-based human pose estimation via deep learning: A survey. IEEE Trans. Hum.-Mach. Syst. 2022, 53, 253–268. [Google Scholar] [CrossRef]
Wang, C.; Zhang, F.; Ge, S.S. A comprehensive survey on 2D multi-person pose estimation methods. Eng. Appl. Artif. Intell. 2021, 102, 104260. [Google Scholar] [CrossRef]
Gamra, M.B.; Akhloufi, M.A. A review of deep learning techniques for 2D and 3D human pose estimation. Image Vis. Comput. 2021, 114, 104282. [Google Scholar] [CrossRef]
Chen, Y.; Tian, Y.; He, M. Monocular human pose estimation: A survey of deep learning-based methods. Comput. Vis. Image Underst. 2020, 192, 102897. [Google Scholar] [CrossRef]
Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. ViTPose: Simple vision transformer baselines for human pose estimation. In Proceedings of the Conference on Neural Information Processing Systems, Virtual, 28 November–9 December 2022; pp. 38571–38584. [Google Scholar]
Jiang, T.; Lu, P.; Zhang, L.; Ma, N.; Han, R.; Lyu, C.; Li, Y.; Chen, K. RTMPose: Real-time multi-person pose estimation based on MMpose. arXiv 2023, arXiv:2303.07399. [Google Scholar] [CrossRef]
Yang, S.; Quan, Z.; Nie, M.; Yang, W. TransPose: Keypoint localization via transformer. In Proceedings of the International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 11802–11812. [Google Scholar]
Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5386–5395. [Google Scholar]
Zhao, M.; Li, T.; Abu Alsheikh, M.; Tian, Y.; Zhao, H.; Torralba, A.; Katabi, D. Through-wall human pose estimation using radio signals. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7356–7365. [Google Scholar]
Ghafoor, M.; Mahmood, A. Quantification of occlusion handling capability of a 3D human pose estimation framework. IEEE Trans. Multimed. 2022, 25, 3311–3318. [Google Scholar] [CrossRef]
Chen, B.; Chin, T.J.; Klimavicius, M. Occlusion-robust object pose estimation with holistic representation. In Proceedings of the Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2929–2939. [Google Scholar]
Munea, T.L.; Jembre, Y.Z.; Weldegebriel, H.T.; Chen, L.; Huang, C.; Yang, C. The progress of human pose estimation: A survey and taxonomy of models applied in 2D human pose estimation. IEEE Access 2020, 8, 133330–133348. [Google Scholar] [CrossRef]
Bragagnolo, L.; Terreran, M.; Allegro, D.; Ghidoni, S. Multi-view Pose Fusion for Occlusion-Aware 3D Human Pose Estimation. arXiv 2024, arXiv:2408.15810. [Google Scholar] [CrossRef]
Zhou, G.; Yan, Y.; Wang, D.; Chen, Q. A novel depth and color feature fusion framework for 6D object pose estimation. IEEE Trans. Multimed. 2020, 23, 1630–1639. [Google Scholar] [CrossRef]
Kazakos, E.; Nikou, C.; Kakadiaris, I.A. On the fusion of RGB and depth information for hand pose estimation. In Proceedings of the International Conference on Image Processing, Athens, Greece, 7–10 October 2018; pp. 868–872. [Google Scholar]
Wang, Z.; Lu, Y.; Ni, W.; Song, L. An RGB-D based approach for human pose estimation. In Proceedings of the International Conference on Networking Systems of AI, Shanghai, China, 19–20 November 2021; pp. 166–170. [Google Scholar]
Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional Pose Machines. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4724–4732. [Google Scholar]
Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Newell, A.; Huang, Z.; Deng, J. Associative embedding: End-to-end learning for joint detection and grouping. In Proceedings of the Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Chen, X.; Yang, C.; Mo, J.; Sun, Y.; Karmouni, H.; Jiang, Y.; Zheng, Z. CSPNeXt: A new efficient token hybrid backbone. Eng. Appl. Artif. Intell. 2024, 132, 107886. [Google Scholar] [CrossRef]
Li, Y.; Yang, S.; Liu, P.; Zhang, S.; Wang, Y.; Wang, Z.; Yang, W.; Xia, S. SimCC: A simple coordinate classification perspective for human pose estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 89–106. [Google Scholar]
Yang, C.H.; Kong, K.B.; Min, S.J.; Wee, D.Y.; Jang, H.D.; Cha, G.H.; Kang, S.J. SEFD: Learning to distill complex pose and occlusion. In Proceedings of the International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 14895–14906. [Google Scholar]
Purkrabek, M.; Matas, J. Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle. arXiv 2024, arXiv:2412.01562. [Google Scholar] [CrossRef]
Artacho, B.; Savakis, A. Full-BAPose: Bottom Up Framework for Full Body Pose Estimation. Sensors 2023, 23, 3725. [Google Scholar] [CrossRef]
Qu, H.; Cai, Y.; Foo, L.G.; Kumar, A.; Liu, J. A characteristic function-based method for bottom-up human pose estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
Bai, X.; Wei, X.; Wang, Z.; Zhang, M. CONet: Crowd and occlusion-aware network for occluded human pose estimation. Neural Netw. 2024, 172, 106109. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Amin, A.; Tamajo, A.; Klugman, I.; Stoev, E.; Fisho, T.; Lim, H.; Kim, H. Real-time 3D multi-person pose estimation using an omnidirectional camera and mmWave radars. In Proceedings of the International Conference on Engineering and Emerging Technologies, Seoul, Republic of Korea, 20–22 October 2023; pp. 1–6. [Google Scholar]
Knap, P.; Hardy, P.; Tamajo, A.; Lim, H.; Kim, H. Real-time omnidirectional 3D multi-person human pose estimation with occlusion handling. In Proceedings of the ACM SIGGRAPH European Conference on Visual Media Production, London, UK, 6–8 November 2023. [Google Scholar]
Knap, P.; Hardy, P.; Tamajo, A.; Lim, H.; Kim, H. Improving real-time omnidirectional 3D multi-person human pose estimation with people matching and unsupervised 2D–3D lifting. In Proceedings of the International Conference on Electronics, Information, and Communication, Jeju Island, Republic of Korea, 10–13 January 2024; pp. 1–4. [Google Scholar]
Sengupta, A.; Jin, F.; Cao, S. NLP based skeletal pose estimation using mmWave radar point-cloud: A simulation approach. In Proceedings of the IEEE Radar Conference, Atlantic City, NJ, USA, 21–24 September 2020; pp. 1–6. [Google Scholar]
An, S.; Ogras, U.Y. Fast and scalable human pose estimation using mmWave point cloud. In Proceedings of the Design Automation Conference, San Francisco, CA, USA, 10–14 July 2022; pp. 889–894. [Google Scholar]
Li, G.; Zhang, Z.; Yang, H.; Pan, J.; Chen, D.; Zhang, J. Capturing human pose using mmWave radar. In Proceedings of the International Conference on Pervasive Computing and Communications Workshops, Austin, TX, USA, 23–27 March 2020; pp. 1–6. [Google Scholar]
Fürst, M.; Gupta, S.T.; Schuster, R.; Wasenmüller, O.; Stricker, D. HPERL: 3D human pose estimation from RGB and LiDAR. In Proceedings of the International Conference on Pattern Recognition, Milano, Italy, 10–15 January 2021; pp. 7321–7327. [Google Scholar]
Ye, D.; Xie, Y.; Chen, W.; Zhou, Z.; Ge, L.; Foroosh, H. LPFormer: LiDAR pose estimation transformer with multi-task network. In Proceedings of the International Conference on Robotics and Automation, Yokohama, Japan, 18–22 May 2024; pp. 16432–16438. [Google Scholar]
Knap, P. Human modelling and pose estimation overview. arXiv 2024, arXiv:2406.19290. [Google Scholar] [CrossRef]
Park, S.; Ji, M.; Chun, J. 2D human pose estimation based on object detection using RGB-D information. KSII Trans. Internet Inf. Syst. 2018, 12, 800–816. [Google Scholar] [CrossRef]
Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 328–341. [Google Scholar] [CrossRef] [PubMed]
Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]

Figure 1. Flowchart of the proposed method.

Figure 2. Structure of the proposed network.

Figure 3. Structure of inference layer in Stage 1.

Figure 4. Structure of inference layer in Stage n (n ≥ 2).

Figure 5. Comparison of confidence maps of right shoulder at Stages 3 and 4: (a) OpenPose; (b) the proposed model.

Figure 6. Body-to-body occlusion identification through bounding box intersection.

Figure 7. Comparison of pose estimation results of HigherHRNet and proposed method: (a) ground truth; (b) HigherHRNet [18]; (c) ours (R = 3).

Figure 8. Visualization of confidence maps for joints in occluded scenes.

Table 1. Recent studies for multi-person human pose estimation.

Study	Year	Backbone	Approach	Description
Wei et al. [29]	2016	CPM	Top-down	Intermediate supervision
Cao et al. [30]	2017	VGG-19	Bottom-up	Part Affinity Field
Cheng et al. [20]	2020	HRNet	Bottom-up	Focus on scale change challenges in bottom-up pipelines
Yang et al. [35]	2023	ResNet/ HRNet	Top-Down	3D mesh estimation by edge information, Knowledge distillation strategy
Artacho et al. [36]	2023	HRNet	Bottom-up	Waterfall architecture, Capture multi-scale features,
Qu et al. [37]	2023	HRNet	Bottom-up	A distance-based loss function for joint heatmaps
Purkrabek et al. [38]	2024	ViTPose	Top-Down	Integration of detection, segmentation, and pose estimation

Table 2. Indices of joint types in NTU RGB+D 120 dataset.

Index	Joint Type	Index	Joint Type
1	Base of the spine	14	Left knee
2	Middle of the spine	15	Left ankle
3	Neck	16	Left foot
4	Head	17	Right hip
5	Left shoulder	18	Right knee
6	Left elbow	19	Right ankle
7	Left wrist	20	Right foot
8	Left hand	21	Spine
9	Right shoulder	22	Tip of the left hand
10	Right elbow	23	Left thumb
11	Right wrist	24	Tip of the right hand
12	Right hand	25	Right thumb
13	Left hip

Table 3. Indices of joint types in MS COCO format.

Index	Joint Type	Index	Joint Type
1	Nose	10	Right knee
2	Neck	11	Right ankle
3	Right shoulder	12	Left hip
4	Right elbow	13	Left knee
5	Right wrist	14	Left ankle
6	Left shoulder	15	Right eye
7	Left elbow	16	Left eye
8	Left wrist	17	Right ear
9	Right hip	18	Left ear

Table 4. Standard deviations for joint types.

Joints Type	σ_i
Nose	0.026
Neck	0.026
Eyes	0.025
Ears	0.035
Shoulders	0.079
Elbows	0.072
Wrists	0.062
Hips	0.107
Knees	0.087
Ankles	0.089

Table 5. Self-comparison results according to the fusion stage adjustment parameter R.

R	AP	P⁵⁰	P⁷⁵	AP^M	AP^L	AR
1	0.49	0.84	0.49	0.31	0.51	0.50
2	0.50	0.85	0.49	0.32	0.51	0.51
3	0.51	0.86	0.51	0.33	0.52	0.52
4	0.45	0.79	0.46	0.18	0.47	0.43
5	0.24	0.46	0.23	0.09	0.26	0.25

Table 6. Performance comparison of different feature fusion methods.

Feature Fusion Method	AP	P⁵⁰	P⁷⁵	AP^M	AP^L	AR
RGB image only	0.39	0.83	0.32	0.22	0.41	0.39
Depth image only	0.28	0.69	0.28	0.11	0.28	0.29
Early fusion	0.14	0.40	0.07	0.06	0.14	0.13
Late fusion	0.40	0.84	0.31	0.23	0.40	0.40
Ours (R = 3)	0.51	0.86	0.51	0.33	0.52	0.52

Table 7. Result on subset test set of NTU RGB+D 120 dataset.

Model	#Params	GFLOPs	AP	P⁵⁰	P⁷⁵	AP^M	AP^L	AR
Subset of 8921 images as in NTU RGB+D 120 dataset
HigherHRNet [18]	28.7M	73.1	0.49	0.88	0.49	0.35	0.50	0.51
OpenPose [27]	52.3M	411.1	0.39	0.83	0.32	0.22	0.41	0.39
Ours (R = 3)	68.9M	593.7	0.51	0.86	0.51	0.33	0.52	0.52

Table 8. Results on evaluating the accuracy of pose estimation in the occlusion test set.

Model	Neck	Shoulder	Elbow	Wrist	Hip	Knee	Ankle	mAP
Occlusion Subset of 383 images as in NTU RGB+D 120 dataset
HigherHRNet [18]	-	0.47	0.47	0.43	0.63	0.68	0.55	0.54
OpenPose [27]	0.62	0.49	0.44	0.33	0.65	0.69	0.63	0.54
Ours (R = 3)	0.67	0.78	0.54	0.38	0.86	0.74	0.70	0.67

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yoon, J.-h.; Kwon, S.-k. Robust Human Pose Estimation Method for Body-to-Body Occlusion Using RGB-D Fusion Neural Network. Appl. Sci. 2025, 15, 8746. https://doi.org/10.3390/app15158746

AMA Style

Yoon J-h, Kwon S-k. Robust Human Pose Estimation Method for Body-to-Body Occlusion Using RGB-D Fusion Neural Network. Applied Sciences. 2025; 15(15):8746. https://doi.org/10.3390/app15158746

Chicago/Turabian Style

Yoon, Jae-hyuk, and Soon-kak Kwon. 2025. "Robust Human Pose Estimation Method for Body-to-Body Occlusion Using RGB-D Fusion Neural Network" Applied Sciences 15, no. 15: 8746. https://doi.org/10.3390/app15158746

APA Style

Yoon, J.-h., & Kwon, S.-k. (2025). Robust Human Pose Estimation Method for Body-to-Body Occlusion Using RGB-D Fusion Neural Network. Applied Sciences, 15(15), 8746. https://doi.org/10.3390/app15158746

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Robust Human Pose Estimation Method for Body-to-Body Occlusion Using RGB-D Fusion Neural Network

Abstract

1. Introduction

2. Related Works

2.1. Human Pose Estimation Methods Based on RGB Images

2.1.1. OpenPose

2.1.2. Scale-Aware High-Resolution Network

2.1.3. Recent Human Pose Estimation Methods

2.1.4. Feature Extractors for Human Pose Estimation

2.2. Human Pose Estimation Methods Based on Fusion of RGB Images and Other Modalities

2.3. Pose Estimation Methods Based on RGB-D Images

3. Human Pose Estimation Method by Progressive Feature Fusion

3.1. Network Architecture

3.2. Loss Function

3.3. Ground Truth for Heatmap Representation

3.4. Generation of Body-to-Body Occlusion Samples

4. Experimental Results

4.1. Dataset

4.2. Performance Metrics in Human Pose Estimation

4.3. Performance Evaluation of Human Pose Estimation

4.4. Performance Evaluation for Body-to-Body Occlusion Subset

4.5. Qualitative Comparison of Human Pose Estimation

5. Discussion and Future Research

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI