You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

7 August 2025

Robust Human Pose Estimation Method for Body-to-Body Occlusion Using RGB-D Fusion Neural Network

and
Department of Computer Software Engineering, Dong-eui University, Busan 47340, Republic of Korea
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Advanced Pattern Recognition & Computer Vision

Abstract

In this study, we propose a novel approach for human pose estimation (HPE) in occluded scenes by progressively fusing features extracted from RGB-D images, which contain RGB and depth images. Conventional bottom-up human pose estimation models that rely solely on RGB inputs often produce erroneous skeletons when parts of a person’s body are obscured by another individual, because they struggle to accurately infer body connectivity due to the lack of 3D topological information. To address this limitation, we modify the traditional OpenPose that is a bottom-up HPE model to take a depth image as an additional input, thereby providing explicit 3D spatial cues. Each input modality is processed by a dedicated feature extractor. Each input modality is processed by a dedicated feature extractor. In addition to the two existing modules for each stage—joint connectivity and joint confidence map estimations for the color image—we integrate a new module for estimating joint confidence maps for the depth image into the initial few stages. Subsequently, the confidence maps derived from both depth and RGB modalities are fused at each stage and forwarded to the next, ensuring that 3D topological information from the depth image is effectively utilized for both joint localization and body part association. Subsequently, the confidence maps derived from both depth and RGB modalities are fused at each stage and forwarded to the next to ensure that 3D topological information is effectively utilized for estimating both joint localization and their connectivity. The experimental results on the NTU 120+ RGB-D Dataset verify that our proposed approach achieves a 13.3% improvement in average recall compared to the original OpenPose model. The proposed method can enhance the performance of the bottom-up HPE models for the occlusion scenes.

1. Introduction

Human pose estimation (HPE) detects body joints from images and estimates the human skeleton. HPE is widely used in HAR (Human Action Recognition), which recognizes body posture or movements [1,2,3], because it is not affected by lighting changes or background complexity [4]. Vision-based HAR utilizing HPE has diverse applications in the field of computer vision, including surveillance video analysis and human–computer interaction (HCI) [5,6,7,8,9]. Recently, a fine-grained pose estimation method using multi-modal information has also been actively studied for HCI [10,11].
HPE methods can be categorized into top-down and bottom-up approaches based on the sequence of body detection and joint localization [12,13,14,15]. The top-down approach first detects human regions in an image and then applies a joint detector to each region. This method is less sensitive to variations in human scale but suffers from a linear increase in computational cost as the number of people in an image increases. In contrast, the bottom-up approach directly detects body joints in an image and estimates their associations based on spatial relationships to reconstruct human poses. This approach is advantageous for real-time processing since the joint detector operates only once [16].
Although recent HPE models [17,18,19,20,21] have achieved outstanding average accuracy and inference speed, an occlusion problem, which refers to a drop in accuracy when some body parts are obstructed by other parts or objects, still remains to be solved. The occlusion problem can be classified into occlusion by objects, body-to-body occlusion, and self-occlusion (e.g., when a person crosses his or her legs). Improvements in HPE for images with occlusion by objects have been studied, whereas other types of occlusion problems have not been adequately addressed [22]. These occlusion problems are mainly caused by the lack of topological information in RGB images [23,24]. In this study, we focus on the improvement in HPE for images with body-to-body occlusion. There are efforts to estimate multiple people’s poses from images captured from multiple views to reduce the influence of occlusion [25]. In this case, it can be mostly free from occlusion, and even 3D pose estimation can be considered. However, this approach is limited in that it can only be used in controlled environments where multiple cameras are precisely installed. RGB-D images that contain both color and depth images can capture topological relationships in three-dimensional space. A depth image stores distances from a camera device along the z-axis, which is the camera optical axis, as its pixels. By additionally utilizing the pixel values of the depth image, occluded body parts can be more effectively separated. Previous studies [26,27,28] have been conducted to estimate object poses by fusing features from color and depth images. Nevertheless, these studies for HPE have typically employed depth images only for auxiliary tasks, such as body region segmentation or post-processing corrections of detected skeletal data, rather than integrating depth information directly into the network itself. In particular, few studies have focused on leveraging depth images to address body-to-body occlusion in multi-person images. We aim to improve HPE for occluded bodies by fusing features from both color and depth images.
In this study, we propose an HPE method robust to body-to-body occlusion problems using a novel network architecture that integrates color and depth information. We modify the traditional OpenPose that is a bottom-up HPE model to take a depth image as an additional input, thereby providing explicit 3D spatial cues. Each input modality is processed by a dedicated feature extractor. The proposed network consists of a multi-stage cascade structure and acts as a joint detection method that effectively considers both color and depth information by iteratively fusing features extracted from two inputs. We demonstrate that 3D spatial feature information captured from depth images can be used as information on occlusion, enabling accurate prediction even in body-to-body scenarios. Additionally, the proposed HPE method adopts a bottom-up approach to maintain stable low computational complexity regardless of the number of people in the image. As a result, the proposed method mitigates the ambiguity caused by overlapping inter-personal body parts by exploiting the correlation between color and depth information and shows improved accuracy compared to previous bottom-up methods in body-to-body occlusion scenarios.
The main contributions of this study are as follows:
  • We propose a novel architecture that integrates depth image features into a bottom-up HPE model. In previous HPE models based on RGB-D images [26,27,28], depth images have been used primarily to augment the base features extracted by the feature extractor. In contrast, our approach progressively fuses depth features with those from color images, thereby substantially improving HPE performances in occlusion scenes. We demonstrate that progressively fusing depth features at each stage of the HPE process leads to improved performance.
  • We specifically selected images containing occlusion scenarios and experimentally demonstrated that leveraging features extracted from depth images leads to clear improvements in pose estimation performance for these cases. Furthermore, we conducted an analysis of how RGB and depth images contribute to pose estimation at each stage of feature fusion.
The remainder of this paper is organized as follows. Section 2 reviews related works. Section 3 presents the proposed HPE method using RGB-D images. Section 4 describes the experimental results for the proposed method. Finally, Section 5 provides a conclusion.

3. Human Pose Estimation Method by Progressive Feature Fusion

Figure 1 presents a flowchart of the proposed HPE method. The core process involves progressively fusing features from color and depth images to enhance joint estimation. This approach enhances joint detection by incorporating edge, feature, and texture information from color images together with three-dimensional characteristics from depth images. The proposed method utilizes RGB and depth images as input data. Feature maps for each modality are extracted through separate feature extractors. Subsequently, the network progressively refines the representations of joints by combining features from both depth and color images. The enhanced features from both modalities are fused and passed as inputs to the next stage. This iterative process continues for several early stages, leading to a gradual strengthening of the joint representations.
Figure 1. Flowchart of the proposed method.

3.1. Network Architecture

The proposed method effectively considers the topological relationships between detected candidate joints using three-dimensional information. This allows for accurate joint detection even in occluded scenes. In this work, we adopt OpenPose [30] as our HPE baseline model due to its widespread use and proven effectiveness.
Figure 2 illustrates the structure of the proposed network. The network consists of three parallel branches, each containing independent inference layers. The first branch extracts PAFs, while the second and third branches generate confidence maps for body joints using RGB and depth images, respectively. The network employs a multi-stage cascaded architecture, where the outputs of each branch are fused and used as input for the subsequent stage. This iterative process progressively expands the receptive field and refines feature representations, ultimately improving the final output.
Figure 2. Structure of the proposed network.
The proposed network is based on the OpenPose model and adopts the first ten layers of VGG-19 [40] as the backbone to extract feature maps Frgb and Fd from RGB and depth images. In the first stage, three independent inference layers, ϕ1, ρ1, and γ1, predict pose-related information. The inference layers ϕ1 and ρ1 take Frgb as input, while γ1 processes Fd. The inference layer ϕ1 extracts a two-dimensional vector field L1, encoding the directional information of joint connections at limb locations. The inference layer ρ1 generates a confidence map S1, predicting body joint locations from the color feature maps. The inference layer γ1 produces a confidence map D1, estimating body joint positions using depth feature maps. All three inference layers, ϕ1, ρ1, and γ1, have an identical structure consisting of three convolutional layers with a 3 × 3 kernel and two convolutional layers with a 1 × 1 kernel. The detailed structure of these layers is shown in Figure 3. After extracting L1, S1, and D1 from each branch in Stage 1, the confidence maps S1 and D1 are fused. A convolutional layer with a 1 × 1 kernel is then applied to reduce the number of channels by half, resulting in the fused confidence map M1, which integrates both color and depth information. Similarly, the feature maps Frgb and Fd are combined, and a 1 × 1 convolutional layer is used to reduce the number of channels by half, producing Frgbd, which encapsulates both color and depth features. The fused outputs M1, L1, and Frgbd are then combined and used as input for each branch in Stage 2. For Stage n (n ≥ 2), the inference layers in each branch consist of five sequential convolutional layers with a 7 × 7 kernel, followed by two convolutional layers with a 1 × 1 kernel. Figure 4 illustrates the structure of the inference layers ϕn, ρn, and γn for each branch in Stage n.
Figure 3. Structure of inference layer in Stage 1.
Figure 4. Structure of inference layer in Stage n (n ≥ 2).
The proposed RGB-D network fuses color and depth information only up to Stage R. In these stages, the fused confidence map Mr is obtained by combining Sr and Dr, while Lr and Frgbd are also incorporated as inputs for the next stage. After Stage R, the network no longer includes Branch 3, which processes depth information. Instead, the subsequent stages use only Lt, St, and Frgb as inputs. In the final Stage T, the network outputs LT and ST, marking the completion of the proposed RGB-D network’s operation. Based on empirical findings from the OpenPose model, T is set to 6, while R is optimized to 3, ensuring the highest performance when color and depth information is fused up to Stage 3. The final outputs L and S from the proposed RGB-D network are used in the PAF grouping strategy of OpenPose to construct the skeletal structure for each individual.
Figure 5 compares the confidence maps of the right shoulder at Stages 3 and 4 between the standard OpenPose and our proposed modification. The confidence maps extracted from Stage 3 and Stage 4 in both models are analyzed to determine the impact of incorporating depth information on joint detection accuracy. The traditional model struggles to distinguish the shoulder positions of two closely positioned individuals when significant occlusion occurs. In Stage 3, the output confidence map S3 exhibits a merged peak, making it difficult to differentiate between the right shoulders of the two individuals. This problem worsens in Stage 4, where the peaks remain indistinguishable, making it even more challenging to separate the shoulder locations compared to Stage 3. In contrast, the proposed RGB-D network effectively leverages depth information from Stage 1, allowing it to incorporate both topological information and color features throughout the process. As a result, in Stage 3, the confidence map S3 successfully separates the peaks for the leftmost person’s shoulder and the more distant individual’s shoulder, making them easily distinguishable. Furthermore, the output confidence map D3 from the third branch is combined with S3, refining the Stage 4 output S4. This results in a more distinct separation of peaks, ultimately improving the accuracy of the right shoulder localization for each individual compared to previous stages.
Figure 5. Comparison of confidence maps of right shoulder at Stages 3 and 4: (a) OpenPose; (b) the proposed model.

3.2. Loss Function

In the proposed RGB-D network, designed for robust pose estimation under occlusions, a loss function is defined for each branch at every stage during training. At Stage r, Branch 1 utilizes the loss function f L r , Branch 2 employs f S r , and Branch 3 applies f D r . The loss functions f L r , f S r , and f D r are formulated utilizing the L2 norm as follows:
f L r = c = 1 C P W p L c r p L c * p 2 2 f S r = j = 1 J P W p S j r p Q j * p 2 2 f D r = j = 1 J P W p D j r p Q j * p 2 2 ,
where W(p) is a binary mask for a pixel p R 2 , which prevents loss contributions from false-positive predictions at joint locations that are not annotated in the ground truth (GT); L c r is the vector fields predicted for limb c in Branch 1 at Stage r; S j r and D j r are the confidence maps predicted for joint j in Branch 2 and Branch 3 at Stage r, respectively; and L c * p and Q c * p are the ground truths of the confidence map and vector field, respectively.
The total loss function f for training the proposed RGB-D network is formulated as follows:
f = r = 1 R f L r + f S r + f D r + r = R + 1 T f L r + f S r .
The total loss function f sums the vector field loss ( f L ) and the confidence map losses ( f S and f D ) over all stages. Since Branch 3 is not present after Stage T, the f D term is excluded from the loss from that point onward.

3.3. Ground Truth for Heatmap Representation

In the proposed network, the GT for network training consists of confidence maps Q * that represent joint locations as heatmaps and a set of 2D vectors L * that encode the directional information between connected joints. The generation of GT from the annotated body joint coordinates follows the procedures described in OpenPose. When training body joint localization, the confidence maps have a width w and height h, and the ground truth Q * R j × w × h is defined as a set of heatmaps Q j * R w × h for each joint j. To generate Q j * , the function Q j , k * R w × h must first be computed. The function Q j , k * represents a heatmap modeled using a Gaussian distribution with variance σ2, based on the distance between a pixel p and the actual joint coordinate xj,k of person k in the dataset. The function Q j , k * is computed as follows:
Q j , k * p = exp p x j , k 2 2 σ 2 ,
where σ controls the spread of the peak in the confidence map represented as a heatmap. In the proposed method, σ is set to 3.
The GT of confidence map Q j * is generated by applying a maximum operation over all Q j , k * at p as follows:
Q j * p = max k Q j , k * ( p ) .
Finally, the confidence maps corresponding to the joints are assigned the GT values.
The generation of the GT of L * requires one to obtain L c , k * R w × h × 2 , which represents the connection direction of limb type c at pixels where the limb of person index k is present. The vector field L c , k * is computed as follows:
L c , k * p = v       i f   p   o n   l i m b   c ,   k 0       o t h e r w i s e   ,
where v represents the direction of the limb. If pixel p corresponds to the location of the c-th limb of the k-th person, v is assigned as a unit vector indicating the limb’s direction. Otherwise, for non-limb pixels, v is set to zero. The vector v is computed as follows:
v = x j 2 , k x j 1 , k / x j 2 , k x j 1 , k 1   ,
where x j 1 , k , x j 2 , k R 2 represent the positions of the two joints forming the c-th limb of the k-th individual. A unit vector with a magnitude of 1 and a direction from x j 1 , k to x j 2 , k can be obtained through computation.
To assign v at limb locations, it is necessary to define the pixel region where the limb exists. Since limbs have thickness, the reference region should include both the line segment connecting the two joints and the area perpendicular to this segment. The threshold for defining the region considering limb thickness is determined based on x j 1 , k and x j 2 , k and is computed as follows:
0 v · p x j 1 , k l c , k   and   v · p x j 2 , k σ l ,
where v is a unit vector perpendicular to the direction of v, σ l represents a limb width that is a distance in pixels, and lc,k denotes the limb length of the c-th limb for k-th person. The limb length lc,k is computed as follows:
l c , k = x j 2 , k x j 1 , k 2 .
L * R c × w × h × 2 is composed of the set of L c * R w × h × 2 for all limb types c across all individuals. L c * is computed as follows:
L c * = 1 n c ( p ) k L c , k * ( p ) ,
where nc(p) denotes the number of instances in which the v corresponding to c-th limb is nonzero at p across all individuals. A nonzero v in L c , k * indicates that the corresponding pixel is the location of the c-th limb of the k-th person. Therefore, nc(p) means the number of individuals whose c-th limbs overlaps at p. The final aggregated direction map L c * is calculated by summing all overlapping L c , k * and dividing by nc(p), thereby averaging the directional vectors for limb connections.

3.4. Generation of Body-to-Body Occlusion Samples

To evaluate the pose estimation performance of the proposed network, we generate a body-to-body occlusion test set by selecting images with inter-body occlusion from the existing multi-person pose estimation test set. RGB-D images in the dataset are identified as body-to-body occlusion if two or more people exist and their bounding boxes are overlapped. This occlusion test set enables the evaluation of pose estimation performance under controlled body-to-body occlusion scenes.
The presence of body-to-body occlusion in a sample is determined by the intersection area β z of the bounding box pair z, where occlusion is identified if 0 < z β z . The intersection area β z is calculated as β z = w i n t e r · h i n t e r , where winter and hinter represent the width and height of the intersection area, respectively. These values are derived from the bounding box coordinates of individuals i and j, where each bounding box is defined by its top-left coordinates (xmin, ymin) and bottom-right coordinates (xmax, ymax). Figure 6 provides an example of bounding boxes, intersection areas, and the corresponding winter and hinter values for multi-person samples.
Figure 6. Body-to-body occlusion identification through bounding box intersection.
The values winter and hinter, which are used to compute the intersection area, may be negative when the bounding boxes do not overlap. To prevent negative values from contributing to the occlusion determination, winter and hinter are computed only when they are positive; otherwise, they are assigned a value of zero to exclude them from the calculation. Therefore, winter and hinter are computed as follows:
w i n t e r = w i , j     ( w i , j > 0 ) 0             ( w i , j 0 )   ,     h i n t e r = h i , j     ( h i , j > 0 ) 0           ( h i , j 0 )     ,
where wi,j and hi,j represent the width and height of the intersection area between the bounding boxes of individuals i and j, respectively. They are computed as follows:
w i , j = min x m a x , i , x m a x , j max x m i n , i , x m i n , j   h i , j = min y m a x , i , y m a x , i max y m i n , i , y m i n , j .
If there is no overlap between the two bounding boxes, winter or hinter is computed as a negative value. In such cases, (12) assigns a value of zero to exclude them from the final area calculation. In some cases, bounding boxes overlap even though actual occlusion does not occur. To address this, we manually verified these cases and selected the occluded samples accordingly.

4. Experimental Results

We evaluate the pose estimation performance of the proposed network. The experiments are conducted in an environment running Ubuntu 20.04, equipped with a NVIDIA GeForce RTX 4070 GPU, 16 GB of memory, and 1 TB of storage.
A self-comparison is performed by adjusting the fusion stage parameter R in the proposed RGB-D network using a test set consisting of approximately 9000 randomly selected samples from the NTU RGB+D 120 dataset [56,57]. The proposed method is compared with existing high-performance bottom-up pose estimation models. The proposed RGB-D network is compared with OpenPose, which uses only RGB images as input. Since the proposed network enhances OpenPose by incorporating depth images and fusing topological information into the confidence maps and PAFs, the stage-wise confidence map prediction performance is analyzed to assess the effectiveness of fusing color and depth information. In order to evaluate performance under inter-person occlusion, a dedicated occlusion test set is constructed by selecting 383 occluded samples from the dataset.

4.1. Dataset

The NTU RGB+D 120 dataset was adopted to evaluate the performance of the proposed method. Although the NTU RGB+D 120 dataset was primarily developed for benchmarking action recognition tasks, it contains a sufficient number of scenes with body-to-body occlusions with multiple individuals and provides body skeleton data. Therefore, the NTU RGB+D 120 dataset is also suitable for evaluating pose estimation based on RGB-D images. The NTU RGB+D 120 dataset contains RGB-D images, that is, pairs of RGB and depth images, along with corresponding skeleton annotations. This dataset contains approximately 110,000 RGB-D images, along with joint position annotations and 120 action class labels. The RGB-D images were captured using Microsoft Kinect v2. The resolutions of RGB and depth images are 1920 × 1080 and 512 × 424, respectively. The skeleton annotations include the index corresponding to each type of joint, as defined in Table 2, and the coordinates of each joint in both the RGB and depth images. If a particular joint is not observed due to occlusion or other reasons, the corresponding joint coordinates are filled with zeros. However, these joints that are not visible and, thus, have coordinates filled with zeros are excluded from the evaluation. Although the dataset does not directly provide information such as camera parameters that can be used to align the two images, it is possible to compute the alignment matrix using the pairs of joint coordinates provided in both images. We computed a transformation matrix for each scene based on this alignment method to align the RGB image with the depth image. The weights of the proposed network are initialized using the pretrained weights of the OpenPose model [58]. The OpenPose model was trained using the joint index numbering and types defined by the MS COCO format, as shown in Table 3, which differ from the index numbering and joint types defined in NTU RGB+D 120. To address this issue, we evaluated only the joints that are commonly defined in both formats, which correspond to joints 1 to 14 in the MS COCO format.
Table 2. Indices of joint types in NTU RGB+D 120 dataset.
Table 3. Indices of joint types in MS COCO format.

4.2. Performance Metrics in Human Pose Estimation

Object Keypoint Similarity (OKS) [58] is adopted to evaluate the pose estimations for scenes with multiple people. OKS is a concept similar to IoU in object detection and is used to compute Average Precision (AP) and Average Recall (AR). OKS generalizes body joint scale variations across different human sizes as follows:
O K S = i [ 0 , N 1 ] e x p ( d i 2 2 s 2 k i 2 ) δ ( υ i ) i [ 0 , N 1 ] δ ( υ i ) ,
where i represents the index of each joint; N denotes the total number of joints; di is the Euclidean distance between the ground truth and predicted locations for joint i; s is the square root of the object area; ki is a constant assigned based on the standard deviation of joint localization for joint i; and δ(υi) is an indicator function based on a visibility flag υi in the ground truth annotation. The δ(υi) is 1 if υi > 0 and 0 otherwise. The 17-COCO Pose dataset provides joint-wise standard deviations computed across the entire benchmark dataset, and these values follow the relationship ki = 2σi for joint i. The standard deviations for joint types are presented in Table 4.
Table 4. Standard deviations for joint types.
The standard deviation σi tends to be larger for the shoulders, knees, and hips compared to facial joints such as the nose, eyes, and ears. Since s in OKS is defined as the object area, segmentation data are required. However, the NTU RGB+D dataset does not provide segmentation information for objects. In the COCO dataset, an empirical approximation method is used when segmentation data are unavailable, where s is defined by multiplying the bounding box area by 0.53. Therefore, s is determined using this empirical approximation method.
To evaluate joint detection performance, a prediction is considered a true positive (TP) for a given threshold (set within the range of 0 to 1) if its OKS score exceeds the threshold. In cases where multiple predictions are associated with a single GT, only the prediction with the highest OKS score is regarded as a TP, while the others are counted as false positives (FPs). Furthermore, if there is no prediction corresponding to a particular GT, it is counted as a false negative (FN). The precision (P) and recall (R) are calculated as follows:
P = T P / ( T P + F P ) R = T P / ( T P + F N ) .
The OKS scores were assessed using thresholds of 0.5 (P50) and 0.75 (P75), respectively. The overall AP is computed by averaging multiple P values over a range of OKS thresholds (0.50:0.05:0.95). To evaluate the impact of object scale variations, performance is also measured for medium-scale objects (APM) with 322 < s2 < 962 and large-scale objects (APL) with s2 > 962.

4.3. Performance Evaluation of Human Pose Estimation

To determine the optimal network configuration, self-comparison experiments are conducted by varying the R parameter, which specifies how many early stages fuse features from the depth image. The experiments use a test set of 8921 randomly selected samples from the NTU RGB+D 120 dataset. The experimental results are presented in Table 5. The self-comparison experiments indicate that the RGB-D network with R = 3 achieves the highest performance across all evaluation metrics, including AP at OKS thresholds of 50 and 75, as well as AP and AR across different person sizes. As R increases from 1 to 3, the accuracy continuously improves. This indicates that, in the initial stages for extracting local features, the additional use of depth information enables more effective separation of overlapping joints and facilitates the extraction of more precise features. However, when R is 4 or above, a marked decrease in accuracy is observed. This suggests that, during the later stages for global feature extraction, fusing depth features can adversely affect the representational capabilities of features pertaining to both individual joints and their connectivity. In summary, the 3D topological cues derived from depth images primarily facilitate the representation of local structural features. In contrast, the 2D appearance information contained in color images is generally more influential in modeling global contextual characteristics.
Table 5. Self-comparison results according to the fusion stage adjustment parameter R.
We compared the modality fusion in our proposed method with the commonly used fusion strategies of early fusion and late fusion, as shown in Table 6. In early fusion and late fusion, the inputs from RGB and depth images are merged either before or after the initial feature extraction stage, respectively. Early fusion showed the lowest AP and AR scores, likely due to the fundamental differences in the data domains of depth and color images. While depth images primarily provide structural information such as distance and shape, color images offer visual details such as color and texture. When these heterogeneous features are simply combined in the early stages, the network may fail to adequately learn the distinct characteristics of each modality, resulting in degraded performance. On the other hand, late fusion yielded slightly better performance compared to using only RGB images but still demonstrated lower overall accuracy than the proposed method. This suggests that reinforcing feature representations solely through depth images is insufficient for significant improvements. In contrast, our method substantially enhances the feature representations of the joints by progressively fusing features from both modalities at each stage.
Table 6. Performance comparison of different feature fusion methods.
Table 7 presents the experimental results obtained from the test set. These experimental results show that the pose estimation network with the fusion stage parameter R = 3 achieves an AP improvement of 11.7 and an AR improvement of 13.3 compared to the original OpenPose model. Additionally, as the OKS threshold increases, OpenPose exhibits a significant decline in accuracy and recall, whereas the proposed method demonstrates a relatively lower performance degradation rate. Despite enhancing pose estimation accuracy, the proposed method only increases GPU Floating Point Operations Per Second (GFLOPS) by approximately 44% compared to OpenPose, which is capable of processing approximately 200 images per second in real time. Compared to HigherHRNet, which currently achieves SOTA performance in multi-person pose estimation and adopts an aggressive detection approach, the proposed network improves AP by 1.6 and AR by 1.6, demonstrating a performance enhancement.
Table 7. Result on subset test set of NTU RGB+D 120 dataset.

4.4. Performance Evaluation for Body-to-Body Occlusion Subset

To evaluate the improvement in pose estimation achieved by the proposed method in body-to-body occlusion scenes, we selected a total of 383 samples from RGB-D images in the NTU RGB+D 120 dataset. For performance comparison and analysis, joint-wise AP is measured during the occlusion experiments. However, in the COCO dataset skeleton format, the nose joint exists in the NTU RGB+D 120 dataset, but its annotated location varies significantly when capturing side profiles of individuals. Therefore, the AP measurement for the nose joint is excluded from this experiment to ensure fair evaluation.
The results in Table 8 demonstrate that the proposed method improves detection accuracy by achieving higher AP scores for most joints compared to existing SOTA multi-person pose estimation benchmarks. Additionally, in terms of mean Average Precision (mAP)—computed as the average AP across all joints—the proposed method outperforms OpenPose and HigherHRNet by 22.2 percentage points and 23.4 percentage points, respectively. However, for wrist joints, which are particularly challenging to detect due to their small region size, the RGB-D network, which integrates depth information, achieves better performance than OpenPose but lower AP scores than HigherHRNet, which benefits from high-resolution feature maps that enhance precise joint localization. These results suggest that incorporating depth information into the pose estimation network does not necessarily provide fine-grained feature details for joint localization. However, in inter-person occlusion scenarios, the proposed method effectively leverages depth variations across multiple stages to reinforce the heatmap representation of occluded joints, enabling more accurate joint predictions from the camera’s perspective.
Table 8. Results on evaluating the accuracy of pose estimation in the occlusion test set.

4.5. Qualitative Comparison of Human Pose Estimation

Figure 7 presents a qualitative evaluation of pose estimation, comparing skeleton extraction results obtained using ground truth, HigherHRNet with HRNet-W32 backbone, and the proposed method. In the skeleton extraction analysis of the proposed method, GT annotations for the nose position were excluded from evaluation, as previously explained in the occlusion experiments. This exclusion was necessary because the GT nose annotations were centered on the head rather than precisely located at the nose. Unlike HigherHRNet, which incorrectly produced extra skeletons, the proposed method successfully extracted exactly two skeletons when two individuals were present in the image. However, self-occlusion led to the left arm of the left person being undetected, as most of its joints were not visible. For the visible joints, detection accuracy was high, and no joint-switching problem occurred.
Figure 7. Comparison of pose estimation results of HigherHRNet and proposed method: (a) ground truth; (b) HigherHRNet [18]; (c) ours (R = 3).
Figure 8 visualizes the confidence maps for joints in occluded scenes. In Figure 8, The red and green circles represent the same joint on different individuals. The traditional OpenPose often fails to accurately localize body parts within occluded areas. In contrast, our proposed method can more precisely detect the corresponding joints by leveraging additional 3D topological information from depth images.
Figure 8. Visualization of confidence maps for joints in occluded scenes.

5. Discussion and Future Research

In this study, we modify the structure of OpenPose, which is a widely used bottom-up HPE model, to additionally extract features from depth images and progressively fuse them with features for color images. The additional use of depth images significantly reduces instances where joints of different individuals are mistakenly assigned to the same skeleton in occluded scenes. Furthermore, compared to conventional early fusion or late fusion approaches, in which features are integrated either at the feature extraction stage or at a later stage, our proposed progressive fusion further improves the accuracy of individual skeleton detection. Within the proposed HPE model, global features are derived from color images, while depth images serve as a valuable source of local 3D topological information. Despite the exclusion of occluded joints from the OKS calculation, we observed that the influence of other joints on the remaining joints is reduced. This indicates that the proposed method can maintain pose estimation consistency by suppressing body-to-body interference under the occlusion scenes. While this study presents performance results on a collective set of occlusion samples, future work examining the impact of different occlusion severities may provide further insight into the robustness of the proposed model.
Although NTU RGB+D 120 provides enough body-to-body occlusion scenes, it was optimized for action recognition. Consequently, the skeleton annotations may have lower precision. This can limit the accurate evaluation of the proposed method.
The proposed feature fusion method was applied to OpenPose, but it can also be extended to the latest SOTA bottom-up HPE models. Furthermore, the HPE performance can be improved by employing advanced feature extractors such as HRNet or CSPNeXt instead of VGG. While the proposed method uses 1 × 1 convolution operations for feature fusion, introducing more sophisticated fusion techniques such as attention-based fusion or cross-modal gating could further enhance the representation capacity.

6. Conclusions

We proposed an improvement method of human pose estimation for scenes with occluded bodies by using RGB-D images. The proposed method iteratively fused body joint information extracted from both color and depth images, thereby enhancing the final joint representation. To evaluate the performance in occlusion-prone environments, the proposed method introduced a method for extracting occluded samples from conventional pose estimation test datasets. The performance of the proposed method was quantitatively assessed using the constructed occlusion test set. The experimental results demonstrated that the proposed approach improved pose estimation accuracy in occluded environments using both color information and topological cues to detect human joints effectively. Specifically, the proposed method improved most per-joint AP values in the occlusion test set. Furthermore, the mAP increased by 22.2% and 23.4% compared to OpenPose and HigherHRNet, respectively. These results demonstrate the effectiveness of the proposed method in pose estimation for scenes with body-to-body occlusions. The proposed pose estimation method is expected to be particularly beneficial in real-time applications where occlusions frequently occur, such as surveillance video analysis, action recognition, and autonomous driving. Moreover, beyond human pose estimation, the method can be extended to structurally defined objects, offering a highly efficient and generalizable approach to object detection in occlusion-prone environments.

Author Contributions

Conceptualization, J.-h.Y. and S.-k.K.; software, J.-h.Y.; writing—original draft preparation, J.-h.Y. and S.-k.K.; supervision, S.-k.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly supported by the Innovative Human Resource Development for Local Intellectualization program through the Institute of Information & Communications Technology Planning & Evaluation (IITP), grant funded by the Korean government (MSIT) (IITP-2025-RS-2020-II201791, 100%), by the BB21plus funded by Busan Metropolitan City and Busan Techno Park, and by a Dong-eui University Grant (202501170001).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Yue, R.; Tian, Z.; Du, S. Action recognition based on RGB and skeleton data sets: A survey. Neurocomputing 2022, 512, 287–306. [Google Scholar] [CrossRef]
  2. Bux, A.; Angelov, P.; Habib, Z. Vision based human activity recognition: A review. In Proceedings of the UK Workshop on Computational Intelligence, Lancaster, UK, 7–9 September 2016; pp. 341–371. [Google Scholar]
  3. Vrigkas, M.; Nikou, C.; Kakadiaris, I.A. A review of human activity recognition methods. Front. Robot. AI 2015, 2, 28. [Google Scholar] [CrossRef]
  4. Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–24 June 2022; pp. 2969–2978. [Google Scholar]
  5. Karim, M.; Khalid, S.; Aleryani, A.; Khan, J.; Ullah, I.; Ali, Z. Human Action Recognition Systems: A Review of the Trends and State-of-the-Art. IEEE Access 2024, 12, 36372–36390. [Google Scholar] [CrossRef]
  6. Liu, Z.; Zhu, J.; Bu, J.; Chen, C. A survey of human pose estimation: The body parts parsing based methods. J. Vis. Commun. Image Represent. 2015, 32, 10–19. [Google Scholar] [CrossRef]
  7. Wang, P.; Li, W.; Ogunbona, P.; Wan, J.; Escalera, S. RGB-D-based human motion recognition with deep learning: A survey. Comput. Vis. Image Underst. 2018, 171, 118–139. [Google Scholar] [CrossRef]
  8. Presti, L.L.; La Cascia, M. 3D skeleton-based human action classification: A survey. Pattern Recognit. 2016, 53, 130–147. [Google Scholar] [CrossRef]
  9. Tome, D.; Russell, C.; Agapito, L. Lifting from the deep: Convolutional 3D pose estimation from a single image. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2500–2509. [Google Scholar]
  10. Huang, H.; Wang, Y.; Linghu, K.; Xia, Z. Multi-modal micro-gesture classification via multi-scale heterogeneous ensemble network. In Proceedings of the Workshop & Challenge on Micro-Gesture Analysis for Hidden Emotion Understanding, Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar]
  11. Wang, Y.; Rui, K.; Huang, H.; Xia, Z. Micro-gesture online recognition with dual-stream multi-scale transformer in long videos. In Proceedings of the Workshop & Challenge on Micro-Gesture Analysis for Hidden Emotion Understanding, Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar]
  12. Dang, Q.; Yin, J.; Wang, B.; Zheng, W. Deep learning based 2D human pose estimation: A survey. Tsinghua Sci. Technol. 2019, 24, 663–676. [Google Scholar] [CrossRef]
  13. Lan, G.; Wu, Y.; Hu, F.; Hao, Q. Vision-based human pose estimation via deep learning: A survey. IEEE Trans. Hum.-Mach. Syst. 2022, 53, 253–268. [Google Scholar] [CrossRef]
  14. Wang, C.; Zhang, F.; Ge, S.S. A comprehensive survey on 2D multi-person pose estimation methods. Eng. Appl. Artif. Intell. 2021, 102, 104260. [Google Scholar] [CrossRef]
  15. Gamra, M.B.; Akhloufi, M.A. A review of deep learning techniques for 2D and 3D human pose estimation. Image Vis. Comput. 2021, 114, 104282. [Google Scholar] [CrossRef]
  16. Chen, Y.; Tian, Y.; He, M. Monocular human pose estimation: A survey of deep learning-based methods. Comput. Vis. Image Underst. 2020, 192, 102897. [Google Scholar] [CrossRef]
  17. Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. ViTPose: Simple vision transformer baselines for human pose estimation. In Proceedings of the Conference on Neural Information Processing Systems, Virtual, 28 November–9 December 2022; pp. 38571–38584. [Google Scholar]
  18. Jiang, T.; Lu, P.; Zhang, L.; Ma, N.; Han, R.; Lyu, C.; Li, Y.; Chen, K. RTMPose: Real-time multi-person pose estimation based on MMpose. arXiv 2023, arXiv:2303.07399. [Google Scholar] [CrossRef]
  19. Yang, S.; Quan, Z.; Nie, M.; Yang, W. TransPose: Keypoint localization via transformer. In Proceedings of the International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 11802–11812. [Google Scholar]
  20. Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5386–5395. [Google Scholar]
  21. Zhao, M.; Li, T.; Abu Alsheikh, M.; Tian, Y.; Zhao, H.; Torralba, A.; Katabi, D. Through-wall human pose estimation using radio signals. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7356–7365. [Google Scholar]
  22. Ghafoor, M.; Mahmood, A. Quantification of occlusion handling capability of a 3D human pose estimation framework. IEEE Trans. Multimed. 2022, 25, 3311–3318. [Google Scholar] [CrossRef]
  23. Chen, B.; Chin, T.J.; Klimavicius, M. Occlusion-robust object pose estimation with holistic representation. In Proceedings of the Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 2929–2939. [Google Scholar]
  24. Munea, T.L.; Jembre, Y.Z.; Weldegebriel, H.T.; Chen, L.; Huang, C.; Yang, C. The progress of human pose estimation: A survey and taxonomy of models applied in 2D human pose estimation. IEEE Access 2020, 8, 133330–133348. [Google Scholar] [CrossRef]
  25. Bragagnolo, L.; Terreran, M.; Allegro, D.; Ghidoni, S. Multi-view Pose Fusion for Occlusion-Aware 3D Human Pose Estimation. arXiv 2024, arXiv:2408.15810. [Google Scholar] [CrossRef]
  26. Zhou, G.; Yan, Y.; Wang, D.; Chen, Q. A novel depth and color feature fusion framework for 6D object pose estimation. IEEE Trans. Multimed. 2020, 23, 1630–1639. [Google Scholar] [CrossRef]
  27. Kazakos, E.; Nikou, C.; Kakadiaris, I.A. On the fusion of RGB and depth information for hand pose estimation. In Proceedings of the International Conference on Image Processing, Athens, Greece, 7–10 October 2018; pp. 868–872. [Google Scholar]
  28. Wang, Z.; Lu, Y.; Ni, W.; Song, L. An RGB-D based approach for human pose estimation. In Proceedings of the International Conference on Networking Systems of AI, Shanghai, China, 19–20 November 2021; pp. 166–170. [Google Scholar]
  29. Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional Pose Machines. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4724–4732. [Google Scholar]
  30. Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
  31. Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
  32. Newell, A.; Huang, Z.; Deng, J. Associative embedding: End-to-end learning for joint detection and grouping. In Proceedings of the Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  33. Chen, X.; Yang, C.; Mo, J.; Sun, Y.; Karmouni, H.; Jiang, Y.; Zheng, Z. CSPNeXt: A new efficient token hybrid backbone. Eng. Appl. Artif. Intell. 2024, 132, 107886. [Google Scholar] [CrossRef]
  34. Li, Y.; Yang, S.; Liu, P.; Zhang, S.; Wang, Y.; Wang, Z.; Yang, W.; Xia, S. SimCC: A simple coordinate classification perspective for human pose estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 89–106. [Google Scholar]
  35. Yang, C.H.; Kong, K.B.; Min, S.J.; Wee, D.Y.; Jang, H.D.; Cha, G.H.; Kang, S.J. SEFD: Learning to distill complex pose and occlusion. In Proceedings of the International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 14895–14906. [Google Scholar]
  36. Purkrabek, M.; Matas, J. Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle. arXiv 2024, arXiv:2412.01562. [Google Scholar] [CrossRef]
  37. Artacho, B.; Savakis, A. Full-BAPose: Bottom Up Framework for Full Body Pose Estimation. Sensors 2023, 23, 3725. [Google Scholar] [CrossRef]
  38. Qu, H.; Cai, Y.; Foo, L.G.; Kumar, A.; Liu, J. A characteristic function-based method for bottom-up human pose estimation. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
  39. Bai, X.; Wei, X.; Wang, Z.; Zhang, M. CONet: Crowd and occlusion-aware network for occluded human pose estimation. Neural Netw. 2024, 172, 106109. [Google Scholar] [CrossRef]
  40. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
  41. Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018. [Google Scholar]
  42. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  43. Wang, C.Y.; Liao, H.Y.M.; Yeh, I.H.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
  44. Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  45. Amin, A.; Tamajo, A.; Klugman, I.; Stoev, E.; Fisho, T.; Lim, H.; Kim, H. Real-time 3D multi-person pose estimation using an omnidirectional camera and mmWave radars. In Proceedings of the International Conference on Engineering and Emerging Technologies, Seoul, Republic of Korea, 20–22 October 2023; pp. 1–6. [Google Scholar]
  46. Knap, P.; Hardy, P.; Tamajo, A.; Lim, H.; Kim, H. Real-time omnidirectional 3D multi-person human pose estimation with occlusion handling. In Proceedings of the ACM SIGGRAPH European Conference on Visual Media Production, London, UK, 6–8 November 2023. [Google Scholar]
  47. Knap, P.; Hardy, P.; Tamajo, A.; Lim, H.; Kim, H. Improving real-time omnidirectional 3D multi-person human pose estimation with people matching and unsupervised 2D–3D lifting. In Proceedings of the International Conference on Electronics, Information, and Communication, Jeju Island, Republic of Korea, 10–13 January 2024; pp. 1–4. [Google Scholar]
  48. Sengupta, A.; Jin, F.; Cao, S. NLP based skeletal pose estimation using mmWave radar point-cloud: A simulation approach. In Proceedings of the IEEE Radar Conference, Atlantic City, NJ, USA, 21–24 September 2020; pp. 1–6. [Google Scholar]
  49. An, S.; Ogras, U.Y. Fast and scalable human pose estimation using mmWave point cloud. In Proceedings of the Design Automation Conference, San Francisco, CA, USA, 10–14 July 2022; pp. 889–894. [Google Scholar]
  50. Li, G.; Zhang, Z.; Yang, H.; Pan, J.; Chen, D.; Zhang, J. Capturing human pose using mmWave radar. In Proceedings of the International Conference on Pervasive Computing and Communications Workshops, Austin, TX, USA, 23–27 March 2020; pp. 1–6. [Google Scholar]
  51. Fürst, M.; Gupta, S.T.; Schuster, R.; Wasenmüller, O.; Stricker, D. HPERL: 3D human pose estimation from RGB and LiDAR. In Proceedings of the International Conference on Pattern Recognition, Milano, Italy, 10–15 January 2021; pp. 7321–7327. [Google Scholar]
  52. Ye, D.; Xie, Y.; Chen, W.; Zhou, Z.; Ge, L.; Foroosh, H. LPFormer: LiDAR pose estimation transformer with multi-task network. In Proceedings of the International Conference on Robotics and Automation, Yokohama, Japan, 18–22 May 2024; pp. 16432–16438. [Google Scholar]
  53. Knap, P. Human modelling and pose estimation overview. arXiv 2024, arXiv:2406.19290. [Google Scholar] [CrossRef]
  54. Park, S.; Ji, M.; Chun, J. 2D human pose estimation based on object detection using RGB-D information. KSII Trans. Internet Inf. Syst. 2018, 12, 800–816. [Google Scholar] [CrossRef]
  55. Hirschmuller, H. Stereo processing by semiglobal matching and mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 30, 328–341. [Google Scholar] [CrossRef] [PubMed]
  56. Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
  57. Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef] [PubMed]
  58. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.