1. Introduction
Human keypoint detection, which focuses on localizing human anatomical landmarks, is a core problem in computer vision, with wide-ranging applications, including vision intelligence [
1,
2,
3,
4,
5], image processing [
6,
7,
8,
9,
10], video retrieval, human–computer interaction, and sports and healthcare motion analysis. The success of these downstream tasks critically depends on accurate and robust detection of skeletal keypoints, which serve as the foundation for understanding human posture [
11,
12,
13,
14,
15,
16,
17], behavior, and intent. An overview of the complete human keypoint detection pipeline is illustrated in
Figure 1, which depicts the sequential process from input image acquisition through feature extraction, keypoint generation, and final coordinate regression, providing a concise visual summary of our proposed approach for both single- and multi-person scenarios.
Recent years have witnessed rapid progress in human keypoint detection, driven by the advancement of deep learning techniques [
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30] and the availability of large-scale datasets such as MS COCO [
31] and PoseTrack [
32]. Numerous high-performance architectures have emerged, ranging from early regression-based methods like DeepPose [
33] to heatmap-based approaches such as Stacked Hourglass Networks [
34] and high-resolution frameworks like HRNet [
35]. Multi-person pose estimation is typically categorized as top-down [
35,
36], which detects human bounding boxes before estimating keypoints, or bottom-up [
37,
38], which detects all keypoints and groups them into individuals. Top-down generally achieves higher accuracy, at the cost of greater computation costs, while bottom-up offers higher efficiency, but may reduce accuracy in crowded scenes.
Despite notable advances, human keypoint detection remains challenging in unconstrained environments. Complex backgrounds, frequent occlusions, large pose variations, and scale diversity continue to hinder model performance. Early approaches based on hand-crafted features and shallow models exhibited limited generalization capabilities in such scenarios. The emergence of deep learning, particularly Convolutional Neural Networks (CNNs), has significantly improved accuracy, by enabling end-to-end learning frameworks. In multi-person settings, two mainstream paradigms have emerged: bottom-up approaches [
37,
38], which first detect all keypoints and subsequently group them into individual skeletons, and top-down approaches [
35,
36], which detect person instances before applying single-person pose estimation within each bounding box.
Among the modern architectures, the High-Resolution Network (HRNet) [
35] stands out for its strong performance in spatial localization tasks. Unlike conventional backbones that progressively downsample feature maps, HRNet maintains high-resolution representations throughout the network via parallel multi-scale streams and iterative feature fusion. This design effectively balances spatial precision and semantic richness, making HRNet particularly suitable for keypoint heatmap generation and downstream pose estimation tasks.
In this work, we propose a comprehensive pose estimation pipeline based on HRNet. We develop both single-person and multi-person keypoint detection frameworks, training our models on the COCO dataset. Furthermore, we extend our framework to real-time multi-person scenarios by integrating a top-down detection strategy that combines Faster-RCNN and HRNet. Extensive experiments demonstrated that our approach achieved competitive accuracy, while maintaining practical inference efficiency, offering a flexible solution for real-world applications.
Our main contributions are summarized as follows:
We develop a human keypoint detection pipeline based on HRNet, which supports both single-person and multi-person pose estimation. Our approach achieves high accuracy and robustness in detecting keypoints, even in challenging environments with complex backgrounds and occlusions. By maintaining high-resolution feature maps throughout the network, our method ensures precise localization of keypoints across various poses and scales.
We integrate a top-down detection framework with HRNet, enabling real-time multi-person pose estimation. This integration utilizes a person detector, such as Faster-RCNN, to localize individual persons in the image, followed by keypoint detection for each person instance. The resulting system offers real-time performance, while maintaining high accuracy, making it suitable for deployment in real-world applications requiring both speed and precision.
The effectiveness of our approach was thoroughly validated on the COCO benchmark, where we achieved competitive performance in terms of Average Precision (AP) and inference efficiency. Our method demonstrated robust performance under diverse conditions, including crowded scenes and occlusions, while maintaining a fast inference speed, making it a highly efficient solution for real-time applications in fields like intelligent surveillance and human–computer interaction.
3. Methodology
In this section, we describe our proposed human keypoint detection framework in detail. Our method is based on a High-Resolution Network (HRNet) backbone and consists of two pipelines: single-person keypoint detection and multi-person keypoint detection. The overall architecture is illustrated in
Figure 2.
3.1. Overall Pipeline
Our approach follows a top-down detection paradigm, which simplifies the task of multi-person keypoint detection by dividing it into two main phases: person localization and keypoint detection. For single-person scenarios, we assume that the target person has already been localized within a cropped image, and keypoint detection is directly performed on the input image. This approach leverages the precision of single-person keypoint detection models to achieve accurate localization in simpler settings.
In multi-person scenarios, we first utilize a person detector (such as Faster-RCNN [
45]) to generate bounding boxes for each individual in the image. These bounding boxes define the regions of interest, which are then used to crop the image into smaller patches [
46]. For each cropped region, we apply the HRNet-based keypoint detection network, which is capable of maintaining high-resolution representations throughout the network and providing detailed spatial information. In our multi-person experiments, we integrated the top-down detection pipeline with MMPose [
47], an open-source pose estimation framework. MMPose adopts a modular design, separating person detection and keypoint estimation into interchangeable components. This allows us to seamlessly combine a Faster R-CNN detector with our HRNet-based keypoint estimator, while benefiting from MMPose’s efficient data preprocessing, bounding-box cropping, and inference scheduling.
Given an input image
, the network outputs a set of
K keypoints
, where each
denotes the predicted coordinates of the
k-th keypoint. This process is illustrated in the following Algorithm 1:
Algorithm 1 Keypoint Detection Pipeline |
Require: Input image I, pretrained HRNet model , (optional) person detector |
Ensure: Predicted keypoints |
- 1:
if multi-person setting then - 2:
Use to detect person bounding boxes - 3:
else - 4:
Use full image I for single-person detection - 5:
end if - 6:
for each detected person instance or full image do - 7:
Crop and resize image patch according to or I - 8:
Feed into to obtain keypoint heatmaps - 9:
for each keypoint k do - 10:
Compute - 11:
end for - 12:
Collect keypoint set for current person - 13:
end for - 14:
return All predicted keypoints for all persons
|
3.2. HRNet Architecture
The HRNet backbone is designed to maintain high-resolution representations throughout the entire network, ensuring that both fine-grained spatial details and rich semantic information are captured at each stage. The architecture employs parallel multi-resolution branches, where each branch is responsible for processing features at different resolutions. This design enables the network to achieve a balance between spatial precision and semantic richness, making it particularly effective for tasks like human keypoint detection. The feature extraction process in HRNet consists of four main stages, each contributing to the network’s ability to maintain high-resolution representations and effectively fuse multi-scale features.
Stage 1: The first stage begins with downsampling the input image to resolution using standard convolutional layers. This stage reduces the spatial resolution, while capturing broad, low-level features such as edges and textures, which provide the foundation for the subsequent feature extraction. The downsampling helps the network to process larger regions of the image at once, which is essential for capturing global patterns and semantic information.
Stage 2: A second branch is introduced at resolution in parallel with the existing branch. At the end of this stage, a fusion module performs bidirectional information exchange: the high-resolution features are downsampled and added to lower-resolution features, while low-resolution features are upsampled and added to higher-resolution features. This allows the network to combine the detailed spatial information from the high-resolution branch with the semantic context from the low-resolution branch.
Stage 3: A third branch at resolution is added. All three branches exchange information via an expanded fusion module that upsamples and downsamples feature maps as needed, so that every branch receives inputs from all other branches. This repeated multi-scale fusion ensures that features at each resolution are enriched with both fine detail and large-scale context.
Stage 4: A fourth branch at resolution is introduced, and multi-scale fusion is again applied across all four branches. This final stage enables the network to capture the broadest possible semantic context, while retaining precise spatial localization in the high-resolution branch.
At each stage, the network produces multi-scale feature maps, denoted as , where S represents the number of scales or resolutions used in the network. These feature maps capture a range of spatial and semantic information, with each map corresponding to a different resolution. The fusion of these multi-scale feature maps is crucial for keypoint detection tasks, as it allows the network to maintain both the fine spatial details necessary for precise localization and the broader semantic context needed for accurate understanding of the scene.
The HRNet architecture’s ability to preserve high-resolution representations throughout the entire network, along with its repeated multi-scale feature fusion, makes it particularly effective for human keypoint detection, where accurate localization of fine details is crucial. This unique design enables HRNet to outperform traditional networks that progressively downsample feature maps, making it a powerful tool for tasks requiring high spatial precision.
3.3. Keypoint Heatmap Prediction
The final output of HRNet is a set of keypoint heatmaps, denoted as , where each heatmap represents the likelihood of the k-th keypoint appearing at each spatial location in the image. Each heatmap is essentially a probability map, where each pixel’s value indicates the likelihood of the keypoint being located at that specific position in the image. The higher the value of a pixel in the heatmap, the more likely it is that the corresponding keypoint is present at that location.
To train the model, we generate a ground-truth heatmap for each keypoint using a 2D Gaussian distribution centered at the true keypoint location, . The Gaussian distribution provides a smooth and continuous representation of the keypoint, with the highest probability at the keypoint’s true location, and the probability decaying as we move farther from the keypoint. This approach allows the network to learn the distribution of the keypoint location, rather than just a binary classification.
The ground-truth heatmap
for the
k-th keypoint is defined as
where
- -
represents the spatial coordinates of a pixel in the heatmap.
- -
is the true location of the k-th keypoint in the image.
- -
is a hyperparameter that controls the spread of the Gaussian distribution. A larger results in a broader spread, while a smaller creates a narrower peak around the keypoint.
The Gaussian distribution is used to create smooth heatmaps that allow the network to focus on local regions around each keypoint and learn to predict the most likely location of each keypoint. This process is crucial for handling slight inaccuracies in the predicted keypoints and enables the model to generalize better in real-world scenarios with occlusions or ambiguous poses.
3.4. Loss Function
The loss function plays a crucial role in training the model by guiding it to minimize the difference between the predicted and the ground-truth keypoint heatmaps. To achieve this, we adopt the Mean Squared Error (MSE) loss between the predicted heatmaps and the ground-truth heatmaps.
For each keypoint
k, the predicted heatmap
is compared to the ground-truth heatmap
, and the squared error between the two is computed for each pixel
in the heatmap. The total MSE loss across all keypoints is averaged over all keypoints
K to obtain the final MSE loss:
This MSE loss ensures that the predicted keypoint locations are as close as possible to the ground-truth locations, promoting better accuracy in the model’s predictions.
For multi-person keypoint detection, an additional loss term is included to account for the person detection process. This term, the object detection loss
, comes from the person detector used to localize each individual in the image. This loss helps to ensure that the person detection module performs well in localizing bounding boxes around each person. The total loss function for multi-person keypoint detection is then a combination of the MSE loss and the object detection loss:
where
is a balancing hyperparameter that determines the relative importance of the keypoint detection loss and the person detection loss. In our implementation, we empirically set
to 1, meaning both losses contribute equally to the total loss. This balance ensures that the model optimizes both the person detection and keypoint localization tasks simultaneously, resulting in better overall performance.
By combining both losses, the network learns to detect and localize keypoints for both single-person and multi-person scenarios, effectively handling complex detection tasks with high accuracy.
3.5. Training Details
Our models were trained on the COCO dataset, which is widely used for human keypoint detection tasks and provides a diverse set of images with annotated keypoints. To improve the robustness and generalization of the model, we employed standard data augmentation strategies during training, including random flipping, rotation, and scale jittering. These augmentations helped the model become invariant to common transformations, making it more effective in real-world scenarios with varying poses and viewpoints.
The optimizer used for training was Adam [
48], which is well-suited for handling the challenges of training deep learning models. The learning rate was initially set to
and was decayed over time using a step scheduler. This helped the model converge efficiently during training and prevented overshooting the optimal solution in the later stages of training. A total of 210 epochs were used, allowing the model to refine its parameters over multiple iterations. Additionally, we applied a warm-up learning rate adjustment during the initial epochs, which helped stabilize the training process and prevents the model from converging prematurely.
Once the model has been trained, inference is performed using a simple post-processing strategy. For each predicted heatmap, the keypoint location is obtained by finding the pixel location with the highest probability, i.e., the argmax of each heatmap:
This approach ensures that the predicted keypoint location corresponds to the peak of the Gaussian distribution in the heatmap, representing the most likely position of the keypoint.
Overall, this pipeline provides a robust and efficient solution for both single-person and multi-person keypoint detection tasks. By leveraging a combination of high-quality data, effective optimization techniques, and simple yet effective post-processing strategies, the model achieves competitive performance on benchmark datasets such as COCO.
4. Experiment
4.1. Dataset and Evaluation Metrics
We conducted comprehensive experiments on the widely used COCO 2017 dataset [
31], which contains over 200,000 images and 250,000 person instances labeled with 17 keypoints. Following the standard protocol, we trained our models on the train2017 split and evaluated on the val2017 split.
For evaluation, we adopted the standard COCO evaluation metrics, which include Average Precision (AP) and Average Recall (AR). These metrics are commonly used in object detection and keypoint detection tasks to assess the performance of a model. Below, we describe these metrics and the formulas used to calculate them.Following the COCO evaluation protocol, APM and APL are computed in the same OKS threshold range as AP0.5:0.95 (0.50 to 0.95 in steps of 0.05), but restricted to medium-scale and large-scale person instances, respectively.
4.1.1. Average Precision (AP)
Average Precision (AP) was used to measure the accuracy of the predicted keypoints by comparing the predicted keypoint locations to the ground-truth keypoint locations. AP was computed under different OKS (Object Keypoint Similarity) thresholds, which were used to define the similarity between the predicted and ground-truth keypoints.
OKS is defined as
where
denotes the predicted coordinate of the
i-th keypoint,
is the corresponding ground-truth coordinate,
s is the object scale computed as the square root of the ground-truth bounding box area,
is a per-keypoint constant defined by the COCO evaluation protocol to account for annotation uncertainty, and
is an indicator function that equals 1 if the keypoint is labeled and visible, and 0 otherwise.
The primary AP metric is
, which is the mean AP computed across OKS thresholds from 0.5 to 0.95 with a step size of 0.05:
where
is the Average Precision at an OKS threshold
t.
In addition to AP0.5:0.95, we report other AP metrics for specific thresholds:
- -
AP0.5: Average Precision at OKS threshold 0.5.
- -
AP0.75: Average Precision at OKS threshold 0.75.
- -
APM: Average Precision for medium-sized objects (objects with an area between 32 × 32 and 96 × 96 pixels).
- -
APL: Average Precision for large-sized objects (objects with an area larger than 96 × 96 pixels).
4.1.2. Average Recall (AR)
Average Recall (AR) measured the recall performance of the keypoint detection system, which indicated the proportion of ground-truth keypoints that were correctly detected. AR is computed in a similar way to AP, but instead of precision, it focuses on the recall. The AR metric was computed for different OKS thresholds, and the average recall was reported across all thresholds.
The AR metric at a given OKS threshold was calculated as
where
is the number of correctly detected keypoints for the
k-th keypoint at threshold
t, and
is the total number of ground-truth keypoints.
The average recall over different thresholds is reported as
where
is the recall at an OKS threshold
t.
These metrics provided a comprehensive evaluation of the model’s performance in detecting human keypoints across different scenarios, considering both precision and recall.
4.2. Implementation Details
Our implementation was based on PyTorch, a widely used deep learning framework that provides an efficient and flexible environment for developing and training deep learning models. We used HRNet-W32 as the backbone for keypoint detection, taking advantage of its high-resolution representation and multi-scale feature fusion capabilities. HRNet-W32 has proven effective in human keypoint detection tasks, delivering strong performance in both single-person and multi-person scenarios.
For multi-person detection, we integrate Faster-RCNN [
45], a well-established object detection framework, with a ResNet-50 [
49] backbone as the person detector. Faster-RCNN is responsible for generating bounding boxes around each detected person in the image, which are then passed to the HRNet model for keypoint detection. This combination of Faster-RCNN for person detection and HRNet for keypoint detection forms a powerful pipeline for multi-person keypoint localization.
The total training duration consisted of 210 epochs. During the first 40 epochs, we trained the model on NVIDIA RTX 4090 GPUs, which provided high computational power and significantly accelerated training. For the remaining epochs, we switched to Tesla P100 GPUs to continue the training process. This strategy ensured efficient use of resources, while achieving stable convergence throughout the training process.
To improve the generalization capability of the model, we applied standard data augmentation techniques, including random flipping, rotation, and scale adjustment. These augmentations helped the model become invariant to common transformations such as changes in orientation, size, and position.
The input resolution was set to for single-person detection experiments and for multi-person detection experiments. This resolution choice strikes a balance between computational efficiency and the need for detailed spatial information for accurate keypoint localization. A higher resolution allows for finer localization, which is particularly beneficial in multi-person detection, where accurate bounding boxes and keypoint detection are essential.
4.3. Results and Analysis
Table 1 summarizes our keypoint detection performance on the COCO validation set. It is worth noting that the HRNet-W32 (official) entry in
Table 1 refers to the COCO benchmark reported in [
35], which followed the same multi-stage, multi-resolution architecture as our implementation but used slightly different training schedules and data augmentation strategies. Our reproduced HRNet-W32 model maintained the original structural design. Our single-person keypoint detection model achieved a
of 72.5%, which is within 4 points of the HRNet official benchmark [
35]. This shows that our model was highly competitive and performed similarly to the state-of-the-art HRNet-W32, despite the fact that it was trained using a different setup. Notably, our
, which measures performance at a lower OKS threshold of 0.5, was 90.2%, reflecting a high accuracy in keypoint localization at this threshold.
For multi-person detection, we integrated the top-down pipeline with MMPose [
47] and achieved an AP of 74.6% on val2017. This improvement demonstrates that our model performed well in multi-person scenarios, with a higher
compared to single-person detection. Additionally, we observed improvements in
(82.7%), indicating that the model was also effective at higher OKS thresholds, where keypoints must be more precisely localized. This suggests that the multi-person detection setup benefited from both the HRNet backbone and the integration with MMPose, for more accurate multi-person pose estimation.
The (Average Recall) score for the multi-person setup was 80.3%, which is higher than that of the single-person model (78.4%). This reflects the model’s ability to correctly detect keypoints in complex multi-person scenarios, with fewer false negatives.
For comparison, the official HRNet-W32 [
35] model achieved an
of 76.6%, which is 4 points higher than the performance of our single-person model. Although our method did not surpass the official HRNet-W32 benchmark, it still provided a competitive solution, with a significantly lower computational overhead in terms of inference speed. The higher
and
for multi-person detection also highlight the potential of our model in more complex real-world applications.
Overall, our model provides a robust and efficient solution for both single-person and multi-person keypoint detection tasks, demonstrating competitive accuracy and recall performance, while maintaining high inference speed.
4.4. Additional Experiments
4.4.1. Ablation Study
To further evaluate the robustness and flexibility of our proposed HRNet-based keypoint detection framework, we conducted an ablation study to investigate the contribution of key components in our model. Specifically, we compared the performance of the full HRNet model with two degraded variants, to understand the impact of the different architectural components on the overall performance. The variants included
Baseline Model with ResNet-50 Backbone: In this variant, we replaced HRNet with a standard ResNet-50 backbone, as shown in
Table 2, which is a commonly used architecture for feature extraction. This comparison helped assess the effect of using HRNet’s high-resolution representations versus the more traditional ResNet backbone.
HRNet without Multi-Scale Fusion: In this version, we removed the multi-scale fusion modules from HRNet, which are responsible for exchanging information between high- and low-resolution feature maps. This variant helped to analyze the importance of multi-resolution feature fusion in improving the localization accuracy.
Table 2.
Comparison between ResNet and HRNet architectures.
Table 2.
Comparison between ResNet and HRNet architectures.
Characteristic | ResNet [49] | HRNet [35] |
---|
Backbone type | Sequential CNN | Parallel multi-resolution CNN |
Feature resolution | Decreases with depth | Maintained at high resolution |
Parameter (typical) | ∼25M (ResNet-50) | ∼28M (HRNet-W32) |
Strengths | Strong global semantics | Strong spatial precision |
In
Table 3, the variant HRNet w/o Fusion indicates an ablation where all cross-resolution fusion modules were removed. In this setting, each resolution branch processed its own feature maps independently, without exchanging information with other branches, and the highest-resolution branch was used directly for keypoint prediction. This variant helped isolate the contribution of multi-scale feature fusion to the overall performance.
Table 3 presents the results of the ablation study on the COCO
val2017 dataset. The performance of the different models was compared in terms of
,
, and
, which measured the precision of keypoint localization at different levels of accuracy.
As shown in
Table 3, removing high-resolution maintenance (i.e., using ResNet-50 as the backbone) led to a noticeable performance degradation, particularly in terms of
and
. The ResNet-50 backbone achieved an
of 66.1%, which was significantly lower than the full HRNet model’s performance of 72.5%. This suggests that the high-resolution representations provided by HRNet played a critical role in improving the localization of keypoints across varying scales and poses.
Additionally, when we removed the multi-scale fusion modules from HRNet, the performance decreased to an of 69.8%. This indicates that multi-scale feature fusion is an essential component for preserving spatial details and achieving accurate keypoint localization. Without fusion, the model is less capable of capturing important high-resolution spatial details, leading to lower performance.
The full HRNet model, which combines high-resolution representations with multi-scale fusion, achieved the best performance across all metrics, confirming the importance of these architectural components for precise keypoint localization.
The weighting parameter controls the relative contribution of the auxiliary loss term to the overall optimization objective. We evaluated values of and observed that the performance remained stable when lay between 0.3 and 0.7, with the best accuracy achieved at . A very small weakens the influence of the auxiliary constraint, slightly reducing precision, while a very large overemphasizes the auxiliary loss and can harm convergence. Therefore, was set to 0.5 in all main experiments, to ensure a balanced trade-off between the two objectives.
Overall, the ablation study highlighted the crucial role of both high-resolution representations and multi-scale feature fusion in achieving accurate keypoint detection, reinforcing the effectiveness of the HRNet-based framework.
4.4.2. Input Resolution Sensitivity
We evaluated the effect of different input resolutions on both detection accuracy and inference speed. As shown in
Table 4, increasing the input resolution improved AP at the cost of a slower inference, while reducing resolution accelerated inference but compromised accuracy. This trade-off is crucial when deploying models on devices with limited computational resources, as a balance between accuracy and speed must be considered, depending on the application’s requirements.
As seen in the table, when the input resolution was increased from to , the model’s improved from 68.7% to 74.9%, reflecting a better performance at higher resolution. However, this came at the cost of inference speed, where the number of frames per second (FPS) dropped significantly. On the RTX 4090, FPS decreased from 55 to 21, and on the Tesla P100, FPS decreased from 22 to 9. This demonstrates the trade-off between computational cost and performance, and it provides the flexibility to adjust resolution based on the deployment constraints.
4.4.3. Robustness Under Challenging Conditions
To validate the robustness of our method in challenging scenarios, we conducted separate evaluations on two subsets of the COCO val2017 split: (1) Occlusion subset (occlusion ratio > 0.5), and (2) Small-scale subset (person area < pixels). These subsets represent two common real-world challenges: occlusions and small-scale objects, both of which can significantly impact keypoint detection performance.
As shown in
Table 5, our method demonstrated significantly improved robustness compared to the ResNet-50 baseline in both occlusion and small-scale scenarios. For instance, on the occlusion subset, our method achieved an
of 62.4%, outperforming ResNet-50 by 4.4%. Similarly, on the small-scale subset, our model achieved 58.9%, which was 5.8% better than the ResNet-50 model. This indicates that our framework can handle challenging conditions such as occlusions and small objects better than standard models.
Failure Case Analysis
Despite the overall robustness demonstrated in the above sections, our method occasionally fails in challenging scenarios. Typical failure modes include severe occlusion (e.g., multiple people blocking each other), extreme or uncommon poses, and crowded scenes with overlapping body parts. In such cases, keypoints may be incorrectly localized or assigned to the wrong individual. These patterns are consistent with known limitations of top-down approaches and indicate potential directions for future improvement, such as incorporating occlusion-aware modules or leveraging temporal cues in videos.
4.4.4. Inference Speed Benchmarking
We further measured the inference speed of our model on two representative GPU platforms: NVIDIA RTX 4090 and Tesla P100. The results are summarized in
Table 6.
These findings demonstrate that our method provided a good balance between accuracy and efficiency. On the RTX 4090, HRNet-W32 achieved an of 72.5% with 38 FPS, while the larger HRNet-W48 model achieved a slightly higher of 74.8%, but at a cost of a lower FPS (26 FPS on RTX 4090). On the Tesla P100, the FPS for HRNet-W32 dropped to 14, while HRNet-W48 further reduced the FPS to 9. These results show that while higher accuracy came at the cost of inference speed, the model still maintained acceptable performance, even on resource-constrained devices such as the Tesla P100.
These benchmarks highlighted the flexibility of our framework, providing a good trade-off between performance and computational efficiency, making it suitable for real-time applications across different hardware platforms.
5. Discussion
Our experiments demonstrated that the proposed approach achieved competitive accuracy on large-scale benchmarks, while maintaining a favorable trade-off between precision and computational efficiency. The integration of high-resolution feature extraction with an efficient top-down pipeline proved particularly effective in crowded or occluded scenarios, where fine spatial details are critical. Despite these advantages, several limitations remain. First, the reliance on an external person detector in the top-down pipeline may lead to performance degradation if the detector fails to localize individuals accurately. Second, while our approach performs well under standard benchmark settings, its robustness to domain shifts (e.g., varying lighting, unusual poses) warrants further investigation. Third, real-time performance could be further improved by exploring model compression or hardware-specific optimizations. In future work, we plan to investigate hybrid strategies that combine the strengths of top-down and bottom-up paradigms, as well as adaptive fusion mechanisms for multi-scale features. We also aim to evaluate our approach on more diverse datasets, to better assess its generalization ability.
6. Conclusions
In this paper, we proposed a unified human keypoint detection framework based on the High-Resolution Network (HRNet), supporting both single-person and multi-person keypoint detection scenarios. By leveraging the high-resolution representation capability of HRNet and integrating a top-down detection strategy, our method achieved competitive accuracy on the COCO benchmark, while maintaining efficient inference speed.
Through extensive experiments, we demonstrated the robustness of our method under diverse conditions, including complex backgrounds and occlusions. Furthermore, the real-time capability of our multi-person detection pipeline indicated promising potential for deployment in real-world applications such as intelligent surveillance, sports analysis, and human–computer interaction.
In the future, we plan to explore lightweight model architectures to further improve real-time performance on edge devices, and investigate hybrid strategies to better balance accuracy and inference speed across diverse deployment scenarios. To further validate the generalizability of our framework, we intend to extend our evaluation to other challenging benchmarks, such as the MPII and PoseTrack datasets. This will allow us to more rigorously assess the model’s performance under different conditions, including varied lighting, environments, and human morphologies, thereby providing a more comprehensive understanding of its real-world capabilities.