High-Resolution Human Keypoint Detection: A Unified Framework for Single and Multi-Person Settings

Lin, Yuhuai; Li, Kelei; Wang, Haihua

doi:10.3390/a18080533

Open AccessArticle

High-Resolution Human Keypoint Detection: A Unified Framework for Single and Multi-Person Settings

by

Yuhuai Lin

,

Kelei Li

and

Haihua Wang

^*

China Agricultural University, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(8), 533; https://doi.org/10.3390/a18080533

Submission received: 23 July 2025 / Revised: 12 August 2025 / Accepted: 20 August 2025 / Published: 21 August 2025

(This article belongs to the Section Evolutionary Algorithms and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

Human keypoint detection has become a fundamental task in computer vision, underpinning a wide range of downstream applications such as action recognition, intelligent surveillance, and human–computer interaction. Accurate localization of keypoints is crucial for understanding human posture, behavior, and interactions in various environments. In this paper, we propose a deep-learning-based human skeletal keypoint detection framework that leverages a High-Resolution Network (HRNet) to achieve robust and precise keypoint localization. Our method maintains high-resolution representations throughout the entire network, enabling effective multi-scale feature fusion, without sacrificing spatial details. This approach preserves the fine-grained spatial information that is often lost in conventional downsampling-based methods. To evaluate its performance, we conducted extensive experiments on the COCO dataset, where our approach achieved competitive performance in terms of Average Precision (AP) and Average Recall (AR), outperforming several state-of-the-art methods. Furthermore, we extended our pipeline to support multi-person keypoint detection in real-time scenarios, ensuring scalability for complex environments. Experimental results demonstrated the effectiveness of our method in both single-person and multi-person settings, providing a comprehensive and flexible solution for various pose estimation tasks in dynamic real-world applications.

Keywords:

human keypoint detection; deep learning; top-down approach

1. Introduction

Human keypoint detection, which focuses on localizing human anatomical landmarks, is a core problem in computer vision, with wide-ranging applications, including vision intelligence [1,2,3,4,5], image processing [6,7,8,9,10], video retrieval, human–computer interaction, and sports and healthcare motion analysis. The success of these downstream tasks critically depends on accurate and robust detection of skeletal keypoints, which serve as the foundation for understanding human posture [11,12,13,14,15,16,17], behavior, and intent. An overview of the complete human keypoint detection pipeline is illustrated in Figure 1, which depicts the sequential process from input image acquisition through feature extraction, keypoint generation, and final coordinate regression, providing a concise visual summary of our proposed approach for both single- and multi-person scenarios.

Recent years have witnessed rapid progress in human keypoint detection, driven by the advancement of deep learning techniques [18,19,20,21,22,23,24,25,26,27,28,29,30] and the availability of large-scale datasets such as MS COCO [31] and PoseTrack [32]. Numerous high-performance architectures have emerged, ranging from early regression-based methods like DeepPose [33] to heatmap-based approaches such as Stacked Hourglass Networks [34] and high-resolution frameworks like HRNet [35]. Multi-person pose estimation is typically categorized as top-down [35,36], which detects human bounding boxes before estimating keypoints, or bottom-up [37,38], which detects all keypoints and groups them into individuals. Top-down generally achieves higher accuracy, at the cost of greater computation costs, while bottom-up offers higher efficiency, but may reduce accuracy in crowded scenes.

Despite notable advances, human keypoint detection remains challenging in unconstrained environments. Complex backgrounds, frequent occlusions, large pose variations, and scale diversity continue to hinder model performance. Early approaches based on hand-crafted features and shallow models exhibited limited generalization capabilities in such scenarios. The emergence of deep learning, particularly Convolutional Neural Networks (CNNs), has significantly improved accuracy, by enabling end-to-end learning frameworks. In multi-person settings, two mainstream paradigms have emerged: bottom-up approaches [37,38], which first detect all keypoints and subsequently group them into individual skeletons, and top-down approaches [35,36], which detect person instances before applying single-person pose estimation within each bounding box.

Among the modern architectures, the High-Resolution Network (HRNet) [35] stands out for its strong performance in spatial localization tasks. Unlike conventional backbones that progressively downsample feature maps, HRNet maintains high-resolution representations throughout the network via parallel multi-scale streams and iterative feature fusion. This design effectively balances spatial precision and semantic richness, making HRNet particularly suitable for keypoint heatmap generation and downstream pose estimation tasks.

In this work, we propose a comprehensive pose estimation pipeline based on HRNet. We develop both single-person and multi-person keypoint detection frameworks, training our models on the COCO dataset. Furthermore, we extend our framework to real-time multi-person scenarios by integrating a top-down detection strategy that combines Faster-RCNN and HRNet. Extensive experiments demonstrated that our approach achieved competitive accuracy, while maintaining practical inference efficiency, offering a flexible solution for real-world applications.

Our main contributions are summarized as follows:

We develop a human keypoint detection pipeline based on HRNet, which supports both single-person and multi-person pose estimation. Our approach achieves high accuracy and robustness in detecting keypoints, even in challenging environments with complex backgrounds and occlusions. By maintaining high-resolution feature maps throughout the network, our method ensures precise localization of keypoints across various poses and scales.
We integrate a top-down detection framework with HRNet, enabling real-time multi-person pose estimation. This integration utilizes a person detector, such as Faster-RCNN, to localize individual persons in the image, followed by keypoint detection for each person instance. The resulting system offers real-time performance, while maintaining high accuracy, making it suitable for deployment in real-world applications requiring both speed and precision.
The effectiveness of our approach was thoroughly validated on the COCO benchmark, where we achieved competitive performance in terms of Average Precision (AP) and inference efficiency. Our method demonstrated robust performance under diverse conditions, including crowded scenes and occlusions, while maintaining a fast inference speed, making it a highly efficient solution for real-time applications in fields like intelligent surveillance and human–computer interaction.

2. Related Work

2.1. Single-Person Keypoint Detection

Single-person keypoint detection focuses on accurately localizing the key skeletal joints of an individual within an image. Early approaches relied on handcrafted features and graphical models, such as pictorial structures [39] and deformable part models [40], which struggled in complex scenes with occlusions. These models, while effective in controlled settings, often failed to generalize well to more diverse and cluttered environments. With the rise of deep learning, convolutional neural networks (CNNs) have largely replaced traditional methods, enabling end-to-end feature learning and providing significant improvements in robustness and accuracy across various conditions [41,42]. CNNs, by learning hierarchical feature representations, have become the backbone of modern keypoint detection systems.

DeepPose [33] was among the first models to regress joint coordinates directly using CNNs. This pioneering work laid the foundation for more advanced methods. Subsequently, heatmap-based approaches became dominant, where the network predicts dense heatmaps encoding the probability of keypoint locations [34,43]. These methods improved spatial precision and robustness in varied conditions, handling scale variations and partial occlusions more effectively.

The High-Resolution Network (HRNet) [35] further advanced single-person keypoint detection by maintaining high-resolution representations throughout the network. Unlike traditional networks that progressively downsample feature maps, HRNet employs parallel multi-scale branches and frequent information exchange, preserving fine-grained spatial details, while capturing semantic context at multiple resolutions. This design has achieved strong performance on standard benchmarks such as MPII [44] and COCO [31] datasets, setting a new standard for keypoint detection accuracy.

2.2. Multi-Person Keypoint Detection

Multi-person keypoint detection presents additional challenges, including scale variation, occlusion, and instance association. These challenges arise from the need to accurately identify and localize keypoints across multiple individuals in complex scenes with varying poses, sizes, and partial occlusions. To address these challenges, two mainstream paradigms have emerged: top-down and bottom-up approaches, each offering distinct advantages and trade-offs.

Top-down methods first detect each person in an image, usually via an object detector, and subsequently apply a single-person keypoint detector within each detected bounding box [36]. This approach simplifies the keypoint detection task by reducing it to individual person-level pose estimation, but it incurs high computational costs, especially in images with a large number of people. The computational complexity grows proportionally with the number of detected persons, which can become a limitation in real-time applications. Recent methods, such as HRNet [35] combined with object detectors (e.g., Faster-RCNN [45]), have achieved superior accuracy in high-resolution scenarios by integrating high-resolution feature representations and object detection within a unified framework. These methods have demonstrated significant improvements in both accuracy and inference speed, making them suitable for practical real-world applications.

Bottom-up methods detect all keypoints in an image without relying on human detection, followed by grouping the detected keypoints into person instances [37,38]. These methods can handle crowded scenes more efficiently, as they do not require the explicit detection of individual persons. However, they often suffer from reduced localization precision and more complex post-processing steps, especially when dealing with overlapping or occluded instances. Despite these challenges, bottom-up methods offer better scalability and flexibility in highly dynamic environments.

With the availability of large-scale datasets like MS COCO [31] and PoseTrack [32], both top-down and bottom-up methods have seen significant improvements in accuracy and robustness. Hybrid approaches, which aim to combine the accuracy of top-down pipelines with the efficiency of bottom-up strategies, are also becoming an active research direction, with the potential to leverage the strengths of both paradigms, to overcome their individual limitations.

3. Methodology

In this section, we describe our proposed human keypoint detection framework in detail. Our method is based on a High-Resolution Network (HRNet) backbone and consists of two pipelines: single-person keypoint detection and multi-person keypoint detection. The overall architecture is illustrated in Figure 2.

3.1. Overall Pipeline

Our approach follows a top-down detection paradigm, which simplifies the task of multi-person keypoint detection by dividing it into two main phases: person localization and keypoint detection. For single-person scenarios, we assume that the target person has already been localized within a cropped image, and keypoint detection is directly performed on the input image. This approach leverages the precision of single-person keypoint detection models to achieve accurate localization in simpler settings.

In multi-person scenarios, we first utilize a person detector (such as Faster-RCNN [45]) to generate bounding boxes for each individual in the image. These bounding boxes define the regions of interest, which are then used to crop the image into smaller patches [46]. For each cropped region, we apply the HRNet-based keypoint detection network, which is capable of maintaining high-resolution representations throughout the network and providing detailed spatial information. In our multi-person experiments, we integrated the top-down detection pipeline with MMPose [47], an open-source pose estimation framework. MMPose adopts a modular design, separating person detection and keypoint estimation into interchangeable components. This allows us to seamlessly combine a Faster R-CNN detector with our HRNet-based keypoint estimator, while benefiting from MMPose’s efficient data preprocessing, bounding-box cropping, and inference scheduling.

Given an input image

I \in R^{H \times W \times 3}

, the network outputs a set of K keypoints

{{\hat{p}}_{k}}_{k = 1}^{K}

, where each

{\hat{p}}_{k} = (x_{k}, y_{k})

denotes the predicted coordinates of the k-th keypoint. This process is illustrated in the following Algorithm 1:

Algorithm 1 Keypoint Detection Pipeline

Require: Input image I, pretrained HRNet model

M_{HRNet}

, (optional) person detector

D_{person}

Ensure: Predicted keypoints

{{\hat{p}}_{k}}_{k = 1}^{K}

1:: if multi-person setting then
2:: Use $D_{person}$ to detect person bounding boxes ${B_{i}}_{i = 1}^{N}$
3:: else
4:: Use full image I for single-person detection
5:: end if
6:: for each detected person instance or full image do
7:: Crop and resize image patch $I_{i}$ according to $B_{i}$ or I
8:: Feed $I_{i}$ into $M_{HRNet}$ to obtain keypoint heatmaps $H = {H_{k}}_{k = 1}^{K}$
9:: for each keypoint k do
10:: Compute ${\hat{p}}_{k} = arg {max}_{(x, y)} H_{k} (x, y)$
11:: end for
12:: Collect keypoint set ${{\hat{p}}_{k}}_{k = 1}^{K}$ for current person
13:: end for
14:: return All predicted keypoints ${{\hat{p}}_{k}}$ for all persons

3.2. HRNet Architecture

The HRNet backbone is designed to maintain high-resolution representations throughout the entire network, ensuring that both fine-grained spatial details and rich semantic information are captured at each stage. The architecture employs parallel multi-resolution branches, where each branch is responsible for processing features at different resolutions. This design enables the network to achieve a balance between spatial precision and semantic richness, making it particularly effective for tasks like human keypoint detection. The feature extraction process in HRNet consists of four main stages, each contributing to the network’s ability to maintain high-resolution representations and effectively fuse multi-scale features.

Stage 1: The first stage begins with downsampling the input image to

\frac{1}{4}

resolution using standard convolutional layers. This stage reduces the spatial resolution, while capturing broad, low-level features such as edges and textures, which provide the foundation for the subsequent feature extraction. The downsampling helps the network to process larger regions of the image at once, which is essential for capturing global patterns and semantic information.

Stage 2: A second branch is introduced at

\frac{1}{8}

resolution in parallel with the existing

\frac{1}{4}

branch. At the end of this stage, a fusion module performs bidirectional information exchange: the high-resolution features are downsampled and added to lower-resolution features, while low-resolution features are upsampled and added to higher-resolution features. This allows the network to combine the detailed spatial information from the high-resolution branch with the semantic context from the low-resolution branch.

Stage 3: A third branch at

\frac{1}{16}

resolution is added. All three branches exchange information via an expanded fusion module that upsamples and downsamples feature maps as needed, so that every branch receives inputs from all other branches. This repeated multi-scale fusion ensures that features at each resolution are enriched with both fine detail and large-scale context.

Stage 4: A fourth branch at

\frac{1}{32}

resolution is introduced, and multi-scale fusion is again applied across all four branches. This final stage enables the network to capture the broadest possible semantic context, while retaining precise spatial localization in the high-resolution branch.

At each stage, the network produces multi-scale feature maps, denoted as

{F_{i}}_{i = 1}^{S}

, where S represents the number of scales or resolutions used in the network. These feature maps capture a range of spatial and semantic information, with each map corresponding to a different resolution. The fusion of these multi-scale feature maps is crucial for keypoint detection tasks, as it allows the network to maintain both the fine spatial details necessary for precise localization and the broader semantic context needed for accurate understanding of the scene.

The HRNet architecture’s ability to preserve high-resolution representations throughout the entire network, along with its repeated multi-scale feature fusion, makes it particularly effective for human keypoint detection, where accurate localization of fine details is crucial. This unique design enables HRNet to outperform traditional networks that progressively downsample feature maps, making it a powerful tool for tasks requiring high spatial precision.

3.3. Keypoint Heatmap Prediction

The final output of HRNet is a set of keypoint heatmaps, denoted as

H = {H_{k}}_{k = 1}^{K}

, where each heatmap

H_{k} \in R^{h \times w}

represents the likelihood of the k-th keypoint appearing at each spatial location in the image. Each heatmap is essentially a probability map, where each pixel’s value indicates the likelihood of the keypoint being located at that specific position in the image. The higher the value of a pixel in the heatmap, the more likely it is that the corresponding keypoint is present at that location.

To train the model, we generate a ground-truth heatmap

H_{k}^{*}

for each keypoint using a 2D Gaussian distribution centered at the true keypoint location,

p_{k}^{*} = (x_{k}^{*}, y_{k}^{*})

. The Gaussian distribution provides a smooth and continuous representation of the keypoint, with the highest probability at the keypoint’s true location, and the probability decaying as we move farther from the keypoint. This approach allows the network to learn the distribution of the keypoint location, rather than just a binary classification.

The ground-truth heatmap

H_{k}^{*}

for the k-th keypoint is defined as

H_{k}^{*} (x, y) = exp (- \frac{{(x - x_{k}^{*})}^{2} + {(y - y_{k}^{*})}^{2}}{2 σ^{2}}),

(1)

where

-: $(x, y)$ represents the spatial coordinates of a pixel in the heatmap.
-: $(x_{k}^{*}, y_{k}^{*})$ is the true location of the k-th keypoint in the image.
-: $σ$ is a hyperparameter that controls the spread of the Gaussian distribution. A larger $σ$ results in a broader spread, while a smaller $σ$ creates a narrower peak around the keypoint.

The Gaussian distribution is used to create smooth heatmaps that allow the network to focus on local regions around each keypoint and learn to predict the most likely location of each keypoint. This process is crucial for handling slight inaccuracies in the predicted keypoints and enables the model to generalize better in real-world scenarios with occlusions or ambiguous poses.

3.4. Loss Function

The loss function plays a crucial role in training the model by guiding it to minimize the difference between the predicted and the ground-truth keypoint heatmaps. To achieve this, we adopt the Mean Squared Error (MSE) loss between the predicted heatmaps and the ground-truth heatmaps.

For each keypoint k, the predicted heatmap

H_{k} (x, y)

is compared to the ground-truth heatmap

H_{k}^{*} (x, y)

, and the squared error between the two is computed for each pixel

(x, y)

in the heatmap. The total MSE loss across all keypoints is averaged over all keypoints K to obtain the final MSE loss:

L_{MSE} = \frac{1}{K} \sum_{k = 1}^{K} \sum_{x, y} {∥H_{k} (x, y) - H_{k}^{*} (x, y)∥}^{2} .

(2)

This MSE loss ensures that the predicted keypoint locations are as close as possible to the ground-truth locations, promoting better accuracy in the model’s predictions.

For multi-person keypoint detection, an additional loss term is included to account for the person detection process. This term, the object detection loss

L_{\det}

, comes from the person detector used to localize each individual in the image. This loss helps to ensure that the person detection module performs well in localizing bounding boxes around each person. The total loss function for multi-person keypoint detection is then a combination of the MSE loss and the object detection loss:

L_{total} = L_{\det} + λ L_{MSE},

(3)

where

λ

is a balancing hyperparameter that determines the relative importance of the keypoint detection loss and the person detection loss. In our implementation, we empirically set

λ

to 1, meaning both losses contribute equally to the total loss. This balance ensures that the model optimizes both the person detection and keypoint localization tasks simultaneously, resulting in better overall performance.

By combining both losses, the network learns to detect and localize keypoints for both single-person and multi-person scenarios, effectively handling complex detection tasks with high accuracy.

3.5. Training Details

Our models were trained on the COCO dataset, which is widely used for human keypoint detection tasks and provides a diverse set of images with annotated keypoints. To improve the robustness and generalization of the model, we employed standard data augmentation strategies during training, including random flipping, rotation, and scale jittering. These augmentations helped the model become invariant to common transformations, making it more effective in real-world scenarios with varying poses and viewpoints.

The optimizer used for training was Adam [48], which is well-suited for handling the challenges of training deep learning models. The learning rate was initially set to

5 \times 10^{- 4}

and was decayed over time using a step scheduler. This helped the model converge efficiently during training and prevented overshooting the optimal solution in the later stages of training. A total of 210 epochs were used, allowing the model to refine its parameters over multiple iterations. Additionally, we applied a warm-up learning rate adjustment during the initial epochs, which helped stabilize the training process and prevents the model from converging prematurely.

Once the model has been trained, inference is performed using a simple post-processing strategy. For each predicted heatmap, the keypoint location is obtained by finding the pixel location with the highest probability, i.e., the argmax of each heatmap:

{\hat{p}}_{k} = arg max_{(x, y)} H_{k} (x, y) .

(4)

This approach ensures that the predicted keypoint location corresponds to the peak of the Gaussian distribution in the heatmap, representing the most likely position of the keypoint.

Overall, this pipeline provides a robust and efficient solution for both single-person and multi-person keypoint detection tasks. By leveraging a combination of high-quality data, effective optimization techniques, and simple yet effective post-processing strategies, the model achieves competitive performance on benchmark datasets such as COCO.

4. Experiment

4.1. Dataset and Evaluation Metrics

We conducted comprehensive experiments on the widely used COCO 2017 dataset [31], which contains over 200,000 images and 250,000 person instances labeled with 17 keypoints. Following the standard protocol, we trained our models on the train2017 split and evaluated on the val2017 split.

For evaluation, we adopted the standard COCO evaluation metrics, which include Average Precision (AP) and Average Recall (AR). These metrics are commonly used in object detection and keypoint detection tasks to assess the performance of a model. Below, we describe these metrics and the formulas used to calculate them.Following the COCO evaluation protocol, AP_M and AP_L are computed in the same OKS threshold range as AP_0.5:0.95 (0.50 to 0.95 in steps of 0.05), but restricted to medium-scale and large-scale person instances, respectively.

4.1.1. Average Precision (AP)

Average Precision (AP) was used to measure the accuracy of the predicted keypoints by comparing the predicted keypoint locations to the ground-truth keypoint locations. AP was computed under different OKS (Object Keypoint Similarity) thresholds, which were used to define the similarity between the predicted and ground-truth keypoints.

OKS is defined as

OKS = \frac{\sum_{i = 1}^{K} exp (- \frac{{(x_{i} - x_{i}^{*})}^{2} + {(y_{i} - y_{i}^{*})}^{2}}{2 {(s \cdot σ_{i})}^{2}}) \cdot 1 [v_{i} > 0]}{\sum_{i = 1}^{K} 1 [v_{i} > 0]},

(5)

where

(x_{i}, y_{i})

denotes the predicted coordinate of the i-th keypoint,

(x_{i}^{*}, y_{i}^{*})

is the corresponding ground-truth coordinate, s is the object scale computed as the square root of the ground-truth bounding box area,

σ_{i}

is a per-keypoint constant defined by the COCO evaluation protocol to account for annotation uncertainty, and

1 [v_{i} > 0]

is an indicator function that equals 1 if the keypoint is labeled and visible, and 0 otherwise.

The primary AP metric is

{AP}^{0.5 : 0.95}

, which is the mean AP computed across OKS thresholds from 0.5 to 0.95 with a step size of 0.05:

{AP}^{0.5 : 0.95} = \frac{1}{10} \sum_{t = 0.5}^{0.95} AP (t),

(6)

where

AP (t)

is the Average Precision at an OKS threshold t.

In addition to AP^0.5:0.95, we report other AP metrics for specific thresholds:

-: AP^0.5: Average Precision at OKS threshold 0.5.
-: AP^0.75: Average Precision at OKS threshold 0.75.
-: AP_M: Average Precision for medium-sized objects (objects with an area between 32 × 32 and 96 × 96 pixels).
-: AP_L: Average Precision for large-sized objects (objects with an area larger than 96 × 96 pixels).

4.1.2. Average Recall (AR)

Average Recall (AR) measured the recall performance of the keypoint detection system, which indicated the proportion of ground-truth keypoints that were correctly detected. AR is computed in a similar way to AP, but instead of precision, it focuses on the recall. The AR metric was computed for different OKS thresholds, and the average recall was reported across all thresholds.

The AR metric at a given OKS threshold was calculated as

AR (t) = \frac{\sum_{k = 1}^{K} True {Positives}_{k} (t)}{\sum_{k = 1}^{K} Ground {Truth}_{k}},

(7)

where

True {Positives}_{k} (t)

is the number of correctly detected keypoints for the k-th keypoint at threshold t, and

Ground {Truth}_{k}

is the total number of ground-truth keypoints.

The average recall over different thresholds is reported as

{AR}^{0.5 : 0.95} = \frac{1}{10} \sum_{t = 0.5}^{0.95} AR (t),

(8)

where

AR (t)

is the recall at an OKS threshold t.

These metrics provided a comprehensive evaluation of the model’s performance in detecting human keypoints across different scenarios, considering both precision and recall.

4.2. Implementation Details

Our implementation was based on PyTorch, a widely used deep learning framework that provides an efficient and flexible environment for developing and training deep learning models. We used HRNet-W32 as the backbone for keypoint detection, taking advantage of its high-resolution representation and multi-scale feature fusion capabilities. HRNet-W32 has proven effective in human keypoint detection tasks, delivering strong performance in both single-person and multi-person scenarios.

For multi-person detection, we integrate Faster-RCNN [45], a well-established object detection framework, with a ResNet-50 [49] backbone as the person detector. Faster-RCNN is responsible for generating bounding boxes around each detected person in the image, which are then passed to the HRNet model for keypoint detection. This combination of Faster-RCNN for person detection and HRNet for keypoint detection forms a powerful pipeline for multi-person keypoint localization.

The total training duration consisted of 210 epochs. During the first 40 epochs, we trained the model on NVIDIA RTX 4090 GPUs, which provided high computational power and significantly accelerated training. For the remaining epochs, we switched to Tesla P100 GPUs to continue the training process. This strategy ensured efficient use of resources, while achieving stable convergence throughout the training process.

To improve the generalization capability of the model, we applied standard data augmentation techniques, including random flipping, rotation, and scale adjustment. These augmentations helped the model become invariant to common transformations such as changes in orientation, size, and position.

The input resolution was set to

256 \times 192

for single-person detection experiments and

384 \times 288

for multi-person detection experiments. This resolution choice strikes a balance between computational efficiency and the need for detailed spatial information for accurate keypoint localization. A higher resolution allows for finer localization, which is particularly beneficial in multi-person detection, where accurate bounding boxes and keypoint detection are essential.

4.3. Results and Analysis

Table 1 summarizes our keypoint detection performance on the COCO validation set. It is worth noting that the HRNet-W32 (official) entry in Table 1 refers to the COCO benchmark reported in [35], which followed the same multi-stage, multi-resolution architecture as our implementation but used slightly different training schedules and data augmentation strategies. Our reproduced HRNet-W32 model maintained the original structural design. Our single-person keypoint detection model achieved a

{AP}^{0.5 : 0.95}

of 72.5%, which is within 4 points of the HRNet official benchmark [35]. This shows that our model was highly competitive and performed similarly to the state-of-the-art HRNet-W32, despite the fact that it was trained using a different setup. Notably, our

{AP}^{0.5}

, which measures performance at a lower OKS threshold of 0.5, was 90.2%, reflecting a high accuracy in keypoint localization at this threshold.

For multi-person detection, we integrated the top-down pipeline with MMPose [47] and achieved an AP of 74.6% on val2017. This improvement demonstrates that our model performed well in multi-person scenarios, with a higher

{AP}^{0.5 : 0.95}

compared to single-person detection. Additionally, we observed improvements in

{AP}^{0.75}

(82.7%), indicating that the model was also effective at higher OKS thresholds, where keypoints must be more precisely localized. This suggests that the multi-person detection setup benefited from both the HRNet backbone and the integration with MMPose, for more accurate multi-person pose estimation.

The

AR

(Average Recall) score for the multi-person setup was 80.3%, which is higher than that of the single-person model (78.4%). This reflects the model’s ability to correctly detect keypoints in complex multi-person scenarios, with fewer false negatives.

For comparison, the official HRNet-W32 [35] model achieved an

{AP}^{0.5 : 0.95}

of 76.6%, which is 4 points higher than the performance of our single-person model. Although our method did not surpass the official HRNet-W32 benchmark, it still provided a competitive solution, with a significantly lower computational overhead in terms of inference speed. The higher

AP

and

AR

for multi-person detection also highlight the potential of our model in more complex real-world applications.

Overall, our model provides a robust and efficient solution for both single-person and multi-person keypoint detection tasks, demonstrating competitive accuracy and recall performance, while maintaining high inference speed.

4.4. Additional Experiments

4.4.1. Ablation Study

To further evaluate the robustness and flexibility of our proposed HRNet-based keypoint detection framework, we conducted an ablation study to investigate the contribution of key components in our model. Specifically, we compared the performance of the full HRNet model with two degraded variants, to understand the impact of the different architectural components on the overall performance. The variants included

Baseline Model with ResNet-50 Backbone: In this variant, we replaced HRNet with a standard ResNet-50 backbone, as shown in Table 2, which is a commonly used architecture for feature extraction. This comparison helped assess the effect of using HRNet’s high-resolution representations versus the more traditional ResNet backbone.
HRNet without Multi-Scale Fusion: In this version, we removed the multi-scale fusion modules from HRNet, which are responsible for exchanging information between high- and low-resolution feature maps. This variant helped to analyze the importance of multi-resolution feature fusion in improving the localization accuracy.

Table 2. Comparison between ResNet and HRNet architectures.

Characteristic	ResNet [49]	HRNet [35]
Backbone type	Sequential CNN	Parallel multi-resolution CNN
Feature resolution	Decreases with depth	Maintained at high resolution
Parameter (typical)	∼25M (ResNet-50)	∼28M (HRNet-W32)
Strengths	Strong global semantics	Strong spatial precision

In Table 3, the variant HRNet w/o Fusion indicates an ablation where all cross-resolution fusion modules were removed. In this setting, each resolution branch processed its own feature maps independently, without exchanging information with other branches, and the highest-resolution branch was used directly for keypoint prediction. This variant helped isolate the contribution of multi-scale feature fusion to the overall performance.

Table 3 presents the results of the ablation study on the COCO val2017 dataset. The performance of the different models was compared in terms of

{AP}^{0.5 : 0.95}

,

{AP}^{0.5}

, and

{AP}^{0.75}

, which measured the precision of keypoint localization at different levels of accuracy.

As shown in Table 3, removing high-resolution maintenance (i.e., using ResNet-50 as the backbone) led to a noticeable performance degradation, particularly in terms of

{AP}^{0.5 : 0.95}

and

{AP}^{0.75}

. The ResNet-50 backbone achieved an

{AP}^{0.5 : 0.95}

of 66.1%, which was significantly lower than the full HRNet model’s performance of 72.5%. This suggests that the high-resolution representations provided by HRNet played a critical role in improving the localization of keypoints across varying scales and poses.

Additionally, when we removed the multi-scale fusion modules from HRNet, the performance decreased to an

{AP}^{0.5 : 0.95}

of 69.8%. This indicates that multi-scale feature fusion is an essential component for preserving spatial details and achieving accurate keypoint localization. Without fusion, the model is less capable of capturing important high-resolution spatial details, leading to lower performance.

The full HRNet model, which combines high-resolution representations with multi-scale fusion, achieved the best performance across all metrics, confirming the importance of these architectural components for precise keypoint localization.

The weighting parameter

λ

controls the relative contribution of the auxiliary loss term to the overall optimization objective. We evaluated

λ

values of

{0.1, 0.3, 0.5, 0.7, 1.0}

and observed that the performance remained stable when

λ

lay between 0.3 and 0.7, with the best accuracy achieved at

λ = 0.5

. A very small

λ

weakens the influence of the auxiliary constraint, slightly reducing precision, while a very large

λ

overemphasizes the auxiliary loss and can harm convergence. Therefore,

λ

was set to 0.5 in all main experiments, to ensure a balanced trade-off between the two objectives.

Overall, the ablation study highlighted the crucial role of both high-resolution representations and multi-scale feature fusion in achieving accurate keypoint detection, reinforcing the effectiveness of the HRNet-based framework.

4.4.2. Input Resolution Sensitivity

We evaluated the effect of different input resolutions on both detection accuracy and inference speed. As shown in Table 4, increasing the input resolution improved AP at the cost of a slower inference, while reducing resolution accelerated inference but compromised accuracy. This trade-off is crucial when deploying models on devices with limited computational resources, as a balance between accuracy and speed must be considered, depending on the application’s requirements.

As seen in the table, when the input resolution was increased from

192 \times 144

to

384 \times 288

, the model’s

{AP}^{0.5 : 0.95}

improved from 68.7% to 74.9%, reflecting a better performance at higher resolution. However, this came at the cost of inference speed, where the number of frames per second (FPS) dropped significantly. On the RTX 4090, FPS decreased from 55 to 21, and on the Tesla P100, FPS decreased from 22 to 9. This demonstrates the trade-off between computational cost and performance, and it provides the flexibility to adjust resolution based on the deployment constraints.

4.4.3. Robustness Under Challenging Conditions

To validate the robustness of our method in challenging scenarios, we conducted separate evaluations on two subsets of the COCO val2017 split: (1) Occlusion subset (occlusion ratio > 0.5), and (2) Small-scale subset (person area <

32^{2}

pixels). These subsets represent two common real-world challenges: occlusions and small-scale objects, both of which can significantly impact keypoint detection performance.

As shown in Table 5, our method demonstrated significantly improved robustness compared to the ResNet-50 baseline in both occlusion and small-scale scenarios. For instance, on the occlusion subset, our method achieved an

{AP}^{0.5 : 0.95}

of 62.4%, outperforming ResNet-50 by 4.4%. Similarly, on the small-scale subset, our model achieved 58.9%, which was 5.8% better than the ResNet-50 model. This indicates that our framework can handle challenging conditions such as occlusions and small objects better than standard models.

Failure Case Analysis

Despite the overall robustness demonstrated in the above sections, our method occasionally fails in challenging scenarios. Typical failure modes include severe occlusion (e.g., multiple people blocking each other), extreme or uncommon poses, and crowded scenes with overlapping body parts. In such cases, keypoints may be incorrectly localized or assigned to the wrong individual. These patterns are consistent with known limitations of top-down approaches and indicate potential directions for future improvement, such as incorporating occlusion-aware modules or leveraging temporal cues in videos.

4.4.4. Inference Speed Benchmarking

We further measured the inference speed of our model on two representative GPU platforms: NVIDIA RTX 4090 and Tesla P100. The results are summarized in Table 6.

These findings demonstrate that our method provided a good balance between accuracy and efficiency. On the RTX 4090, HRNet-W32 achieved an

{AP}^{0.5 : 0.95}

of 72.5% with 38 FPS, while the larger HRNet-W48 model achieved a slightly higher

{AP}^{0.5 : 0.95}

of 74.8%, but at a cost of a lower FPS (26 FPS on RTX 4090). On the Tesla P100, the FPS for HRNet-W32 dropped to 14, while HRNet-W48 further reduced the FPS to 9. These results show that while higher accuracy came at the cost of inference speed, the model still maintained acceptable performance, even on resource-constrained devices such as the Tesla P100.

These benchmarks highlighted the flexibility of our framework, providing a good trade-off between performance and computational efficiency, making it suitable for real-time applications across different hardware platforms.

5. Discussion

Our experiments demonstrated that the proposed approach achieved competitive accuracy on large-scale benchmarks, while maintaining a favorable trade-off between precision and computational efficiency. The integration of high-resolution feature extraction with an efficient top-down pipeline proved particularly effective in crowded or occluded scenarios, where fine spatial details are critical. Despite these advantages, several limitations remain. First, the reliance on an external person detector in the top-down pipeline may lead to performance degradation if the detector fails to localize individuals accurately. Second, while our approach performs well under standard benchmark settings, its robustness to domain shifts (e.g., varying lighting, unusual poses) warrants further investigation. Third, real-time performance could be further improved by exploring model compression or hardware-specific optimizations. In future work, we plan to investigate hybrid strategies that combine the strengths of top-down and bottom-up paradigms, as well as adaptive fusion mechanisms for multi-scale features. We also aim to evaluate our approach on more diverse datasets, to better assess its generalization ability.

6. Conclusions

In this paper, we proposed a unified human keypoint detection framework based on the High-Resolution Network (HRNet), supporting both single-person and multi-person keypoint detection scenarios. By leveraging the high-resolution representation capability of HRNet and integrating a top-down detection strategy, our method achieved competitive accuracy on the COCO benchmark, while maintaining efficient inference speed.

Through extensive experiments, we demonstrated the robustness of our method under diverse conditions, including complex backgrounds and occlusions. Furthermore, the real-time capability of our multi-person detection pipeline indicated promising potential for deployment in real-world applications such as intelligent surveillance, sports analysis, and human–computer interaction.

In the future, we plan to explore lightweight model architectures to further improve real-time performance on edge devices, and investigate hybrid strategies to better balance accuracy and inference speed across diverse deployment scenarios. To further validate the generalizability of our framework, we intend to extend our evaluation to other challenging benchmarks, such as the MPII and PoseTrack datasets. This will allow us to more rigorously assess the model’s performance under different conditions, including varied lighting, environments, and human morphologies, thereby providing a more comprehensive understanding of its real-world capabilities.

Author Contributions

Conceptualization, Y.L. and K.L.; methodology, Y.L.; software, Y.L.; validation, Y.L., K.L. and H.W.; formal analysis, Y.L.; investigation, Y.L.; resources, Y.L.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, K.L.; visualization, H.W.; supervision, K.L.; project administration, K.L.; funding acquisition, H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, C.; Jiang, H.; Liu, W.; Li, J.; Tang, S.; Juhas, M.; Zhang, Y. Correction of out-of-focus microscopic images by deep learning. Comput. Struct. Biotechnol. J. 2022, 20, 1957–1966. [Google Scholar] [CrossRef] [PubMed]
Zheng, F.; Chen, X.; Liu, W.; Li, H.; Lei, Y.; He, J.; Pun, C.M.; Zhou, S. SMAFormer: Synergistic Multi-Attention Transformer for Medical Image Segmentation. In Proceedings of the BIBM, Lisbon, Portugal, 3–6 December 2024. [Google Scholar]
Liu, W.; Shen, X.; Pun, C.M.; Cun, X. Explicit visual prompting for low-level structure segmentations. In Proceedings of the CVPR, Vancouver, BC, Canada, 17–24 June 2023; pp. 19434–19445. [Google Scholar]
Di, X.; Cui, K.; Wang, R.F. Toward Efficient UAV-Based Small Object Detection: A Lightweight Network with Enhanced Feature Fusion. Remote Sens. 2025, 17, 2235. [Google Scholar] [CrossRef]
Huo, Y.; Wang, R.F.; Zhao, C.T.; Hu, P.; Wang, H. Research on Obtaining Pepper Phenotypic Parameters Based on Improved YOLOX Algorithm. AgriEngineering 2025, 7, 209. [Google Scholar] [CrossRef]
Chen, X.; Ng, M.K.P.; Tsang, K.F.; Pun, C.M.; Wang, S. ConnectomeDiffuser: Generative AI Enables Brain Network Construction from Diffusion Tensor Imaging. arXiv 2025, arXiv:2505.22683. [Google Scholar] [CrossRef]
Chen, X.; Li, Z.; Shen, Y.; Mahmud, M.; Pham, H.; Pun, C.M.; Wang, S. High-Fidelity Functional Ultrasound Reconstruction via A Visual Auto-Regressive Framework. arXiv 2025, arXiv:2505.21530. [Google Scholar]
Li, M.; Sun, H.; Lei, Y.; Zhang, X.; Dong, Y.; Zhou, Y.; Li, Z.; Chen, X. High-Fidelity Document Stain Removal via A Large-Scale Real-World Dataset and A Memory-Augmented Transformer. In Proceedings of the WACV, Tucson, AZ, USA, 26 February–6 March 2025; pp. 7614–7624. [Google Scholar]
Wu, A.Q.; Li, K.L.; Song, Z.Y.; Lou, X.; Hu, P.; Yang, W.; Wang, R.F. Deep Learning for Sustainable Aquaculture: Opportunities and Challenges. Sustainability 2025, 17, 5084. [Google Scholar] [CrossRef]
Yang, Z.Y.; Xia, W.K.; Chu, H.Q.; Su, W.H.; Wang, R.F.; Wang, H. A comprehensive review of deep learning applications in cotton industry: From field monitoring to smart processing. Plants 2025, 14, 1481. [Google Scholar] [CrossRef]
Li, H.; Pun, C.M. Cee-net: Complementary end-to-end network for 3d human pose generation and estimation. In Proceedings of the AAAI, Montréal, QC, Canada, 8–10 August 2023; Volume 37, pp. 1305–1313. [Google Scholar]
Li, H.; Pun, C.M. Monocular robust 3d human localization by global and body-parts depth awareness. TCSVT 2022, 32, 7692–7705. [Google Scholar] [CrossRef]
Li, H.; Ge, S.; Gao, C.; Gao, H. Few-shot object detection via high-and-low resolution representation. Comput. Electr. Eng. 2022, 104, 108438. [Google Scholar] [CrossRef]
Li, H.; Zheng, F.; Liu, Y.; Xiong, J.; Zhang, W.; Hu, H.; Gao, H. Adaptive Skeleton Prompt Tuning for Cross-Dataset 3D Human Pose Estimation. In Proceedings of the ICASSP, Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
Li, H.; Pun, C.M.; Xu, F.; Pan, L.; Zong, R.; Gao, H.; Lu, H. A hybrid feature selection algorithm based on a discrete artificial bee colony for Parkinson’s diagnosis. ACM Trans. Internet Technol. 2021, 21, 1–22. [Google Scholar] [CrossRef]
Yang, S.; Li, H.; Pun, C.M.; Du, C.; Gao, H. Adaptive spatial-temporal graph-mixer for human motion prediction. IEEE Signal Process. Lett. 2024, 31, 1244–1248. [Google Scholar] [CrossRef]
Yan, X.; Xie, J.; Liu, M.; Li, H.; Gao, H. Hierarchical local temporal network for 2d-to-3d human pose estimation. IEEE Internet Things J. 2024, 12, 869–880. [Google Scholar] [CrossRef]
Bai, J.; Yin, Y.; Dong, Y.; Zhang, X.; Pun, C.M.; Chen, X. LensNet: An End-to-End Learning Framework for Empirical Point Spread Function Modeling and Lensless Imaging Reconstruction. arXiv 2025, arXiv:2505.01755. [Google Scholar]
Bai, J.; Yin, Y.; He, Q.; Li, Y.; Zhang, X. Retinexmamba: Retinex-based mamba for low-light image enhancement. In Proceedings of the ICONIP, Okinawa, Japan, 20–24 November 2025; pp. 427–442. [Google Scholar]
Xia, J.; Bai, J.; Dong, Y. DLEN: Dual Branch of Transformer for Low-Light Image Enhancement in Dual Domains. arXiv 2025, arXiv:2501.12235. [Google Scholar]
Zhang, X.; Chen, F.; Wang, C.; Tao, M.; Jiang, G.P. Sienet: Siamese expansion network for image extrapolation. IEEE Signal Process. Lett. 2020, 27, 1590–1594. [Google Scholar] [CrossRef]
Zhang, X.; Zhao, Y.; Gu, C.; Lu, C.; Zhu, S. SpA-Former: An Effective and lightweight Transformer for image shadow removal. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN), Gold Coast, Australia, 18–23 June 2023; IEEE: Piscataway Township, NJ, USA, 2023; pp. 1–8. [Google Scholar]
Xu, Z.; Zhang, X.; Chen, W.; Liu, J.; Xu, T.; Wang, Z. MuralDiff: Diffusion for Ancient Murals Restoration on Large-Scale Pre-Training. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 2169–2181. [Google Scholar] [CrossRef]
Zhang, X.; Xu, Z.; Tang, H.; Gu, C.; Zhu, S.; Guan, X. Shadclips: When Parameter-Efficient Fine-Tuning with Multimodal Meets Shadow Removal. Int. J. Pattern Recognit. Artif. Intell. 2024, 38, 16. [Google Scholar] [CrossRef]
Zhang, X.; Shen, C.; Yuan, X.; Yan, S.; Xie, L.; Wang, W.; Gu, C.; Tang, H.; Ye, J. From Redundancy to Relevance: Enhancing Explainability in Multimodal Large Language Models. arXiv 2024, arXiv:2406.06579. [Google Scholar]
Wei, J.; Zhang, X. DOPRA: Decoding Over-accumulation Penalization and Re-allocation in Specific Weighting Layer. arXiv 2024, arXiv:2407.15130. [Google Scholar]
Yuan, X.; Shen, C.; Yan, S.; Zhang, X.; Xie, L.; Wang, W.; Guan, R.; Wang, Y.; Ye, J. Instance-adaptive Zero-shot Chain-of-Thought Prompting. arXiv 2024, arXiv:2409.20441. [Google Scholar]
Zhang, X.; Quan, Y.; Gu, C.; Shen, C.; Yuan, X.; Yan, S.; Cheng, H.; Wu, K.; Ye, J. Seeing Clearly by Layer Two: Enhancing Attention Heads to Alleviate Hallucination in LVLMs. arXiv 2024, arXiv:2411.09968. [Google Scholar] [CrossRef]
Zhang, X.; Zeng, F.; Gu, C. Simignore: Exploring and enhancing multimodal large model complex reasoning via similarity computation. Neural Netw. 2024, 184, 107059. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Zeng, F.; Quan, Y.; Hui, Z.; Yao, J. Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation. AAAI 2025, 39, 10203–10211. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the ECCV, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
Andriluka, M.; Iqbal, U.; Insafutdinov, E.; Pishchulin, L.; Milan, A.; Gall, J.; Schiele, B. PoseTrack: A benchmark for human pose estimation and tracking. In Proceedings of the CVPR, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the ECCV, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the CVPR, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the ICCV, Venice, Italy, 22–29 October 2017. [Google Scholar]
Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the CVPR, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Kreiss, S.; Bertoni, L.; Alahi, A. Pifpaf: Composite fields for human pose estimation. In Proceedings of the CVPR, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Felzenszwalb, P.F.; Huttenlocher, D.P. Pictorial structures for object recognition. Int. J. Comput. Vis. 2005, 61, 55–79. [Google Scholar] [CrossRef]
Yang, Y.; Ramanan, D. Articulated human detection with flexible mixtures of parts. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2878–2890. [Google Scholar] [CrossRef]
Yang, Z.X.; Li, Y.; Wang, R.F.; Hu, P.; Su, W.H. Deep Learning in Multimodal Fusion for Sustainable Plant Care: A Comprehensive Review. Sustainability 2025, 17, 5255. [Google Scholar] [CrossRef]
Wang, Z.; Zhang, H.W.; Dai, Y.Q.; Cui, K.; Wang, H.; Chee, P.W.; Wang, R.F. Resource-Efficient Cotton Network: A Lightweight Deep Learning Framework for Cotton Disease and Pest Classification. Plants 2025, 14, 2082. [Google Scholar] [CrossRef]
Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the ECCV, Munich, Germany, 8–14 September 2018. [Google Scholar]
Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D human pose estimation: New benchmark and state of the art analysis. In Proceedings of the CVPR, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the NIPS, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Wang, R.F.; Qu, H.R.; Su, W.H. From Sensors to Insights: Technological Trends in Image-Based High-Throughput Plant Phenotyping. Smart Agric. Technol. 2025, 12, 101257. [Google Scholar] [CrossRef]
Sengupta, A.; Jin, F.; Zhang, R.; Cao, S. mm-Pose: Real-time human skeletal posture estimation using mmWave radars and CNNs. IEEE Sens. J. 2020, 20, 10032–10044. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]

Figure 1. Overview of human keypoint detection, illustrating the complete pipeline, from data input to keypoint prediction.

Figure 2. Overview of the proposed HRNet-based keypoint detection framework, which consists of a top-down detection pipeline with high-resolution feature extraction and keypoint heatmap prediction.

Table 1. Performance comparison on COCO val2017 dataset.

Method	AP^0.5:0.95	AP^0.5	AP^0.75	AR
HRNet-W32 (single-person)	72.5	90.2	80.1	78.4
MMPose + HRNet-W32 (multi-person)	74.6	91.0	82.7	80.3
HRNet-W32 (official) [35]	76.6	92.5	84.0	82.0

Table 3. Ablation study results on COCO val2017 dataset.

Method	AP_0.5:0.95	AP_0.5	AP_0.75
ResNet-50 Backbone	66.1	86.3	72.8
HRNet w/o Fusion	69.8	88.7	76.4
Full HRNet	72.5	90.2	80.1

Table 4. Performance with different input resolutions.

Input Size	AP_0.5:0.95	FPS (RTX 4090)	FPS (P100)
192 × 144	68.7	55	22
256 × 192	72.5	38	14
384 × 288	74.9	21	9

Table 5. Robustness evaluation on occlusion and small person Subsets.

Condition	AP_0.5:0.95	Comparison (ResNet-50)
Occlusion	62.4	+4.4
Small Person	58.9	+5.8
Full Dataset	72.5	–

Table 6. Inference speed comparison on different hardware.

Model	AP_0.5:0.95	FPS (4090)	FPS (P100)
HRNet-W32	72.5	38	14
HRNet-W48	74.8	26	9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, Y.; Li, K.; Wang, H. High-Resolution Human Keypoint Detection: A Unified Framework for Single and Multi-Person Settings. Algorithms 2025, 18, 533. https://doi.org/10.3390/a18080533

AMA Style

Lin Y, Li K, Wang H. High-Resolution Human Keypoint Detection: A Unified Framework for Single and Multi-Person Settings. Algorithms. 2025; 18(8):533. https://doi.org/10.3390/a18080533

Chicago/Turabian Style

Lin, Yuhuai, Kelei Li, and Haihua Wang. 2025. "High-Resolution Human Keypoint Detection: A Unified Framework for Single and Multi-Person Settings" Algorithms 18, no. 8: 533. https://doi.org/10.3390/a18080533

APA Style

Lin, Y., Li, K., & Wang, H. (2025). High-Resolution Human Keypoint Detection: A Unified Framework for Single and Multi-Person Settings. Algorithms, 18(8), 533. https://doi.org/10.3390/a18080533

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

High-Resolution Human Keypoint Detection: A Unified Framework for Single and Multi-Person Settings

Abstract

1. Introduction

2. Related Work

2.1. Single-Person Keypoint Detection

2.2. Multi-Person Keypoint Detection

3. Methodology

3.1. Overall Pipeline

3.2. HRNet Architecture

3.3. Keypoint Heatmap Prediction

3.4. Loss Function

3.5. Training Details

4. Experiment

4.1. Dataset and Evaluation Metrics

4.1.1. Average Precision (AP)

4.1.2. Average Recall (AR)

4.2. Implementation Details

4.3. Results and Analysis

4.4. Additional Experiments

4.4.1. Ablation Study

4.4.2. Input Resolution Sensitivity

4.4.3. Robustness Under Challenging Conditions

Failure Case Analysis

4.4.4. Inference Speed Benchmarking

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI