A Compact and Powerful Single-Stage Network for Multi-Person Pose Estimation †

: Multi-person pose estimation generally follows top-down and bottom-up paradigms. The top-down paradigm detects all human boxes and then performs single-person pose estimation on each ROI. The bottom-up paradigm locates identity-free keypoints and then groups them into individuals. Both of them use an extra stage to build the relationship between human instance and corresponding keypoints (e.g., human detection in a top-down manner or a grouping process in a bottom-up manner). The extra stage leads to a high computation cost and a redundant two-stage pipeline. To address the above issue, we introduce a ﬁne-grained body representation method. Concretely, the human body is divided into several local parts and each part is represented by an adaptive point. The novel body representation is able to sufﬁciently encode the diverse pose information and effectively model the relationship between human instance and corresponding keypoints in a single-forward pass. With the proposed body representation, we further introduce a compact single-stage multi-person pose regression network, called AdaptivePose++, which is the extended version of AAAI-22 paper AdaptivePose. During inference, our proposed network only needs a single-step decode operation to estimate the multi-person pose without complex post-processes and reﬁnements. Without any bells and whistles, we achieve the most competitive performance on representative 2D pose estimation benchmarks MS COCO and CrowdPose in terms of accuracy and speed. In particular, AdaptivePose++ outperforms the state-of-the-art SWAHR-W48 and CenterGroup-W48 by 3.2 AP and 1.4 AP on COCO mini-val with faster inference speed. Furthermore, the outstanding performance on 3D pose estimation datasets MuCo-3DHP and MuPoTS-3D further demonstrates its effectiveness and generalizability on 3D scenes.


Introduction
Human pose estimation (HPE) [1][2][3][4][5][6][7] is a classical yet challenging task in vision communities [8][9][10]. It aims to locate the person's keypoints from the natural image. HPE always serves as the necessary step for high-level vision tasks such as action recognition [11][12][13][14][15][16] and pose tracking [17], etc. Existing 2D/3D multi-person pose estimation methods can be categorized into top-down [18][19][20][21][22][23][24] and bottom-up [25][26][27][28][29][30] paradigms. The top-down strategy divides this problem into human detection and single-person pose estimation, each detected human region is cropped and normalized to locate the single-person keypoints. It achieves a superior performance while suffering from a large computation cost and low efficiency due to the additional human detector. The bottom-up strategy formulates this task as keypoint localization and a grouping process. It firstly detects all person keypoints simultaneously on the full image instead of the cropped single-person regions and then Both top-down and bottom-up methods generally use the conventional keypoint heatmap representation that models the human pose via absolute keypoint position, as shown in Figure 2a, which separates the relationship between the position of human instance and corresponding keypoints. Consequently, an extra stage is required to build up the connections. Recent research has tried to model the connections between the human body and corresponding keypoints in a single-forward process while suffering some obstacles, thus leading to a compromised performance. As shown in Figure 2b, CenterNet [32] represents the instance as center point and encodes the relationship between instance and its keypoints via center-to-joint offsets. Nevertheless, it achieves an inferior performance since the limited center feature cannot encode the various poses effectively. As shown in Figure 2c, SPM [33] also represents the human instance via the limited feature of root joint and further employs a fixed hierarchical structure along the skeleton path to build the relationship between the human instance and keypoints. Due to the intermediate nodes being pre-defined and the supervision acting on the offsets between the adjacent joints, the fixed hierarchical path will lead to accumulated errors along the hierarchical path.
To address the aforementioned problems, in this work, we propose a novel body representation which is able to sufficiently encode various human poses and effectively build the relations between the instance and keypoints in a single-forward pass. Specifically, the human body is divided into several parts and each human part is represented as an adaptive part related point. In this manner, we leverage the human center feature together with the features at several human-part related points to represent diverse human poses. Connections can be built by the center to adaptive points then to the keypoints path as shown in Figure 2d. Compared with previous representations, our representation brings two-fold benefits as follows: (1) The proposed point set representation introduces additional features at adaptive part related points, which are able to encode more informative features for flexible poses compared with limited center representation; (2) The adaptive part related points serving as relay nodes can more effectively model the associations between human instances and corresponding keypoints in a single-forward pass.  [22], as well as bottom-up methods such as CMU-pose [6]. (b) Center-to-joint body representation proposed by CenterNet [32]. (c) Hierarchical body representation introduced by SPM [33]. (d) Our adaptive point set representation. In contrast to (b,c) only uses center or root features; the features of adaptive points are introduced to encode the keypoint information in each part.
With the adaptive point set representation, we propose an effective and efficient singlestage differentiable regression network, termed AdaptivePose++, which mainly consists of three novel components. First, we introduce the Part Perception Module to regress seven adaptive human-part related points for perceiving the corresponding seven human parts. Second, in contrast to using the limited feature with a fixed receptive field to predict the human center, we propose the Enhanced Center-aware Branch to conduct the receptive field adaptation by aggregating the features of adaptive human-part related points to perceive the center of various poses more precisely. Third, we propose the Two-hop Regression Branch together with the Skeleton-Aware Regression Loss for regressing keypoints. The adaptive human-part related points act as one-hop nodes to factorize the center-to-joint offsets dynamically. AdaptivePose++ eliminates the time-consuming post-processes and achieves the best speed-accuracy trade-offs.
A preliminary version of this work [1] was accepted in AAAI Conference on Artificial Intelligence (AAAI), 2022. We extend it in terms of five aspects: (1) We augment the content of the Abstract, Introduction, Related Work, Methodology and Experiments to cover sufficient details for clearer and more comprehensive presentation; (2) We improve the regression loss and add an additional loss term to learn the skeleton connections, which is helpful for crowd scenes; (3) We tune several hyper-parameters and improve the performance in a single forward pass, and add more ablation experiments with analyses to verify the superior positioning capacity of our framework. We further report the more comprehensive comparisons with competitive bottom-up counterparts and list more qualitative results; (4) We report the state-of-the-art results on CrowdPose [34], which contains an enormous number of crowd scenes; (5) We keep the 2D framework and add the depth estimation components, and further extend our method to a 3D multi-person pose estimation task-the promising results on MuPoTS-3D [35] verify the effectiveness and generalizability of our method for 3D scenes.
We summarize our main contributions as follows: • We propose representing human parts as points, thus, the human body can be represented via an adaptive point set including the center and several human-part related points. To our best knowledge, we are the first to present a fine-gained and adaptive body representation to sufficiently encode the pose information and effectively build up the relation between the human instance and keypoints in a single-forward pass. • Based on the novel representation, we exploit a compact single-stage differentiable network, called AdaptivePose++. Specifically, we introduce a novel Part Percep-tion Module to perceive the human parts by regressing seven human-part related points. By manipulating human-part related points, we further propose the Enhanced Center-aware Branch to more precisely perceive the human center and the Two-hop Regression Branch together with the Skeleton-Aware Regression Loss to precisely regress the keypoints. • Our method significantly simplifies the pipeline of existing multi-person pose estimation methods. The effectiveness is demonstrated on both 2D and 3D pose estimation benchmarks. We achieve the best speed-accuracy trade-offs without complex refinements and post-processes. Furthermore, extended experiments on CrowdPose and MuPoTS-3D clearly verify the generalizability on crowd and 3D scenes.

Related Work
In this section, we review three parts related to our method including top-down methods, bottom-up methods and point-based methods.
Top-down Methods. Given an arbitrary RGB image, the top-down methods [4,[19][20][21][22][23] first crop and resize the region of a detected person and then locate the single-person keypoints in each cropped area. The detected human areas are cropped and resized to a unified size so that it has superior performance. For convolution-based methods, HRNet [21] maintains high-resolution features and repeatedly fuses multi-resolution features throughout the whole process to generate reliable high-resolution representations. Su et al. [23] proposed a Channel Shuffle Module and Spatial, Channel-wise Attention Residual Bottleneck (SCARB) to drive the cross-channel information flow. For transformer-based networks, TokenPose [36] embeds each keypoint as a token to simultaneously learn constraint relationships across keypoints and visual representation from images. Other researchers [34,37] have tried to handle quantization errors and occlusion issues. However, the detectionfirst paradigm always brings additional computational cost, and forward time, top-down methods are often not feasible for the real-time systems with strict latency constraints.
Bottom-up Methods. In contrast to top-down methods, bottom-up methods [6,[25][26][27][28][29][30] first localize keypoints of all human instances in the input image and then group them to the corresponding person. Bottom-up methods mainly concentrate on the effective grouping process or tackling the scale variation. For example, CMU-pose [6] proposes a non parametric representation, named Part Affinity Fields (PAFs), which encodes the location and orientation of limbs, to group the keypoints to individuals. AE [27] simultaneously outputs a keypoint heatmap and a tag map for each body joint, then assigns the keypoints with similar tags to individuals. HigherHRNet [26] generates a high-resolution feature pyramid with multi-resolution supervision and multi-resolution heatmap aggregation for learning scale-aware representations. However, one case worth noting is that the grouping process, serving as a post-process, is still computationally complex and redundant.
Point-based methods. In the deep learning era, the point-based methods [32,[38][39][40][41][42] represent the instances by the grid points and have been applied to many tasks. They have drawn much attention as they are always simpler and more efficient than anchor-based representation [20,[43][44][45]. CenterNet [32] leverages the bounding box center to encode the object information and regresses the other object properties, such as size, to predict a bounding box in parallel. SPM [33] represents the person via root joint and further presents a fixed hierarchical body representation to estimate human poses. Point-Set Anchors [41] propose to leverage a set of pre-defined points as a pose anchor to provide more informative features for regression. In contrast to previous methods that use center or pre-defined pose anchors to model human instances, we propose to represent human instances via an adaptive point set including the center and seven human-part related points as shown in Figure 3a. The novel representation is able to capture the diverse pose information and effectively model the connections between human instances and keypoints.

Methodology
First, we elaborate on the proposed body representation in Section 3.1. Then, Section 3.2 provides a detailed description of network architecture including the Part Perception Module and the Enhanced Center-aware Branch, as well as the Two-hop Regression Branch. Finally, we report the training and inference details in Section 3.3.

Body Representation
We present an adaptive point set representation that uses the center point together with several human-part related points to represent the human instance. The proposed representation introduces the adaptive human-part related points, whose features are used to encode the per-part information, thus can sufficiently capture the structural pose information. Meanwhile, they serve as the intermediate nodes to effectively model the relationship between the human instance and keypoints. In contrast to the fixed hierarchical representation in SPM [33], The adaptive part related points are predicted by center feature dynamically and not pre-defined locations, thus avoid the accumulated error propagated along the fixed hierarchical path. Furthermore, instead of using the root feature to encode all keypoints, the features of adaptive points are also leveraged to encode keypoints of different parts respectively in our method.
Our body representation is built upon the pixel-wise keypoint regression framework, which estimates the candidate pose at each pixel. For a human instance, we manually divide the human body into seven parts (i.e., head, shoulder, left arm, right arm, hip, left leg and right leg) according to the inherent structure of human body, as shown in Figure 3b. Each divided human part is a rigid structure; we represent it via an adaptive human-part related point, which is dynamically regressed from the human center. The process can be formulated as: where C inst refers to the instance center, others indicate seven adaptive human-part related points corresponding to head, shoulder, left arm, right arm, hip, left leg and right leg. Human pose is finely-grained represented by a point set C inst , P head , P sho , P la , P ra , P hip , P ll , P rl . By introducing the adaptive human-part related points, the semantic and position information of different keypoints can be encoded by the specific human-part related point's feature, instead of only using the limited center feature to encode all keypoints' information. For convenience, P part is used to indicate the seven human-part related points P head , P sho , P la , P ra , P hip , P ll , P rl . Then, the feature on each human-part related point is responsible for regressing the keypoints belonging to corresponding parts as follows: The novel representation starts from the human center to the adaptive human-part related points, then to body keypoints, to build up the connection between the instance position and corresponding keypoint position in a single-forward pass without any nondifferentiable process.
Based on the proposed representation, we delivered a single-stage differentiable solution to estimate multi-person pose. Concretely, the Part Perception Module was proposed to predict seven human-part related points. By using the adaptive humanpart related points, the Enhanced Center-aware Branch was introduced to perceive the center of human with various deformation and scales. In parallel, the Two-hop Regression Branch is presented to regress keypoints via the adaptive part-related points.

Single-Stage Network
Overall Architecture. As shown in Figure 4, given an input image, we first extracted the semantic feature via the backbone, following three well-designed components to predict specific information. We leveraged the Part Perception Module to regress seven adaptive human-part related points from the assumed center for each human instance. Then, we conducted the receptive field adaptation in the Enhanced Center-aware Branch by aggregating the features of the adaptive points to predict the center heatmap. In addition, the Two-hop Regression Branch adopts the adaptive human-part related points as one-hop nodes to indirectly regress the offsets from the center to each keypoint. Our network followed the pixel-wise keypoint regression paradigm, which estimates the candidate pose at each pixel (called center pixel) by predicting an 2K-dimensional offset vector from the center pixel to the K keypoints. We only take a pixel position as an example to describe the single-stage network.

Part Perception Module.
With the proposed body representation, we artificially divided each human instance into seven local parts (i.e., head, shoulder, left arm, right arm, hip, left leg, right leg) according to the inherent structure of the human body. The Part Perception Module is proposed to perceive the human parts by predicting seven adaptive human-part related points. For each part, we automatically regressed an adaptive point from center pixel c without explicit supervision. Each adaptive part related point was considered as encoding the informative features for the keypoints belonging to this part. As shown in Figure 5, we fed the regression branch specific feature F r into the 3×3 convolutional layer to regress 14-channel x-y offsetsōff 1 from the center c to seven adaptive human-part related points on each pixel. These adaptive points acted as intermediate nodes, which were used for subsequent center positioning and keypoint regression.
... Enhanced Center-aware Branch. In previous works [32,33,46], the center of human instances with various scales and deformation were predicted via the features with a fixed receptive field for each position. However, the pixel position which predicts the center of larger human body ought to have a larger receptive field compared with the position for predicting the center of a smaller human body. Thus, we propose a novel Enhanced Centeraware Branch which consists of a receptive field adaptation process to extract and aggregate the features of seven adaptive human-part related points for precise center localization.

Warp
As shown in Figure 5, we used the structure of 3 × 3 conv-relu to generate the branchspecific features. In Enhanced Center-aware Branch, F c is a branch-specific feature with the fixed receptive field for each pixel position. We firstly used the 1 × 1 convolution to compress the 256-channel feature F c and obtain the 64-channel feature F c0 . Then, we extracted the feature vectors of the adaptive points via bilinear interpolation (named 'Warp' in Figure 5) on F c0 . Taking the head part as an example, the bilinear interpolation can be formulated as F head . Since the predicted adaptive points located on the seven divided parts are relatively evenly distributed on the human body region, the process above can be regarded as the receptive field adaptation according to the human scale, as well as capture the various pose information sufficiently. Finally, we used F adapt c with an adaptive receptive field to predict the 1-channel probability map for the center localization.
We used the normalized Gaussian kernel with mean (C x , C y ) and adaptive variance δ calculated by human scale to generate the ground-truth center heatmap. Concretely, we calculated the Gaussian kernel radius by the size of an object by ensuring that a pair of points within the radius would generate a bounding box with at least IoU 0.7 with the ground-truth annotation. The adaptive variance is 1/3 of the radius. For the loss function of the Enhanced Center-aware Branch, we employed the pixel-wise focal loss in a penalty-reduced manner as follows: where N refers to the number of positive sample,P c and P c indicate the predicted per-pixel confidence and corresponding ground truth. α and β are hyper-parameters and set to 2 and 4, following CenterNet [32] and CornerNet [42]. In the above loss, only center pixels with peak 1.0 are positive samples and all others are negative samples. Two-hop Regression Branch. We leveraged a two-hop regression method to predict the displacements instead of directly regressing the center-to-joint offsets. In this manner, the adaptive human-part related points predicted by Part Perception Module act as one-hop nodes to build up the connection between human instance and keypoints more effectively.
We firstly leveraged the structure of 3 × 3 conv-relu to generate a branch-specific feature, named F r , in the Two-hop Regression Branch. Then, we fed 256-channel F r into the deformable convolutional layer [47,48] to generate 64-channel feature F p . Then we extracted the features at the adaptive part related points via the bilinear interpolation operation (called 'Warp' in Figure 5) on F p for corresponding keypoint regression. We denoted the The Two-hop Regression Branch outputs a 34-channel tensor corresponding to x-y offsetsō ff from the center to 17 keypoints, which is predicted by the two-hop manner as follows:ō ff =ōff 1 +ōff 2 , whereōff 1 andōff 2 respectively indicate the offset from the center to adaptive humanpart related point (One-hop offset mentioned in Figure 5) and the offset from human-part related point to specific keypoints (second-hop offset mentioned in Figure 5). The predicted offsetsō ff are supervised by vanilla L1 loss and the supervision only acts at positive keypoint locations; the other background locations are ignored. Furthermore, we added an additional loss term to learn the rigid bone connection between adjacent keypoints, termed Skeleton-Aware Regression Loss. In particular, as shown in Figure 3c, we denoted a bone connection set as B = {B i } I i=1 , where I is the number of bone connections in pre-defined set B. Each bone is formulated as B = P adjacent(joint) − P joint , in which P is the joint position and the function adjacent( * ) return the adjacent joints for input joint. The total regression loss is formulated as follows: where off n gt and B i gt are the ground truth center-to-keypoint offset and bone connection. N indicates the number of human instances. K is the number of valid keypoint locations. We find that employing the supervision on the bone connections can bring 0.3 AP improvements on CrowdPose [34].

Training and Inference Details
During training, we employed an auxiliary training objective to learn keypoint heatmap representation, which enabled the feature to maintain more human structural geometric information. In particular, we added a parallel branch to output a 17-channel heatmap corresponding to 17 keypoints and applied a Gaussian kernel with adaptive variance to generate a ground truth keypoint heatmap. We denote this training objective as loss hm , which is similar to Equation (3). The only difference is that N refers to the number of positive keypoints. The auxiliary branch was only used for the training process and was removed in the inference process.
Our total training loss for multi-task training procedure is formulated as: During inference, the Enhanced Center-aware Branch outputs the center heatmap that indicates whether the pixel position is at the center or not. The Two-hop Regression Branch outputs the offsets from the center to each keypoint. We first picked the human center by using a 5 × 5 max-pooling kernel on the center heatmap to maintain 20 candidates, and then retrieved the corresponding offsets (δ i x , δ i y ) to form a human pose without any extra tricks. Specifically, we denoted the predicted center as (C x , C y ). The above decode process is formulated as follows: where (K i x , K i y ) is the coordinate of the i-th keypoint. In contrast to DEKR [49], which further uses the average of the extracted heat values at each regressed keypoints to modulate the center heat-values, we only leveraged the center heat-values as the final pose score for fast inference.

Experiments and Analysis
In this section, we first briefly introduce the 2D pose estimation datasets, evaluation metric, data augmentation and implementation details. Next, we conduct comprehensive ablation studies to reveal the effectiveness of each component in Section 4.2. Then, we compare our proposed method with the previous methods on MS COCO [31] in Section 4.3 and CrowdPose [34] in Section 4.4. Finally, we extend our network to 3D multi-person pose estimation and verify the generalizability on 3D MuCo-3DHP [35] and MuPoTS-3D [35] datasets.

Experimental Setup
Dataset. We evaluated our method on two 2D multi-person pose estimation benchmarks including MS COCO [31] and CrowdPose [34]. The MS COCO dataset [31] is a large-scale pose estimation benchmark consisting of over 200,000 images for more than 250,000 human instances annotated with 17 keypoints. It is divided into train, validation, and test sets, respectively. We trained our model on the COCO train2017 dataset. The comprehensive experimental results are reported on the COCO mini-val set with 5000 images and on the test-dev2017 set with 20,000 images. The CrowdPose [34] dataset consists of 20,000 images for 80,000 labelled persons. The training, validation and test sets are partitioned in the proportions of 5:1:4. They contain more challenging images, which are used to verify the robustness for crowded scenes. We follow previous works [2,3,26,49] and trained our models on the train and validation sets and report the results on the test set.
Evaluation Metric. We leveraged average precision and average recall based on different Object Keypoint Similarity (OKS) [31] thresholds to evaluate our keypoint detection performance on both MS COCO and CrowdPose datasets. OKS is formulated as follows: where d i is the Euclidean distance between the predicted keypoint and the corresponding ground-truth, υ i represents the visibility tag of keypoint, δ in a function when υ i > 0 is 1, otherwise is 0, s refers to the instance scale, and k i is a constant to control falloff for each specific keypoint. In addition, for the COCO dataset, we report AP M and AP L , which corresponds to AP over medium and large-sized instances respectively. For CrowdPose, we report AP E , AP M , AP H , which indicate AP scores over easy, medium and hard instances, according to dataset annotations. Data Augmentation. During training, we used random flip, random rotation, random scaling and color jitter to augment training samples. The flip probability was set to 0.5, the rotation range was (−30, 30) and the scale range was (0.6, 1.3). During the training process, each input image was cropped according to the random center and random scale and then resized to 512 × 512 / 640 × 640 / 800 × 800 pixels for different backbones. Implementation Details. We trained our proposed model via Adam [50] optimizer with an initial learning rate of 2.5 × 10 −4 on the workstation with eight Tesla V100 GPUs. The learning rate was dropped to 2.5 × 10 −5 and 2.5 × 10 −6 at the 230th and 260th epochs, respectively. The total training procedure was terminated at the 280th epoch (2× training scheme). All codes were implemented with Pytorch. DLA-34 (19.7M) [51] and HRNet [21] were adopted to achieve the trade-offs between the accuracy and efficiency. The batch size was set to 128 for DLA-34 and HRNet-W32 and 64 for HRNet-W48 due to the limited GPU memory. During inference, we kept the aspect ratio of the raw image and resized the short side of the images to 512/640/800 pixels accordingly. The output size was 1/4 of the input resolution. We further used flip and multi-scale image pyramids to boost the performance. It is worth highlighting that the flip was only applied to the center heatmap predicted by the Enhanced Center-aware Branch. All training and inference setups were shared between MS COCO [31] and CrowdPose [34] datasets.

Ablation Experiments
In this subsection, we conducted comprehensive ablation experiments to analyze each component, respectively, as well as our whole regression model. All ablation studies adopted DLA-34 as a backbone and used the 1× training schedule (140 epochs) via singlescale testing without horizontal flip on the COCO mini-val set.
Analysis of Part Perception Module. Based on the adaptive point set representation, the Part Perception Module was proposed to regress seven adaptive human-part related points, which were used for the subsequent prediction of the Enhanced Center-aware Branch and the Two-hop Regression Branch. Figure 6 shows the predicted human center and seven adaptive human-part related points on the human instances with various scales and poses. As reported in Table 1, we leveraged various designs to study the structure of the Part Perception Module including (1) 1 × 1 convolutional layer; (2) 3 × 3 convolutional layer with group 7, in which each group is responsible for a human part; (3) vanilla 3 × 3 convolutional layer. The vanilla 3 × 3 convolution achieves the slightly better result. We selected the vanilla 3 × 3 convolution for the follow-up experiments. Analysis of Enhanced Center-aware Branch. In the Enhanced Center-aware Branch, we conducted the receptive field adaptation operation by aggregating the feature vectors of seven adaptive human-part related points to more precisely position the human center.
We conducted the controlled experiments to explore the effect of receptive field adaptation (RFA) process in the Enhanced Center-aware Branch. Compared with using the feature with fixed receptive field to position the human center, the receptive field adaptation process obtained 1.4% AP improvements in (Expt. 3 versus Expt. 4) of Table 2. We consider that receptive field adaptation is capable of enhancing center feature representation and dynamically adjusting its receptive field accordingly.

Analysis of Two-hop Regression Branch.
In the Two-hop Regression Branch, we adopted the adaptive human-part related points as intermediate nodes to localize the keypoints along the center-to-adaptive points-to-keypoints path.
As reported in (Expt. 2 versus Expt. 4) of Table 2, it achieves 4.5% AP improvements compared with directly regressing the displacements from the center to each joint. The results prove that the feature embedding of the adaptive point is more capable of sufficiently encoding the content and position information of corresponding keypoints than limited center feature embedding. Thus, these adaptive points serving as the intermediate nodes can factorize center-to-joint offsets effectively to improve the regression performance.
Furthermore, we analyzed the localization error of the direct center-to-joint regression, hierarchical regression in SPM [33] and our adaptive two-hop regression (with auxiliary keypoint heatmap loss applied for three above regression schemes) via the coco-analyze tool [52]. The localization error consists of four error types: (1) Jitter is a small localization error; (2) Miss refers to a large localization error; (3) Inversion denotes confusion between keypoints within an human instance; (4) Swap indicates the confusion between keypoints across different human bodies. The results are shown in Table 3; compared with direct center-to-joint regression and hierarchical regression, our adaptive two-hop regression reduces Jitter error by 4.5 and 1.4, respectively, and also reduces Miss error by 1.9 and 0.5, which proves that our regression method can improve the localization quality of the other regression methods evidently.
Analysis of auxiliary loss. We added a parallel branch to learn keypoint heatmap representation, which was only used for auxiliary loss computation in the training stage.
In order to study the effect of auxiliary loss, we achieved 1.6% AP improvements by employing auxiliary heatmap loss to help coordinate regression. It experimentally proves that learning the keypoint heatmap means it is able to retain more structural geometric information to improve regression performance. Analysis of Overall Architecture. We studied the inherent relationship between the Enhanced Center-aware Branch and the Two-hop Regression Branch, which are correlated by the adaptive human-part related points. As shown in Expt. 1 and 2 of Table 2, without two-hop regression, receptive field adaptation achieves 1.0% AP improvements. As reported in Expt. 3 and 4 of Table 2, with two-hop regression, we further observe that receptive field adaptation achieves 1.4% AP improvements. We consider that loss o f f enables the adaptive points to scatter over the divided human parts, thus the receptive field adaptation is capable of perceiving the human center more precisely. Meanwhile, as reported in Expt. 1 and 3 of Table 2, without receptive field adaptation, two-hop regression brings 4.1% AP improvements. With receptive field adaptation, two-hop regression brings 4.5% AP improvements as shown in Expt. 2 and 4 of Table 2. It experimentally proves that loss ct drives the adaptive points to locate on the semantically significant region, thus two-hop regression is better able to locate the keypoints.
Analysis of Heatmap Refinement. CenterNet [32] performs a post-processing step, which searches the closest peaks (confidence > 0.1) on the keypoint heatmap to replace the initial regressed results. Since the position of confidence peaks on the keypoint heatmap are integer, sub-pixel offsets are predicted to recover the discretization errors in parallel. In this manner, the regressed predictions are grouping clues for assigning the keypoints detected from heatmap to individuals. We named the above process heatmap refinement.
As reported in CenterNet [32], heatmap refinement brings large improvements of 6.2% AP to the initial regression result (from 51.7% AP to 57.9% AP). For validating the regression performance of our method, we further conducted the heatmap refinement for our regression result. For convenience, the two-hop regression result and the heatmap refinement result are denoted as Ours-reg and Ours-heat, respectively. As shown in Table 4, Ours-reg obtained a slightly better performance than Ours-heat (64.6% AP versus 64.4% AP), which proves that our regression method has the better positioning capacity.

Results on MS COCO Dataset
We report the comparisons with the previous state-of-the-art methods on the COCO mini-val and test-dev sets. All experimental results were obtained via a 2x training schedule.
Mini-val Results. Table 5 reports the comparisons with the recent most competitive bottom-up methods to reveal the keypoint positioning capability in cases without any test-time augmentation (single-scale testing without flip). Adopting smaller DLA-34 as a backbone and the same input resolution 512 pixels, our method achieves 65.8 AP, which outperforms competitive bottom-up HrHRNet-W32 [26] and SWAHR-W32 [2], as well as DEKR-W32 [49], by 2.2 AP, 1.1 AP, 2.4 AP, respectively, with a much faster inference speed. By using HRNet-W32, we outperform state-of-the-art CenterGroup-W32 [3] by 1.1 % AP. Adopting DLA-34 and 640 pixels input resolution, our network achieves an equal performance to those of HrHRNet-W48, SWAHR-W48 and DEKR-W48, with only 1/3 parameters. Furthermore, we obtained 70.5 AP by using HRNet-W48 with 640 pixels input resolution, which achieved a 3.4 AP gain over state-of-the-art regression-based method DEKR-W48. It is noteworthy that DEKR leverages the keypoint heat-value to modulate the center heat-value and further employs an extra rescoring network in post-process, while we directly adopt the center heat-value as the final pose score for fast inference. For the state-of-the-art bottom-up method CenterGroup-W48 (adopting HigherHRNet-W48 as a backbone), which introduces a transformer encoder to conduct the grouping process, we surpass it by 1.4 AP via HRNet-W48 without extra deconvolution layers. The above results prove that our proposed body representation is more effective at modeling the relationship between human instances and keypoints than previous heuristic or learnable grouping methods in terms of accuracy and speed. Figure 7 shows the predicted skeletons on the COCO mini-val set. Test-dev Results. We further list comprehensive comparisons with the existing bottomup and single-stage regression-based methods on the COCO test-dev set. In detail, as reported in Table 6, our method achieves state-of-the-art 71.4 AP, which outperforms the widely-used bottom-up methods CMU-pose [6] and AE [27] by a large margin with faster inference speed. Finally, compared with previous single-stage regression-based methods, our method surpasses SPM [33] (refined by the well-trained single-person pose estimation model) by 4.5 AP without any refinement and also outperforms DirectPose [46] with a large margin, by 5.1 AP.

Results on CrowdPose Dataset
We compared our method with the previous state-of-the-art methods on the CrowdPose dataset, which consists of more challenging crowd scenes. Following existing methods [2,26,49], we trained our models on the train and val sets and evaluated the performance on the test set.
The comparisons with previous state-of-the-art methods are shown in Table 7. Generally, the top-down paradigm always achieved the better performance than bottom-up and single-stage paradigms due to the persons being cropped to perform single-person pose estimation. Nevertheless, our single-stage methods achieved the better performance than most widely-used top-down methods on CrowdPose. We consider that the detected single person region always contains the bodies of other persons in crowd scenes, where persons are usually heavily overlapped.
Furthermore, for the bottom-up methods, we outperform CMU-pose [6] by a large margin. Compared with HigherHRNet [26] using the HRNet-W48 and higher output resolution, our methods achieve an equal performance by only using small HRNet-W32 without multiscale heatmap aggregation. Our method with HRNet-W48 improves HigherHRNet-W48 by 2.2 AP (1.6 AP) for single-scale (multi-scale) testing. Compared with the competitive DEKR [49], we achieve 1.2 AP gains without any bells and whistles (e.g., only using the center score as the final pose score) in the inference stage. Figure 8 shows the predicted skeletons on CrowdPose. The results prove that the positioning capability of our network is much better.

AdaptivePose for 3D Pose Estimation
We further extended AdaptivePose to 3D multi-person pose estimation [54] to verify its generality.
Methodology. To simply demonstrate the effectiveness of our proposed body representation and the single-stage network in 3D scenes, we used the pixel-wise depth map to predict the depth information of all human bodies. Based on the 2D network, we further added two parallel branches. One was used to estimate a 1-channel root absolute-depth map in a camera-centered coordinate system. For its target, the map values at the region centering on the root joint with radius 4 equaled their absolute depths. The other branch was to output the 14-channel relative depth map of other keypoints compared to their root joint ( MuCo-3DHP and MuPoTS-3D only provide 15 keypoints annotations.). For its target, the map values at the region centering on the root joint with radius 4 equaled their relative depths compared to root joints. Due to the visual perception of object scale and depth depending on the size of field of view (FoV), following SMAP [55], we normalized the depth by the size of FoV for all training samples as: Z = z * w /f, where Z is the normalized depth, z is the original depth, and f and w are the focal length and the image width.
During inference, first, we formed the 2D human pose as described in Section 3.3. Second, we extracted the absolute root depth and corresponding relative depth of the other keypoints via bilinear interpolation at the root position of each pose candidate. The predicted depth values can be converted back to metric values during inference. Finally, according to the 2D keypoint coordinates and corresponding depth, the 3D pose can be reconstructed through the perspective camera model: where [X, Y, Z] refers to 3D coordinated in a camera-centered coordinate system and [x, y] is the 2D coordinate of a keypoint in a pixel coordinate system, and K is the camera intrinsic matrix. Dataset. MuCo-3DHP is the training dataset which is generated by compositing the 3D single-person pose estimation dataset MPI-INF-3DHP [56]. The MuPoTS-3D dataset is the test set, containing 8700 challenging images, which was generated out of doors and consists of 20 real-world scenes annotated with 3D keypoint positions. The annotations were obtained from a multi-view marker-less motion capture system. Evaluation Metrics. We leveraged the 3D percentage of correct keypoints (3D PCK rel ) with root alignment and the area under the 3D PCK curve across different thresholds (AUC rel ) to evaluate relative root-centered prediction. The prediction was considered as correct if it lay within 15cm of the annotated keypoint position, following SMAP [55]. We further used 3D PCK abs , which indicates the 3D PCK without root alignment to evaluate the absolute camera-centered prediction. Implementation Details. We adopted the Adam optimizer to train our 3D network with a batch size of 64 on a workstation with eight 32GB Tesla V100 GPUs. We employed a warmup training strategy and the initial learning rate was set to 1.0 ×10 −3 . The total training procedure was terminated at the 20th epoch. Following previous work, we used MuCo-3DHP mixed with MS COCO to train the 3D network. All images were shuffled and each mini-batch was randomly sampled from the shuffled dataset. All images were resized to a fixed resolution of 832 × 512 as model input for both training and testing processes.
Results. Table 8 reports the results of our method and previous top-down and bottomup methods on the MuPoTS-3D [35] dataset. Our AdaptivePose-3D achieves 83.9 3D PCK rel and AUC rel score 44.6 with HRNet-W32, which outperforms top-down 3DMPPE [57] and Hdnet [58] by 1.4 and 0.2 3D PCK rel , respectively. Compared with the bottom-up methods, our method outperforms Xnect [59] by a large margin, and surpasses SMAP [55] and Shen et al. [60] by 3.5 and 0.7 3D PCK rel , respectively. The results prove that our body representation can more effectively build the relationship between a human instance and corresponding keypoints than the heuristic grouping process in a 3D scene. Although we only used a particularly simple method to regress absolute root depth, we also achieved the promising performance on PCK abs . Figure 9 shows the predicted skeletons on MuPoTS-3D [35]. We believe that our AdaptivePose-3D has great potential with a more effective depth estimation approach.

Conclusions
In this paper, we introduced a fine-grained body representation that represents human parts as adaptive points. Based on the proposed body representation, we built a compact single-stage network, named AdaptivePose++. The proposed network eliminates the time-consuming grouping and refinement processes, thus obtaining the best speed-accuracy trade-offs. Concretely, our method exceeds the state-of-the-art bottom-up DEKR [49] and CenterGroup [3] methods by 3.4 AP and 1.4 AP, and outperforms other existing bottom-up as well as single-stage approaches on MS COCO with a faster inference speed. Comprehensive experiments prove the generality on crowd and 3D scenes.
AdaptivePose++ eliminates complex post-processes during inference but still requires NMS to remove the duplicates. We will explore designing a more efficient framework without any post-process in future works. We also believe that the proposed body representation can inspire other human-centered vision tasks such as action recognition and human reconstruction.