You are currently viewing a new version of our website. To view the old version click .
Sensors
  • Article
  • Open Access

22 August 2023

SSA Net: Small Scale-Aware Enhancement Network for Human Pose Estimation

,
,
,
and
School of Computer Science and Technology, Zhejiang Sci-Tech University, Hangzhou 310018, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Artificial Intelligence in Computer Vision: Methods and Applications

Abstract

In the field of human pose estimation, heatmap-based methods have emerged as the dominant approach, and numerous studies have achieved remarkable performance based on this technique. However, the inherent drawbacks of heatmaps lead to serious performance degradation in methods based on heatmaps for smaller-scale persons. While some researchers have attempted to tackle this issue by improving the performance of small-scale persons, their efforts have been hampered by the continued reliance on heatmap-based methods. To address this issue, this paper proposes the SSA Net, which aims to enhance the detection accuracy of small-scale persons as much as possible while maintaining a balanced perception of persons at other scales. SSA Net utilizes HRNetW48 as a feature extractor and leverages the TDAA module to enhance small-scale perception. Furthermore, it abandons heatmap-based methods and instead adopts coordinate vector regression to represent keypoints. Notably, SSA Net achieved an AP of 77.4% on the COCO Validation dataset, which is superior to other heatmap-based methods. Additionally, it achieved highly competitive results on the Tiny Validation and MPII datasets as well.

1. Introduction

Human pose estimation is a crucial task in the field of computer vision that has garnered significant attention from researchers. Its specific approach involves localizing the keypoints of the human body (such as knees and elbows) from an image. Human pose estimation has a wide range of applications in daily life, such as action recognition [1,2,3], motion tracking [4,5,6], and augmented reality [7,8,9].
Heatmap-based methods are widely employed in the field of human pose estimation due to their high performance. However, these methods suffer from several issues, particularly in scenarios with low resolution or small-scale persons, where significant degradation of performance occurs. At such times, the keypoints exhibit blurriness and density, and the heatmap representation cannot effectively solve these problems. In recent years, researchers have been exploring new coordinate representation methods to replace the heatmap-based approach. One such method is the 1D vector representation, which has been validated in SimCC [10] and has shown superior localization accuracy for small-scale persons when compared to heatmap-based methods. Building on this, this article has optimized this method by significantly reducing the parameter count without a significant loss of performance. This optimized approach is referred to as CVR (coordinate vector regression).
Based on the above, to tackle the issue of predicting small-scale persons, this article proposes the SSA Net. This network leverages the HRNetW48 architecture as the feature extractor, with the TDAA module employed to reinforce small-scale perception, and the CVR method to predict keypoints. In detail, this article first sifts through each person based on the predicted bounding box. Following this, the single-person image is fed into the backbone for feature extraction, with the resultant feature map measuring 1/4 of the original image size. Considering that high-resolution feature maps are more favorable for predicting small-scale persons, the output feature map is resized to 1/2 of the original image size using transpose convolution. This article then uses dilated convolution to restrict the receptive field of the feature map within a relatively limited range. Subsequently, this article introduces the coordinate attention [11] to generate position-sensitive feature maps. Our TDAA module can effectively resolve issues associated with small person features blurriness and keypoints concentration. Finally, this article introduces the residual mechanism to fuse features and uses the CVR method to predict keypoints. Overall, the entire network can be understood as a network designed to focus on features for small-scale persons as much as possible. The main contributions of this paper are:
  • This article proposes a new network structure SSA Net, the most important feature of this network is that it focuses on the performance of small-scale persons and solves the problem of unbalanced scale perception of mainstream models.
  • This article proposes the TDAA module in SSA Net, which can effectively improve the expression ability of small-scale person features and thus improve the prediction accuracy of small-scale persons.
  • This article proposes a coordinate vector regression method, which is better than the heatmap method in terms of both prediction accuracy and speed for small-scale persons.
  • SSA Net achieves significant performance improvements over mainstream heatmap methods on the COCO Validation and COCO test dev datasets, as well as competitive results on the MPII Validation dataset.

3. Proposed Method

3.1. Feature Extractor

The structure of SSN Net is shown in Figure 3. In this paper, we uses the popular HRNetW48 network as the feature extractor, taking images of size H × W × 3 as input. After several layers of convolutional neural network for feature extraction, it outputs a feature map with a size of 1/4 of the original image.
Figure 3. Structure of SSA Net, the network consists of three parts: the Feature Extractor module, the TDAA module (highlighted in green), and the 1D Vector Generator (CVR) module, where the O x i and O y i represent the coordinates of the predicted keypoints.

3.2. TDAA Module

Next, this paper feeds the output of the feature extractor into the TDAA module, as illustrated in Figure 4. In this module, through the addition operation in the residual mechanism, we merge the initial high-resolution feature map with the high-resolution feature map that has been enhanced through small-scale perception. This is performed to balance the perceptual capabilities of persons at different scales. Moreover, while enhancing perception at the small scale, it does not excessively impact the perceptual abilities of medium-to-large scale persons. The module comprises a transpose convolution operation (T), a dilated convolution operation (D), an attention mechanism (A), and a residual mechanism (A).
Figure 4. Structure of TDAA module.
Specifically, considering that high-resolution feature maps are more friendly for small persons, this paper uses transpose convolution to increase the size of the feature maps to 1/2 of the original image. The feature maps output by the feature extractor can be represented as ( N , C , H i n , W i n ) and after transpose convolution, the feature maps can be represented as ( N , C , H o u t , W o u t ) , the calculated as follows:
H o u t = ( H i n 1 ) × s t r i d e [ 0 ] 2 × p a d d i n g s [ 0 ] + k s [ 0 ]
W o u t = ( W i n 1 ) × s t r i d e [ 1 ] 2 × p a d d i n g s [ 1 ] + k s [ 1 ]
where H represents the length of the feature maps, W represents the width of the feature maps, s t r i d e refers to the step size of the convolution kernel, k s refers to the size of the convolution kernel, and p a d d i n g is an important parameter used to calculate the padding of the feature maps. Subsequently, dilated convolutions are employed to control the receptive field. The approach uses a kernel size of 3, a padding of 2, and a dilation rate of 3 to limit the receptive field of the feature map to a small range, enabling the model to better perceive small-scale persons. The effectiveness of this module is validated through ablation experiments.
After transpose convolution, the number of channels is doubled through a 1 × 1 convolution. Then, the feature map is fed into a coordinate attention block [11], as shown in Figure 5.
Figure 5. Structure of coordinate attention block.
To obtain attention on the image width and height and encode accurate position information, the coordinate attention block divides the input feature map into two directions, width and height, and performs global average pooling on each separately. The feature maps in the width and height directions are obtained as shown in the following formulas:
Z c h ( h ) = 1 W 0 i < w x c ( h , i )
Z c w ( w ) = 1 H 0 j < h x c ( j , w )
where W is the width of the feature maps and H is their height.
Next, the feature maps obtained from the width and height directions are concatenated and then fed into a shared 1 × 1 convolutional module. This reduces the dimensionality of the feature maps to C / r , where r is a reduction ratio. Afterwards, the feature maps F 1 , which have been processed by batch normalization, are passed through a sigmoid activation function to obtain a feature map f with a size of 1 × ( W + H ) × C / r , as shown in the following formula:
f = δ ( F 1 ( [ z h , z w ] ) )
Then, the feature maps f are processed by a 1 × 1 convolutional kernel along their height and width, resulting in two feature maps F h and F w with the same number of channels as the original. After applying the sigmoid activation function, we obtain the attention weights g h and g w for the height and width directions, respectively. The formulas are as follows:
g h = σ ( F h ( f h ) )
g w = σ ( F w ( f w ) )
After the aforementioned computations, the attention weights g h and g w for the input feature map’s height and width will be obtained. Finally, a multiplication weighting calculation is performed on the original feature maps to obtain the feature maps with attention weights in both height and width directions. The formula is as follows:
y c ( i , j ) = x c ( i , j ) × g c h ( i ) × g c w ( j )
In summary, coordinate attention can be viewed as a process of decomposing channel attention into two 1D feature encoding processes that aggregate features along different directions. This has the benefit of capturing long-range dependencies along one spatial direction while maintaining accurate position information along the other spatial direction. Subsequently, the resulting feature maps are encoded separately to generate a set of direction-sensitive and position-aware feature maps, which can be highly advantageous for dense human pose estimation tasks that involve numerous keypoints.
Finally, to make each module work better, this paper introduces a residual mechanism to fuse the output of the transpose convolution with the output of the coordinate attention mechanism.

3.3. CVR Module

The principle of the CVR (coordinate vector regression) method is shown in Figure 6. In this method, the feature map output by the TDAA module is first flattened, with a 1D vector length of H / 2 × W / 2 and M vectors in total. Then, they are separately fed into the X and Y vector generators to generate the corresponding X and Y vectors. Finally, the coordinates are predicted by decoding the X and Y vectors. The X and Y vector generators are improved from SimCC [10], in which the authors used two fully connected layers for prediction. In the coordinate vector regression method, this paper uses one-dimensional convolutional blocks to replace the expensive fully connected layers and achieve good results. In the next section, this paper also validates the effectiveness of this method through ablation experiments.
Figure 6. Overview of the coordinate vector regression module, where K is the scaling factor, H and W are the height and width of the original image, and M is the number of keypoints marked for each human instance.
Coordinate Encoding: In this method, the x and y coordinates of the keypoints are represented by two independent 1D vectors. By using a scaling factor K where we follow the setting of SimCC [10] and set K = 2, the length of the 1D vector obtained will also be greater than or equal to the image edge length. For the p t h keypoint, its encoded coordinates will be represented as follows:
p = ( x , y ) = ( r o u n d ( x p × k ) , r o u n d ( y p × k ) )
The scaling factor k divides each pixel into k equally-sized bins. Its purpose is to increase the localization accuracy to a level smaller than that of a single pixel.
Coordinate Decoding: For the output X and Y vectors of the model, this paper naturally uses the argmax function to predict the final keypoints. The calculation method for the predicted point coordinates is shown below:
o x = a r g m a x i ( o x ( i ) ) k
o y = a r g m a x j ( o y ( j ) ) k
In other words, the location of the maximum value point on the 1D vector is divided by the scaling factor to restore it to the image scale.

4. Experiments

4.1. Experimental Details

The objective of this study is to enhance the overall performance by improving the detection of small-scale persons. In the domain of human pose estimation, the widely used datasets include COCO, MPII, and Human3.6M. Through the authors’ investigation, it is found that only the COCO Validation dataset contains small-scale persons when using the bounding box area as a threshold to differentiate among large, medium, and small persons. However, other datasets possess images that are too ideal. To validate the model’s performance, this paper selects a subset of images containing small-scale persons from the COCO Validation dataset, named the Tiny Validation dataset.
Additionally, this paper tests the COCO and MPII datasets to analyze the contribution of small-scale persons to the overall accuracy. To enhance the model’s persuasiveness, this paper conducts several ablation experiments. First, we verify the effectiveness of each module in TDAA. Then, we demonstrate the superiority of the improved coordinate vector regression method over the SimCC baseline. Finally, we verify the comparison between the TDAA module used in the heatmap method and the comparison between the heatmap method and the coordinate vector regression method.

4.1.1. COCO Dataset

The COCO dataset is a large and versatile dataset proposed by Microsoft for image classification, object detection, semantic segmentation, and pose estimation tasks. It mainly contains images from Google and Bing, with content mostly consisting of daily scenes. The COCO dataset contains over 200 k images, with 250 k annotated instances of human body keypoints. The COCO training set has 118 k images, and the test set includes two subsets: COCO Validation, which contains 5 k images for simple testing and ablation experiments, and COCO test-dev, which contains 20 k images for online testing and fair comparison with mainstream models. The evaluation metrics used in the COCO dataset are the average precision (AP) and average recall (AR), which are both calculated based on the object keypoint similarity ( O K S ) between the ground truth and predicted keypoints. The O K S formula is shown below:
O K S = i e x p ( d i 2 2 s 2 k i 2 ) δ ( v i > 0 ) i δ ( v i > 0 )
where i represents the number of annotated keypoints, d i 2 is the squared Euclidean distance between the predicted and ground truth keypoint coordinates, s 2 is the area of the person in the image, k i 2 is a normalization factor that represents the displacement standard deviation of the true keypoints, and v i indicates whether the keypoint is visible or not.

4.1.2. Tiny Validation Dataset

The COCO Validation dataset contains 5 K images that cover most of the common scenes in daily life. In our preliminary research, we found that when the square of the pixel size is less than 80, both the quantity and quality of the images will significantly decrease, rendering them devoid of research value. On the other hand, when the pixel size is larger than 80 squared, the size of the persons in the images becomes excessively large. Therefore, this paper defines images with a person area less than 80 2 pixels as photos containing small-scale persons. After screening, we obtain 361 images that meet this criterion, and we name this dataset Tiny Validation dataset. This dataset is a subset of COCO Validation and is used to evaluate the performance of mainstream models on small-scale persons.

4.1.3. MPII Dataset

The MPII dataset is a commonly used dataset for human pose estimation. It consists of approximately 40 k annotations, with each person annotated with 16 keypoints. These images are extracted from videos on YouTube. Generally, 28 k images are used for training and 11 k for testing. Additionally, the validation dataset includes annotations for occluded body parts, 3D torso, and head orientation. The evaluation metric used in the MPII dataset is the percentage of correct keypoints ( P C K ). Specifically, a prediction is considered correct if the distance between the predicted and ground-truth keypoint coordinates is within a certain threshold range. The calculation formula is as follows:
P C K σ p ( d 0 ) = 1 | τ | τ δ ( | | x p f y p f | | 2 < σ )
where d 0 represents a detector, σ is the threshold for whether the keypoint matches the ground truth.

4.1.4. Experimental Environment

The hardware and software environment of the experiment are shown in Table 1.
Table 1. The software and hardware environment for all experiments in this article.

4.2. Experimental Results

4.2.1. Results on Tiny Validation Dataset

This paper first conducted tests on the Tiny Validation dataset to verify SSA Net’s accuracy in detecting small-scale individuals. As shown in Table 2, although mainstream models perform well on the COCO Validation dataset, their performance on the Tiny Validation dataset collectively declines. The A P of HigherHRNet drops from 66.5% to 46.8%, while SWAHR drops from 68.9% to 49.7%. This result also confirms the authors’ previous analysis that heatmap-based models are not suitable for predicting small-scale persons due to their own shortcomings.
Table 2. Comparison with mainstream models on Tiny Validation dataset. Where ↓ represents how much the accuracy of the model has changed compared to the COCO Validation dataset and the Tiny Validation dataset.
SimCC has a higher A P on the Tiny Validation dataset than other models, but it cannot solve the problem of dropped points well. This indicates that SimCC’s performance on small-scale persons is indeed better than that of heatmap-based models, but it has not achieved scale-aware balance and lacks optimization for small-scale persons. In contrast, our SSA Net is specifically optimized to address this issue, with a significant improvement in dropped point performance and the A P reaches 69.8%, far better than the performance of other models on the Tiny Validation dataset.
Therefore, it can be seen that the performance of small-scale persons may be an important factor limiting the overall A P improvement, which was not well addressed by previous mainstream models.

4.2.2. Results on COCO Validation Dataset

This paper conducted tests on the COCO Validation dataset to preliminarily validate the contribution of SSA Net to overall accuracy after small-scale aware enhancement. As shown in Table 3. In the COCO Validation dataset, SSA Net outperforms mainstream heatmap-based models and keypoint regression models in major metrics, especially with a significant improvement in A P M .
Table 3. Comparison with mainstream models on COCO Validation dataset, bold is the best result in each column.
In particular, compared with PRTR-W32, SSA Net achieves a significant improvement of 5.5% in A P M and 4.1% in overall A P while having a slightly higher parameter count of 2.6 M. Compared with SimCC, while reducing the parameter count by 6.5 M, SSA Net increases A P by 1.5%, with most of the improvement contributed by small and medium-scale persons. SSA Net outperforms other mainstream models on A P M and shows a significant improvement compared to the baseline network. Overall, SSA Net is highly effective for small-scale persons and enhances small-scale persons perception compared to other mainstream models.

4.2.3. Results on COCO Test Dev Dataset

This paper further conducted testing on the COCO test-dev dataset to compare our method with the state-of-the-art mainstream models. According to Table 4, this paper tests SSA Net on the COCO test dev dataset and finds that SSA Net achieves the best performance in most indicators. Compared with TransPose, which is based on the heatmap method, SSA Net improves A P by 0.8%, and A P M improves by 2.2%, which is the most significant improvement among all indicators. In terms of regression-based methods, compared with PRTR-W48, SSA Net improves A P by 3.7%, while GFLOPs is only 38% of PRTR-W48, indicating that SSA Net is superior to mainstream heatmap-based and regression-based models in both speed and accuracy. Compared with the SimCC baseline, SSA Net improves A P by 3.1% and A P M by 4.3%, while GFLOPs decrease by 5.5. This shows that compared to networks of the same type, SSA Net is also very competitive.
Table 4. Comparison with mainstream models on COCO test dev dataset, bold is the best result in each column.

4.2.4. Results on MPII Dataset

This paper also conducted testing on the mainstream MPII dataset to more comprehensively evaluate the model’s performance. The testing results are shown in Table 5. It is evident that SSA Net outperforms HRNetW48 in all body parts, except for Elb and Kne, among heatmap-based methods. Moreover, in regression-based methods, SSA Net surpasses PRTR and other networks. This indicates that SSA Net also can achieve relatively good performance on datasets with ideal human image quality, such as MPII.
Table 5. Comparison with mainstream models on MPII dataset, bold is the best result in each column.

4.2.5. Qualitative Experimental Results

To provide a more intuitive illustration of the effectiveness of SSA Net, this paper visualizes the model’s testing results on COCO Validation in Figure 7. The results demonstrate that SSA Net is capable of accurately predicting the keypoints in various challenging scenarios, such as when the person is small or in a crowded environment. As shown in Figure 8 the left image represents the original image, the middle image represents the result of the heatmap-based method, where we take the experimental results of HigherHRNet-W48 as a representative example, and the right image represents the experimental results of SSA Net. From the first image, it can be seen that the heatmap-based method is not very accurate in predicting the keypoints of the legs, while SSA Net can predict them accurately. From the second to last image, it can be seen that when a person is small and occluded in the background, SSA Net predicts the upper body more accurately and the occluded parts of the lower body more reasonably. Other images also show that the heatmap-based method has missed detections in some distant scenes with small-scale persons, while SSA Net can solve this problem very well.
Figure 7. Illustration of human pose estimation results of SSA Net in different scenes on COCO Validation dataset.
Figure 8. Comparison results between SSA Net and other mainstream models on COCO Validation dataset.

4.3. Ablation Experiments

4.3.1. Ablation Experiment of TDAA Module

The TDAA module is a crucial component of SSA Net, enhancing the network’s ability to perceive small-scale persons. To verify the effectiveness of each part of the TDAA module, this paper conducts ablation experiments.
It is worth noting that this paper uses the coordinate vector regression method proposed in this paper to predict the keypoints for all five methods. The results, shown in Table 6, indicate that the coordinate attention mechanism contributes the most to SSA Net’s average precision (AP), with an improvement of 1.1%. The transposed convolution module follows closely with a contribution of 0.7% to the overall performance. The dilation convolution contributes 0.3%, and the residual mechanism contributes 0.2%. It can be seen that the various modules work together and contribute to overall performance improvement. Through this ablation experiment, it is also verified that the TDAA module is specifically designed to enhance the perception of small-scale persons.
Table 6. Ablation experiment of TDAA module, where A1 is attention module, A2 is residual mechanism. Where ↓ represents how much the accuracy of the model has changed compared to the baseline.

4.3.2. Ablation Experiment of TDAA and CVR Module

As the effectiveness of the TDAA module has been proven in Table 7, this paper attempts to combine it with heatmap methods. As shown in method 1 and method 3, the TDAA module contributes a significant 2.1% improvement to A P , demonstrating its capability to enhance performance when used in conjunction with heatmap methods.
Table 7. Ablation experiment of TDAA and CVR modules.
Furthermore, to evaluate the performance of our proposed coordinate vector regression method compared to heatmap methods, method 3 and method 4 are utilized. The results indicate that the coordinate vector regression method improves A P by 2.2% compared to heatmap methods. When the TDAA module is not used, method 1 and method 2 show that the coordinate vector regression method improves A P by 2.7%, indicating an even more significant improvement. These findings verify the superior performance of the coordinate vector regression method compared to heatmap methods and indirectly demonstrate the role of the TDAA module, which contributes a 0.5% A P gain to the heatmap methods.

4.3.3. Ablation Experiment of CVR Module

Through the above analysis, we fully verify the effectiveness of the coordinate vector regression method in Table 8, which is an improvement over the method proposed in SimCC. This paper replaces the expensive fully connected layers in SimCC with a 1D convolution block, which reduces dimensionality through sparse connections achieved by convolution. As shown in method 1 and method 2, with almost no loss in performance, the number of parameters decreases by 6.5 M, which is a worthwhile trade-off considering the significant reduction in parameters with a sacrifice of only 0.1% in A P .
Table 8. Ablation experiment of CVR module.

5. Conclusions

SSA Net addresses the deficiencies of previous models and makes specific optimizations for small-scale persons pose estimation. It uses a more accurate top-down structure and replaces the heatmap representation method with the coordinate vector regression method to more accurately locate the keypoints of small persons. Additionally, SSA Net proposes the TDAA module and verifies its effectiveness through ablation experiments.
While SSA Net has achieved impressive results, it still faces some challenges. Despite the improvement in perceiving small-scale persons compared to other models, there is still a 7.6% accuracy loss observed in the Tiny Validation dataset. This is an issue that requires further in-depth research in our future work.

Author Contributions

Conceptualization, S.L.; methodology, S.L. and H.Z.; software, S.L.; validation, H.Z., H.M., J.F. and M.J.; formal analysis, S.L.; investigation, S.L.; resources, H.Z.; data curation, S.L.; writing—original draft preparation, S.L.; writing—review and editing, S.L. and H.Z.; visualization, S.L.; supervision, H.Z., H.M., J.F. and M.J.; project administration, H.Z.; funding acquisition, M.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of China (61672466 62011530130), Joint Fund of Zhejiang Provincial Natural Science Foundation (LSZ19F010001).

Institutional Review Board Statement

The institution has reviewed and agreed to submit.

Data Availability Statement

The datasets used in the experiments are derived from the public COCO dataset as well as the MPII dataset, which is available on its own. Due to the continuous development of the algorithm, there is some follow-up work that needs to be continued, and the source code can be obtained by contacting the corresponding author when it is reasonable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Pham, H.H.; Salmane, H.; Khoudour, L.; Crouzil, A.; Velastin, S.A.; Zegers, P. A Unified Deep Framework for Joint 3D Pose Estimation and Action Recognition from a Single RGB Camera. Sensors 2020, 20, 1825. [Google Scholar] [CrossRef] [PubMed]
  2. Neili Boualia, S.; Essoukri Ben Amara, N. Deep Full-Body HPE for Activity Recognition from RGB Frames Only. Informatics 2021, 8, 2. [Google Scholar] [CrossRef]
  3. Lin, F.-C.; Ngo, H.-H.; Dow, C.-R.; Lam, K.-H.; Le, H.L. Student Behavior Recognition System for the Classroom Environment Based on Skeleton Pose Estimation and Person Detection. Sensors 2021, 21, 5314. [Google Scholar] [CrossRef] [PubMed]
  4. Patil, A.K.; Balasubramanyam, A.; Ryu, J.Y.; Chakravarthi, B.; Chai, Y.H. An Open-Source Platform for Human Pose Estimation and Tracking Using a Heterogeneous Multi-Sensor System. Sensors 2021, 21, 2340. [Google Scholar] [CrossRef] [PubMed]
  5. Kim, M.; Lee, S. Fusion Poser: 3D Human Pose Estimation Using Sparse IMUs and Head Trackers in Real Time. Sensors 2022, 22, 4846. [Google Scholar] [CrossRef] [PubMed]
  6. Guidolin, M.; Menegatti, E.; Reggiani, M. UNIPD-BPE: Synchronized RGB-D and Inertial Data for Multimodal Body Pose Estimation and Tracking. Data 2022, 7, 79. [Google Scholar] [CrossRef]
  7. Shao, M.Y.; Vagg, T.; Seibold, M.; Doughty, M. Towards a Low-Cost Monitor-Based Augmented Reality Training Platform for At-Home Ultrasound Skill Development. J. Imaging 2022, 8, 305. [Google Scholar] [CrossRef] [PubMed]
  8. Basiratzadeh, S.; Lemaire, E.D.; Baddour, N. A Novel Augmented Reality Mobile-Based Application for Biomechanical Measurement. BioMed 2022, 2, 255–269. [Google Scholar] [CrossRef]
  9. Park, Y.J.; Ro, H.; Lee, N.K.; Han, T.-D. Deep-cARe: Projection-Based Home Care Augmented Reality System with Deep Learning for Elderly. Appl. Sci. 2019, 9, 3897. [Google Scholar] [CrossRef]
  10. Li, Y.; Yang, S.; Liu, P.; Zhang, S.; Wang, Y.; Wang, Z. SimCC: A Simple Coordinate Classification Perspective for Human Pose Estimation. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; pp. 89–106. [Google Scholar]
  11. Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
  12. Toshev, A.; Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1653–1660. [Google Scholar]
  13. Li, S.; Zhang, H.; Ma, H.; Feng, J.; Jiang, M. CSIT: Channel Spatial Integrated Transformer for human pose estimation. IET Image Process. 2023, 17, 3002–3011. [Google Scholar] [CrossRef]
  14. Tian, Z.; Chen, H.; Shen, C. Directpose: Direct end-to-end multi-person pose estimation. arXiv 2019, arXiv:1911.07451. [Google Scholar]
  15. Sun, X.; Xiao, B.; Wei, F.; Liang, S.; Wei, Y. Integral human pose regression. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 529–545. [Google Scholar]
  16. Sun, X.; Shang, J.; Liang, S.; Wei, Y. Compositional human pose regression. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2602–2611. [Google Scholar]
  17. Nie, X.; Feng, J.; Zhang, J.; Yan, S. Single-stage multi-person pose machines. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6951–6960. [Google Scholar]
  18. Li, J.; Bian, S.; Zeng, A.; Wang, C.; Pang, B.; Liu, W.; Lu, C. Human pose regression with residual log-likelihood estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, ON, Canada, 10–17 October 2021; pp. 11025–11034. [Google Scholar]
  19. Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar]
  20. Zhang, F.; Zhu, X.; Ye, M. Fast human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 3517–3526. [Google Scholar]
  21. Zhang, F.; Zhu, X.; Dai, H.; Ye, M.; Zhu, C. Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 7093–7102. [Google Scholar]
  22. Li, W.; Wang, Z.; Yin, B.; Peng, Q.; Du, Y.; Xiao, T.; Sun, J. Rethinking on multi-stage networks for human pose estimation. arXiv 2019, arXiv:1901.00148. [Google Scholar]
  23. Cai, Y.; Wang, Z.; Luo, Z.; Yin, B.; Du, A.; Wang, H.; Sun, J. Learning delicate local representations for multi-person pose estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 455–472. [Google Scholar]
  24. Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 483–499. [Google Scholar]
  25. Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5693–5703. [Google Scholar]
  26. Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 466–481. [Google Scholar]
  27. Yang, S.; Quan, Z.; Nie, M.; Yang, W. Transpose: Keypoint localization via transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, ON, Canada, 10–17 October 2021; pp. 11802–11812. [Google Scholar]
  28. Li, J.; Chen, T.; Shi, R.; Lou, Y.; Li, Y.L.; Lu, C. Transpose: Keypoint localization via transformer. Localization with sampling-argmax. In In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Volume 34, pp. 27236–27248. [Google Scholar]
  29. Nibali, A.; He, Z.; Morgan, S.; Prendergast, L. Numerical coordinate regression with convolutional neural networks. arXiv 2018, arXiv:1801.07372. [Google Scholar]
  30. Tompson, J.J.; Jain, A.; LeCun, Y.; Bregler, C. Joint training of a convolutional network and a graphical model for human pose estimation. arXiv 2014. [Google Scholar] [CrossRef]
  31. Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, MA, USA, 14–19 June 2020; pp. 5386–5395. [Google Scholar]
  32. Luo, Z.; Wang, Z.; Huang, Y.; Wang, L.; Tan, T.; Zhou, E. Rethinking the heatmap regression for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 13264–13273. [Google Scholar]
  33. Geng, Z.; Sun, K.; Xiao, B.; Zhang, Z.; Wang, J. Bottom-up human pose estimation via disentangled keypoint regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 14676–14686. [Google Scholar]
  34. Yin, S.; Wang, S.; Chen, X.; Chen, E.; Liang, C. Attentive one-dimensional heatmap regression for facial landmark detection and tracking. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, MA, USA, 12–16 October 2020; pp. 538–546. [Google Scholar]
  35. Xiong, Y.; Zhou, Z.; Dou, Y.; Su, Z. Gaussian vector: An efficient solution for facial landmark detection. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2020. [Google Scholar]
  36. Mao, W.; Ge, Y.; Shen, C.; Tian, Z.; Wang, X.; Wang, Z. Tfpose: Direct human pose estimation with transformers. arXiv 2021, arXiv:2103.15320. [Google Scholar]
  37. Li, K.; Wang, S.; Zhang, X.; Xu, Y.; Xu, W.; Tu, Z. Pose recognition with cascade transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Online, 19–25 June 2021; pp. 1944–1953. [Google Scholar]
  38. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  39. Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2D pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
  40. Papandreou, G.; Zhu, T.; Kanazawa, N.; Toshev, A.; Tompson, J.; Bregler, C.; Murphy, K. Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice, Italy, 22–29 October 2017; pp. 4903–4911. [Google Scholar]
  41. Newell, A.; Huang, Z.; Deng, J. Associative embedding: End-to-end learning for joint detection and grouping. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  42. Kocabas, M.; Karagoz, S.; Akbas, E. Multiposenet: Fast multi-person pose estimation using pose residual network. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 417–433. [Google Scholar]
  43. Fang, H.S.; Xie, S.; Tai, Y.W.; Lu, C. Rmpe: Regional multi-person pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2334–2343. [Google Scholar]
  44. Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, TX, USA, 18–22 June 2018; pp. 7103–7112. [Google Scholar]
  45. Huang, S.; Gong, M.; Tao, D. A coarse-fine network for keypoint localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3028–3037. [Google Scholar]
  46. Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
  47. Wei, F.; Sun, X.; Li, H.; Wang, J.; Lin, S. Point-set anchors for object detection, instance segmentation and pose estimation. In Proceedings of the Computer Vision–ECCV2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 527–544. [Google Scholar]
  48. Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional pose machines. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4724–4732. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.