Improved Multi-Person 2D Human Pose Estimation Using Attention Mechanisms and Hard Example Mining

Zhang, Lixin; Huang, Wenteng; Wang, Chenliang; Zeng, Hui

doi:10.3390/su151813363

Open AccessArticle

Improved Multi-Person 2D Human Pose Estimation Using Attention Mechanisms and Hard Example Mining

by

Lixin Zhang

¹,

Wenteng Huang

¹,

Chenliang Wang

² and

Hui Zeng

^1,2,*

¹

School of Intelligence Science and Technology, University of Science and Technology Beijing, Beijing 100083, China

²

School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(18), 13363; https://doi.org/10.3390/su151813363

Submission received: 15 July 2023 / Revised: 17 August 2023 / Accepted: 31 August 2023 / Published: 6 September 2023

(This article belongs to the Special Issue Sustainability and Automation: Intelligent Control and Its Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In recent years, human pose estimation, as a subfield of computer vision and artificial intelligence, has achieved significant performance improvements due to its wide applications in human-computer interaction, virtual reality, and smart security. However, most existing methods are designed for single-person scenes and suffer from low accuracy and long inference time in multi-person scenes. To address this issue, increasing attention has been paid to developing methods for multi-person pose estimation, such as utilizing Partial Affinity Field (PAF)-based bottom-up methods to estimate 2D poses of multiple people. In this study, we propose a method that addresses the problems of low network accuracy and poor estimation of flexible joints. This method introduces the attention mechanism into the network and utilizes the joint point extraction method based on hard example mining. Integrating the attention mechanism into the network improves its overall performance. In contrast, the joint point extraction method improves the localization accuracy of the flexible joints of the network without increasing the complexity. Experimental results demonstrate that our proposed method significantly improves the accuracy of 2D human pose estimation. Our network achieved a notably elevated Average Precision (AP) score of 60.0 and outperformed competing methods on the standard benchmark COCO test dataset, signifying its exceptional performance.

Keywords:

multiple person; 2D human pose estimation; hard example mining; part affinity fields

1. Introduction

Human pose estimation is the method of estimating the positions of multiple people joints within an 2D image. This method is a crucial aspect of computer vision and has been widely applied in various fields, including intelligent video surveillance, human-computer interaction [1], virtual reality, human body animation, motion capture, gaming, and security. The task of 2D human pose estimation can be divided into two categories: single human pose estimation [2,3,4,5,6,7,8,9,10,11,12] and multiple human pose estimation. The former is relatively simple and can be achieved by using deep neural networks and heatmaps. Popular methods that have demonstrated robust results include UniPose [13] and EfficientPose [14]. For multi-person human pose estimation, due to the presence of multiple individuals in the image, the algorithm needs to determine the ownership relationship of joint points to determine the pose of each person in the image.

Multi-person pose estimation can be achieved through two approaches: top-down and bottom-up methods.

Top-down methods: The top-down method is to detect the positions of all persons in the image by using the target detection algorithm, then estimate the pose of each person. A representative example of this method is MIPNet [15], which utilizes the innovative Multi-Instance Modulation Block (MIMB) to detect people, and overcome the limitations of failures in crowded scenes with occlusions.

Bottom-up methods: This method first estimates the positions of the joint points in the image, then classifies the joint points belonging to the same person, and finally estimate the pose of each person.

OmniPose [16] presents a pioneering single-pass, end-to-end framework for multi-person pose estimation. Through the innovative waterfall module, it boosts backbone feature extraction across scales, eliminating post-processing needs. Its efficient multi-scale representations, powered by the enhanced waterfall module, offer both progressive filtering efficiency and broad field-of-view, rivaling spatial pyramids. Residual Steps Network (RSN) [17], Introduced a method for precise keypoint detection. RSN efficiently aggregates intra-level features, capturing rich low-level spatial details. It employs the Pose Refine Machine (PRM) for balancing local and global representations, leading to refined keypoints. HigherHRNet [18] is a novel bottom-up human pose estimation method that addresses the scale variation challenge. It utilizes high-resolution feature pyramids and multi-resolution supervision for training and inference to improve keypoint localization, especially for small individuals. Kreiss et al. [19] expanded the application of scalar and vector fields in pose estimation to composite fields. Their model uses the Part Intensity Field (PIF) and the Part Association Field (PAF), these two modules are able to realize the estimation of the human pose.

Other research has focused on improving the efficiency and speed of 2D human pose estimation. EfficientHRNet [20] presents a family of lightweight multi-person human pose network structures that prove to be more computationally efficient than other bottom-up 2D human pose estimation approaches at every level while achieving highly competitive accuracy. Simple and Effective Network (SEN) [21] addresses the problem of multi-person pose estimation in low-power embedded systems. Kushwaha et al. [22] proposed an innovative approach for evaluating human 3D pose from a single 2D image using depth prediction with pose alignment. OpenPose [23] is a real-time approach to multi-person 2D pose estimation that uses Part Affinity Fields (PAFs) to associate body parts with individuals in the image. The proposed bottom-up method achieves high accuracy and real-time performance regardless of the number of people in the image.

The top-down human pose estimation method is known for its high accuracy, but its performance depends on the detector performance, and its inference time increases proportionally with the number of people in the image [24]. On the other hand, the accuracy of the bottom-up human pose method is lower than that of the top-down method, but its inference time remains relatively constant even as the number of people in the image increases [25].

In light of practical application scenarios and performance, in this study, we introduce a multi-person 2D human pose estimation method and its improved algorithm based on PAF. Firstly, we introduce top-down and bottom-up methods and their advantages and disadvantages were briefly analyzed. Secondly, we introduced the structure of our SE-ResNet-OKHM-CMU-Pose network and the specific process of algorithm implementation. Finally, we introduce attention mechanism into the network and propose an optimization algorithm based on hard example mining of joint points, and test the method on the COCO dataset. Our method effectively improves the accuracy of the network and further improves the accuracy of predicting difficult, overlapping, and complex joint points. To summarize, the contributions of this work include the following aspects. Code is available at: https://github.com/cljhwt/SE-ResNet-OKHM-CMU-Pose (accessed on 14 July 2023).

The use of a bottom-up method based on PAF for multi-person 2D pose estimation, and the incorporation of an attention mechanism to enhance the overall performance of the network.
Addressing the issue of poor accuracy in extracting more flexible nodes. The proposed method incorporates a hard example mining mechanism, which effectively improves the accuracy of the pose estimation.

2. Background

The network architecture of multi-person 2D human pose estimation based on Part Affinity Fields (PAF), known as CMU-Pose, was proposed by Cao et al. [23] as a network model that applies convolutional neural networks to bottom-up multi-person 2D human pose estimation. The implementation of multi-person 2D human pose estimation is based on PAF and deep learning methods. Therefore, the first part of this paper introduces the network architecture and provides background knowledge on the loss function.

2.1. CMU-Pose Network Structure

The diagram in Figure 1 shows the architecture of the CMU-Pose network, which consists of two main components: network feature maps (F), basis layers, and optimization layers (Stage 1 and Stage t, respectively). The network feature maps (F) consist of the first ten convolutional layers of VGG19 [26] and is used to extract the basic features from the joint heatmaps and PAF maps. The basic layer (Stage 1) consists of five convolutional layers, with the first three layers utilizing 3 × 3 convolution kernels and the last two layers utilizing 1 × 1 convolution kernels. The input to this layer is the underlying feature generated by the network (F), and the output is two branches representing the joint heatmaps and PAF maps. The primary function of this layer is to provide a rough estimate of the human pose in the image and to serve as the foundation for the network’s capabilities. The optimization layer (Stage t) comprises seven convolution layers, with the first five layers utilizing 7 × 7 convolution kernels and the last two layers utilizing 1 × 1 convolution kernels. The purpose of utilizing a 7 × 7 convolution kernel is to increase the receptive field of the network, thereby allowing it to “see” more information and facilitating the network’s inference process. The optimization layer receives input from two sources: the output of the previous layer and the underlying features generated by the network (F). Using such inputs facilitates the network’s ability to perform precise network inference by integrating low-level features with the coarse inference results from previous layers. The output of this layer is the same as that of the basic layer, being the heatmaps and PAF maps of joint points. The base stage and refinement stage are equipped with a loss function, which enables the network to learn more effectively and obtain relatively accurate results.

2.2. CMU-Pose Network Loss Function

Network labels:

Because the network is based on supervised learning, it requires the design of labels for the training set to enable network training. The network’s output consists of keypoint heatmaps and PAF maps. Therefore, it is necessary to design labels for keypoints and PAFs for the network.

Joint heatmap labels:

The output of the network includes a heatmap, with each position having a corresponding confidence level that represents the probability of it being a joint point [27]. The closer a position is to the true location of the joint point, the higher its confidence level is. To ensure that the confidence level is 1 at the exact location of the joint point, a Gaussian response is implemented around it. As the distance from the true location of the joint point increases, the confidence level decreases. The mathematical representation of the confidence level is defined as:

S_{j, k}^{*} (P) = exp (- \frac{| |P - x_{j, k}| |_{2}^{2}}{σ^{2}})

(1)

x_{j, k}

is the joint point position of the

j^{t h}

(18 joint points in total) of the kth person. Let P represent a specific position in the network’s output feature map.

S_{j, k}^{*} (P)

is the confidence level represented by the

j^{t h}

joint point of the kth person at the P position. Let

σ

represent the standard deviation, with a value of 7.

For a given position P in the image, if there are two or more Gaussian distribution curves, such as Gaussian1 and Gaussian2 shown in Figure 2, the confidence score for position P is determined by the maximum value along the Max curve.

PAF labels:

The PAF (Part Affinity Fields) labels are defined by placing unit vectors at position P on each limb of the human body in the image. The direction of the unit vector corresponds to the unit vector of the line connecting two keypoints on the limb. The visual representation of PAF labels is shown in Figure 3; the specific vector calculation and determination of the point P are described in Equations (2)–(4). When the point P is located on multiple limbs, the direction is illustrated in Figure 4.

Figure 3 provides an intuitive representation of the vector represented by position P. In this representation, if P is located on the limb connecting two keypoints, the direction of the unit vector at point P is indicated by the direction shown by v.

L_{c, k}^{*} (P) = \{\begin{matrix} v & if P on limb c, k \\ 0 & otherwise \end{matrix}

(2)

v = \frac{(x_{j_{2}, k} - x_{j_{1}, k})}{{∥x_{j_{2}, k} - x_{j_{1}, k}∥}_{2}}

(3)

0 \leq v \cdot (P - x_{j_{1}, k}) \leq l_{c, k} a n d |v_{⊥} \cdot (P - x_{j_{1}, k})| \leq σ_{l}

(4)

The sample block matching Equation (2) indicates that if the P point is on the limb c of the kth person, the unit vector of the P point is expressed as v. If not, it is 0 vector. The length of the vector v is determined by Equation (3), which means that when P is located on the limb of the kth person containing

j_{1}

and

j_{2}

joint points, the vector is expressed as the unit vector connected with the

j_{1}

and

j_{2}

joint points. Whether point P is located on limb c is determined by Equation (4).

σ_{l}

is the limb width in the image (in pixels),

l_{c, k}

is the length of the kth person’s limb c, and the size is the distance between

j_{1}

joint point and

j_{2}

joint point in the figure (in pixels).

v_{⊥}

is expressed as a unit vector perpendicular to v. Therefore, the criterion to judge whether P point is on the limb is that the length from P to a joint point is not greater than the length of two joint points, and the height shall not exceed the height of limb c.

As shown in Figure 4, one arm is holding the other. The red vector is the unit vector where point P is located in the first arm, and the purple vector is the unit vector where point P is located in the grabbed arm, so point P is the superposition of the red and purple unit vectors.

Loss function: The loss function of the network is the Euclidean distance between the predicted value and the ground truth of the joint point. The specific calculations and formulations can be found in Equations (5)–(7).

f_{S}^{t} = \sum_{j = 1}^{J} \sum_{P} W (P) \cdot {∥S_{j}^{t} (P) - S_{j}^{*} (P)∥}_{2}^{2}

(5)

f_{S}^{t} = \sum_{c = 1}^{C} \sum_{P} W (P) \cdot {∥L_{C}^{t} (P) - L_{C}^{*} (P)∥}_{2}^{2}

(6)

f = \sum_{t = 1}^{T} (f_{S}^{t} + f_{L}^{t})

(7)

W (P) = \{\begin{matrix} 0, & if P is in a crowded or shaded area \\ 1, & else \end{matrix}

(8)

where, t represents the predicted value generated by the stage t (t can be 1) in the network. In Equation (5),

S_{j}^{t} (p)

represents the predicted value of position P in the heatmap generated by the joint point j in the stage t of the network, and

S_{j}^{*} (p)

is the ground truth of the corresponding position. In Equation (6),

L_{c}^{t} (p)

represents the predicted value of the position of point P in the PAF maps of limb c generated by the network at stage t, and

L_{c}^{*} (p)

represents the ground truth of the corresponding position.

W (p)

is a binary function with a value of either 0 or 1, In Equation (8). When point P is located in a crowded or occluded area, the value of

W (p)

is set to 0, and for all other instances, it is set to 1. This is done because it is often difficult to accurately mark the position of joint points in crowded or occluded areas; setting

W (p)

to 0 can help reduce the impact of marking errors and increase the accuracy of the network.

3. Method

In this paper, the CMU-Pose network is adopted for the estimation of the 2D human pose of multiple individuals. The network consists of a backbone network (F), a basic layer, and an optimization layer. For the CMU-Pose network, the backbone network is responsible for generating the underlying features required by the network. Hence, the structure of the backbone network has a direct impact on the network’s results. In this study, the backbone network is modified from the VGG19 network to the ResNet, and an attention mechanism is introduced into the backbone network to improve the extraction of useful features and suppress the extraction of useless features [28]. Additionally, it is found that the CMU-Pose network has a poor performance in extracting flexible joint points such as wrists, ankles, and so on. To address this issue, this paper proposes an optimization method based on mining hard examples of joint points. In summary, the CMU-Pose network is optimized in two aspects: the first is the network structure based on the attention mechanism, and the second is the optimization method based on the hard case mining of joint points. The following sections provide a more detailed introduction.

3.1. Network Structure Based on Attention Mechanism

The introduction of the attention mechanism in the backbone network is primarily divided into three parts: the first is the introduction of the ResNet basic block, the second is the introduction of the attention block, and the third is the introduction of the overall network architecture for estimation. Below is a detailed description of each part.

ResNet basic block:

ResNet is a deep neural network architecture proposed by He et al. [29]. which aims to solve the problem of network gradient disappearance and accuracy decrease as network depth increases. Figure 5 illustrates its basic convolution block. The convolution process is divided into two parts: the main path, which carries out the standard convolution operation, and the bypass structure, also known as the shortcut structure, which is responsible for forwarding the input to the output. The output of the convolution process is the sum of the main output and the bypass output. During the network’s training, if the main path contributes to the network’s accuracy, the output of the main path will not be zero. Conversely, when the main path is detrimental to the network accuracy, the output of the main path will be 0, and the output will be equal to the input, thus maintaining the network accuracy. Therefore, the basic convolution block of ResNet ensures the accuracy of the network architecture. In order to decrease the network’s utilization of parameters and computational resources, the primary pathway of the basic convolutional block employs a 1 × 1 convolutional kernel to decrease the number of channels within the network, followed by a 3 × 3 convolutional kernel for convolution, and finally utilizes a 1 × 1 convolutional kernel to increase the number of channels within the network.

Attention block:

The utilization of attention models has become prevalent across a spectrum of deep learning applications, including natural language processing, image recognition, and speech recognition. Due to the primary focus of this paper on visual tasks, we will now discuss attention mechanisms in the context of computer vision. The visual attention mechanism refers to the brain’s signal-processing capability that allows humans to selectively attend to certain areas of an image while disregarding others. By quickly scanning the entire image, the visual attention mechanism directs focus to the relevant information and filters out irrelevant data, thereby improving the efficiency and accuracy of visual information processing. The literature on the implementation of attention mechanisms primarily encompasses references [30,31,32].

Figure 6 illustrates the attention mechanism module, known as the Squeeze-and-Excitation (SE) block, which was introduced by Hu [33] in the network architecture. The use of the SE block is intended to enhance the channel weights that are beneficial to the outcome while diminishing the channel weights that are detrimental to the outcome at the channel level. X is the input of the network, and the size is

C^{'} \times H^{'} \times W^{'}

(

C^{'}

is the number of input channels,

H^{'}

is the height of input, and

W^{'}

is the width of input). After a series of convolution processes

F_{t r}

, the output U is obtained, and the output size is

C \times H \times W

. The network output, denoted as U, is used to determine the significance of each channel through the Squeeze and Activation process, thereby incorporating an attention mechanism at the channel level [34]. The specific steps are outlined in Equations (9) and (10). The culmination of this process is depicted in Equation (11). For network output U, the network determines the importance of the channel and introduces the attention mechanism on the channel through the process of Squeeze and Activation. The specific process is shown in Equations (9) and (10). Finally, the final output is obtained through the

F_{s c a l e} (\cdot, \cdot)

process, as shown in Equation (11).

z_{c} = F_{s q} (u_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} u_{c} (i, j)

(9)

Equation (9) represents the specific calculation formula for the Squeeze process. In this formula, H and W represent the input height and width.

u_{c} (i, j)

represents the value at position

(i, j)

on channel c. Therefore, the Squeeze operation represents the average of the summed values at each position across channels. This process allows each channel to be represented by its mean value, which enables the network to have a global perspective.

s = F_{e x} (z, W) = σ (g (z, W)) = σ (W_{2} δ (W_{1} z))

(10)

Equation (10) is the specific calculation formula of the Excitation process, where z is the input of the activation process and the output of the extraction process. W is the weight that the network needs to learn (the size is

C \times C

).

g (z, W)

is the computation function, which is computed using the full connection layer in deep learning.

σ

is a sigmoid function that aims to express the importance of each channel with a weight of 0–1. To reduce the complexity of network computation, a single full connection layer is replaced by two full connection layers with lower computational complexity. The weights to be learned are

W_{1} (\frac{C}{r} \times C)

and

W_{2} (C \times \frac{C}{r})

, and the size is 8.

δ

uses the Relu function to ensure the non-linearity of the network.

\tilde{X_{c}} = F_{s c a l e} (u_{c}, s_{c}) = s_{c} \cdot u_{c}

(11)

Equation (11) is the final processing process, where

s_{c}

is the weight of channel c after the extraction and activation processes, and

u_{c}

is the input of channel c. The attention mechanism can be introduced at the channel level by multiplying the corresponding channel weight by the corresponding channel input. Figure 7 shows the final SE block structure, and the simplified SE block structure is shown at right.

Network structure of human pose estimation:

The attention mechanism introduced to the ResNet network is called SE-ResNet. The basic structure of the SE-ResNet network is shown in the left figure of Figure 8. The right figure shows its simplified method, where Res_1 represents the name of the basic block. The first “64” represents the number of channels in the first 1 × 1 convolutional layer. The second “64” represents the number of channels in the 3 × 3 convolutional layer. The “256” represents the number of channels in the final 1 × 1 convolutional layer. “1” represents the stride of the first 1 × 1 convolutional layer, which means no stride or stride = 1 is used.

Figure 9 illustrates the specific structure of the backbone network. The block labeled “res” is the basic block of the SE-ResNet network. Conv1 is a convolutional layer with a kernel size of 7 × 7, 64 channels, and 2 sides. We took Convolutional Pose Machine (CPM) from [12]. CPM1 is a convolutional layer with a kernel size of 3 × 3, 512 channels, and 1 stripe. CPM2 is a convolutional layer with a kernel size of 3 × 3, 128 channels, and 1 stripe. Dconv1 is a deconvolutional layer with a kernel size of 4 × 4, 256 channels, and 2 sides. The purpose of this layer is to increase the resolution of the input by upsampling it, doubling the original input size, and enhancing the network’s resolution [35].

3.2. Optimization Method Based on Mining Hard Examples of Joint Points

The multi-person 2D human pose estimation based on Part Affinity Fields (PAF) involves generating low-level features from a backbone network. These features then pass through the base stage (Stage 1) and five refinement stages (Stage t) to produce keypoint heatmaps and PAF maps for human body joints [36]. The results show that the network has poor performance in estimating more flexible joint points (such as ankles, wrists, etc.). Therefore, the use of joint point hard example mining can improve the effectiveness of the network in more challenging joint points. Specifically, the calculation method of the loss function for the base layer (stage 1) and the first three optimization layers (stage t) is kept unchanged and used to generate the heatmap of all joint points. The loss function for the last two optimization layers is changed to the heatmaps of the joint point, and the final loss is computed by summing up the losses corresponding to the maximum eight joint points, allowing the network to optimize predictions for more challenging keypoints. The specific calculation method is outlined in Equations (12)–(14).

f_{s, j}^{t} = \sum_{p} W (p) \cdot | | S_{j}^{t} (p) - S_{j}^{*} (p) {| |}_{2}^{2} (s = 5, 6)

(12)

As shown in Equation (12), (s = 5, 6) represents the range of values for the subscript of f. f represents the computed loss for the heatmap S of keypoint j during the optimization stage t. In this way, the corresponding loss of each joint point can be obtained.

J = d e s c (f_{s, j_{0}}^{t}, f_{s, j_{1}}^{t}, \cdot \cdot \cdot)

(13)

Equation (13) represents the loss to each node, and “desc” means to arrange it in order from largest to smallest and to save the ID of the corresponding node in J.

f_{s}^{t} = \sum_{n}^{8} f_{s, J_{n}}^{t} (s = 5, 6)

(14)

Equation (14) represents the keypoint loss for the corresponding optimization stage t, which is the sum of the losses of the top N keypoints with the highest loss. It has been empirically determined that setting N = 8 yields the best performance. By following the aforementioned steps, the optimization method based on keypoint hard example mining can be implemented, thereby improving the network’s performance on more flexible and challenging keypoints estimation tasks.

3.3. Estimation Processes

The multi-person 2D human pose estimation based on PAFs [31] consists of two main processes:

Keypoint and PAF detection.
2D Pose Estimation using PAFs.

Key point and PAF detection: Fixed-sized heatmaps of 19 channels for keypoint detection and 38-dimensional PAF maps are generated using deep neural networks. For the generated heatmaps of joint points, each channel is the heatmap of a certain part of several people (such as the neck or right shoulder, etc., a total of 18 joint points are shown in the blue dot in Figure 10, with a total of 18 dimensions, and the 19th dimension is the background information. For the generated 38-dimensional PAF maps, each two channels represent the unit vector information (X direction and Y direction) connected by a particular limb of the human body (a total of 19 are shown in the black line in Figure 10. See Figure 11 and Figure 12 for details:

Two-dimensional pose estimation using PAFs: The position of the joint points in the image can be determined using the Non-Maximum Suppression (NMS) method based on the joint point information obtained by deep learning. However, since there are multiple people in the image, it is not possible to classify them based on joint information alone. Therefore, it is necessary to use PAFs to infer the relationship between different joint points in order to classify the joint points that belong to the same person. This process is divided into three steps:

Determining the possibility of two keypoints belonging to the same limb.
Identifying the corresponding keypoints on the same limb and grouping them together.
Grouping all keypoints belonging to the same individual into a single class.

Determining the possibility of two joint points on the same limb: As shown in the left figure of Figure 13, the red and blue dots are all of the detected joint points j1 and j2. Because there are many people in the image, the joint point has multiple positions. As shown in the right figure of Figure 13, joint point j1 has three positions: m1, m2, and m3, and joint point j2 has three positions: n1, n2, and n3. Therefore, when determining the possibility of two joint points on the same limb, it is necessary to determine the possibility of connecting all positions of joint point j1 and joint point j2, as shown by the green dotted line in the right figure of Figure 13. The calculation of probability is shown in Figure 14, Equations (15) and (16).

Figure 14 shows the possibility of connecting two joint points

d_{j_{1}}

and

d_{j_{2}}

.

P (u)

is a point position on the line between joint point

d_{j_{1}}

and

d_{j_{2}}

. The calculation of

P (u)

is shown in Equation (16), and the value of u ranges between 0 and 1.

L_{c} (P (u))

is the vector prediction value at

P (u)

in the PAF map output by the network. The possibility of connecting two joint points is shown in Equation (15), which represents the integral of the dot product of the predicted value and the true value of each point between the two joint points.

E = \int_{u = 0}^{u = 1} L_{c} (P (u)) \cdot \frac{d_{j_{2}} - d_{j_{1}}}{| | d_{j_{2}} - d_{j_{1}} | |} d u

(15)

P (u) = (1 - u) d_{j_{1}} + u d_{j 2}

(16)

Identifying the corresponding keypoints on the same limb and grouping them together: The left figure in Figure 15 shows the visualization result of the final position on the same limb. The right figure in Figure 15 shows the visualization results of the process of finding two joint points on the same limb. The specific implementation is shown in Equation (17), where

D_{j_{1}}

refers to the set of all joint points belonging to

j_{1}

.

D_{j_{2}}

represents the set of all joint points belonging to

j_{2}

.

E_{m n}

refers to the connection possibility of different positions in the joint points of

j_{1}

and

j_{2}

(such as the connection possibility of

m_{1}

and

n_{1}

).

Z_{j_{1} j_{2}}^{m n}

is a binary function, taking 0 or 1. It is 1 when the joint point position has been used and 0 when it has not been used. The problem of finding two joint points on the same limb becomes a convex optimization problem; that is, when the position of each joint point is only used once, the maximum sum of connection possibilities is obtained and used as the final result.

\begin{matrix} max_{ζ_{c}} E_{c} = max_{Z_{c}} \sum_{m \in D_{j_{1}}} \sum_{n \in D_{j_{2}}} E_{m n} \cdot z_{j_{1} j_{2}}^{m n} \\ s . t \forall m \in D_{j_{1}}, \sum_{n \in D_{j_{2}}} z_{j_{1} j_{2}}^{m n} \leq 1 \\ \forall n \in D_{j_{2}}, \sum_{m \in D_{j_{1}}} z_{j_{1} j_{2}}^{m n} \leq 1 \end{matrix}

(17)

Grouping all keypoints belonging to the same individual into a single class: As depicted in Figure 16, the connection between the red point and the green point can be determined by aforementioned steps, and the connection between the blue point and the green point can be obtained by repeating the first and second steps. Similarly, by repeating the first and second steps several times, the connection relationship between all joint points can be determined. Finally, all joint points of each person in the image are obtained and visualized in Figure 17.

4. Results and Discussion

4.1. Experimental Platform and Dataset

Datasets: The COCO dataset is one of the most popular large-scale labeled image datasets available for public use. It contains image annotations in 80 categories, with over 1.5 million object instances. The COCO dataset is used for multiple computer vision tasks such as object detection and instance segmentation, image captioning, keypoints detection, panoptic segmentation, dense pose, and stuff image segmentation. The COCO keypoints dataset provides accessibility to over 200,000 images and 250,000 person instances labeled with keypoints. It includes various hard examples such as human pose, overlapping people, occlusion, and complex poses.

Training: The presented network is implemented using Python, with PyTorch open-source framework. It is trained with two GTX1080 graphics cards, 8 GB graphics memory, and Ubuntu 16.04 LST operating system. To evaluate the performance of our networks, we train our networks on the COCO train2017 dataset, and evaluate them on the COCO val2017 dataset. The training process for the well-designed network is conducted as follows. The images are first resized to 368 × 368. Then, we initialize the model parameters. We set the base learning rate [37] to 1, set the batch size to 23. Gaussian initialization is used for convolutional and deconvolutional layers. Hyperparameters are set by following VGG19 [26] pretrained on the ImageNet [38]. All parameters are updated by the SGD [39] optimization method. Finally, after repeated training for approximately 200 epochs, a set of parameters that yield satisfactory performance is selected as the optimal network model.

4.2. Analysis of Experimental Results of COCO Dataset

In this paper, the optimization of the multi-person 2D pose estimation based on PAFs (CMU-Pose method) includes two aspects. The first aspect is the use of the backbone network introducing attention mechanism, which is defined as the SE-ResNet-CMU-Pose method, and the second aspect is the optimization method based on the mining of joint points, which is defined as the SE-ResNet-OKHM-CMU-Pose method. In order to verify the effectiveness of each part, a comparative experiment is made on the two parts, and ensure that the hyperparameters and training methods used in training are consistent with the CMU-Pose network training.

We took Average Precision (AP) and Object Keypoint Similarity (OKS) as evaluation metrics. The

A P^{50}

represents the AP value of the network when OKS = 0.5,

A P^{75}

represents the AP value of the network when OKS = 0.75, and AP represents the average value of all OKS standards between OKS = 0.5:0.05:0.95. For

A P^{M}

, M stands for medium, which is the AP value between 32 × 32 and 96 × 96 pixels in the area of human body in the image. For

A P^{L}

, L stands for large, that is, the AP value where the area of the human body in the image is greater than 96 × 96 pixels. As shown in Table 1, we compared the results of our SE-ResNet-CMU-Pose and SE-ResNet-OKHM-CMU-Pose with other popular pose estimation methods. Our network achieves better accuracy than other methods. As compared to other networks, our SE-ResNet-OKHM-CMU-Pose achieves the highest AP score of 60.0. Benefiting from the hard examples mining, as compared to the ShuffleNetV2 1× network, our SE-ResNet-OKHM-CMU-Pose also achieves comparable or even higher accuracy.

A P^{50}

decreased slightly while

A P^{75}

increased, indicating that the network failed to detect some joint points, but the accuracy of detecting the joint points positions was improved. The improved accuracy of

A P^{M}

and the decreased accuracy of

A P^{L}

indicate that the network performs well in estimating medium-sized human pose in the images, but there is room for improvement in effectively estimating larger-scale human pose within the image.

As indicated in Table 1, compared to HRNet-W16, SE-ResNet-OKHM-CMU-Pose improves AP by 4.0 points with only 50% GFLOPs. Compared to ShuffleNetV2, our SE-ResNet-OKHM-CMU-Pose achieves a close AP, but with the advantage of requiring fewer GFLOPs. The complexity of our network is much lower than compared methods. Compared to Hourglass and CMU-Pose, our network achieves comparable AP score with far lower complexity.

In general, after introducing the attention mechanism into the network, the accuracy of predicting joint points is improved, but the performance of predicting hard joint points is poor. To effectively address this limitation, the network introduced optimization method based on the mining of hard examples of joint points. The overall performance of the network has improved the AP value by 1.6%, and both

A P^{50}

and

A P^{75}

have increased, which shows that the network has improved the extraction accuracy and prediction performance of the related nodes. It also enhances the prediction performance of large human joint points, which refers to situations where the human have a significant size within the image. The visual representation of the performance improvement is shown in Figure 18.

Figure 18 shows the results of multi-person 2D human pose estimation. The first row shows the recognition results using the CMU pose network, and the second row shows the recognition results using the SE-ResNet-OKHM-CMU-Pose network. In the first images of these two lines, CMU Pose only detected 17 joint points for the two individuals on the right, while SE-ResNet-OKHM-CMU-Pose recognized 18. The joint points in the red circle in the figures were not recognized by CMU-Pose. In addition, in the second images of these two lines, only SE-ResNet-OKHM-CMU-Pose recognized the whole 18 joints of the third people. In the third images of these two lines, the left ankle joint of the first person only correctly recognized by SE-ResNet-OKHM-CMU-Pose. These results demonstrate that SE-ResNet-OKHM-CMU-Pose has better accuracy in predicting more difficult and flexible joint points. Our method can detect multiple overlapping and complex human joint points while ensuring the accuracy of keypoint detection. By mining hard examples, we effectively solved the difficulties in COCO datasets. These results validate the effectiveness of our method.

5. Conclusions

This paper primarily introduces a method for multi-person 2D human pose estimation based on PAFs, along with its improved algorithms. First, we introduced two implementation methods (top-down and bottom-up) to achieve multi-person 2D human pose estimation, and briefly analyzed the advantages and disadvantages of these methods. Second, the design of the network model for SE-ResNet-OKHM-CMU-Pose and the specific process of implementation were explained in detail. Finally, improvements were made for the problems encountered in the network structure and the process of joint point extraction. We introduced the attention mechanism into the network and proposed an optimization method based on the mining of hard examples of joint points. The COCO dataset was used to validate our network. Through a comparative analysis of the results, our SE-ResNet-OKHM-CMU-Pose achieved an AP score of 60.0 and outperformed competing methods on COCO the test dataset. It can be concluded that our method effectively improves the accuracy of pose estimation and improves the accuracy of predicting hard and flexible joint points. This method performs well in extracting medium-sized human bodies from images, but there is still room for improvement in effectively capturing larger human bodies in images. Next, we will focus on the use of video inter-frame information to improve the accuracy of the estimation results.

Author Contributions

Conceptualization, L.Z.; Methodology, W.H.; Software, C.W.; Validation, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was Funded by by the National Natural Science Foundation of China (Grant No. 61973029, 62273034, 62273315).

Conflicts of Interest

The authors declare no conflict of interest.

References

Fan, Z.; Zhao, X.; Lin, T.; Su, H. Attention-based multiview re-observation fusion network for skeletal action recognition. IEEE Trans. Multimed. 2018, 21, 363–374. [Google Scholar] [CrossRef]
Ouyang, W.; Chu, X.; Wang, X. Multi-source deep learning for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2329–2336. [Google Scholar]
Newell, A.; Huang, Z.; Deng, J. Associative embedding: End-to-end learning for joint detection and grouping. Adv. Neural Inf. Process. Syst. 2017, 30, 2274–2284. [Google Scholar]
Tompson, J.; Goroshin, R.; Jain, A.; LeCun, Y.; Bregler, C. Efficient object localization using convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 648–656. [Google Scholar]
Hua, G.; Li, L.; Liu, S. Multipath affinage stacked—Hourglass networks for human pose estimation. Front. Comput. Sci. 2020, 14, 1–12. [Google Scholar] [CrossRef]
Chu, X.; Yang, W.; Ouyang, W.; Ma, C.; Yuille, A.L.; Wang, X. Multi-context attention for human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1831–1840. [Google Scholar]
Chen, Y.; Tian, Y.; He, M. Monocular human pose estimation: A survey of deep learning-based methods. Comput. Vis. Image Underst. 2020, 192, 102897. [Google Scholar] [CrossRef]
Belagiannis, V.; Zisserman, A. Recurrent human pose estimation. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 468–475. [Google Scholar]
Bulat, A.; Tzimiropoulos, G. Human pose estimation via convolutional part heatmap regression. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VII 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 717–732. [Google Scholar]
Pfister, T.; Charles, J.; Zisserman, A. Flowing convnets for human pose estimation in videos. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1913–1921. [Google Scholar]
Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VIII 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 483–499. [Google Scholar]
Wei, S.E.; Ramakrishna, V.; Kanade, T.; Sheikh, Y. Convolutional pose machines. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4724–4732. [Google Scholar]
Artacho, B.; Savakis, A. Unipose: Unified human pose estimation in single images and videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7035–7044. [Google Scholar]
Groos, D.; Ramampiaro, H.; Ihlen, E.A. EfficientPose: Scalable single-person pose estimation. Appl. Intell. 2021, 51, 2518–2533. [Google Scholar] [CrossRef]
Khirodkar, R.; Chari, V.; Agrawal, A.; Tyagi, A. Multi-instance pose networks: Rethinking top-down pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 3122–3131. [Google Scholar]
Artacho, B.; Savakis, A. Omnipose: A multi-scale framework for multi-person pose estimation. arXiv 2021, arXiv:2103.10180. [Google Scholar]
Cai, Y.; Wang, Z.; Luo, Z.; Yin, B.; Du, A.; Wang, H.; Zhang, X.; Zhou, X.; Zhou, E.; Sun, J. Learning delicate local representations for multi-person pose estimation. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part III 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 455–472. [Google Scholar]
Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5386–5395. [Google Scholar]
Kreiss, S.; Bertoni, L.; Alahi, A. Pifpaf: Composite fields for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11977–11986. [Google Scholar]
Neff, C.; Sheth, A.; Furgurson, S.; Middleton, J.; Tabkhi, H. EfficientHRNet: Efficient and scalable high-resolution networks for real-time multi-person 2D human pose estimation. J. Real Time Image Process. 2021, 18, 1037–1049. [Google Scholar] [CrossRef]
Li, H.; Wen, S.; Shi, K. A simple and effective multi-person pose estimation model for low power embedded system. Microprocess. Microsyst. 2023, 96, 104739. [Google Scholar] [CrossRef]
Kushwaha, M.; Choudhary, J.; Singh, D.P. Enhancement of human 3D pose estimation using a novel concept of depth prediction with pose alignment from a single 2D image. Comput. Graph. (Pergamon) 2022, 107, 172–185. [Google Scholar] [CrossRef]
Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
Silva, L.J.S.; da Silva, D.L.S.; Raposo, A.B.; Velho, L.; Lopes, H.C.V. Tensorpose: Real-time pose estimation for interactive applications. Comput. Graph. (Pergamon) 2019, 85, 1–14. [Google Scholar] [CrossRef]
Su, K.; Yu, D.; Xu, Z.; Geng, X.; Wang, C. Multi-person pose estimation with enhanced channel-wise and spatial information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5674–5682. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015—Conference Track Proceedings, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Zhang, F.; Zhu, X.; Dai, H.; Ye, M.; Zhu, C. Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7093–7102. [Google Scholar]
Li, J.; Liu, X.; Zhang, M.; Wang, D. Spatio-temporal deformable 3d convnets with attention for action recognition. Pattern Recognit. 2020, 98, 107037. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
Bagherinezhad, H.; Horton, M.; Rastegari, M.; Farhadi, A. Label refinery: Improving imagenet classification through label progression. arXiv 2018, arXiv:1805.02641. [Google Scholar]
Gu, Z.; Su, X.; Liu, Y.; Zhang, Q. Local stereo matching with adaptive support-weight, rank transform and disparity calibration. Pattern Recognit. Lett. 2008, 29, 1230–1235. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, X.; Tong, J.; Wang, R. Attention refined network for human pose estimation. Neural Process. Lett. 2021, 53, 2853–2872. [Google Scholar] [CrossRef]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar]
Smith, L.N. Cyclical learning rates for training neural networks. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 464–472. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Goyal, P.; Dollár, P.; Girshick, R.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv 2017, arXiv:1706.02677. [Google Scholar]
Papandreou, G.; Zhu, T.; Kanazawa, N.; Toshev, A.; Tompson, J.; Bregler, C.; Murphy, K. Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4903–4911. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-hrnet: A lightweight high-resolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10440–10450. [Google Scholar]

Figure 1. The structure of CMU-Pose network.

Figure 2. Multiple Gaussian distributions at a same location.

Figure 3. PAF on single limb. P is the point where the unit vector will be calculated, v and

v_{⊥}

are two unit vectors along the arm direction.

x_{j_{2}, k}

and

x_{j_{1}, k}

are two joint points.

Figure 3. PAF on single limb. P is the point where the unit vector will be calculated, v and

v_{⊥}

are two unit vectors along the arm direction.

x_{j_{2}, k}

and

x_{j_{1}, k}

are two joint points.

Figure 4. PAF on multiple Overlapped limbs. P is the point where the unit vector will be calculated, v and

v_{⊥}

are two unit vectors along the arm direction.

x_{j_{2}, k}

and

x_{j_{1}, k}

are two joint points.

Figure 4. PAF on multiple Overlapped limbs. P is the point where the unit vector will be calculated, v and

v_{⊥}

are two unit vectors along the arm direction.

x_{j_{2}, k}

and

x_{j_{1}, k}

are two joint points.

Figure 5. ResNet Basic Block. The numbers in the convolutional module represent the size of the convolutional kernel and the number of channels, respectively.

Figure 6. SE block structure. X is the input of the network,

F_{t r}

represent convolution processes,

F_{s c a l e}

represent scale processes,

F_{e x}

represent Excitation processes.

Figure 6. SE block structure. X is the input of the network,

F_{t r}

represent convolution processes,

F_{s c a l e}

represent scale processes,

F_{e x}

represent Excitation processes.

Figure 7. Overall structure of SE block. The number to the right of the block represents its size.

Figure 8. SE-ResNet basic block and its simplified representation.

Figure 9. Structure of SE-ResNet-OKHM-CMU-Pose backbone. The res block in the figure represents the SE-ResNet block. Conv and Dconv represent convolution and deconvolution layers, respectively. The numbers within the Res module represent the channels of the first 1 × 1 convolutional layer, the channels of the 3 × 3 convolutional layer, the channels of the last 1 × 1 convolutional layer, and the stride of the first 1 × 1 convolutional layer within the SE-ResNet block, respectively. The numbers in the convolutional module represent the size of the convolutional kernel, the number of channels, and its stride, respectively.

Figure 10. Joint points of the human body.

Figure 11. Joint points on the same limb of different people, the color areas in the figure represent the same joint points.

Figure 12. PAF map with unit vector, the arrow represents the direction of unit vector.

Figure 13. Possible connection of joint points of different people.

Figure 14. Possibility of connecting two joint points.

Figure 15. Obtaining the connection relationship between two joint points.

Figure 16. Obtaining joint connections.

Figure 17. Connection relationship of the final 18 joint points.

Figure 18. Results of CMU-Pose (a–c) and SE-ResNet-OKHM-CMU-Pose(Ours) (d–f). The joint points in the red circle were not recognized by CMU-Pose.

Table 1. Comparison with bottom-up methods on COCO val2017 set. Our SE-ResNet-OKHM-CMU-Pose model outperforms CMU-Pose [23]. by 1.6 AP.As shown in the table, our framework achieves competitive or better performance than previous strong baselines (e.g., Mask R-CNN and CMU-Pose).

Method	GFLOPs	AP	AP⁵⁰	AP⁷⁵	AP^M	AP^L
DL-61 [40]	-	53.3	75.1	48.5	55.5	54.8
CMU-Pose [23]	0.24	58.4	81.5	62.6	54.4	65.1
Hourglass [11]	14.30	56.6	81.8	61.8	67.0	-
SSD [41] + CPM [12]	-	52.7	71.1	57.2	47.0	64.2
Mask R-CNN [42]	-	57.2	83.5	60.3	69.4	57.9
ShuffleNetV2 1× [43]	1.28	59.9	85.4	66.3	56.6	66.2
HRNet-W16 [44]	0.54	56.0	83.8	63.0	52.4	62.6
SE-ResNet-CMU-Pose (Ours)	-	58.6	80.7	63.0	57.9	60.6
SE-ResNet-OKHM-CMU-Pose (Ours)	0.118	60.0	81.7	65.2	59.3	62.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, L.; Huang, W.; Wang, C.; Zeng, H. Improved Multi-Person 2D Human Pose Estimation Using Attention Mechanisms and Hard Example Mining. Sustainability 2023, 15, 13363. https://doi.org/10.3390/su151813363

AMA Style

Zhang L, Huang W, Wang C, Zeng H. Improved Multi-Person 2D Human Pose Estimation Using Attention Mechanisms and Hard Example Mining. Sustainability. 2023; 15(18):13363. https://doi.org/10.3390/su151813363

Chicago/Turabian Style

Zhang, Lixin, Wenteng Huang, Chenliang Wang, and Hui Zeng. 2023. "Improved Multi-Person 2D Human Pose Estimation Using Attention Mechanisms and Hard Example Mining" Sustainability 15, no. 18: 13363. https://doi.org/10.3390/su151813363

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improved Multi-Person 2D Human Pose Estimation Using Attention Mechanisms and Hard Example Mining

Abstract

1. Introduction

2. Background

2.1. CMU-Pose Network Structure

2.2. CMU-Pose Network Loss Function

3. Method

3.1. Network Structure Based on Attention Mechanism

3.2. Optimization Method Based on Mining Hard Examples of Joint Points

3.3. Estimation Processes

4. Results and Discussion

4.1. Experimental Platform and Dataset

4.2. Analysis of Experimental Results of COCO Dataset

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI