Lightweight 2D Human Pose Estimation Based on Joint Channel Coordinate Attention Mechanism

Li, Zuhe; Xue, Mengze; Cui, Yuhao; Liu, Boyi; Fu, Ruochong; Chen, Haoran; Ju, Fujiao

doi:10.3390/electronics13010143

Open AccessArticle

Lightweight 2D Human Pose Estimation Based on Joint Channel Coordinate Attention Mechanism

¹

School of Computer and Communication Engineering, Zhengzhou University of Light Industry, Zhengzhou 450002, China

²

Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(1), 143; https://doi.org/10.3390/electronics13010143

Submission received: 23 October 2023 / Revised: 6 December 2023 / Accepted: 26 December 2023 / Published: 28 December 2023

Download

Browse Figures

Versions Notes

Abstract

:

Traditional human pose estimation methods typically rely on complex models and algorithms. Lite-HRNet can achieve an excellent performance while reducing model complexity. However, its feature extraction scale is relatively single, which can lead to lower keypoints’ localization accuracy in crowded and complex scenes. To address this issue, we propose a lightweight human pose estimation model based on a joint channel coordinate attention mechanism. This model provides a powerful information interaction channel, enabling features of different resolutions to interact more effectively. This interaction can solve the problem of human pose estimation in complex scenes and improve the robustness and accuracy of the pose estimation model. The introduction of the joint channel coordinate attention mechanism enables the model to more effectively retain key information, thereby enhancing keypoints’ localization accuracy. We also redesign the lightweight basic module using the shuffle module and the joint channel coordinate attention mechanism to replace the spatial weight calculation module in the original Lite-HRNet model. By introducing this new module, we not only improve the network calculation speed and reduce the number of parameters of the entire model, but also ensure the accuracy of the model, thereby achieving a balance between performance and efficiency. We compare this model with current mainstream methods on the COCO and MPII dataset. The experimental results show that this model can effectively reduce the number of parameters and computational complexity while ensuring high model accuracy.

Keywords:

human pose estimation; deep learning; attention mechanism; lightweight module

1. Introduction

Human pose estimation is a method of calculating human body postures from pose-related information seen in photos or movies, such as joint angles and the skeleton of human body. As an important research topic in the field of computer vision, it is widely used in behavior recognition, motion tracking, clinical medicine, human–computer interaction, and other fields [1,2,3,4,5,6]. Nevertheless, in complex application scenarios, the actual captured human posture can be affected by several factors, such as fluctuations in lighting, occlusions, variations in pose, and unstable backgrounds. These elements may compromise the accuracy and resilience of pose estimation. For example, in outdoor sporting events, athletes may be captured by cameras while moving at high speeds, with backgrounds containing various obstacles and other athletes. This increases the difficulty of accurately estimating their poses. In the medical field, human pose estimation can also be employed to monitor a patient’s physical condition. However, patients may exhibit unstable movements or have parts of their bodies obscured due to illness, posing challenges for accurate pose estimation. Consequently, human pose estimation in complex scenarios remains a formidable challenge. Enhancing the performance of pose estimation algorithms in complex environments, thereby enabling them to better cater to our daily needs, remains a crucial research focus within the field of computer vision.

At present, research on human pose estimation methods is primarily centered on deepening network depth, expanding feature map resolution, and designing networks with different resolutions to accomplish multi-scale feature fusion and feature extraction. This type of network requires support from high-performance computing devices. It is difficult to apply these technologies to actual scenarios due to problems such as numerous parameters, long training time, and difficulty in deployment on low-performance computing devices. This huge computing requirement limits the promotion of deep learning models in practical applications. Additionally, researchers in this domain encounter substantial computational and resource limitations, which further impede the widespread integration and utilization of deep learning models in everyday life. Therefore, in the current human pose estimation task, there is an urgent need to further reduce the computational complexity and parameter amount of the model, while ensuring that the accuracy of keypoint detection is not greatly affected. There are two main design ideas for existing lightweight networks. One is to obtain modular depth-wise separable convolutions from efficient networks like MobileNet and ShuffleNet [7,8] and combine them with high-resolution models to create the network backbone. The second is to utilize efficient convolution modules to replace conventional 3 × 3 and 1 × 1 convolutions for feature extraction so that the model can achieve lightweight effects.

Currently, the demand for efficient human pose estimation models is increasing, and there is an urgent need for models to be lightweight while ensuring a certain degree of accuracy. Therefore, when designing pose estimation models, it is necessary to not only improve the detection accuracy of the model, but also consider the issue of making the model lightweight. To achieve this goal, we propose an innovative lightweight human pose estimation model that aims to reduce the computational complexity and the amount of model parameters while maintaining pose detection accuracy. In this study, the joint channel coordinate attention mechanism is incorporated into the human pose estimation task employing Lite-HRNet [9] as the backbone network. This approach is able to ensure a low number of parameters while maintaining high channel resolution and high spatial resolution. Additionally, we introduce a non-linear model into the attention mechanism to make the fitted output closer to the actual output. Extensive experiments on the COCO and MPII dataset reveal the effectiveness of our model. This work will provide strong support for the application of deep learning-based human pose estimation models in complex scenes in real life. The following is a summary of our contributions:

We propose a lightweight human pose estimation model with a joint channel coordinate attention mechanism, enhancing the model’s sensitivity to pose direction and position perception while simultaneously reducing model complexity.
We present a joint channel coordinate attention mechanism that efficiently focuses on the positional information of posture, improving the precision of human posture estimation in challenging circumstances.
We propose a novel shuffle module that efficiently reduces the network’s computational cost while preserving model accuracy.

The rest of this paper is divided into the following sections. In Section 2, we examine relevant research in the field of lightweight models and human pose estimation. We present the overall design of our proposed model as well as the fundamental ideas behind each component in Section 3. We introduce the dataset, assessment measures, and experimental findings in Section 4. We review our findings and discuss their potential value in Section 5.

2. Related Work

The two primary categories of human posture estimation technology are deep learning-based methods and conventional computer vision methods. Traditional computer vision methods utilize image processing and machine learning techniques to extract human pose features. Convolutional networks are primarily used in deep learning-based algorithms to process input images and produce outputs for human posture estimation. In addition, there are methods that combine deep learning and traditional computer vision approaches. For example, Sun et al. [10] proposed a high-resolution network which connects networks of different resolutions in a parallel manner to maintain high-resolution feature maps throughout the entire network model and effectively enhance the accuracy of predicting human keypoints. Cheng et al. [11] presented a high-resolution network based on the Hourglass [12] structure, utilizing larger receptive fields and more feature maps to enhance accuracy. Additionally, Papaioannidis et al. [13] proposed a novel neural module to enhance the accuracy of existing lightweight 2D human pose estimation methods. Wang et al. [14] proposed a part-relation-aware human parser (PRHP), which accurately describes three human part relationships of decomposition, composition, and dependence through three different relationship networks. By imposing parameters in the relationship network, expressive relationship information can be captured to satisfy the specific geometric characteristics of different relationships. Fang et al. [15] proposed a unified real-time multi-person whole-body pose estimation and tracking framework, which accurately locates whole-body keypoints while tracking human bodies in the presence of inaccurate bounding boxes and redundant detections. Yang et al. [16] introduced a method to unify a person’s overall keypoints and local keypoints into the same box representation and optimize them in an end-to-end manner through consistent regression loss. Zhou et al. [17] proposed two new algorithms based on projected gradient descent and imbalanced optimal transfer to differentially solve matching issues.

As the number of layers and computational complexity of neural networks surges, lightweight models [18] are increasingly important in the field of deep learning. A series of MobileNet [19] models were proposed to employ linear bottleneck and inverse residual techniques to enhance network representation capability. Additionally, a network architecture search algorithm has been developed to improve accuracy and reduce inference delay. ShuffleNet models were designed to introduce a channel shuffling method to reduce the computational load of models. When the computational complexity is known, channel shuffling has been used to enhance inter-group connections and preserve more channel information. ShuffleNetV2 [20] was proposed to employ channel transformation, which conducts convolution operations solely on half of the channels following channel separation to enhance cross-channel information interaction. This model has provided significant inspiration for research on lightweight networks. Through making modifications to ShuffleNetv2, ThunderNet [21] was proposed to create a backbone network called SNet. A region proposal network for object detection was adopted as the network’s detection component. MixNet [22] was proposed to explore various convolution kernel sets of different sizes, laying the foundation for the basic idea of split convolution. ConvNeXt [23] was proposed to combine large convolution kernels with inverted bottleneck layers, increasing accuracy while reducing computational complexity. Accordingly, in the field of pose estimation, lightweight human pose estimation models have also been proposed. Woo et al. [24] presented ConvNeXt v2, which extends a fully convolutional mask autoencoder framework based on the ConvNeXt architecture, to improve the performance of pure ConvNeXt on various recognition benchmarks. Lv et al. [25] introduced a lightweight end-to-end trainable single-stage framework for pose estimation, obtaining a lightweight model with high accuracy. Zhang et al. [26] introduced contextual information modeling to enhance information exchange among high-resolution modules, considering both accuracy and speed. This approach maximizes the utilization of channel and spatial representation capabilities.

Attention mechanisms [27,28,29,30] are a technology that can dynamically learn and adjust the weights of input features in deep learning models. In the task of human pose estimation, the attention mechanism pays more attention to the keypoints and reduces the influence of irrelevant points on the detection results. As a result, lightweight attention-based models find extensive applications in the domain of human pose estimation, effectively reducing model complexity while maintaining performance. Multi-scale feature fusion has become a significant problem in the design of convolutional neural networks. However, many current approaches simply stack corresponding layers without taking into account the semantic distinctions between layers, which could result in inadequate feature fusion. Chen et al. [31] proposed a multi-attention network, including a dual-path dual-attention module and a query-based cross-modal Transformer. The dual-attention multiplies the channel-gated attention and the spatial attention in a cascade manner. Yi et al. [32] proposed a Multi-Kernel Temporal Block (MKTB) and an attention module called the Global Refinement Block (GRB). This architecture can effectively explore the spatiotemporal features within tolerable computational cost. For scenarios with low-texture indoor environments or limited outdoor training data, Wang et al. [33] introduced an innovative transformer-based hierarchical extraction and matching method, enabling synchronized feature extraction and feature matching. In each stage of the hierarchical structure, cross-attention modules and self-attention modules alternate, providing the optimal combination path to enhance multi-scale features. This matching-aware encoder alleviates the burden on the decoder, making the model more efficient. Due to the drawbacks of high complexity and suboptimal accuracy in existing human pose estimation methods, Wang et al. [34] introduced a lightweight end-to-end human pose estimation network built upon the Yolo-Pose architecture [35]. This model can obtain feature information that is cross-channel, orientation-aware, and position-sensitive, enhancing the network’s robustness in densely occluded scenarios.

The accuracy of human posture prediction and the sensitivity of direction and position perception in complex scenes are not high, and high-complexity network models require a large amount of calculation. Existing human posture estimation methods often cannot take into account both the accuracy and the computational cost of the model. Therefore, we integrate the advantages of existing methods and introduce a novel human pose estimation method. Our approach employs Lite-HRNet as the backbone network, incorporates a joint channel coordinate attention module to address the challenge of human pose detection in intricate scenes, and utilizes lightweight level modules to tackle the issue of computational costs. This model is able to facilitate the application of human posture estimation on mobile devices or embedded intelligent terminals.

3. Method

We will introduce our lightweight human pose estimation model based on joint channel coordinate attention in this section. The basic architecture of our approach is presented in Figure 1. We will provide a comprehensive overview of our proposed model from various aspects.

We take HRNet as our backbone network, which employs a large number of residual blocks for feature extraction. The entire network mainly consists of bottleneck and basic modules. The bottleneck module is one of the core components of HRNet. Each bottleneck module has multiple branches, and each branch processes feature maps of different resolutions. This multi-resolution scheme enhances the network’s feature fusion capability, thereby improving its performance. The basic module takes the feature map from the previous layer as input and generates an output feature map with the same resolution. Furthermore, it incorporates a shortcut connection to directly add input feature maps to the output feature maps for improved gradient propagation and accelerated training. Based on the high-resolution network, we employ channel coordinate fusion and lightweight techniques to construct the network model in order to improve accuracy and reduce model parameters and computational complexity.

The coordinate attention mechanism [36] is a method that weights the spatial positions of input data. It adds positional encoding to feature maps, enhancing the model’s perception of various areas of an image by emphasizing the importance of different locations. In contrast to channel attention, which transforms inputs into a single feature vector through two-dimensional global pooling, the coordinate attention mechanism breaks down spatial attention into two one-dimensional feature encoding processes and converges features in different directions. Following their independent encoding, the resultant feature maps are combined to create two direction-aware and position-sensitive feature maps that can complement the input feature maps to enhance the representation of the target of interest. We encode each channel using two spatial ranges of the pooling kernel along the horizontal and vertical coordinates, respectively. Specifically, for a given feature map X, we compute the averages along each row and column of channels using the following two formulas.

Ζ_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i \leq W} X_{c} (h, i)

(1)

Ζ_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j \leq H} X_{c} (j, w)

(2)

Each channel is encoded along the horizontal and vertical coordinates by the two transformations mentioned above, producing two feature maps with directionally aware features. The coordinate attention mechanism is more detailed than the channel attention mechanism since it is (h + w) times larger and maintains the positional information of each channel in the feature maps. These two equations provide the conditions for the attention mechanism to preserve both the long-term dependencies of features and precise positional information, assisting the network in more accurately locating critical features.

Next, to generate aggregated feature maps using the positional information, we first concatenate them and then pass them through the shared 1 × 1 convolutional transformation function F₁, resulting in:

f = δ (F_{1} ([z^{h}, z^{w}]))

(3)

Following feature aggregation, a new feature map is obtained that integrates coordinated attention information. This new feature map can be passed to subsequent network layers for further processing. To ensure the values of the extracted f^w and f^h maps fall within the [0, 1] range, they are normalized as shown in the following equation:

g^{w} = σ (F_{w} (f^{w}))

(4)

g^{h} = σ (F_{h} (f^{h}))

(5)

where

σ

is the sigmoid function.

Ultimately, the output of our coordinate attention block can be expressed as

Z_{c o o r d} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(6)

3.1. Joint Channel Coordinate Attention Function

The coordinate attention block functions as a computing unit designed to enhance the feature representation capacity of the mobile network. Next, the input features are processed using two layers of perceptrons, resulting in two processed channel features. After adding these two features, weights are obtained using the sigmoid activation function; these weights are then multiplied with the original input features to generate the channel attention feature. The spatial attention module uses a cascaded average pooling layer and a max pooling layer to process the input features of dimensions H × W × C, and then spatial weights are derived by applying a convolutional layer with a Sigmoid activation function. Finally, the obtained weights are multiplied with the input features to yield coordinate attention features. We joined the channel attention with the coordinate attention mechanism to create a new joint channel coordinate attention module (CCM). We integrate the coordinate attention technique based on channel attention, as seen in Figure 2, to guarantee the acquisition of crucial channel information while acquiring location information.

For the channel attention mechanism, channel weights Z_c are computed through global pooling and fully connected layers; the obtained channel weights are weighted with feature map weights to obtain Z_g. For coordinate attention, position encoding can be added to the feature map, and attention weights Z_coord can be computed. Then, the obtained channel attention weights and coordinate attention weights are separately weighted with the feature map to obtain the final attention weights Z_final (as shown in Algorithm 1). The formulas are as follows:

Z_{final} = Z_{g} \times Z_{coord}

(7)

Algorithm 1 Channel coordinate association module

Input: Input images

Output: The attention weight of fused channel attention feature and coordinate attention feature(Z_f)

1: The channel attention weight is obtained using the channel attention method (Z_c)

2: The weights are weighted with the inputs to obtain new weights and serve as inputs to the coordinate attention module

3: Use the following method to calculate the coordinate attention weight
4: for x do

  5: Calculating horizontal weights using Equation (1)
  6: for y do
  7:   Calculating vertical weights using Equation (2)
  8: end for
  9: end for
  10: Aggregate the resulting features using Equation (3)
  11: for x do
  12: The extracted (f_w)is normalized using Equation (4)
  13: for x do
  14:    The extracted (f_h) is normalized using Equation (5)
  15: end for
  16: end for
  17: Calculate the weight of the coordinate attention module using Equation (6)
  18: Calculate the coordinate channel association weight (Z_f) using Equation (7)
  19: Return (Z_f)

3.2. The Cross-Resolution Weighting Function

The Cross-Resolution Weighting function is employed to fuse feature maps from various resolutions. Its objective is to process feature maps extracted from multiple resolution branches and unify them into a coherent feature representation. In its implementation, the function initially employs adaptive average pooling on feature maps from each resolution, ensuring their sizes are harmonized to the smallest resolution. Subsequently, these average-pooled feature maps are concatenated to create a feature map with a channel dimension equal to the total number of channels. Next, a small convolution operation is applied to reduce the channel dimension of the concatenated feature map to reduce the parameter count. Then, another small convolution operation is used to increase the channel dimension of the reduced feature map back to the original number of channels. Finally, channel-wise multiplication is employed to multiply the upsampled feature map with the original resolution feature map, achieving cross-resolution weighting. Through this method of cross-resolution weighting, it becomes possible to efficiently utilize feature maps from different resolutions, merge their information, and thereby enhance the performance and expressive capacity of the network. This weighting mechanism assists in capturing feature information at diverse scales and resolutions, enhancing the model’s receptive field and semantic representation ability.

3.3. Lightweight Block

To incorporate the joint channel coordinate attention mechanism module into Lite-HRNet, we introduce the lightweight block. Initially, we construct the foundational network structure of Lite-HRNet in accordance with its architecture, encompassing convolutional layers, pooling layers, and other elements. The parameters of these layers can be fine-tuned and optimized to match the particular application. Subsequently, we will substitute the spatial weighting module with the channel coordinate fusion function. Specifically, we will add position encoding to the feature map of this module. Position encoding can be simple spatial coordinates or more advanced position information encoding, such as attention-based position encoding. In the channel coordinate fusion module, we calculate the channel coordinate weights, which can be achieved by applying global average pooling and fully connected layers, and then multiply the obtained weights with the feature map to perform weighted processing of the feature map. In this manner, the network will focus more on the target region in the image during human pose estimation tasks. After constructing the network architecture and introducing the channel coordinate fusion module, we can train the network using standard human pose estimation datasets. By defining suitable loss functions and optimization algorithms, we can update and optimize the network parameters to enhance its performance in human pose estimation tasks.

As can be seen from Figure 3, for the shuffle module, the channel is divided into two parts through the channel split operation, and the features are output after point-by-point convolution and other operations; in the Lite-HRNet module, a module called conditional channel weighting is introduced as an efficient unit for performing information exchange across channels to replace expensive point-wise convolutions in shuffle blocks. Different from these two blocks, we introduce a new module called the CCM. The CCM stores rich channel information, and the position information of each coordinate of the input channel is saved from the channel information. Next, by weighting the channels, the network can allocate more resources to those channels that are most helpful for the task, thereby improving overall performance and achieving higher efficiency with limited resources. By learning channel weights, the network can automatically select the most discriminative channel. Some channels may contain task-irrelevant noise or redundant information, while others may be crucial for the task. Through weighting, the network can automatically disregard task-irrelevant channels, thereby improving performance. It should be noted that the module “S” in Figure 3 is more lightweight than the CCM because there is an additional coordinate attention block in the CCM. However, the impact of this coordinate attention block on the model complexity is almost negligible. This will be verified in the experiment section.

We present a lightweight human pose estimation network based on joint channel coordinate attention. It preserves high resolution, high sensitivity, and includes abundant scale information and fine-grained image details, all while having fewer parameters. We also propose a new lightweight module, an improved shuffle block designed for accurate keypoint extraction in complex scenes. It incorporates channel weighting units and deep convolution to address the problem of excessive computational complexity. By incorporating channel weighting units, it allows for adaptive weighting of features in each channel, improving the accuracy of keypoint localization. Additionally, deep convolution is employed to increase the receptive field, enabling better capturing of global context information and enhancing the robustness of pose estimation. The overall model design also takes into consideration the lightweight and efficient nature of the network. It can achieve rapid pose estimation while preserving high resolution image details, which are crucial for real-time applications.

4. Experiments

The PyTorch deep learning framework serves as the foundation for our proposed network structure. Random flipping, rotation, resizing, and cropping are used to enhance image data. All models are trained on an NVIDIA TITAN RTX GPU. For the COCO dataset, the aspect ratio of input images is set to 4:3, and the input sizes are configured as 256 × 192 and 384 × 288. Mean squared error loss and the Adam optimizer are used for training the network. The total number of training iterations is 210, with an initial learning rate of 2 × 10⁻³. Learning rates are adjusted to 2 × 10⁻⁴ and 2 × 10⁻⁵ at iterations 170 and 210, respectively.

4.1. Dataset

The COCO dataset [37], provided by the Microsoft team, is an extensive dataset designed for human pose estimation. It encompasses 200,000 images that capture diverse human subjects with varying body types and poses in different environmental settings. Annotations for the COCO dataset comprise 250,000 human keypoints and are stored in JSON files. The MPII dataset [38] currently stands as one of the most advanced benchmarks for human pose estimation. It encompasses approximately 25,000 images and boasts over 40,000 annotated keypoints for human body poses, categorizing it as a multi-person pose dataset. Data within this dataset are stored in RGB format and capture around 410 everyday human activities, encompassing a wide range of body postures.

4.2. Evaluation Metric

OKS is used in the field of human pose estimation to predict the similarity between keypoints and ground-truth keypoints. It is a popular evaluation index for current human keypoint detection algorithms. A higher OKS indicates a closer match between predicted and ground-truth keypoints. Formula (8) shows the calculation formula for OKS:

OKS = \frac{\sum_{i} e^{- d_{i}^{2} / 2 s^{2} k_{i}^{2}} δ (v_{i} > 0)}{\sum_{i} δ (v_{i} > 0)}

(8)

PCKh quantifies the proportion of normalized distances between predicted keypoints and their corresponding ground-truth keypoints that fall below a specified threshold. PCKh is often chosen as the evaluation metric for MPII datasets. The calculation formula for PCKh is illustrated in Formula (9):

P C K h = \sum_{n} \frac{| | P - G | |^{2}}{α L^{head}} \cdot \frac{V_{(n \times 1)}}{N}

(9)

AP (Average Precision) refers to the percentage of test set detection accuracy. AP first calculates the similar OKS between predicted keypoints and ground-truth keypoints. When OKS is greater than the given threshold, the current person is considered to be correctly detected. The ultimate AP is determined based on the OKS calculations. The formula looks like this:

A P = \frac{\sum_{P} δ (o k s_{p} > T)}{\sum_{p} 1}

(10)

4.3. Experimental Results

We introduce a lightweight human pose estimation model based on a joint channel coordinate attention mechanism module. We further test the model’s performance on the COCO dataset and MPII dataset. The COCO test set contains 5000 images and 104,125 samples. It takes 5647 s to complete the test, with an average of 18.4 samples per second. The MPII test set contains 2729 images and 2958 samples. It takes 195 s to complete the test, with an average of 15.2 samples per second. In comparison to other methods, our method shows better advantages in terms of the number of parameters and accuracy. Table 1 and Table 2 demonstrate that while our model’s overall prediction accuracy is less accurate than the accuracy of HRNet, our model has greatly reduced the number of model parameters and computational complexity compared to HRNet. Compared with Hourglass and Simple Baseline, our method has a certain improvement in accuracy, while reducing the number of model parameters and computational complexity. Compared with the lightweight model Lite-HRNet, the accuracy is also improved while reducing the number of parameters. Therefore, the lightweight human posture estimation model combined with coordinate channels proposed in our work can effectively reduce the amount of network model parameters and computational complexity, while ensuring better prediction accuracy. Figure 4 shows the results on the COCO validation set.

An important sign of a lightweight model is the reduction in its number of parameters. Diminishing the model’s parameter count alleviates the computational burden during both training and inference, rendering the model more suitable for resource-constrained environments. The computational complexity of a model refers to the amount of computation required to perform inference on the model. Lightweight models usually reduce computational complexity by designing a simpler network structure or using lightweight convolution operations. Common lightweight operations include depth-separable convolution, channel-by-channel convolution, etc. We use our proposed lightweight 2D human pose estimation model with joint channel and coordinate attention to reduce the computational burden while maintaining a certain performance. In our work, the complexity of the model is measured through two indicators: Params and GFLOPs, which represent the number of parameters and the amount of computation, respectively. As shown in Table 1 and Table 2, the Params and GFLOPs have been significantly reduced.

We conducted visual validation on both the COCO 2017 and MPII validation datasets, randomly selecting images from these sets with various scenarios: single-person keypoints without occlusion, single-person keypoints with occlusion, multi-person keypoints without occlusion, and multi-person keypoints with occlusion. The detection results of our model under different conditions are presented in Figure 5 and Figure 6.

During testing, the COCO dataset did not use manual bounding boxes and instead used a two-stage top-down architecture for identification and prediction. In contrast, the MPII dataset followed a standard testing strategy with predefined manual bounding boxes. As shown in Table 3, we provide a comparison of our results on the COCO test set.

We have shown in the previous section how well our method works for full-body position estimation. As shown in Figure 7, we can observe that our method is insensitive to the presence or absence of the human body in videos. By running our well-trained pose estimation model, we successfully detected keypoints of the human body in each video frame. The keypoint location information includes the head, shoulders, arms, legs, and other body parts. Our model is capable of consistently identifying keypoint locations under different actions and perspectives, thereby depicting the motion trajectory of the human body in the video.

However, as shown in Figure 8, we have also encountered some failure cases in the experiment, in which some keypoints of the human body were not accurately detected. For example, when the human body posture is affected by various factors such as light changes, occlusion, posture diversity, and background interference, the attention module focuses on part of the human body but incorrectly detects other keypoints that do not exist. This shows that the robustness of our model to complex interference is still insufficient, and we plan to continue to pay attention to this issue in subsequent research.

4.4. Ablation Experiments

We conducted ablation experiments on the COCO and MPII datasets to examine the contributions of different model components. First, we included this function in the HRNet model to confirm the combined channel coordinate attention weight function’s efficacy. With the input image size set at 256 × 192, as shown in Table 3, our model exhibited an accuracy improvement of approximately 2.6%, notably outperforming methods employing other attention mechanisms.

Furthermore, ablation experiments were performed on the CCM. In the base HRNet network, the CCM, coordinate attention, and channel attention modules were added one after the other. As shown in Table 4, the accuracy of the model significantly decreases when the coordinate attention module and channel attention module are included alone; on the other hand, they enhance the model’s accuracy without a substantial increase in computational parameters. The ultimate experimental results underscore that the inclusion of the CCM yields the optimal pose estimation performance.

In this study, we adopted the Lite-HRNet network architecture with 32 input channels and an input image size of 256 × 192. A series of ablation experiments were conducted on the COCO dataset. We used the Lite-HRNet network architecture in this study, which has 32 input channels and a 256 × 192 input image size. On the COCO dataset, a number of ablation experiments were carried out. In these experiments, the basic feature extraction modules in the model were progressively replaced with the lightweight CCM in four stages. The experimental results are detailed in Table 5. As the fundamental modules in the high-resolution network were successively substituted with lightweight human pose estimation modules relying on the joint channel coordinate attention mechanism, there was a substantial reduction in the number of parameters, accompanied by a slight decrease in average precision. However, with the support of our model, the average precision remained stable at around 74.0. From the data in the table, it can be observed that, for all four stages, replacing the basic modules with the CCM achieved optimal performance in terms of both parameter count and key-point detection accuracy. Subsequently, further ablation experiments were conducted on the joint channel coordinate module and the overall model. We incrementally introduced channel attention mechanism modules, coordinate attention mechanism modules, and the CCM method. According to the results in Table 5, introducing only the channel attention mechanism module and the coordinate attention mechanism module significantly reduced the number of parameters and computations but also led to a noticeable drop in model accuracy. However, introducing the CCM method further improved the model accuracy with almost no increase in computational parameters. Adding the CCM not only enhanced the accuracy of the network model but also incurred only a slight increase in parameters. The final experimental results clearly demonstrate that introducing the CCM can achieve optimal pose estimation performance.

Although our work has achieved good results, the proposed model still faces huge challenges when dealing with dynamic scenes. For example, in video surveillance, the human body may move rapidly, and the lightweight model proposed in our work may not be able to capture and accurately estimate human posture in a short time. This can be a significant problem for applications that require immediate feedback, such as pose tracking.

5. Conclusions

We aim to design an efficient lightweight model to address the issues of large parameter size, decreased accuracy, and reduced model detection accuracy in existing human pose estimation networks, particularly in complex scenarios. Our model lowers the number of parameters in the network model through lightweight modifications, while improving the accuracy in comparison to the fundamental modules of residual networks by including the joint channel coordinate attention mechanism module. In the experimental design, we incorporate a joint channel coordinate module throughout the entire model, replacing residual blocks in various stages of the Lite-HRNet model with lightweight modules. Our model successfully lowers the number of parameters and computational complexity of human pose estimation models, while preserving keypoint detection accuracy that is comparable to Lite-HRNet, as shown by both the quantitative and qualitative results. Future research directions may include the application of lightweight human pose estimation models in practical scenarios such as the diagnosis of motor disorders and pose tracking, further promoting the use and development of lightweight models in real-world applications.

Author Contributions

Conceptualization, Z.L.; Data curation, B.L., R.F. and F.J.; Formal analysis, Y.C. and B.L.; Investigation, Z.L. and Y.C.; Methodology, Z.L., M.X. and F.J.; Project administration, Z.L. and H.C.; Resources, Y.C. and B.L.; Software, Y.C. and H.C.; Supervision, M.X., H.C. and F.J.; Validation, B.L., R.F. and H.C.; Visualization, R.F.; Writing—original draft, M.X.; Writing—review and editing, M.X., R.F. and F.J. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by the Henan Provincial Science and Technology Research Project under Grant 232102211006 and 232102210044, the Songshan Laboratory Pre-research Project under Grant YYJC012022023, the Research and Practice Project of Higher Education Teaching Reform in Henan Province under Grant 2019SJGLX320 and 2019SJGLX020, the Undergraduate Universities Smart Teaching Special Research Project of Henan Province under Grant Jiao Gao [2021] No. 489-29, and the Academic Degrees & Graduate Education Reform Project of Henan Province under Grant 2021SJGLX115Y.

Data Availability Statement

The data presented in this study are openly available in [COCO] at [arXiv.1405.0312] reference number [33] and [MPII] at [10.1109/CVPR.2014.471] reference number [34].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, C.; Wang, T.; Li, D.; Hong, J. Repetitive assembly action recognition based on object detection and pose estimation. J. Manuf. Syst. 2020, 55, 325–333. [Google Scholar] [CrossRef]
da Silva, M.V.; Marana, A.N. Human action recognition in videos based on spatiotemporal features and bag-of-poses. Appl. Soft Comput. 2020, 95, 106513. [Google Scholar] [CrossRef]
Casado, F.; Losada, D.P.; Santana-Alonso, A. Pose estimation and object tracking using 2D images. Procedia Manuf. 2017, 11, 63–71. [Google Scholar] [CrossRef]
Chen, K.; Gabriel, P.; Alasfour, A.; Gong, C.; Doyle, W.K.; Devinsky, O.; Friedman, D.; Dugan, P.; Melloni, L.; Thesen, T.; et al. Patient-specific pose estimation in clinical environments. IEEE J. Transl. Eng. Health Med. 2018, 6, 2101111. [Google Scholar] [CrossRef] [PubMed]
Zhou, T.; Wang, W.; Qi, S.; Ling, H.; Shen, J. Cascaded human-object interaction recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4263–4272. [Google Scholar]
Jiang, S.; Wang, Q.; Cheng, F.; Qi, Y.; Liu, Q. A Unified Object Counting Network with Object Occupation Prior. IEEE Trans. Circuits Syst. Video Technol. 2023. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Yu, C.; Xiao, B.; Gao, C.; Yuan, L.; Zhang, L.; Sang, N.; Wang, J. Lite-hrnet: A lightweight high-resolution network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10440–10450. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.S.; Zhang, L. Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5386–5395. [Google Scholar]
Newell, A.; Yang, K.; Deng, J. Stacked hourglass networks for human pose estimation. In Proceedings of the Computer Vision–ECCV 2016 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 483–499. [Google Scholar]
Papaioannidis, C.; Mademlis, I.; Pitas, I. Fast single-person 2D human pose estimation using multi-task Convolutional Neural Networks. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
Wang, W.; Zhou, T.; Qi, S.; Shen, J.; Zhu, S.-C. Hierarchical human semantic parsing with comprehensive part-relation modeling. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3508–3522. [Google Scholar] [CrossRef]
Fang, H.-S.; Li, J.; Tang, H.; Xu, C.; Zhu, H.; Xiu, Y.; Li, Y.-L.; Lu, C. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 7157–7173. [Google Scholar] [CrossRef] [PubMed]
Yang, J.; Zeng, A.; Liu, S.; Li, F.; Zhang, R.; Zhang, L. Explicit box detection unifies end-to-end multi-person pose estimation. arXiv 2023, arXiv:2302.01593. [Google Scholar]
Zhou, T.; Yang, Y.; Wang, W. Differentiable Multi-Granularity Human Parsing. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 8296–8310. [Google Scholar] [CrossRef] [PubMed]
Jiang, S.; Qi, Y.; Cai, S.; Lu, X. Light fixed-time control for cluster synchronization of complex networks. Neurocomputing 2021, 424, 63–70. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. Shufflenet v2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Qin, Z.; Li, Z.; Zhang, Z.; Bao, Y.; Yu, G.; Peng, Y.; Sun, J. ThunderNet: Towards real-time generic object detection on mobile devices. In Proceedings of the IEEE/CVF International Conference on Computer Vision 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6718–6727. [Google Scholar]
Tan, M.; Le, Q.V. Mixconv: Mixed depthwise convolutional kernels. arXiv 2019, arXiv:1907.09595. [Google Scholar]
Li, J.; Wang, C.; Huang, B.; Zhou, Z. ConvNext-backbone HoVerNet for nuclei segmentation and classification. arXiv 2022, arXiv:2202.13560. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. Convnext v2: Co-designing and scaling ConvNets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023, Vancouver, BC, Canada, 17–24 June 2023; pp. 16133–16142. [Google Scholar]
Lv, X.; Hao, W.; Tian, L.; Han, J.; Chen, Y.; Cai, Z. LiteDEKR: End-to-end lite 2D human pose estimation network. IET Image Process. 2023, 17, 3392–3400. [Google Scholar] [CrossRef]
Zhang, L.; Zheng, J.C.; Zhao, S.J. An improved lightweight high-resolution network based on multi-dimensional weighting for human pose estimation. Sci. Rep. 2023, 13, 7284. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 14–19 June 2020; pp. 11534–11542. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Rui, L.; Gao, Y.; Ren, H. EDite-HRNet: Enhanced Dynamic Lightweight High-Resolution Network for Human Pose Estimation. IEEE Access 2023, 11, 95948–95957. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Chen, W.; Hong, D.; Qi, Y.; Han, Z.; Wang, S.; Qing, L.; Huang, Q.; Li, G. Multi-attention network for compressed video referring object segmentation. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 4416–4425. [Google Scholar]
Yi, Y.; Ni, F.; Ma, Y.; Zhu, X.; Qi, Y.; Qiu, R.; Zhao, S.; Li, F.; Wang, Y. High Performance Gesture Recognition via Effective and Efficient Temporal Modeling. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China, 10–16 August 2019; pp. 1003–1009. [Google Scholar]
Wang, Q.; Zhang, J.; Yang, K.; Peng, K.; Stiefelhagen, R. Matchformer: Interleaving attention in transformers for feature matching. In Proceedings of the Asian Conference on Computer Vision 2022, Macao, China, 4–8 December 2022; pp. 2746–2762. [Google Scholar]
Wang, X.; Tong, J.; Wang, R. Attention refined network for human pose estimation. Neural Process. Lett. 2021, 53, 2853–2872. [Google Scholar] [CrossRef]
Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. Yolo-pose: Enhancing YOLO for multi-person pose estimation using object keypoint similarity loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2022, New Orleans, LA, USA, 19–20 June 2022; pp. 2637–2646. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2021, Nashville, TN, USA, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference 2014, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2D human pose estimation: New benchmark and state-of-the-art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2014, Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar]
Chen, Y.; Wang, Z.; Peng, Y.; Zhang, Z.; Yu, G.; Sun, J. Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2018, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7103–7112. [Google Scholar]
Xiao, B.; Wu, H.; Wei, Y. Simple baselines for human pose estimation and tracking. In Proceedings of the European Conference on Computer Vision (ECCV) 2018, Munich, Germany, 8–14 September 2018; pp. 466–481. [Google Scholar]
Zhang, F.; Zhu, X.; Dai, H.; Ye, M.; Zhu, C. Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020, Seattle, WA, USA, 14–19 June 2020; pp. 7093–7102. [Google Scholar]

Figure 1. Basic framework diagram.

Figure 2. Joint channel coordinate attention mechanism module (CCM).

Figure 3. Joint channel coordinate attention mechanism module, where (a) (left) represents the shuffle module, a large number of point-wise convolutions are used in the shuffle module, (b) (middle) represents the lightweight module in Lite-HRNet, using conditional channel weighting to replace expensive point-wise convolution, and (c) (right) represents the channel coordinate fusion module; through weighting, the model can focus on more important information, thereby improving performance. The yellow shapes h, S and CCM denote cross-resolution weighting function, spatial weighting function, and joint channel-coordinate attention mechanism weighting function respectively.

Figure 4. Displays the accuracy and FLOPs on the COCO dataset with an input size of 384 × 288. The bar chart represents accuracy on the COCO dataset, while the line chart represents FLOPs.

Figure 5. The detection performance on the COCO dataset.

Figure 6. The detection outcomes on the MPII dataset.

Figure 7. Presented experimental results using video as an example. The red numbers indicate the frame sequence numbers.

Figure 8. Failure cases presented in the experiment.

Table 1. Comparison of various network results on the validation set of the COCO dataset. The bold numbers are the best results.

Model	Backbone	Inputsize	Params	GFLOPs	AP	AP⁵⁰	AP⁷⁵	AP^M	AP^L	AR
Large networks
Hourglass [12]	Hourglass	256 × 192	25.1 M	14.3	66.9	-	-	-	-	-
CPN [39]	ResNet-50	256 × 192	27.0 M	6.20	68.6	-	-	-	-	-
SimpleBaseline [40]	ResNet-50	256 × 192	34.0 M	8.90	70.4	88.6	78.3	67.1	77.2	76.3
HRNet	HRNet-W32	256 × 192	28.5 M	7.10	73.4	89.5	80.7	70.2	80.1	78.9
DARK [41]	HRNet-W32	128 × 96	63.6 M	3.6	71.9	89.1	79.6	69.2	78.0	77.9
Small networks
MobileNetV2	MobileNetV2	256 × 192	9.6 M	1.48	64.6	87.4	72.3	61.1	71.2	70.7
MobileNetV2	MobileNetV2	384 × 288	9.6 M	3.33	67.3	87.9	74.3	62.8	74.7	72.9
ShuffleNetV2	ShuffleNetV2	256 × 192	7.6 M	1.28	59.9	85.4	66.3	56.6	66.2	66.4
ShuffleNetV2	ShuffleNetV2	384 × 288	7.6 M	2.87	63.6	86.5	70.5	59.5	70.7	69.7
Lite-HRNet	Lite-HRNet	256 × 192	1.8 M	0.31	67.2	88.0	75.0	64.3	73.1	73.3
Lite-HRNet	Lite-HRNet	384 × 288	1.8 M	0.70	70.4	88.7	77.7	67.5	76.3	76.2
Ours	Lite-HRNet	256 × 192	1.8 M	0.31	68.8	89.3	74.9	64.5	72.9	69.6
Ours	Lite-HRNet	384 × 288	1.8 M	0.72	71.5	91.5	79.2	69.1	75.2	74.6

Table 2. Effect of weight on network classification performance. The bold numbers are the best results.

Model	Params	GFLOPs	Head	Shoulder	Elbow	Wrist	Hip	Knee	Ankle	Mean
Hourglass	-	-	96.5	95.3	88.4	82.5	87.1	83.5	78.3	87.5
Simple Baseline	-	-	96.7	95.4	88.6	82.9	87.5	83.8	79.0	87.9
HRNet	-	-	97.1	95.9	90.3	86.4	89.1	87.2	83.3	90.3
Lite-HRNet	1.1 M	0.27	96.0	94.2	85.9	80.1	85.9	82.2	76.4	86.1
Ours	1.1 M	0.28	96.3	94.3	86.6	80.3	86.7	82.0	77.0	86.6

Table 3. Comparative analysis of results on the test set of the COCO dataset in comparison to various network results.

Method	Backbone	Pretrain	AP	AP⁵⁰	AP⁷⁵	AP^M	AP^L
Hourglass	Hourglass	N	66.9	-	-	-	-
CPN	ResNet-50	Y	68.6	-	-	-	-
CPN + OHKM	ResNet-50	Y	69.4	-	-	-	-
HRNet	HRNet-W32	N	73.4	89.5	80.7	70.2	80.1
HRNet	HRNet-W32	Y	75.8	90.6	82.7	71.9	81.0
Ours	HRNet-W32	Y	69.5	88.1	77.1	72.8	75.5

Table 4. Ablation analysis of attention mechanisms within the HRNet model.

Method	AP	AP⁵⁰	AP⁷⁵	AP^M	AP^L	AR
HRNet	73.4	89.5	80.7	70.2	80.1	78.9
HRNet + Channel Attention	73.7	90.5	78.9	69.3	80.0	79.2
HRNet + CBAM	74.9	91.3	80.9	69.8	79.3	79.6
HRNet + Coordinate Attention	74.7	91.5	79.2	70.4	80.3	79.3
Ours	76.0	93.5	83.6	72.8	80.4	80.2

Table 5. Ablation experiments on the CCM in the Lite-HRNet model. ‘-’ indicates the corresponding module is not used, and ‘√’ indicates the corresponding module is used.

Order	Channel Attention	Coordinate Attention	CCM	AP	AR
1	-	-	-	67.2	73.3
2	√	-	-	67.0	74.1
3	-	√	-	67.9	74.5
4	-	-	√	68.8	74.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Xue, M.; Cui, Y.; Liu, B.; Fu, R.; Chen, H.; Ju, F. Lightweight 2D Human Pose Estimation Based on Joint Channel Coordinate Attention Mechanism. Electronics 2024, 13, 143. https://doi.org/10.3390/electronics13010143

AMA Style

Li Z, Xue M, Cui Y, Liu B, Fu R, Chen H, Ju F. Lightweight 2D Human Pose Estimation Based on Joint Channel Coordinate Attention Mechanism. Electronics. 2024; 13(1):143. https://doi.org/10.3390/electronics13010143

Chicago/Turabian Style

Li, Zuhe, Mengze Xue, Yuhao Cui, Boyi Liu, Ruochong Fu, Haoran Chen, and Fujiao Ju. 2024. "Lightweight 2D Human Pose Estimation Based on Joint Channel Coordinate Attention Mechanism" Electronics 13, no. 1: 143. https://doi.org/10.3390/electronics13010143

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight 2D Human Pose Estimation Based on Joint Channel Coordinate Attention Mechanism

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Joint Channel Coordinate Attention Function

3.2. The Cross-Resolution Weighting Function

3.3. Lightweight Block

4. Experiments

4.1. Dataset

4.2. Evaluation Metric

4.3. Experimental Results

4.4. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI