A Multi-Channel Parallel Keypoint Fusion Framework for Human Pose Estimation

: Although modeling self-attention can signiﬁcantly reduce computational complexity, human pose estimation performance is still affected by occlusion and background noise, and undifferentiated feature fusion leads to signiﬁcant information loss. To address these issues, we propose a novel human pose estimation framework called DatPose (deformable convolution and attention for human pose estimation), which combines deformable convolution and self-attention to relieve these issues. Considering that the keypoints of the human body are mostly distributed at the edge of the human body, we adopt the deformable convolution strategy to obtain the low-level feature information of the image. Our proposed method leverages visual cues to capture detailed keypoint information, which we embed into the Transformer encoder to learn the keypoint constraints. More importantly, we designed a multi-channel two-way parallel module with self-attention and convolution fusion to enhance the weight of the keypoints in visual cues. In order to strengthen the implicit relationship of fusion, we attempt to generate keypoint tokens to the visual cues of the fusion module and transformers, respectively. Our experimental results on the COCO and MPII datasets show that performing the keypoint fusion module improves keypoint information. Extensive experiments and visual analysis demonstrate the robustness of our model in complex scenes and our framework outperforms popular lightweight networks in human pose estimation.


Introduction
Estimating the 2D coordinates of human keypoints from images is a fundamental research topic in the field of computer vision.This has a broad application prospect, including human activity recognition [1], action quality evaluation [2], and autonomous driving [3].It requires consideration of both the position information and the constraint relationships between the keypoints.
Recent studies have achieved remarkable success in human pose estimation by spatially locating keypoints alone [4][5][6].However, these methods rely on scale information to achieve high-resolution data, which requires significant computational resources.Additionally, feature extraction based on a fixed size of convolution and pooling kernels cannot effectively capture the constraint relationships between keypoints.These constraints represent the interdependencies and geometric relationships between different body parts.For example, the position of the elbow determines the position of the wrist, and the alignment of the neck affects the tilt of the head.The human pose is a complex system, with fixed relationships between its components.These relationships can be based on anatomical geometric constraints or on dynamic constraints related to movement and actions.These constraint relationships influence the position, angles, and relative positions of various body parts.However, traditional fixed-size convolution and pooling operations are not suitable for capturing the constraint relationships between keypoints.This is because fixedsize operations cannot adaptively handle variations in poses, angles, and relative positions.They treat each keypoint as an independent entity and overlook the interdependencies and geometric constraints between keypoints.This can lead to suboptimal accuracy in pose estimation, as the relationships between keypoints are not fully utilized.Therefore, developing a robust model that can effectively recognize and establish relationships among keypoints is crucial for accurate human pose estimation.To achieve this, researchers must focus on improving the model's ability to emphasize essential keypoint information.
Researchers have introduced the transformer model, originally used in natural language processing (NLP) [7], to advocate research in this direction.Enforcing the vision transformer for visual cues constraints is an innovative and effective method for pose estimation [8][9][10].The transformer model utilizes a self-attention mechanism in its encoder and decoder modules, enabling it to calculate the response by considering all location features in the feature map weighted.This inherent global modeling capability has led to significant advancements in various pose estimation tasks, as evidenced by the numerous transformer-based models.Yang et al. [8] introduced a method that leverages image tokens to capture visual cues, akin to the way word2vec captures similarity between words and characters in a vector space.However, although their embedded attention mechanism is capable of computing global attention, it overlooks the crucial constraint relationship between keypoints and visual cues.Therefore, Li et al. [9] proposed a new method called TokenPose to solve this problem.Specifically, TokenPose introduces the utilization of tokens to represent individual keypoints.This approach facilitates the acquisition of both visual cues and constraint relations through interactions with visual and other keypoint tokens.While the constraint strategy effectively addresses the limitations of fusing visual cues and keypoint information, it does introduce some background noise.Additionally, the keypoint tokens are treated together with visual cues, without strengthening keypoint information.
In this work, we propose a novel convolution and self-attention parallel multi-channel keypoint fusion method, which aims to emphasize keypoint features.Some works, such as Transpose and HRformer [8,11], are based on convolution neural network (CNN) as a back-bone, utilizing early layers to capture low-level visual information and deeper layers for richer feature expression.However, in DatPose, the situation is quite distinct.Our primary objective in designing the Deformable Convolution is to selectively capture edge keypoint features specific to the human body in an adaptive way.In the first stage, rather than simply extracting visual cues, we extract two streams of features in parallel using convolution and attention mechanisms to strengthen the key-point information.Finally, we divide the feature map into patches and keypoints as tokens, which are fed into the Transformer encoder to learn the constraint relationship between visual cues and keypoints, thus improving the network's performance.
The main contributions of this paper can be summarized as follows: (1) We introduce a deformable convolution that can selectively adjust the target of a human body image, reducing information redundancy by filtering out irrelevant information and placing it in the appropriate location.
(2) We propose a keypoint fusion module that combines convolution and self-attention to enhance keypoint information and minimize background noise.
(3) Experimental results on COCO demonstrate that our proposed method, DatPose, efficiently incorporates information from visual cues and keypoint information at multiple levels, achieving state-of-the-art performance on 2D metrics.
The present research is organized as follows: Section 2 provides a comprehensive overview of existing literature in the field, Section 3 elaborates on the architecture of DatPose, Section 4 presents the experimental validation and in-depth analysis, and finally, the paper concludes with pertinent findings and conclusions.

Related Work
The subsequent passage presents a concise overview of pertinent literature on vision transformers, 2D pose estimation, and convolution-enhanced attention.

Vision Transformer
The transformer architecture was initially introduced in the natural language processing domain to overcome the issue of long-distance dependencies and has resulted in significant advancements in classification, segmentation, detection, and virtual reality.Recently, the Vision Transformer [12] has been adapted to computer vision by splitting images into patches and processing them as tokens, akin to NLP inputs.Liu et al. [13] introduced a hierarchical architecture that incorporates the fusion of image patches in deeper layers.This design enables the model to effectively process images with diverse dimensions.It also introduced a shift window mechanism that computes self-attention in non-overlapping windows locally.Various transformer-based models have undergone enhancements through widely used model compression techniques such as Deit [14], which employed knowledge distillation methods to acquire inductive biases inherent in CNNs.Nevertheless, these approaches primarily concentrate on particular classification tokens and are not directly applicable to pose estimation tasks.In contrast, Rao et al. [15] employed a dynamic token sparsification framework to progressively and dynamically remove redundant tokens.

2D Human Pose Estimation
Two-dimensional pose estimation has witnessed significant progress in recent years, with CNN architectures being the typical solution for human pose estimation [4].Unlike 3D human pose estimation [16], these architectures use a multi-scale approach to capture keypoint information by changing the resolution through the use of hourglass structures.However, this approach may not fully exploit information from various scales.In this regard, Sun et al. [17] and Wu et al. [18] achieved high accuracy by parallel convolutional extraction of features from different resolutions while maintaining a high resolution.Nonetheless, the method is computationally expensive and does not consider the constraints between keypoint information.Xu et al. [10] leveraged transformer-based methods to deal with these spatial constraints.As an extension, Yang et al. [8] combined convolutional and transformer-based methods to further improve performance.Nonetheless, such methods may be vulnerable once keypoints are partially obscured, as their constraints may be insufficiently strong.To mitigate this issue, Li et al. [9] proposed a separate keypoint extraction mechanism, later integrated with visual information to enhance the inter-keypoint constraints.However, this approach treats visual cues and keypoint information equally, without considering the greater importance of keypoint information in visual cues.In response, we propose a novel method that combines deformable convolution and transformer-based approaches to better capture the significance of keypoints in visual cues.

Convolution Enhanced Attention
In computer vision tasks, especially in vision transformers, the self-attention network's inductive bias is weak.To address this issue, several methods have introduced convolution operations to enhance the capability of inductive bias.Wu et al. [19] employed convolution in the tokenization process and integrated stride convolution to reduce the computation complexity of self-attention.ViT [12] with convolutional stem achieved better performance by adding convolutions at the early stage.Dong et al. [20] introduced positional coding based on convolution and showcased advancements in downstream tasks.Additionally, Peng et al. [21] merged a transformer with a separate CNN model to incorporate both features.However, existing approaches often integrate features from cascade hierarchies, whereas our method strives to eliminate such cascade dependencies and process features in a parallel way, aligning better with the transformer's objective of reducing computational amount.Furthermore, in contrast to the conventional approach of augmenting the high-level features generated by deep convolutional neural networks with fine-grained lowlevel features, our proposed fusion attention module specifically targets keypoint feature information.This emphasis on keypoint feature integration distinguishes our method from others.We integrate the keypoint information into the convolutional stream, allowing for joint learning and increasing the weight of keypoint information relative to visual cues.

Materials and Methods
Figure 1 depicts the overall architecture of our proposed DatPose, which employs convolution and self-attention blocks to extract keypoints at the human body edges.Initially, in order to mitigate the intricacy involved in subsequent feature extraction and acquire a feature map F with dimensions H × W × C, where H, W, and C represent height, width, and channel, respectively, we introduce image I as the input to the stem CNN.To enhance the keypoint information, we introduce a fusion block to increase the ratio of keypoints to visual cues, which is referred to as the fusion of convolution and self-attention.Specifically, we divide the feature map into two streams: the convolution stream and the attention stream.The convolution layer multiplies the keypoints to acquire local keypoint information, while the self-attention layer learns the global visual cues and the constraints between key-points.Finally, the two streams are combined into a feature map.We divide the fused feature map and input it to the Transformer encoder to learn global dependencies.This multi-stage approach reinforces the keypoint information.
achieved better performance by adding convolutions at the early stage.Dong et al. [20] introduced positional coding based on convolution and showcased advancements in downstream tasks.Additionally, Peng et al. [21] merged a transformer with a separate CNN model to incorporate both features.However, existing approaches often integrate features from cascade hierarchies, whereas our method strives to eliminate such cascade dependencies and process features in a parallel way, aligning better with the transformer's objective of reducing computational amount.Furthermore, in contrast to the conventional approach of augmenting the high-level features generated by deep convolutional neural networks with fine-grained low-level features, our proposed fusion attention module specifically targets keypoint feature information.This emphasis on keypoint feature integration distinguishes our method from others.We integrate the keypoint information into the convolutional stream, allowing for joint learning and increasing the weight of keypoint information relative to visual cues.

Materials and Methods
Figure 1 depicts the overall architecture of our proposed DatPose, which employs convolution and self-attention blocks to extract keypoints at the human body edges.Initially, in order to mitigate the intricacy involved in subsequent feature extraction and acquire a feature map F with dimensions H × W × C, where H, W, and C represent height, width, and channel, respectively, we introduce image I as the input to the stem CNN.To enhance the keypoint information, we introduce a fusion block to increase the ratio of keypoints to visual cues, which is referred to as the fusion of convolution and self-attention.Specifically, we divide the feature map into two streams: the convolution stream and the attention stream.The convolution layer multiplies the keypoints to acquire local keypoint information, while the self-attention layer learns the global visual cues and the constraints between key-points.Finally, the two streams are combined into a feature map.We divide the fused feature map and input it to the Transformer encoder to learn global dependencies.This multi-stage approach reinforces the keypoint information.

Deformable Convolution
Deformable convolution is well-known for feature extraction and offset learning [22,23].The 2D convolution can be formulated as: where w (p n ) is the weight matrix applied to the feature map x(p 0 + p n +∆p) and p n +∆p represents the offset locations.The regular grid R is augmented with offsets {∆p n |n = 1,. ..,N},where p n enumerates the location in R and N = |R|.
To ensure the accurate pixel position, the Formula (1) can be written in the following form where p denotes an arbitrary fractional location (p 0 + p n +∆p); the sum symbol ∑ denotes the sum of all the terms of the source pixel position q.Each term is composed of the weight function G (q, p) multiplied by the value x(q) of the corresponding source pixel position q.By summing all terms, the value x(p) of the target pixel position p can be obtained.
It is used to ensure that each source pixel position in the formula takes into account the contribution of the target pixel position.q enumerates all integral spatial locations in the feature map x, and G (•, ) is the bilinear interpolation kernel.To ensure the accurate pixel position, bilinear interpolation is performed to achieve the position offset.
G(q,p) = g(q x , p x )g q y , p y (3) where g (a, b) = max(0, 1 − |a − b|).By utilizing this deformable convolution operation, the feature map can dynamically adapt to the specific shape of the target, which is beneficial to capture the keypoints of the human body edge.

Fusion of Convolution and Self-Attention
The essence of pose estimation is effectively aggregating relevant keypoint information while filtering out irrelevant visual information.Treating keypoint information and visual cues equally by using linear layers is not a prudent approach.We propose a fusion module that enhances keypoint information in the presence of visual cues.This module consists of two streams: the keypoint with convolution stream and the attention stream, which is the core of pose estimation.

Keypoint with Convolution
To overcome the interference of irrelevant visual information and enhance the keypoint information, we propose a fusion block.The fusion block consists of two essential components: keypoint elementary and visual cues.We regard the convolution operation as a summation of shifted feature maps and achieve it by using three 1 × 1 convolutions.These convolutions refer to the use of 1 × 1-sized kernel filters in the convolution operation.These 1 × 1 convolutions can be employed to change the number of channels in a feature map, providing a way to transform the representation of information.The formula for the operation is: Consider a standard convolution with a kernel K∈ R C out ×C in ×k×k , where k represents the kernel size and C in and C out denote the input and output channel sizes, respectively.Where K p,q ∈ R C out ×C in and the indices p and q range from 0 to k − 1, representing the kernel weights associated with the kernel position (p, q).For convenience, we can rewrite as the summation of the feature maps from different kernel positions: In the above formulation, to simplify the formulation further, we introduce the Shift operation as Shift( f , ∆x, ∆y), which represents shifting the feature map f by ∆x units in the horizontal direction and ∆y units in the vertical direction as: where ∆x, ∆y correspond to the horizontal and vertical displacements.Then, the formulation can be rewritten as: Based on the formulation, the convolution kernel K p,q f ij is applied to the input of the position (p − k/2 , q − k/2 )) by applying the Shift operation to obtain the output g ij .In order to enhance the representation and importance of keypoint information in the convolution flow, the keypoint information X k is introduced, which contains k keypoints, and the keypoint information is integrated into the convolution flow by multiplying X K with the convolution kernel K p,q f ij element by element by using the '*' operation.The keypoint with convolution can be formulated as: where k represents the N keypoints, which add out channels.Specifically, each keypoint information X k is multiplied by the elements at the corresponding position of the input feature mapK p,q f ij .In this way, the elements of the keypoint information corresponding to the position will be amplified or weakened, thereby enhancing the weight of the keypoint.The ' * ' operation makes the keypoints obtain higher weights throughout the convolution process.This makes the information of key points more prominent than visual cues, as shown in Figure 2. According the operation, the keypoint can obtain more weight compared to the visual cues.
In the above formulation, to simplify the formulation further, we introduce the Shift operation as  ~≜ Shift (, Δ, Δ), which represents shifting the feature map f by Δx units in the horizontal direction and Δy units in the vertical direction as: where Δx, Δy correspond to the horizontal and vertical displacements.Then, the formulation can be rewritten as: Based on the formulation, the convolution kernel  ,  is applied to the input of the position (p − ⌊k/2⌋, q − ⌊k/2⌋)) by applying the Shift operation to obtain the output gij.In order to enhance the representation and importance of keypoint information in the convolution flow, the keypoint information Xk is introduced, which contains k keypoints, and the keypoint information is integrated into the convolution flow by multiplying XK with the convolution kernel  ,  element by element by using the '*' operation.The keypoint with convolution can be formulated as: where k represents the N keypoints, which add out channels.Specifically, each keypoint information Xk is multiplied by the elements at the corresponding position of the input feature map ,  .In this way, the elements of the keypoint information corresponding to the position will be amplified or weakened, thereby enhancing the weight of the keypoint.The ' * ' operation makes the keypoints obtain higher weights throughout the convolution process.This makes the information of key points more prominent than visual cues, as shown in Figure 2. According the operation, the keypoint can obtain more weight compared to the visual cues.

Fusion of Self-Attention Mechanism
The input of self-attention is same as the keypoint with convolution, separated by the three 1 × 1 convolutions.As shown in Figure 3, three given inputs: Query Q, Key K, and  s (x, y) corresponds to the shift operation defined in Formula (7).⊗ denotes the elementwise multiplication operation.

Fusion of Self-Attention Mechanism
The input of self-attention is same as the keypoint with convolution, separated by the three 1 × 1 convolutions.As shown in Figure 3, three given inputs: Query Q, Key K, and Value V of the same dimension Q, K, V, give the output which is computed as a weighted sum where the parameters inside the activation function Softmax(.)reflect the similarity of Q and K. To avoid the resulting small gradients affecting the training, d k is the dimension of tokens, the d k is usually used to scale the size of the QK T .The self-attention mechanism can reflect the contribution of different image positions through gradients [24][25][26].
where the parameters inside the activation function Softmax(.)reflect the similarity of Q and K. To avoid the resulting small gradients affecting the training, dk is the dimension o tokens, the dk is usually used to scale the size of the QK T .The self-attention mechanism can reflect the contribution of different image positions through gradients [24][25][26].
Illustration of the proposed fusion module.Given feature map fin, the shift operation rep resents that the Q, K, and V are embedded by the three kernel size is (1 × 1).⊕ denotes the element wise addition.⊗ denotes the matrix multiplication.
In our study, we employ the ACmix [27] approach, where the two paths are added and fused to achieve our final result.Furthermore, we also utilize the learned scalar to regulate the intensity of the convolution with the aim of enhancing keypoint information  =  +  (10

Transformer Module
In order to accurately predict the location information of human keypoints, we pro pose a joint approach that integrates visual information with keypoint information, allow ing for mutual interaction to improve the performance of human target detection, even under low resolution.We use the Transformer model, known for its ability to capture de pendencies between elements, to facilitate the robust detection and tracking of keypoints Specifically, we segment the feature map into several patches, which are then encoded Figure 3. Illustration of the proposed fusion module.Given feature map fin, the shift operation represents that the Q, K, and V are embedded by the three kernel size is (1 × 1).⊕ denotes the element-wise addition.⊗ denotes the matrix multiplication.
In our study, we employ the ACmix [27] approach, where the two paths are added and fused to achieve our final result.Furthermore, we also utilize the learned scalar to regulate the intensity of the convolution with the aim of enhancing keypoint information.

Transformer Module
In order to accurately predict the location information of human keypoints, we propose a joint approach that integrates visual information with keypoint information, allowing for mutual interaction to improve the performance of human target detection, even under low resolution.We use the Transformer model, known for its ability to capture dependencies between elements, to facilitate the robust detection and tracking of keypoints.Specifically, we segment the feature map into several patches, which are then encoded using the Transformer model.Finally, the multi-layer perceptron (MLP) model is employed to predict the keypoints.This joint approach offers a promising solution for enhancing the effectiveness of keypoint detection in human targets.

Construction of Token
After constructing feature maps by combining convolution and self-attention layers, the feature maps are split into visual and keypoint tokens, as shown in Figure 4.The visual token, denoted by x, captures constraints among the visual tokens, while the keypoint token is designed to learn the constraints between keypoints, which helps to address low-resolution and occluded keypoints.These tokens are concatenated and fed into the Transformer Encoder to learn the dependencies between tokens.
hancing the effectiveness of keypoint detection in human targets.

Construction of Token
After constructing feature maps by combining convolution and self-attention layers, the feature maps are split into visual and keypoint tokens, as shown in Figure 4.The visual token, denoted by x, captures constraints among the visual tokens, while the keypoint token is designed to learn the constraints between keypoints, which helps to address lowresolution and occluded keypoints.These tokens are concatenated and fed into the Transformer Encoder to learn the dependencies between tokens.The feature map x is divided into N patches, which are then transformed into a 1D vector through the linear projection of the flattened patches layer.The vector that is created in one dimension is utilized as a visual token, followed by position encoding that incorporates a sine strategy.The result is then combined with keypoints through concatenation.

Transformer Encoder
Given a 1D token as the input of the Transformer, which consists of N Transformer modules, each module contains a multi-head self-attention module and a multi-class prediction module.Layer Norm [28] is applied to each module.The core formula of the Transformer is as follows: where WK, WV, and WQ are parameters that belong to the real number space of d × d.They are the learnable parameters of the three linear projection layers.SA represents the selfattention operation.T l−1 represents the output of the (l − 1)-th layer.T represents the output of the l-th layer.dh represents the dimension of tokens, which is also equal to d.It should be noted that the location of keypoints is typically predicted using heatmap [29][30][31].

Dataset
Figure 4. Construction of token.The feature map x is divided into N patches, which are then transformed into a 1D vector through the linear projection of the flattened patches layer.The vector that is created in one dimension is utilized as a visual token, followed by position encoding that incorporates a sine strategy.The result is then combined with keypoints through concatenation.

Transformer Encoder
Given a 1D token as the input of the Transformer, which consists of N Transformer modules, each module contains a multi-head self-attention module and a multi-class prediction module.Layer Norm [28] is applied to each module.The core formula of the Transformer is as follows: where W K , W V , and W Q are parameters that belong to the real number space of d × d.They are the learnable parameters of the three linear projection layers.SA represents the self-attention operation.T l−1 represents the output of the (l − 1)-th layer.T represents the output of the l-th layer.d h represents the dimension of tokens, which is also equal to d.It should be noted that the location of keypoints is typically predicted using heatmap [29][30][31].We employed DatPose for the COCO and MPII datasets [32].The COCO dataset consists of more than 330 k images, 1.5 million targets and 80 target categories, and 91 material categories, and is publicly available.It has more than 250,000 keypoint marked pedestrians.The COCO dataset is usually used as an evaluation criterion for human pose estimation.MPII is a large-scale multi-person pose estimation dataset [21], which contains about 25,000 image samples.These images contain the poses of the characters in different scenes and provide 16 keypoints of the characters, including the positions of keypoints such as head, torso, and limbs.

Evaluation Metrics
Following the metrics in [9], the standard average precision and recall rate are calculated to evaluate performance.In the COCO dataset, the performance of object keypoint detection models is evaluated using metrics such as average precision (AP) and average recall (AR).These metrics are calculated based on the object keypoint similarity (oks), which measures the similarity between predicted and ground truth keypoint locations: where d i represents the Euclidean distance between the i-th predicted keypoint coordinate and its corresponding ground truth.v i represents the visibility flag of the keypoint.s denotes the object scale, and k i is a constant specific to each keypoint.The quantity of d i in the given equation represents the Euclidean distance between the detected keypoint and the corresponding ground truth.The visibility flag of the ground truth is represented by v i .The object scale is denoted by s.Additionally, k i is a per keypoint constant that governs the falloff rate.As such, this expression plays a significant role in assessing the efficacy of keypoint detection algorithms.The key point evaluation criterion of the MPII dataset is the head-normalized probability of correct keypoint (PCKh), and its formula is expressed as: Among them, PCKh @ α is the proportion of keypoints correctly predicted when the head threshold is α, X is the number of keypoints, and f(p i ) is the similarity of i th keypoint.

Implementation Details
The experimental operating system is Ubuntu 18.04, the programming environment is PyTorch 1.10.1 + cu113, Python 3.8.12,and the GPU is NVIDIA Tesla T4.We increase the height or width of the human detection box to a predetermined aspect ratio: 4:3, and subsequently crop the box from the image, which is resized to a fixed dimension of either 256 × 192 or 384 × 288.The data augmentation techniques incorporated during this process comprise random rotation (within the range of −45 • to 45 • ), random scaling (between 0.65 and 1.35), and flipping.In this work, we follow the two-stage top-down human pose estimation paradigm, which has been utilized in several prior works such as [5,17,33,34].The approach involves initially detecting the individual person instance using a person detector and subsequently predicting the keypoints.To accomplish this, we adopt the popular person detectors furnished by SimpleBaseline [5] for both the validation set and test-dev set.The input image size is set to 256 × 192.The mean square error loss is used for learning.The Adam optimizer [35] was utilized to train our model for a total of 300 epochs.Throughout the training process, a small batch size of 16 and a dropout rate of 0.5 were employed.The initial learning rate is 1 × 10 −3 .The predicted heatmaps are twodimensional spatial information, and we use the two-dimensional sine strategy to embed the position.Figure 5 shows visual outcomes attained by the proposed DatPose model on MS COCO, which encompasses diverse scenarios.Our model has demonstrated precise prediction capabilities for various challenging scenarios such as variations in viewpoint and appearance, as well as instances of occlusion.

Comparison with State-of-the-Art Methods
Table 1 shows the comparison of DatPose with state-of-the art models, including the CNN-based methods [5,17] 2 compares the performance of this algorithm with other methods on the COCO test-dev set.Compared with HRNet, the AP is improved by 0.5%, indicating superior performance.Moreover, compared with HRNet, the Params and GFLOPs indexes of our method are significantly reduced, thus ensuring the lightweight of the model.Furthermore, when compared to TransPose [4], DatPose achieves the same AP while utilizing only 32% of TransPose's [4] GFLOPs.Compared with TokenPose [5], the AP is slightly inferior, but it has fewer parameters and capacity.The reason is that the fusion module efficiently fuses high-level semantic information and spatial location detail information, thus commanding less capacity.Based on the above experimental results, the method proposed in this work has fewer parameters and complexity compared with the large model net-work.In addition, compared with the lightweight network, the accuracy of human pose estimation is improved under the condition of adding a small number of parameters, and it has the ability to compare with the advanced model.

Comparison with State-of-the-Art Methods
Table 1 shows the comparison of DatPose with state-of-the art models, including the CNN-based methods [5,17]   Table 2 compares the performance of this algorithm with other methods on the COCO test-dev set.Compared with HRNet, the AP is improved by 0.5%, indicating superior performance.Moreover, compared with HRNet, the Params and GFLOPs indexes of our method are significantly reduced, thus ensuring the lightweight of the model.Furthermore, when compared to TransPose [4], DatPose achieves the same AP while utilizing only 32% of TransPose's [4] GFLOPs.Compared with TokenPose [5], the AP is slightly inferior, but it has fewer parameters and capacity.The reason is that the fusion module efficiently fuses high-level semantic information and spatial location detail information, thus commanding less capacity.Based on the above experimental results, the method proposed in this work has fewer parameters and complexity compared with the large model net-work.In addition, compared with the lightweight network, the accuracy of human pose estimation is improved under the condition of adding a small number of parameters, and it has the ability to compare with the advanced model.
Table 3 presents the experimental results of our algorithm compared to other state-ofthe-art methods for human pose estimation on the MPII validation set.The input image size for all methods is set to 256 × 256 pixels.Our algorithm demonstrates a PCKh@0.5 improvement of 2.8% and 1.8% compared to the traditional convolution networks SHN and SimpleBase-Res50, respectively.Furthermore, when compared to a Transformer-based human pose estimation model, specifically the baseline TokenPose, our algorithm achieves a modest improvement of 0.1%.The COCO dataset is visualized using DatPose, where each column depicts the 17 keypoints and each row displays the prediction of the keypoints from varied viewpoints in Figure 6.The representation provides comprehensive insights into the accuracy of the keypoint predictions.TokenPose is the most relevant model to DatPose, as it strengthens the keypoint information to jointly assess all the patches in the self-attention.However, it introduces the keypoint features and image clues equally to all the Transformer Blocks without giving greater weight to the keypoint information.By collecting the keypoint information of human body edges via the Fusion of Convolution and Self-Attention Block, our model achieves remarkable improvement.Table 3 presents the experimental results of our algorithm compared to other stateof-the-art methods for human pose estimation on the MPII validation set.The input image size for all methods is set to 256 × 256 pixels.Our algorithm demonstrates a PCKh@0.5 improvement of 2.8% and 1.8% compared to the traditional convolution networks SHN and SimpleBase-Res50, respectively.Furthermore, when compared to a Transformerbased human pose estimation model, specifically the baseline TokenPose, our algorithm achieves a modest improvement of 0.1%.The COCO dataset is visualized using DatPose, where each column depicts the 17 keypoints and each row displays the prediction of the keypoints from varied viewpoints in Figure 6.The representation provides comprehensive insights into the accuracy of the keypoint predictions.TokenPose is the most relevant model to DatPose, as it strengthens the keypoint information to jointly assess all the patches in the self-attention.However, it introduces the keypoint features and image clues equally to all the Transformer Blocks without giving greater weight to the keypoint information.By collecting the keypoint information of human body edges via the Fusion of Convolution and Self-Attention Block, our model achieves remarkable improvement.

Ablation Study
Table 4 shows ablation results to verify the contribution of each component in our model.Model '1' is a Transformer human pose estimation method based on the standard residual network ResNet.The models '2' and '3' are based on model '1', and the deformable convolution module and the fusion module are added, respectively, to compare the AP and AR.

Conclusions
In this paper, we propose a framework for human pose estimation named DatPose,

Figure 1 .
Figure 1.An overview of our model.The model contains three modules: the deformable convolution block aims to capture keypoints of human body edge and the fusion of convolution and self-attention block supports the keypoint information and visual cues weight distribution.Furthermore, the Transformer encoder conducts token construction and constraint relationship learning.

Figure 1 .
Figure 1.An overview of our model.The model contains three modules: the deformable convolution block aims to capture keypoints of human body edge and the fusion of convolution and self-attention block supports the keypoint information and visual cues weight distribution.Furthermore, the Transformer encoder conducts token construction and constraint relationship learning.

Figure 2 .
Figure 2.An illustration of the proposed shift operation.The feature map is projected with three 1 × 1 convolutions and the intermediate features are multiplied by the keypoints.s(x, y) corresponds to the shift operation defined in Formula(7).⊗ denotes the elementwise multiplication operation.

Figure 2 .
Figure 2.An illustration of the proposed shift operation.The feature map is projected with three 1 × 1 convolutions and the intermediate features are multiplied by the keypoints.s(x, y) corresponds to the shift operation defined in Formula(7).⊗ denotes the elementwise multiplication operation.

Figure 4 .
Figure 4. Construction of token.The feature map x is divided into N patches, which are then transformed into a 1D vector through the linear projection of the flattened patches layer.The vector that is created in one dimension is utilized as a visual token, followed by position encoding that incorporates a sine strategy.The result is then combined with keypoints through concatenation.

Electronics 2023 ,
12,  x FOR PEER REVIEW 10 of 15 COCO, which encompasses diverse scenarios.Our model has demonstrated precise prediction capabilities for various challenging scenarios such as variations in viewpoint and appearance, as well as instances of occlusion.

Figure 5 .
Figure 5. Qualitative results of some example images in the COCO data set containing view-point and appearance change, occlusion.
and CNN-based methods proposed spatial multiple scales features.The CNN-Transformer based methods [8,9,36] capture the constraints of spatial locations.The pure Transformer model learns the relationship between features directly from the original image [9,10].Our model consistently outperforms state-of-the-art models on all the metrics and achieves 74.8% boost on AP and 80.3% boost on AR accuracy.Although the VITPose-B model improves the AP by 1% compared with the Datspose model, it is worth noting that the Datspose model has fewer parameters and reduces the complexity of the model.

Figure 5 .
Figure 5. Qualitative results of some example images in the COCO data set containing view-point and appearance change, occlusion.
and CNN-based methods proposed spatial multiple scales features.The CNN-Transformer based methods [8,9,36] capture the constraints of spatial locations.The pure Transformer model learns the relationship between features directly from the original image [9,10].Our model consistently outperforms state-of-the-art models on all the metrics and achieves 74.8% boost on AP and 80.3% boost on AR accuracy.Although the VITPose-B model improves the AP by 1% compared with the Datspose model, it is worth noting that the Datspose model has fewer parameters and reduces the complexity of the model.

Figure 6 .
Figure 6.Visualization of DatPose on the COCO dataset.Each column represents the visualization of 17 keypoints, and each row represents the prediction of keypoints from different viewpoints.

Figure 6 .
Figure 6.Visualization of DatPose on the COCO dataset.Each column represents the visualization of 17 keypoints, and each row represents the prediction of keypoints from different viewpoints.

Figure 7 .
Figure 7.The visualization of attention maps based on the dependency relationship between keypoints and visual cues.

Figure 7 .
Figure 7.The visualization of attention maps based on the dependency relationship between keypoints and visual cues.

Table 1 .
State-of-the-art comparison on COCO validation set.

Table 1 .
State-of-the-art comparison on COCO validation set.

Table 2 .
State-of-the-art comparison on COCO test-dev set.

Table 3 .
State-of-the-art comparison on MPII dataset.

Table 3 .
State-of-the-art comparison on MPII dataset.

Table 5 .
Ablation study of fusion module on COCO dataset.

Table 5 .
Ablation study of fusion module on COCO dataset.