Enhancing Human Pose Transfer with Convolutional Block Attention Module and Facial Loss Optimization

Cheng, Hsu-Yung; Chiang, Chun-Chen; Jiang, Chi-Lun; Yu, Chih-Chang

doi:10.3390/electronics14091855

Open AccessArticle

Enhancing Human Pose Transfer with Convolutional Block Attention Module and Facial Loss Optimization

¹

Department of Computer Science and Information Engineering, National Central University, Taoyuan City 320317, Taiwan

²

Department of Information and Computer Engineering, Chung Yuan Christian University, Taoyuan City 320314, Taiwan

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(9), 1855; https://doi.org/10.3390/electronics14091855

Submission received: 31 March 2025 / Revised: 28 April 2025 / Accepted: 30 April 2025 / Published: 1 May 2025

(This article belongs to the Special Issue Machine Learning Techniques for Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Pose transfer methods often struggle to simultaneously preserve fine-grained clothing textures and facial details, especially under large pose variations. To address these limitations, we propose a model based on the Multi-scale attention guided pose transfer model, with modifications to its residual block by integrating the convolutional block attention module and changing the activation function from ReLU to Mish to capture more features related to clothing and skin color. Additionally, as the generated images had facial features differing from the original image, we propose two different facial feature loss functions to help the model learn more precise image features. According to the experimental results, the proposed method demonstrates superior performance compared to the Multi-scale Attention Guided Pose Transfer (MAGPT) on the DeepFashion dataset, achieving a 3.41% reduction in FID, a 0.65% improvement in SSIM, a 2% decrease in LPIPS, and a 2.7% decrease in LPIPS. Ultimately, only one reference image is required to enable users to transform into different action videos with the proposed system architecture.

Keywords:

generative adversarial network; pose transfer; image generation

1. Introduction

Pose transfer research aims to synthesize a realistic image of a person in a new pose while maintaining their identity and clothing details. It is motivated by several key factors, including advancements in computer vision and practical applications in various industries. The applications include fashion design and marketing, animation and game development, medical and rehabilitation, and data augmenting for machine learning.

Since the introduction of Generative Adversarial Networks (GANs) [1], various types of GANs [2,3,4,5,6] have made significant progress in the field of realistic image synthesis over the past few years. The initial idea of GANs was to use a generator and a discriminator to train the generative model through adversarial competition. Later, to generate images for specific domains, Conditional Generative Adversarial Networks (CGAN) [7,8,9,10,11,12] were developed. CGANs can be applied to various fields, including image translation [13,14,15], image inpainting [16], image super-resolution [17,18], and pose transfer tasks [7,19]. In the field of human image generation, generative models often incorporate the concept of CGANs, using images of people with target poses as input to generate realistic human images that match the target pose, as shown in Figure 1.

The research proposed by Lassner et al. [3] combines variational autoencoder [20] and GANs to generate images of people wearing different clothes based on 3D human models. The work proposed by Zhao et al. [7] uses a method that progresses from low frequency to high frequency to generate multi-view images of people or clothes from a single viewpoint observation. This method starts with a simplified model and gradually refines the details to achieve high-precision image reconstruction. It is effective for both human and clothing images, enabling detailed multi-view image reconstruction from a single image. Balakrishnan et al. [7] introduce a pose transfer scheme that decomposes the task of generating human images into separate foreground and background generation using GANs. This method first segments and generates the foreground and background independently and finally merges them to achieve high-quality image reconstruction. Ma et al. [9] propose a two-stage approach for pose transfer, generating human images that match the target pose from low resolution to high resolution. The first stage extracts rough features, preserving the texture details of the clothes. The second stage processes the results of the first stage, enhancing the texture features obtained in the first stage. In their subsequent work, Ma et al. [10] improve their previous work by decoupling and encoding the foreground, background, and pose of the input image into embedded features, then decoding these features back into an image. Siarohin et al. [21] use a U-Net-based generator in their research on deformable GANs, with deformable skip connections to alleviate pixel-to-pixel misalignment issues caused by pose differences. Esser et al. [22] utilize a variational autoencoder combined with a conditional U-Net to model inherent shapes and appearances. Zhu et al. [23] introduce multiple connected pose attention transfer blocks into the generator in their study, using attention mechanisms to gradually guide the deformable pose transfer process. In this method, pose transfer is handled in small local regions at each intermediate stage, helping to overcome various difficulties posed by the complex structures on the global manifold. Similarly, the work in [24] addresses the pose transfer task using an attention mechanism. They encode the source image and human pose separately at different resolutions, then combine them with an attention mechanism before decoding the image to obtain the pose-transferred image.

Recently, diffusion-based methods have gained attention in the field of pose transfer for their ability to generate high-fidelity images. For example, Lu et al. [24] propose a coarse-to-fine latent diffusion framework that integrates multi-scale attention to preserve appearance and pose consistency. Similarly, Shen et al. [25] introduce a progressive conditional diffusion model to gradually refine structure and detail throughout the generation process. Although these models achieve impressive results, they typically suffer from high computational and inference costs, making them less practical in resource-constrained settings.

Based on the above analysis, existing pose transfer methods generally face the following challenges:

(1): The inability to simultaneously preserve both clothing textures and facial details.
(2): Structural distortions or pixel misalignment under large pose variations.
(3): Difficulty in maintaining identity consistency of the subject.

To address these issues, this paper improves the generative model architecture of [26]. We improve image detail generation by incorporating dual attention mechanisms within the residual blocks and leveraging the Mish activation function to enhance feature representation. Furthermore, we introduce two specialized facial loss functions to guide the network in accurately capturing the facial features of the original subjects.

2. Proposed Method

2.1. Network Architecture

We use Multi-scale Attention Guided Pose Transfer (MAGPT) [26] as our base model. In the experimental results of the MAGPT model, some generated images exhibit significant differences in clothing details compared to the original images, while others produce facial features that do not resemble the actual faces. This paper addresses these two issues to improve the results.

To address the significant differences in clothing details, we introduce the Convolutional Block Attention Module (CBAM) [27]. By incorporating dual attention mechanisms, we enhance the model’s ability to recognize and generate detailed clothing features. CBAM allows the model to focus more on clothing details by strengthening spatial and channel attention, helping the model to capture textures, colors, and shapes more accurately. Additionally, we propose a second method for extracting detailed textures from images. We replace the ReLU activation function with Mish [28], which enables the model to capture more features during training through nonlinear activation. The complete model architecture is shown in Figure 2.

In the proposed model,

I_{A}^{j}

represents the source image, and

P_{A}^{j}

is the keypoint map of the source image. The index j indicates the specific entry in the dataset. The pose

P_{B}^{j}

is the keypoint map of the desired pose for the person in the image, and

{\hat{I}}_{B}^{j}

is the image generated by the model. Both

P_{A}^{j}

and

P_{B}^{j}

are produced using the pre-trained Human Pose Estimator (HPE) based on OpenPose [29] and are converted into heatmaps represented as

H_{A}^{j}

and

H_{B}^{j}

, respectively. In addition, σ represents the sigmoid function, and ☉ represents element-wise multiplication.

The encoder consists of two parallel down-sampling branches. One branch specifically processes the conditional person image

I_{A}^{j}

, and the other handles the combined pose heatmaps

H_{A}^{j}

and

H_{B}^{j}

. Initially, both branches convert the input data into a space with

N_{f}

feature maps of dimensions ℎ × w through a two-dimensional 3 × 3 convolutional layer, followed by batch normalization and Mish activation function to obtain the mapped input features. These feature maps are then passed to N consecutive Down2xBlocks, where each block halves the input dimensions while doubling the number of feature maps.

The decoder consists of an up-sampling branch, which aims to generate the target pose image

{\hat{I}}_{B}^{j}

. Starting from the feature space dimensions, it processes the feature maps through N consecutive 2 × up-sampling blocks. In each 2 × up-sampling block, the input size is doubled, and the number of feature maps is halved. After these 2 × up-samplings, we achieve a feature space with dimensions h × w

\times N_{f}

. Ultimately, these feature maps continuously pass through four residual blocks enhanced by CBAM, followed by a pointwise two-dimensional convolution operation that projects the feature maps into a three-channel output space. Finally, tanh is used to ensure a more uniform distribution across each channel of the model output, which helps reduce the issue of vanishing gradients during training, thus enhancing training stability and convergence speed, then to produce the generated image

{\hat{I}}_{B}^{j}

.

The attention mechanism establishes attention connections between layers of different resolutions in the encoder and decoder. In each layer, an attention mask

M_{k}

, where k denotes the corresponding resolution level, is calculated using the sigmoid function to determine how much of the original image’s features should be retained in each pixel of the final generated image. This attention mask updates the image feature maps

I_{k}

by element-wise multiplication. The updated feature maps, enriched by

M_{k}

, directly influence the final output image

{\hat{I}}_{B}^{j}

by contributing weighted feature information at various decoding stages. These feature maps are then fed into the decoder’s up-sampling module, where they are further processed to reconstruct the image. As these operations are repeated across all resolution levels, starting from the lowest level and connecting upwards progressively, the model builds a series of attention connections. This allows the model to focus on important features more effectively during the decoding process and more accurately reconstruct the target pose image.

To address the issue of generated facial features not resembling the actual faces, we introduced the head region loss to help the generated model learn the facial features. The head region loss utilizes the dataset segmented from DeepFashion from ADGAN [30]. However, since this method is only applicable to the DeepFashion dataset, we proposed another approach, Face Focused Loss, which uses OpenPose [29] to segment the generated image’s face. This demonstrates that extracting facial features can enhance the quality of generated facial images.

We adopted the same discriminator model as MAGPT, which is derived from the PatchGAN Discriminator [13]. In the implementation of the architecture, the discriminator evaluates the transformation of images, not just the generated images. By combining the conditional image

I_{j}^{A}

with either the target image

I_{j}^{B}

or the generated image

{\hat{I}}_{j}^{B}

in depth, and labeling (

I_{j}^{A}, I_{j}^{B}

) as real and (

I_{j}^{A}, {\hat{I}}_{j}^{B}

) as fake, the discriminator can make judgments based on the differences between the conditional and target or generated images. This design allows the discriminator to not only identify the authenticity of individual images but also to evaluate whether the transformation from the conditional image to the target pose is reasonable. The architecture of the discriminator model is shown in Figure 3. It takes as input the result of combining the conditional image

I_{j}^{A}

and either the target image

I_{j}^{B}

or the generated image

{\hat{I}}_{j}^{B}

, and processes them through a series of convolutional layers. Following the PatchGAN design, the discriminator operates with a 70 × 70 receptive field, which enables localized evaluation. This structure allows the model to effectively determine whether an image is real or fake and also assess whether the transformation from the conditional image to the target pose is plausible.

2.2. Residual Block with CBAM

In pose transfer tasks, we aim to extract more features from the source images. To achieve this, the MAGPT architecture already employs residual blocks to enhance the model’s feature extraction capabilities and maintain feature transmission in deep networks, thereby avoiding gradient vanishing issues. The introduction of residual blocks is crucial for preserving and enhancing the performance of deep learning models, especially when dealing with complex images. However, despite the use of residual blocks in MAGPT to maintain overall network stability, defects still occur in some clothing or human features. To more effectively capture the detailed features from the source images, we incorporated the Convolutional Block Attention Module (CBAM) into the residual blocks of MAGPT, as shown in Figure 4. The structure consists of a standard residual block enhanced with two attention modules: channel attention and spatial attention. These modules are applied sequentially to refine the intermediate features, allowing the network to focus on important textures and spatial regions. This design helps improve the generation of fine details, especially in clothing and human contours.

2.3. Mish Activation Function

During the training process of pose transfer, MAGPT initially used the ReLU activation function. ReLU is widely used in various deep learning architectures due to its ability to effectively introduce non-linearity and maintain consistent gradients when processing positive inputs. However, in handling complex pose transfer tasks, ReLU may cause the loss of important human features because it outputs all negative values as zero. Based on this observation, we chose to introduce the Mish activation function to optimize the MAGPT architecture. The Mish activation curve is shown in Figure 5.

2.4. Head Region Loss

To address the issue of facial features not matching the source images, we propose a loss function specifically for the head region, called Head Region Loss. We extract feature masks representing the head region from the segmentation dataset designed specifically for the DeepFashion dataset from ADGAN. This mask corresponds to labels 2 and 4 in the dataset, which represent the hair and face regions, respectively. By performing an element-wise multiplication of the target image

I_{j}^{B}

and the generated image

{\hat{I}}_{j}^{B}

with the mask, we can isolate the head region’s features.

In our implementation, we first resize the head region features to 64 × 64 using bilinear interpolation to facilitate computation within the model. We then combine this with the perceptual loss. The cropped images are processed through the intermediate layers of VGG19 to obtain the feature differences between the target image and the generated image. The head region loss

L_{H}^{G}

is defined as:

L_{H}^{G} = \frac{1}{h_{H} w_{H} c_{H}} \sum_{x = 1}^{w_{H}} \sum_{y = 1}^{h_{H}} \sum_{c = 1}^{c_{H}} {‖ϕ_{ρ} ({C_{h e a d, x, y, c} (\hat{I}}_{j}^{B})) - ϕ_{ρ} ({C_{h e a d, x, y, c} (I}_{j}^{B}))‖}_{1}

(1)

where

C_{h e a d}

represents the output of the head region features using the mask. In this paper, we use the 4th and 9th layers of VGG19 to calculate the perceptual loss, denoted as

ϕ_{ρ}

in Equation (1). Also, the terms

h_{H}

,

w_{H}

, and

c_{H}

denote the height, width, and channel number of the head region features, respectively.

2.5. Face-Focused Loss

Since the method in Section 2.4 can only be applied to the DeepFashion dataset, we propose an adaptable method that can be used with other datasets. Our alternative approach utilizes OpenPose. We use the nose, right eye, and left eye, denoted as p0, p1, and p2, respectively. Then, we flip the positions of the right and left eyes with the nose as the center point, denoted as

p 0^{'}

and

p 1^{'}

. This forms a square to segment the face portion. The face-focused loss

L_{F}^{G}

can be expressed as:

L_{F}^{G} = \frac{1}{h_{F} w_{F} c_{F}} \sum_{x = 1}^{h_{F}} \sum_{y = 1}^{w_{F}} \sum_{c = 1}^{c_{F}} {‖ϕ_{ρ} ({C_{f a c e, x, y, c} (\hat{I}}_{j}^{B})) - ϕ_{ρ} ({C_{f a c e, x, y, c} (I}_{j}^{B}))‖}_{1}

(2)

where

C_{f a c e}

represents the output of the facial features extracted using the keypoints. In Equation (2), the terms

h_{F}

,

w_{F}

, and

c_{F}

denote the height, width, and channel number of the face region features, respectively.

2.6. Total Loss

To train the network, we define the total generator loss

L_{G}

as a weighted combination of multiple components:

L_{G} = λ_{1} L_{1} + λ_{2} L_{B C E} (D (I_{j}^{A}, {\hat{I}}_{j}^{B}), 1) + λ_{3} L_{p e r c e p t u a l} + λ_{4} L_{r e g i o n},

(3)

where

L_{B C E}

denotes the binary cross-entropy loss, computed using a PatchGAN discriminator.

L_{p e r c e p t u a l}

is the perceptual loss computed from the 4th and 9th layers of VGG19, and

L_{r e g i o n}

denotes either the head region loss or the face-focused loss, depending on the loss variant used. In our experiments, the weights

λ_{1}

to

λ_{4}

are set to 5, 1, 5 and 3, respectively.

The discriminator loss

L_{D}

is defined as:

L_{D} = L_{B C E} (D (I_{j}^{A}, I_{j}^{B}), 1) + L_{B C E} (D (I_{j}^{A}, {\hat{I}}_{j}^{B}), 0),

(4)

where

I_{j}^{A}

and

I_{j}^{B}

represent the source and target images, respectively, and

{\hat{I}}_{j}^{B}

is the image generated by the network.

3. Experiments

To evaluate our proposed architecture, we conducted comprehensive qualitative and quantitative analyses by comparing it with several established pose transfer techniques [9,22,23,26,31] using a consistent dataset [32]. Our approach surpasses previous methods in most evaluation metrics and also outperforms MAGPT.

3.1. Dataset

This paper utilizes the DeepFashion dataset [32] to perform experiments. We employ the In-shop Clothes Retrieval Benchmark subset from this dataset, which consists of 7982 items. Each item is displayed through real-person fitting photos, including front, side, back, and full-body shots. This subset exhibits a richer variety of human poses and clothing changes compared to other subsets. The images in this dataset have a resolution of 176 × 256 pixels. For standardization, we use bilinear interpolation to adjust these images to a resolution of 256 × 256 pixels, ensuring that the adjusted images are centered within a 256 × 256 square grid. For a fair comparison, we adopt the same training and testing dataset split as [26], where each data point represents a person before and after pose transformation. Additionally, we observed a scarcity of male data in DeepFashion, amounting to only 4149 entries. Hence, we match them into pairs and randomly select 20,000 data points from these to add to the original training dataset, bringing the total to 121,966 entries in the training set and 8570 in the testing set. We also ensure that there is no overlap between training and testing datasets.

3.2. Evaluation Metrics

We utilize evaluation metrics including Structural Similarity Index (SSIM) [33], Inception Score (IS) [34], PCKh [35], and LPIPS [36] to evaluate the quality of pose-transferred images.

SSIM measures the similarity between a generated image and a reference image in terms of luminance, contrast, and structure. It is defined as:

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(5)

where

μ_{x}

,

μ_{y}

are the means,

σ_{x}^{2}

,

σ_{y}^{2}

are the variances, and

σ_{x y}

is the covariance of the two images

x

and

y

. The Inception Score evaluates image quality and diversity by passing generated images through a pre-trained Inception network and is defined as:

IS = \exp (E_{x} [D_{KL} (p (y| x) ∥ p (y))])

(6)

where

p (y| x)

is the conditional label distribution given image

x

, and

p (y)

is the marginal distribution. PCKh (Percentage of Correct Keypoints head) measures keypoint localization accuracy, calculated as the proportion of predicted keypoints within a normalized distance from ground truth keypoints.

PCKh = \frac{Number of correct keypoints}{Total number of keypoints}

(7)

LPIPS (Learned Perceptual Image Patch Similarity) evaluates the perceptual similarity between generated and real images using deep features extracted from pre-trained networks. It is computed as:

LPIPS (x, y) = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, w} | w_{l} ⊙ (\hat{f_{l}^{x}} (h, w) - \hat{f_{l}^{y}} (h, w)) |_{2}^{2}

(8)

where

\hat{f_{l}^{x}}

and

\hat{f_{l}^{y}}

are normalized feature activations from layer

l

for images

x

and

y

,

w_{l}

are learned weights, and

H_{l}

and

W_{l}

are the spatial dimensions. We also evaluate Fréchet Inception Distance (FID) [37], which measures the distance between feature distributions of real and generated images in the embedding space of a pre-trained Inception model. It is computed as:

FID = | μ_{r} - μ_{g} |_{2}^{2} + Tr (Σ_{r} + Σ_{g} - 2 {(Σ_{r} Σ_{g})}^{1 / 2})

(9)

where (

μ_{r}

,

Σ_{r}

) and (

μ_{g}

,

Σ_{g}

) are the means and covariances of real and generated image features, respectively.

For all these metrics, we clarify the direction of desirable performance: higher SSIM, IS, and PCKh scores indicate better perceptual quality, structure alignment, and image diversity, respectively, while lower values of FID and LPIPS (for both VGG and SqueezeNet variants) indicate smaller perceptual and distributional gaps between generated and real images. Therefore, improvements in these metrics, such as increased SSIM and reduced FID and LPIPS, directly reflect the model’s effectiveness in generating high-quality pose transfer images with realism and structural consistency.

3.3. Qualitative Comparison

This section compares the original architecture with the method proposed in this paper. We refer to the original architecture as MAGPT, and we will call our modified paper structure CMF. Since we have two types of facial loss functions, we will distinguish these as CMF with Head Region loss and CMF with Face-Focused loss.

The qualitative comparison between our model and MAGPT is shown in Figure 6. We can observe that although MAGPT can successfully perform pose transfer with clothing and facial features, it still exhibits issues with fragmented parts in clothing details and mismatched skin colors for some individuals. In particular, when dealing with clothing that contains complex textures and patterns, our model produces significantly more coherent and realistic results. Compared to MAGPT, which tends to introduce noticeable artifacts and distortions in these scenarios, our method demonstrates superior preservation of fine-grained visual details and clothing consistency. As highlighted by the red box in Figure 6, MAGPT generates distorted textures and incomplete patterns in clothing with complex designs. In contrast, our method avoids such distortions by leveraging residual blocks with CBAM and the Mish activation function, which enhance spatial attention and feature expressiveness to better preserve the original garment structure. Additionally, in the third row, MAGPT fails to preserve the original sleeve length of the clothing, erroneously generating short sleeves instead of the long sleeves present in the target image. In contrast, both variants of our method correctly retain the garment’s structure, demonstrating improved spatial awareness and contour preservation. Furthermore, the images produced by MAGPT have issues with insufficient facial features, such as unequal eye sizes and irregular gaze directions, and even instances of facial disappearance. By incorporating head region loss or face-focused loss, the proposed model also achieves more detailed feature representation, and the aforementioned issues are addressed in the performance of our qualitative experiments.

To better understand the individual contributions of CBAM and Mish, we conduct an ablation study based on the MAGPT architecture. As shown in Figure 7, we compare the original MAGPT with its variants integrating CBAM or replacing ReLU with Mish. The CBAM-enhanced version demonstrates improved structural retention and spatial attention, particularly evident in clothing contours and folds. Meanwhile, the version with Mish activation produces cleaner local textures and fewer visual distortions. These findings confirm that both modules independently improve different aspects of the pose transfer process.

3.4. Quantitative Comparison

We analyze the performance of the proposed model in quantitative experiments across various metrics, compared to different methods including MAGPT [26],

{P G}^{2}

[9], Deform [31], VUNet [22], and PATN [23]. The results of the quantitative comparisons are shown in Table 1. The arrows in Table 1 indicate the desired direction for each metric: ↑ means higher values are better (e.g., SSIM, IS, and PCKh), while ↓ means lower values are better (e.g., FID and LPIPS).

The proposed model with head region loss and with face-focused loss both outperform the MAGPT model in terms of SSIM and LPIPS scores. This demonstrates that the modifications with the CBAM attention module and Mish activation function can result in greater detail and structural closeness to the target image. The integration of CBAM helps the model focus more on key areas of the image, thereby enhancing the authenticity and detail of the generated image structure. In terms of LPIPS performance, the facial loss functions specifically optimize the facial details of the generated images. It improves the authenticity and naturalness of facial expressions and features, which is crucial for enhancing the overall perceptual image quality. Additionally, the Mish activation function improves the model’s learning efficiency and stability across different input values due to its non-monotonicity and smoothness, directly contributing to an enhancement in LPIPS scores. For the PCKh metric, both the proposed method with head region loss and the MAGPT model reach 0.98. Since the ideal value of PCKh computed using ground truth is 1, the space for improving this metric is quite limited. The proposed model also performs better compared to MAGPT in terms of FID. The FID metric measures the statistical similarity between generated and real images, reflecting the model’s strong ability to capture overall image distribution and detail authenticity. For the IS metric, the proposed model with face-focused loss slightly outperforms MAGPT. The IS metric evaluates the diversity of generated images and their recognizability in a pre-trained classification model, focusing more on the categorical attributes of images rather than the true reproduction of details. The proposed model focuses more on enhancing the authenticity of the image details. Therefore, the improvement of the proposed model on the IS score is not as obvious as that on the FID score.

3.5. Experiment on Handling Diverse Actions

In this subsection, we apply the proposed model on data with more diverse actions to demonstrate that the proposed model is not limited to the DeepFashion dataset. From the results shown in Figure 8, we can validate that the proposed model can handle diverse action transformations. For the jumping jack action, the model captures the details of the hand movements effectively. As for the stylized and rhythmically intense hip-hop dancing, the model demonstrates its ability to capture the movement details and maintain consistency in the style of the action.

4. Conclusions

This paper introduces a pose transformation model based on a dual attention mechanism and improved head or facial loss functions. We enhance clothing details and overall image edge features using CBAM and Mish and enhance personal feature learning through head region loss and face-focused loss. Head region loss is tailored for the DeepFashion dataset to yield better results, while face-focused loss uses Openpose for facial feature extraction and is applicable to any datasets. Despite some differences, both loss functions perform well in qualitative and quantitative experiments compared to the original MAGPT structure.

Based on our findings, future research could focus on the following improvements.

Enhancing hand detail generation, as current models lack training on images with open hands and struggle with hand and finger details.
Improving handling of complex backgrounds, since current experiments are trained on all-white backgrounds.
Diversifying the training dataset to include non-Western features, as the current model is predominantly trained on Western faces. Adding a variety of ethnicities could improve performance.

Author Contributions

Conceptualization, H.-Y.C.; Methodology, C.-C.C.; Software, C.-C.C.; Validation, C.-L.J. and C.-C.Y.; Investigation, H.-Y.C. and C.-C.Y.; Writing—original draft, C.-C.C.; Writing—review & editing, H.-Y.C. and C.-L.J.; Supervision, C.-C.Y.; Project administration, H.-Y.C.; Funding acquisition, H.-Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by [National Science and Technology Council, Taiwan] grant number [112-2221-E-008-069-MY3].

Data Availability Statement

The original data presented in the study are openly available in https://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html (accessed on 8 January 2024).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Abbreviation	Full Form
GAN	Generative Adversarial Networks
CGAN	Conditional Generative Adversarial Network
MAGPT	Multi-scale Attention Guided Pose Transfer
CBAM	Convolutional Block Attention Module
SSIM	Structural Similarity Index Measure
IS	Inception Score
PCKh	Percentage of Correct Keypoints (head-normalized)
LPIPS	Learned Perceptual Image Patch Similarity
FID	Fréchet Inception Distance
HPE	Human Pose Estimator
ADGAN	Attention-guided Disentangled GAN

References

Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar]
Johnson, J.; Alahi, A.; Li, F.-.F. Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European Conference; Springer International Publishing: Cham, Switzerland, 2016; pp. 694–711. [Google Scholar]
Lassner, C.; Pons-Moll, G.; Gehler, P.V. A generative model of people in clothing. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 853–862. [Google Scholar]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Balakrishnan, G.; Zhao, A.; Dalca, A.V.; Durand, F.; Guttag, J. Synthesizing images of humans in unseen poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8340–8348. [Google Scholar]
Chan, C.; Ginosar, S.; Zhou, T.; Efros, A.A. Everybody dance now. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 5933–5942. [Google Scholar]
Ma, L.; Jia, X.; Sun, Q.; Schiele, B.; Tuytelaars, T.; Van Gool, L. Pose guided person image generation. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Ma, L.; Sun, Q.; Georgoulis, S.; Van Gool, L.; Schiele, B.; Fritz, M. Disentangled person image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 99–108. [Google Scholar]
Neverova, N.; Guler, R.A.; Kokkinos, I. Dense pose transfer. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 123–138. [Google Scholar]
Si, C.; Wang, W.; Wang, L.; Tan, T. Multistage adversarial losses for pose-based human image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 118–126. [Google Scholar]
Isola, P.; Zhu, J.-Y.; Zhou, T.; Efros, A.A. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1125–1134. [Google Scholar]
Sangkloy, P.; Lu, J.; Fang, C.; Yu, F.; Hays, J. Scribbler: Controlling deep image synthesis with sketch and color. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5400–5409. [Google Scholar]
Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Yeh, R.A.; Chen, C.; Lim, T.Y.; Alexander, G.S.; Hasegawa-Johnson, M.; Do, M.N. Semantic image inpainting with deep generative models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef] [PubMed]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Zhao, B.; Wu, X.; Cheng, Z.-Q.; Liu, H.; Jie, Z.; Feng, J. Multi-view image generation from a single-view. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 383–391. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Esser, P.; Sutter, E.; Ommer, B. A variational u-net for conditional appearance and shape generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8857–8866. [Google Scholar]
Zhu, Z.; Huang, T.; Shi, B.; Yu, M.; Wang, B.; Bai, X. Progressive pose attention transfer for person image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2347–2356. [Google Scholar]
Lu, Y.; Zhang, M.; Ma, A.J.; Xie, X.; Lai, J. Coarse-to-fine latent diffusion for pose-guided person image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 6420–6429. [Google Scholar]
Shen, F.; Ye, H.; Zhang, J.; Wang, C.; Han, X.; Yang, W. Advancing pose-guided image synthesis with progressive conditional diffusion models. arXiv 2023, arXiv:2310.06313. [Google Scholar]
Roy, P.; Bhattacharya, S.; Ghosh, S.; Pal, U. Multi-scale attention guided pose transfer. Pattern Recognit. 2023, 137, 109315. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Misra, D. Mish: A self regularized non-monotonic activation function. arXiv 2019, arXiv:1908.08681. [Google Scholar]
Cao, Z.; Simon, T.; Wei, S.-E.; Sheikh, Y. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7291–7299. [Google Scholar]
Men, Y.; Mao, Y.; Jiang, Y.; Ma, W.-Y.; Lian, Z. Controllable person image synthesis with attribute-decomposed gan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 5084–5093. [Google Scholar]
Siarohin, A.; Sangineto, E.; Lathuiliere, S.; Sebe, N. Deformable gans for pose-based human image generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3408–3416. [Google Scholar]
Liu, Z.; Luo, P.; Qiu, S.; Wang, X.; Tang, X. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1096–1104. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved Techniques for Training Gans; Neural Information Processing Systems Foundation: San Diego, CA, USA, 2016; Volume 29. [Google Scholar]
Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3686–3693. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 586–595. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]

Figure 1. The example of pose transfer task.

Figure 2. Generator model architecture. The top half of the figure illustrates the overall structure of the proposed model. The bottom half provides detailed representations of key internal components, including the residual block with CBAM, the input and output convolution modules (inconv and outconv), and the encoder and decoder blocks.

Figure 3. The architecture of discriminator model.

Figure 4. CBAM integrated with a residual block.

Figure 5. Mish activation function.

Figure 6. Qualitative comparison between our model and MAGPT.

I_{A}^{j}

denotes the source image,

P_{A}^{j}

denotes the pose of source image,

I_{B}^{j}

denotes the target image,

P_{B}^{j}

denotes the pose of target image, and subsequent columns show the generated images by MAGPT and our methods.

Figure 6. Qualitative comparison between our model and MAGPT.

I_{A}^{j}

denotes the source image,

P_{A}^{j}

denotes the pose of source image,

I_{B}^{j}

denotes the target image,

P_{B}^{j}

denotes the pose of target image, and subsequent columns show the generated images by MAGPT and our methods.

Figure 7. Qualitative comparison among model variants.

I_{A}^{j}

denotes the source image,

P_{A}^{j}

denotes the pose of source image,

I_{B}^{j}

denotes the target image,

P_{B}^{j}

denotes the pose of target image, and subsequent columns show the generated images by MAGPT, MAGPT with CBAM, and MAGPT with Mish.

Figure 7. Qualitative comparison among model variants.

I_{A}^{j}

denotes the source image,

P_{A}^{j}

denotes the pose of source image,

I_{B}^{j}

denotes the target image,

P_{B}^{j}

denotes the pose of target image, and subsequent columns show the generated images by MAGPT, MAGPT with CBAM, and MAGPT with Mish.

Figure 8. Results of applying the proposed method to action generation.

Table 1. Quantitative comparison among different pose transfer methods.

Model	FID ↓	SSIM ↑	IS ↑	PCKh ↑	LPIPS (VGG) ↓	LPIPS (SqzNet) ↓
${P G}^{2}$ [9]	-	0.773	3.163	0.89	0.523	0.416
Deform [31]	-	0.760	3.362	0.94	-	-
VUNet [22]	-	0.763	3.440	0.93	-	-
PATN [23]	20.73	0.773	3.209	0.96	0.299	0.170
MAGPT [26]	10.56	0.769	3.379	0.98	0.200	0.111
Proposed with Head Region Loss	10.40	0.775	3.070	0.98	0.196	0.108
Proposed with Face Focused Loss	10.20	0.774	3.411	0.96	0.196	0.108
Real data	7.68	1.000	3.864	1.00	0.000	0.000

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, H.-Y.; Chiang, C.-C.; Jiang, C.-L.; Yu, C.-C. Enhancing Human Pose Transfer with Convolutional Block Attention Module and Facial Loss Optimization. Electronics 2025, 14, 1855. https://doi.org/10.3390/electronics14091855

AMA Style

Cheng H-Y, Chiang C-C, Jiang C-L, Yu C-C. Enhancing Human Pose Transfer with Convolutional Block Attention Module and Facial Loss Optimization. Electronics. 2025; 14(9):1855. https://doi.org/10.3390/electronics14091855

Chicago/Turabian Style

Cheng, Hsu-Yung, Chun-Chen Chiang, Chi-Lun Jiang, and Chih-Chang Yu. 2025. "Enhancing Human Pose Transfer with Convolutional Block Attention Module and Facial Loss Optimization" Electronics 14, no. 9: 1855. https://doi.org/10.3390/electronics14091855

APA Style

Cheng, H.-Y., Chiang, C.-C., Jiang, C.-L., & Yu, C.-C. (2025). Enhancing Human Pose Transfer with Convolutional Block Attention Module and Facial Loss Optimization. Electronics, 14(9), 1855. https://doi.org/10.3390/electronics14091855

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Human Pose Transfer with Convolutional Block Attention Module and Facial Loss Optimization

Abstract

1. Introduction

2. Proposed Method

2.1. Network Architecture

2.2. Residual Block with CBAM

2.3. Mish Activation Function

2.4. Head Region Loss

2.5. Face-Focused Loss

2.6. Total Loss

3. Experiments

3.1. Dataset

3.2. Evaluation Metrics

3.3. Qualitative Comparison

3.4. Quantitative Comparison

3.5. Experiment on Handling Diverse Actions

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI