A Deep Learning Framework with Multi-Scale Texture Enhancement and Heatmap Fusion for Face Super Resolution

Xu, Bing; Wang, Lei; Wu, Yanxia; Liu, Xiaoming; Gan, Lu

doi:10.3390/ai7010020

Open AccessArticle

A Deep Learning Framework with Multi-Scale Texture Enhancement and Heatmap Fusion for Face Super Resolution

by

Bing Xu

^1,2,3,

Lei Wang

³,

Yanxia Wu

¹,

Xiaoming Liu

³

and

Lu Gan

^3,*

¹

College of Computer Science and Technology, Harbin Engineering University, Harbin 150001, China

²

China Communication Information Technology Group Co., Ltd., Beijing 101300, China

³

School of Physics and Electronic Information, Anhui Normal University, Wuhu 241002, China

^*

Author to whom correspondence should be addressed.

AI 2026, 7(1), 20; https://doi.org/10.3390/ai7010020

Submission received: 1 December 2025 / Revised: 2 January 2026 / Accepted: 7 January 2026 / Published: 9 January 2026

(This article belongs to the Special Issue Deep Learning Technologies and Their Applications in Image Processing, Computer Vision, and Computational Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Face super-resolution (FSR) has made great progress thanks to deep learning and facial priors. However, many existing methods do not fully exploit landmark heatmaps and lack effective multi-scale texture modeling, which often leads to texture loss and artifacts under large upscaling factors. To address these problems, we propose a Multi-Scale Residual Stacking Network (MRSNet), which integrates multi-scale texture enhancement with multi-stage heatmap fusion. The MRSNet is built upon Residual Attention-Guided Units (RAGUs) and incorporates a Face Detail Enhancer (FDE), which applies edge, texture, and region branches to achieve differentiated enhancement across facial components. Furthermore, we design a Multi-Scale Texture Enhancement Module (MTEM) that employs progressive average pooling to construct hierarchical receptive fields and employs heatmap-guided attention for adaptive texture refinement. In addition, we introduce a multi-stage heatmap fusion strategy that injects landmark priors into multiple phases of the network, including feature extraction, texture enhancement, and detail reconstruction, enabling deep sharing and progressive integration of prior knowledge. Extensive experiments on CelebA and Helen demonstrate that the proposed method achieves superior detail recovery and generates perceptually realistic high-resolution face images. Both quantitative and qualitative evaluations confirm that our approach outperforms state-of-the-art methods.

Keywords:

face super-resolution; multi-scale texture enhancement; heatmap fusion; deep learning

1. Introduction

In recent years, Face Super-Resolution (FSR) has emerged as an important research direction in computer vision [1,2,3]. FSR aims to reconstruct high-resolution (HR) face images with rich details from low-resolution (LR) inputs, playing a critical role in applications such as video surveillance, image enhancement, and digital forensics. Moreover, advances in FSR can significantly improve the performance of downstream tasks including face recognition, parsing, and alignment.

Compared to general single-image super-resolution (SISR) methods FSR can exploit the relatively consistent structure of human faces as a strong prior, which helps recover both global geometry and local details, especially at large scaling factors (e.g., ×8). With the rapid development of deep convolutional neural networks, deep learning-based FSR approaches have achieved remarkable progress in reconstruction quality. Existing works can be broadly categorized into three groups: attribute-constrained methods, identity-preserving methods, and structure prior-guided methods. Among them, structure prior-guided methods, which leverage geometric information such as facial landmarks, heatmaps, or parsing maps to guide reconstruction, have become a major research focus.

Despite these advances, structure prior-guided FSR methods still face several limitations. First, prior information often depends on low-quality LR inputs or intermediate SR results, the inaccuracy of which may compromise final reconstruction. Second, simple feature concatenation or shallow attention mechanisms are insufficient to fully capture structural variations across different facial components. Third, inadequate multi-scale texture modeling frequently results in texture loss or severe artifacts, especially under large upscaling factors.

To address these challenges, we propose a Face Super-Resolution method with Multi-Scale Texture Enhancement and Deep Heatmap Fusion, the core innovations of which include:

Face Detail Enhancer (FDE): This enhances input features through three branches—edge, texture, and region—enabling differentiated refinement across facial components (e.g., eyes, nose, mouth) to improve local detail recovery.
Multi-Scale Texture Enhancement Module (MTEM): This employs progressive pooling to construct hierarchical receptive fields, performs adaptive enhancement of multi-scale facial textures, and incorporates FDE with heatmap-guided spatial attention for fine-grained texture reconstruction.
Multi-stage Heatmap Fusion Strategy: This extends landmark heatmap information to multiple stages, including feature extraction, texture enhancement, and detail reconstruction, enabling deep sharing and progressive integration of structural priors to ensure facial consistency and realism.
Efficient Collaborative Optimization Framework: This integrates FDE, MTEM, Residual Attention-Guided Units (RAGUs), deep heatmap fusion, and upsampling reconstruction modules in a collaborative manner, significantly improving reconstruction quality while maintaining computational efficiency, reducing artifacts, and enhancing the naturalness of local textures.

The remaining part of this paper is organized as follows: Section 2 reviews the related work in the literature; Section 3 introduces the proposed method in detail; Section 4 is devoted to the experiments and result analysis; Section 5 concludes this work.

2. Related Work in the Literature

2.1. Single Image Super-Resolution

Single Image Super-Resolution (SISR) [3,4,5,6] has long been a central research topic in the field of image processing and has attracted increasing attention in recent years. Early approaches were primarily based on interpolation algorithms and statistical learning techniques, but these methods exhibited clear limitations in handling complex textures and fine details. The advent of deep learning brought revolutionary progress to super-resolution research.

Dong et al. first introduced deep convolutional neural networks into the SR domain and proposed SRCNN [1], which learns a nonlinear mapping between low-resolution (LR) and high-resolution (HR) image pairs, significantly improving reconstruction quality. Subsequently, Kim et al. presented VDSR [2], a 20-layer deep network that learns image residuals and incorporates skip connections to ease the training of very deep networks. The DRCN [7] method further explored the use of recursive networks for SR, leveraging parameter sharing to enhance representational capacity.

To further improve performance, researchers began focusing on feature propagation and information reuse. Li et al. proposed SRFBN [8], which introduces a feedback block structure to fully exploit high-level semantic information, achieving strong performance particularly under large upscaling factors. Zhang et al. developed RCAN [9], which integrates a channel attention mechanism to adaptively recalibrate features, thereby enhancing the expressive power of the network. While these methods achieved significant gains in objective metrics such as PSNR and SSIM, their reconstructed images often appear overly smooth and lack realistic high-frequency details. Chen et al. [10] proposed a hierarchical transfer and progressive learning framework with block-based convolution and key-point-sensitive loss to improve fine-grained feature extraction and long-tailed recognition performance in low-resolution sonar images, demonstrating the effectiveness of multi-stage training and structure-aware supervision for challenging visual reconstruction tasks.

More recently, improving perceptual quality has become a key focus in SR research. Ledig et al. pioneered the use of Generative Adversarial Networks (GANs) for SR and proposed SRGAN [11], which employs adversarial training and perceptual loss to produce more realistic textures. Building on this, Wang et al. refined the design of perceptual loss [12] and introduced more effective feature matching strategies. These perceptually optimized methods [11,12,13,14,15] have achieved remarkable improvements in visual quality, yet balancing objective metrics with perceptual fidelity remains a challenging issue.

2.2. Face Super-Resolution

Face Super-Resolution (FSR), as an important branch of SR techniques, presents unique characteristics and challenges. Owing to the relatively fixed structural configuration and rich semantic information of human faces, researchers have explored how to leverage facial prior knowledge to improve reconstruction performance.

Early FSR methods were primarily based on subspace learning and component reconstruction strategies. Subspace-based approaches often relied on dimensionality reduction techniques such as Principal Component Analysis (PCA), which required precise alignment of facial images—an unrealistic assumption under low-resolution conditions. Component-based methods attempted to detect facial landmarks for localized reconstruction, but the accuracy of landmark detection on LR inputs was limited. Consequently, these early approaches struggled to generate satisfactory results, particularly under large scaling factors.

The advent of deep learning opened new opportunities for FSR. Zhu et al. proposed a cascaded dual-branch [16] network that jointly optimizes face hallucination and dense correspondence field estimation within a unified framework. Yu et al. introduced the use of Generative Adversarial Networks (GANs) for FSR [17], directly super-resolving LR inputs and later extending the model to handle unaligned, noisy, and attribute-diverse faces. Chen et al. presented FSRNet [18], which incorporates both facial parsing maps and landmark heatmaps as geometric priors to guide the reconstruction process.

More recent studies have investigated richer forms of facial priors. Kim et al. proposed a facial attention loss [19] to encourage networks to focus more on landmark regions. Ma et al. designed an iterative framework where SR prediction and landmark estimation [20] are alternately refined to mutually enhance performance. Although these methods improved reconstruction by introducing additional supervision, they suffer from two main drawbacks: (1) reliance on extra data annotation, and (2) limited effectiveness in guiding networks to focus on critical facial structures through indirect supervision.

2.3. Applications of Attention Mechanisms in Image Processing

As an important feature enhancement technique, attention mechanisms have been widely applied in computer vision. The core idea of attention is to learn importance weights to reallocate feature responses, emphasizing informative signals while suppressing redundant ones.

In image classification, Hu et al. introduced SENet [21], which achieves significant improvements through channel attention. Wang et al. proposed a trunk-mask attention mechanism, embedding attention into residual networks [22]. Woo et al. further developed CBAM [23], which infers attention maps along both channel and spatial dimensions, enabling more fine-grained feature modulation.

Attention mechanisms have also demonstrated great potential in image generation tasks. Zhang et al. integrated channel attention with deep residual networks for image super-resolution [9]. Cao et al. proposed a reinforcement learning-based attention-aware framework [24,25,26,27] for face super-resolution, which sequentially applies attention, patch cropping, and SR reconstruction. However, existing methods exhibit certain limitations: (1) attention mechanisms in high-level vision tasks often rely on pooling operations to extract semantic information, which may lead to the loss of critical low- and mid-level features such as edges and shapes; (2) most methods focus on channel attention, while spatial attention could offer greater advantages in region-specific processing of facial components.

3. Method

This chapter provides a detailed description of the proposed face super-resolution approach. As illustrated in Figure 1, the overall network adopts a progressive stage-wise structure composed of stacked Residual Attention-Guided Units (RAGUs). Each RAGU consists of two collaborative submodules: the Multi-Scale Texture Enhancement Module (MTEM) and the Prior Attention Guidance Module (PAGM). From a mathematical perspective, the proposed MRSNet can be viewed as a progressive feature refinement framework guided by structural priors. At each stage, multi-scale texture features are recovered while facial landmark heatmaps provide spatial guidance through attention modulation. This joint optimization enables both local detail enhancement and global structural consistency to be gradually strengthened along the network depth.

The low-resolution (LR) input face is first pre-upsampled and then sequentially refined by the stacked RAGUs in a progressive manner. The output of each RAGU not only serves as the input for the subsequent stage but also feeds back intermediate high-resolution (HR) features into the attention-guided branch. This feedback enhances the texture recovery of key facial regions.

Specifically, the MTEM is responsible for multi-scale texture modeling and enhancement, while the PAGM transforms facial priors (e.g., landmark heatmaps) into channel attention via a Squeeze-and-Excitation (SE) module, thereby guiding the MTEM to perform region-specific enhancement on.

As shown in Figure 1, the network mainly consists of three components: a shallow feature extraction module, a deep feature refinement module composed of stacked layers, and an image reconstruction (upsampling) module.

I_{L R}

,

I_{S R}

,

I_{H R}

denote the LR face image (i.e., input), the super-resolved image (i.e., output), and the ground truth HR image, respectively.

I_{L R}

is upsampled to the target resolution

I_{L R}^{U P}

by using bicubic interpolation. The upsampled image

I_{L R}^{U P}

is then fed into the shallow feature extraction module to obtain features

F_{0} = ϕ_{ext} (I_{L R}^{U P})

. Subsequently,

F_{0}

is fed into the deep refinement module, which is composed of

T

cascaded RAGUs:

F_{t} = ϕ_{R A G U}^{(t)} (F_{t - 1}, H_{t - 1}), t = 1, 2, \dots, T

(1)

Here,

H_{t - 1}

denotes the facial heatmap prior available at the

t - 1

layer (replaced with zeros or pseudo-priors when no prior is available). Each RAGU consists of an MTEM and a PAGM, which work collaboratively to generate refined features. After passing through

T

layers, the final feature representation

F_{T}

is obtained, and the reconstruction module generates the SR image

I_{S R} = ϕ_{rec} (F_{T}),

Here,

ϕ_{rec} (\cdot)

which performs upsampling followed by convolution operations to map the features into the RGB space.

3.1. Residual Attention-Guided Unit (RAGU)

The internal schematic of the Residual Attention-Guided Unit (RAGU) is shown in Figure 2. The input to each RAGU consists of the features from the previous stage

F_{t - 1}

and the available heatmap prior

H_{t - 1}

and the output is

F_{t}

. The design of the RAGU explicitly decouples multi-scale texture enhancement from prior-guided attention while allowing effective interaction between them. The residual formulation encourages the network to focus on learning high-frequency facial details, where as the attention guidance derived from landmark priors emphasizes semantically important facial regions. By stacking multiple RAGUs, structural priors are repeatedly injected and progressively refined, leading to increasingly accurate facial reconstructions.

F_{t} = Ψ (MTEM (F_{t - 1}; θ_{trm}), PAGM (H_{t - 1}; θ_{pgm}))

(2)

Here,

Ψ (\cdot, \cdot)

denotes the fusion of the MTEM output and the attention-guided, information generated by the PAGM. In our implementation, the MTEM output is first weighted by the spatial attention

A

generated by the PAGM, and then a residual connection is applied to obtain

F^{'} = A ⊙ F_{tr}

. Finally, a residual connection is applied.

3.2. Multi-Scale Texture Enhancement Module

The Multi-Scale Texture Enhancement Module (MTEM) is designed to enhance facial textures in a multi-scale manner. Specifically, MTEM constructs hierarchical receptive fields via progressive pooling and incorporates a Face Detail Enhancer (FDE) at each scale for fine-grained detail refinement. Unlike conventional multi-scale feature pyramids that process different scales independently, the proposed MTEM adopts a progressive pooling strategy, where features at coarser scales are hierarchically derived from finer ones. This design facilitates information flow across scales and enables the network to capture both fine-grained textures and broader contextual patterns. Such hierarchical receptive fields are particularly beneficial for face super-resolution, where different facial components exhibit distinct texture characteristics at different spatial scales.

The internal schematic of the Multi-Scale Texture Enhancement Module is shown in Figure 3. The MTEM takes an input

X \in R^{B \times C_{i n} \times H \times W}

, where

B

denotes the batch size,

C_{i n}

the number of channels, and

H

and

W

the spatial dimensions. First, the input features are mapped to the hidden channel dimension via a

1 \times 1

convolutional layer, followed by batch normalization and activation, resulting in

m_{0}

. If the landmark heatmap

H

is provided, it is first interpolated to the target spatial size

H \times W

and then channel-adapted to obtain

\tilde{H}

. Subsequently, The initial attention

A_{0}

is generated using a

1 \times 1

convolution kernel

W_{a}

and used to weight

m_{0}

:

A_{0} = σ (W_{a} * \tilde{H}) \in R^{B \times C_{i n} \times H \times W}, m_{0} \leftarrow m_{0} ⊙ A_{0}

(3)

Next, each scale undergoes pooling and convolution sequentially. For each scale, the feature

m_{i}

is processed by the Face Detail Enhancer (FDE), which enhances edges, textures, and regions through three separate branches. The internal structure of the FDE can be understood from Figure 4. The enhanced features are then fused and combined with a residual connection to produce the final refined features. The edge branch emphasizes local edge information by subtracting the pooled feature from the input, followed by a convolution, batch normalization, and nonlinear activation:

e f = σ (B N (W e * (x - P o o l (x))))

(4)

Here,

W e

is a learnable convolution kernel,

Pool (\cdot)

denotes a spatial pooling operation (such as average pooling),

BN (\cdot)

and

σ (\cdot)

is the activation function (e.g., ReLU). The resulting

e f

captures enhanced edge features of the input. The texture branch captures multi-level texture patterns using a two-layer convolution chain:

t = σ (W_{t 2} * σ (W_{t 1} * x))

(5)

W_{t 1}

and

W_{t 2}

are learnable convolution kernels for the first and second layers, respectively. This branch extracts richer texture details, helping the network to reconstruct fine-grained facial features.

The region branch incorporates prior information from facial landmark heatmaps. If heatmaps are provided, they are first interpolated to match the spatial resolution of the input feature and then adapted to the channel dimension using a learnable

1 \times 1

convolution:

r = σ (W r * C h a n n e l A d a p t (H))

(6)

If no heatmaps are available, a pseudo-region feature is generated via adaptive average pooling of the texture branch. Here,

W r

denotes a learnable

1 \times 1

convolution kernel, and

ChannelAdapt (\cdot)

represents the channel alignment operation. The region branch provides context-aware information to guide the enhancement of important facial areas.

Finally, the outputs of the three branches are concatenated along the channel dimension and mapped back to the original channel size through a convolution, then added to the input as a residual connection:

F D E (x, H) = x + ϕ f (C o n c a t (e f, t, r))

(7)

The three-branch design of the FDE is motivated by the heterogeneous nature of facial details. The edge branch enhances sharp structural boundaries, the texture branch recovers mid- and high-frequency patterns critical for perceptual realism, and the region branch leverages landmark priors to spatially constrain enhancement. Their fusion enables differentiated and region-aware detail refinement.

3.3. Prior Attention Guidance Module (PAGM) and Heatmap Adaptation

Facial landmark heatmaps provide strong structural priors that indicate where enhancement should be emphasized in face super-resolution. To incorporate such priors into the feature refinement process in a consistent and reproducible manner, we design the Prior Attention Guidance Module (PAGM), which jointly exploits spatial priors from heatmaps and channel-wise importance learned from feature statistics.

Let the original facial landmark heatmap at stage

s

be denoted as

H^{s}

. We first rescale it to the target spatial size

(H, W)

using the interpolation operator

I (\cdot)

, where

I (\cdot)

denotes bilinear interpolation. Then, a channel adaptation operator

C (\cdot)

, implemented by a learnable

1 \times 1

convolution, is applied to align the channel dimension with the feature map, resulting in the adapted heatmap:

H_{a d p}^{s} = C (I (H^{s}))

(8)

The adapted heatmap serves as a spatial prior that highlights semantically important facial regions, such as eyes, nose, and mouth.

In parallel, channel-wise attention is learned from the MTEM output features using a Squeeze-and-Excitation (SE) mechanism. Let

A_{c h}^{s}

denote the channel-wise attention weights at stage

s

. Specifically, given the MTEM output feature map

F^{(s)}

, global average pooling (GAP) is first applied to aggregate spatial information, followed by two successive

1 \times 1

convolution layers and a sigmoid activation to generate channel attention weights:

A_{c h}^{s} = σ (Conv (Conv (GAP (F^{(s)}))))

(9)

where

σ (\cdot)

denotes the sigmoid function. The resulting channel attention emphasizes what semantic feature channels should be enhanced.

The spatial prior derived from the adapted heatmap and the channel attention learned from features are jointly applied to guide feature refinement. By combining spatial emphasis and channel-wise reweighting, the PAGM enables structure-aware and region-consistent enhancement of facial details. The resulting attention guidance is applied to the texture recovery features through element-wise multiplication.

When facial landmark priors are unavailable at stage

s

, the heatmap

H^{s}

is set to zeros, such that the adapted heatmap

H_{a d p}^{s}

provides no spatial bias and the PAGM degenerates to purely feature-driven attention.

Finally, the reconstruction module receives the refined feature map, restores the target spatial resolution via the upsampling operator

U (\cdot)

, and maps the features to the RGB space using a convolution layer to generate the super-resolved image. During training, supervision can be applied either to the output of each intermediate stage or only to the final output.

3.4. Loss Function Design

To train the proposed MRSNet effectively, we adopt a joint optimization strategy that combines a pixel-level reconstruction loss with a heatmap-guided attention loss. The former enforces overall fidelity between the reconstructed image and the ground truth, while the latter explicitly encourages structural consistency by aligning attention-enhanced features with facial landmark priors.

3.4.1. Pixel-Level $L_{2}$ Loss

The pixel-level loss constrains the difference between the super-resolved (SR) image and the corresponding high-resolution (HR) ground truth. Specifically, we employ the L1 loss, which has been widely adopted in super-resolution tasks due to its robustness and its ability to preserve sharp details:

L_{pix} = \frac{1}{N} \sum_{i = 1}^{N} {‖I_{S R}^{(i)} - I_{H R}^{(i)}‖}_{F}^{2}

(10)

where

N

is the number of training samples,

I_{S R}^{i}

is the predicted super-resolved image, and

I_{H R}^{i}

is the corresponding high-resolution ground-truth image.

To explicitly incorporate facial structural priors into the feature refinement process, we introduce a heatmap-guided attention loss. This loss enforces consistency between the attention-enhanced feature representations and the adapted facial landmark heatmaps at multiple stages of the network.

L_{att} = \frac{1}{S} \sum_{s = 1}^{S} {‖A v g c (W_{a t t}^{s} ⊙ F^{(s)}) - H_{a d p}^{s}‖}_{F}^{2}

(11)

Here,

S

denotes the number of stages in the network, where

A v g c

denotes channel-wise average pooling.

W_{a t t}^{s}

is the channel attention weight at stage

s

,

F^{s}

is the MTEM output feature at stage

s

, and

H_{a d p}^{s}

is the adapted heatmap at the corresponding stage. The symbol

⊙

represents element-wise multiplication.

The channel attention

W_{a t t}^{s}

is multiplied with the MTEM feature

F^{(s)}

channel-wise with broadcasting, and

{‖\cdot‖}_{F}^{2}

denotes the Frobenius norm squared. This loss encourages the spatial distribution of attention-enhanced features to be consistent with facial landmark priors.

3.4.2. Overall Loss

The total loss is a weighted sum of the above two components:

L_{total} = λ_{pix} L_{pix} + λ_{att} L_{att}

(12)

where

λ_{p i x}

and

λ_{a t t}

are hyperparameters that control the contributions of pixel-level reconstruction and heatmap-guided attention, respectively. Minimizing

L_{t o t a l}

encourages the network to generate high-quality super-resolved images while preserving key-region textures and structural consistency. In all experiments, we fix

λ_{p i x} = 1.0

and

λ_{a t t} = 0.1

. This setting balances global reconstruction fidelity and structural consistency, leading to visually realistic and geometrically coherent super-resolved face images.

4. Experiments and Results Analysis

4.1. Dataset

In the experimental part, we evaluate our method on two widely used face datasets: CelebA (Liu et al., 2015) [28] and Helen (Le et al., 2012) [29]. For both datasets, we employ OpenFace to detect 68 facial landmarks for each image, which are used as structural priors. Based on the detected landmarks, background regions are removed and each image is cropped and aligned to a face region of size 128 × 128. The corresponding low-resolution (LR) inputs are generated by bicubic downsampling to 16 × 16, resulting in an upscaling factor of ×8. For data partitioning, we use 168,800 images from CelebA for training and 1000 images for testing, while 2000 images from Helen are used for training and 50 images for testing. To ensure fair comparison and reproducibility, all experiments are conducted under a unified training protocol. No additional data augmentation strategies are applied during training.

4.2. Implementation Details

The proposed MRSNet is adopted as the base architecture for all experiments. The network consists of a shallow feature extraction module, multiple cascaded Residual Attention-Guided Units (RAGUs), and a reconstruction module. Each RAGU integrates a Multi-Scale Texture Enhancement Module (MTEM) and a Prior Attention Guidance Module (PAGM) to collaboratively model multi-scale facial textures and incorporate facial structural priors. The MTEM constructs hierarchical receptive fields through progressive pooling and embeds a Face Detail Enhancer (FDE) at each scale to achieve differentiated enhancement of edges, textures, and facial regions.

During training, the network is optimized end-to-end using the Adam optimizer with an initial learning rate of 1 × 10⁻⁴. A multi-step learning rate decay strategy is employed, where the learning rate is reduced by a factor of 0.5 at [10 K, 20 K, 40 K, 80 K] iterations. The total number of training iterations is set to 150 K, and the batch size is fixed to 8.

The overall loss function consists of a pixel-level reconstruction loss and a heatmap-guided attention loss. The pixel-level loss adopts the L1 formulation to constrain the difference between the super-resolved output and the HR ground truth. The heatmap-guided attention loss is implemented using an L2 (Frobenius norm) formulation, enforcing consistency between attention-weighted features and adapted facial landmark heatmaps. The corresponding loss weights are fixed to 1.0 and 0.1, respectively.

All experiments are conducted on a multi-GPU platform equipped with two NVIDIA RTX 4090D GPUs (NVIDIA Corporation, Santa Clara, CA, USA; 12 GB memory each). During training, logs are recorded every 250 iterations, and quantitative evaluation is performed every 2000 iterations on the validation set, with representative visual results saved for qualitative analysis.

4.3. Comparisons with the State-of-the-Art

4.3.1. Quantitative Evaluations

To quantitatively evaluate the performance of the proposed method, we compare MRSNet with existing face super-resolution (FSR) approaches under a ×8 upscaling factor. The evaluation is conducted on the CelebA and Helen datasets, focusing on both reconstruction fidelity and perceptual similarity.

We adopt Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and LPIPS as quantitative evaluation metrics. PSNR and SSIM are widely used to assess pixel-level reconstruction accuracy and structural consistency, while LPIPS evaluates perceptual similarity based on deep feature representations and has been shown to correlate well with human visual perception.

All metrics are computed on the luminance (Y) channel for fair comparison with existing works. The quantitative results on the CelebA and Helen test sets are summarized in Table 1.

As shown in Table 1, the proposed MRSNet consistently achieves the highest PSNR and SSIM values on both datasets, demonstrating superior reconstruction fidelity and structural consistency. At the same time, MRSNet also obtains the lowest LPIPS scores among the compared methods, indicating improved perceptual similarity between the reconstructed images and the ground-truth images.

These results indicate that the proposed method effectively enhances perceptual quality without sacrificing pixel-level reconstruction accuracy. The performance gains can be attributed to the progressive multi-scale texture enhancement and attention-guided refinement strategy, which enables more accurate recovery of facial structures and fine-grained details.

4.3.2. Qualitative Evaluations

To further analyze the visual quality of the reconstructed face images, we provide qualitative comparisons between the proposed method and existing approaches. Representative visual results on the CelebA and Helen datasets are shown in Figure 5 and Figure 6, respectively. Compared with other methods, MRSNet produces visually cleaner face images with more coherent facial structures and fewer artifacts. Fine details around key facial components—such as the eyes, nose, mouth, and facial contours—are better preserved.

In challenging regions with rich high-frequency information, competing methods often exhibit blurred boundaries or inconsistent textures. In contrast, MRSNet maintains clearer local structures and more stable texture patterns. These qualitative improvements are mainly attributed to the proposed Multi-Scale Texture Enhancement Module (MTEM) and Face Detail Enhancer (FDE), which enable scale-adaptive and region-aware refinement of facial details.

4.3.3. Model Complexity and Efficiency

In addition to reconstruction performance, we further evaluate the model complexity and computational efficiency of different methods. Specifically, we compare the number of parameters and inference time on the CelebA dataset under a ×8 upscaling factor. The results are summarized in Table 2.

As shown in Table 2, MRSNet achieves a moderate model size and competitive inference speed compared with existing face super-resolution methods. This balance indicates that the improvements in reconstruction fidelity and perceptual quality are obtained without introducing prohibitively high computational overhead, making MRSNet suitable for practical face super-resolution applications.

Overall, the quantitative results presented in Table 1 and Table 2 demonstrate that MRSNet consistently outperforms existing methods across multiple datasets and evaluation metrics, including PSNR, SSIM, and LPIPS. These improvements validate the effectiveness of the proposed multi-scale texture enhancement and attention-guided architecture, which enables more effective exploitation of facial priors and hierarchical contextual information. The moderate increase in computational complexity of MRSNet mainly arises from its multi-stage design and attention fusion strategy, which are intended to enhance reconstruction quality.

4.4. Ablation Studies

To validate the effectiveness of the key components in our framework, we conduct a series of ablation experiments focusing on three core designs: the Face Detail Enhancer (FDE), the Multi-Scale Texture Enhancement Module (MTEM), and the multi-stage fusion attention mechanism. Quantitative results are reported in Table 3, while qualitative evaluations are provided in Figure 7 and Figure 8, where Figure 7 presents enlarged comparisons of local facial details and Figure 8 shows overall visual results under different ablation settings. All experiments are conducted under identical training settings to ensure fair comparison. G1–G6 represent progressively enhanced configurations, where FDE, MTEM, and fusion strategies are incrementally introduced.

4.4.1. Effectiveness of Face Detail Enhancer

The Face Detail Enhancer is designed to explicitly enhance fine-grained facial details, such as edges, high-frequency textures, and local structural patterns that are critical for perceptually realistic face super-resolution. In the ablation setting, this comparison corresponds to G1 (baseline without FDE) and G2 (with FDE) in Table 3, where no fusion attention mechanism is introduced, allowing the effect of FDE to be examined in isolation. To examine its effectiveness, we keep all other architectural components and training settings unchanged and only introduce FDE into the baseline model. This FDE-equipped model serves as a consistent reference configuration in all subsequent ablation comparisons involving FDE. The introduction of FDE leads to consistent improvements across both distortion-based metrics and perceptual metrics, indicating that explicit facial detail enhancement enables the network to better recover sharp contours and subtle textures. Qualitative results further show clearer boundaries around key facial structures, including the eyes, nose, and mouth, as well as reduced overall blurring, demonstrating that FDE alone provides meaningful gains in fine-detail restoration.

4.4.2. Effectiveness of Multi-Scale Texture Enhancement Module

We further investigate the contribution of MTEM, which inherently integrates FDE while extending detail enhancement to a multi-scale setting. To ensure a fair and consistent comparison, MTEM is evaluated as a strict extension of the same FDE-only configuration described above, rather than by removing FDE from its internal structure. Specifically, this comparison corresponds to G2 (FDE only, without fusion) and G3 (FDE + MTEM, with single-stage fusion FUS1) in Table 3, ensuring that the effect of MTEM is assessed under a consistent FDE-based backbone. The experimental results show that incorporating MTEM yields additional improvements beyond those achieved by the FDE-only model, particularly in perceptual quality. This suggests that while FDE effectively enhances local facial details, modeling textures across multiple spatial scales is essential for maintaining consistency among different facial regions. Visual comparisons confirm that MTEM improves the coherence of fine textures such as hair strands, eyebrows, and skin details, reducing local inconsistencies and visually unnatural artifacts. These results verify that MTEM provides complementary benefits rather than redundant enhancement.

As shown in Figure 7, the introduction of FDE and MTEM leads to clearer reconstruction of fine facial details, particularly in challenging regions such as the eyes and teeth.

4.4.3. Effectiveness of Multi-Stage Fusion Attention

Finally, we assess the necessity of the multi-stage fusion attention mechanism by comparing single-stage and multi-stage fusion strategies under otherwise identical architectures. Specifically, the single-stage fusion setting is denoted as FUS1 (e.g., G2 or G3), while the proposed multi-stage fusion corresponds to FUS2, as instantiated in G4 and G6 in Table 3. In both cases, the underlying feature extraction backbone—including the same FDE or MTEM configurations—is kept unchanged, ensuring that the observed differences can be solely attributed to the fusion strategy. The multi-stage design progressively injects attention guidance at multiple depths, allowing facial priors and attention cues to continuously influence the restoration process. Compared with single-stage fusion, multi-stage fusion consistently yields better perceptual performance and more stable visual results, indicating that progressive attention injection effectively mitigates the attenuation of guidance in deep networks.

These ablation experiments demonstrate that each component plays a distinct and necessary role. The Face Detail Enhancer improves fine-detail reconstruction, the Multi-Scale Texture Enhancement Module further enhances texture consistency through multi-scale context modeling as a strict extension of FDE, and the multi-stage fusion attention mechanism ensures sustained guidance across network depth. Their combined effect, as demonstrated by the G6 configuration, leads to more stable and perceptually convincing face super-resolution results.

To provide a more intuitive understanding of the role of different components in the proposed framework, attention heatmaps are visualized for face images under several ablation configurations. Attention heatmaps reflect the spatial distribution of feature responses and indicate which regions are emphasized by the network during the super-resolution process. In face super-resolution, such visualizations are particularly useful for analyzing whether the model effectively focuses on structurally important facial regions, thereby offering interpretability beyond quantitative metrics.

Figure 9 presents the attention heatmaps obtained under five different processing configurations, corresponding to the base model and progressively enhanced variants involving FDE, MTEM, and two fusion strategies. Rather than depicting pixel-level reconstruction results, these heatmaps illustrate how the network’s attention distribution evolves across different architectural settings. The comparison demonstrates that the incorporation of dedicated detail modeling, multi-scale texture enhancement, and hierarchical fusion mechanisms leads to more structured and coherent attention patterns, suggesting that the network is increasingly guided toward semantically meaningful facial regions as the framework becomes more complete.

Overall, the attention heatmap visualization qualitatively supports the effectiveness of the proposed design choices. It shows that the our strategy contributes to a more informed and consistent allocation of attention during feature reconstruction.

Specifically, G1–G6 are designed as progressively enhanced configurations, where components are incrementally introduced to isolate their individual and combined effects.

The qualitative results in Figure 8 further confirm that multi-stage fusion produces more stable and visually coherent face super-resolution results. In addition, Figure 10 illustrates the PSNR and SSIM curves of different ablation configurations during training, providing further insight into the impact of each component on convergence behavior.

5. Conclusions

In this paper, we propose a novel face super-resolution network, MRSNet. The overall framework adopts a modular stacking strategy, where facial heatmap priors are effectively integrated into the feature extraction and recovery process under the guidance of structural prior attention, enabling progressive detail reconstruction. Unlike traditional methods that directly utilize heatmaps, we transform facial landmark heatmaps into attention-guided information to emphasize key regions such as the eyes, nose, and mouth, thereby enhancing structural consistency and detail realism. In the feature enhancement stage, we design a Multi-Scale Texture Enhancement Module (MTEM), within which a Face Detail Enhancer (FDE) is embedded to refine high-frequency facial details at each scale. Through multi-stage stacking and iterative refinement, MRSNet is capable of generating face super-resolution results with clear structure, natural textures, and realistic details.

Although the proposed MRSNet demonstrates strong performance in both reconstruction fidelity and perceptual quality, it still has several limitations. First, the multi-stage stacking strategy and the integration of multi-scale texture enhancement increase the overall training complexity. Compared with simpler single-stage architectures, MRSNet requires longer training time and more careful hyper-parameter tuning to ensure stable convergence.

Second, while the proposed model achieves a reasonable balance between performance and efficiency, its parameter count and inference time remain higher than those of lightweight face super-resolution networks. As a result, direct deployment in real-time or resource-constrained environments, such as mobile or embedded platforms, may be challenging.

In addition, due to differences in implementation frameworks and hardware environments, we primarily report parameter count and inference time as practical indicators of efficiency, while a comprehensive FLOPs comparison is not fully explored in the current study. Further investigation into fine-grained computational cost analysis is left for future work.

Finally, the current implementation focuses on maximizing reconstruction quality rather than minimizing computational cost. Future work will explore more efficient architectures, such as parameter sharing across stages, network pruning, or lightweight variants, to preserve the advantages of multi-stage heatmap fusion and multi-scale texture enhancement while improving inference efficiency and practical deployability in real-world applications.

Author Contributions

Conceptualization, L.G.; methodology, B.X.; data curation, L.W.; writing—original draft preparation, L.W.; writing—review and editing, X.L. and Y.W.; funding acquisition, B.X. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the Natural Science Foundation of Anhui Province (2308085Y02) and the National Natural Science Foundation of China (61871003).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Bing Xu was employed by China Communication Information Technology Group Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. In Proceedings of the European Conference on Computer Vision (ECCV), Zurich, Switzerland, 6–12 September 2014; pp. 184–199. [Google Scholar]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Wang, X.; Liu, Y.; Zhang, Z.; Tang, Y. Spatial-Frequency Mutual Learning for Face Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 18–22 June 2023; pp. 8793–8802. [Google Scholar]
Tsai, Y.-H.; Lin, C.-H.; Liu, Z. Arbitrary-Resolution and Arbitrary-Scale Face Super-Resolution with Implicit Representation Networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–7 January 2024; pp. 1225–1234. [Google Scholar]
Yang, D.; Wei, Y.; Hu, C.; Yu, X.; Sun, C.; Wu, S.; Zhang, J. Multi-Scale Feature Fusion and Structure-Preserving Network for Face Super-Resolution. Appl. Sci. 2023, 13, 8928. [Google Scholar] [CrossRef]
Liu, X.; Li, Y.; Gu, M.; Zhang, H.; Zhang, X.; Wang, J.; Lv, X.; Deng, H. SwinDPSR: Dual-Path Face Super-Resolution Network Integrating Swin Transformer. Symmetry 2024, 16, 511. [Google Scholar] [CrossRef]
Kim, J.; Lee, J.K.; Lee, K.M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645. [Google Scholar]
Li, H.; Xiong, J.; Liu, J.; Wang, X. Feedback network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3862–3871. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 294–310. [Google Scholar]
Chen, X.; Tao, H.; Zhou, H.; Zhou, P.; Deng, Y. Hierarchical and Progressive Learning with Key Point Sensitive Loss for Sonar Image Classification. Multimed. Syst. 2024, 30, 380. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Loy, C.C. ESRGAN: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision Workshops (ECCVW), Munich, Germany, 8–14 September 2018; pp. 63–79. [Google Scholar]
Ren, X.; Hui, Q.; Zhao, X.; Xiong, J.; Yin, J. BESRGAN: Boundary equilibrium face super-resolution generative adversarial networks. IET Image Process. 2023, 17, 1784–1796. [Google Scholar] [CrossRef]
Cheng, X.; Siu, W.-C. Life-Fellow IEEE. Edge Fusion Back Projection GAN for Large-Scale Face Super-Resolution. J. Vis. Commun. Image Represent. 2024, 100, 104143. [Google Scholar] [CrossRef]
Varanka, T.; Toivonen, T.; Tripathy, S.; Zhao, G.; Acar, E. PFStorer: Personalized Face Restoration and Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 2372–2381. [Google Scholar]
Zhu, S.; Liu, S.; Loy, C.C.; Tang, X. Deep cascaded bi-network for face hallucination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 614–622. [Google Scholar]
Yu, X.; Porikli, F. Ultra-resolving face images by discriminative generative networks. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 318–333. [Google Scholar]
Chen, Y.; Tai, Y.; Liu, X.; Shen, C.; Yang, J. FSRNet: End-to-end learning face super-resolution with facial priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 2492–2501. [Google Scholar]
Kim, D.; Cho, M.; Kwak, S. Progressive face super-resolution via attention to facial landmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 476–485. [Google Scholar]
Ma, C.; Jiang, X.; Rao, Y.; Lu, J.; Zhou, J. Deep face super-resolution with iterative collaboration between attentive recovery and landmark estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 5569–5578. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; Tang, X. Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3156–3164. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Cao, Y.; Ma, X.; Zhang, Z.; Ding, Y.; Xie, H.; Zeng, X. Attention-aware face hallucination via deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 690–699. [Google Scholar]
Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Montreal, Canada, 8–13 December 2014; pp. 2204–2212. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning (ICML), Lille, France, 7–9 July 2015; pp. 2048–2057. [Google Scholar]
Chen, L.; Zhang, H.; Xiao, J.; Nie, L.; Shao, J.; Liu, W.; Chua, T.-S. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6298–6306. [Google Scholar]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar]
Le, V.; Brandt, J.; Lin, Z.; Bourdev, L.; Huang, T.S. Interactive facial feature localization. In Proceedings of the European Conference on Computer Vision (ECCV), Florence, Italy, 7–13 October 2012; pp. 679–692. [Google Scholar]

Figure 1. Overview of the proposed MRSNet architecture.

Figure 2. Overview of the proposed RAGU architecture.

Figure 3. Overview of the proposed MTEM architecture.

Figure 4. Overview of the proposed FDE architecture.

Figure 5. Qualitative comparison on the CelebA dataset. Our proposed MRSNet demonstrates significant advantages in handling artifacts and details around facial features. The qualitative results indicate that the proposed method outperforms other FSR approaches. Please zoom in to observe the differences.

Figure 6. Qualitative comparisons on Helen dataset.

Figure 7. Visual comparison of local facial details across different ablation settings. ed boxes indicate the regions selected for enlargement, and the corresponding zoomed-in patches highlight differences in eye and teeth reconstruction among different G configurations.

Figure 8. Qualitative comparison of face super-resolution results under different ablation settings.

Figure 9. Comparison of attention heatmaps obtained with different processing stages.

Figure 10. Ablation study of MRSNet. (a) Ablation experiment results for different PSNR. (b) Ablation experiment results for different SSIM.

Table 1. Quantitative Comparisons of Ours and Existing FSR Methods for ×8 FSR on CelebA and Helen Test Sets.

Method	CelebA PSNR/SSIM/LPIPS	Helen PSNR/SSIM/LPIPS
Bicubic	23.58/0.6285/0.2669	23.89/0.6751/0.2760
SRResNet	25.82/0.7369/0.2489	25.30/0.7269/0.2501
ESRGAN	26.52/0.7535/0.1953	26.78/0.7866/0.1984
RCAN	26.40/0.7248/0.2205	26.37/0.7362/0.3437
DICNet	27.35/0.7867/0.2029	26.71/0.7856/0.2158
MLGNet	27.34/0.7915/0.2351	26.78/0.7753/0.2482
MRSNet	27.41/0.7983/0.1807	26.94/0.7994/0.1918

Table 2. Comparisons of Model Complexity on ×8 CelebA.

Methods	#Params (M)	Inference Time (s)
SRResNet	1.5	0.015
ESRGAN	16.7	0.085
RCAN	15.85	0.093
DICNet	21.8	0.054
DICGAN	26.3	0.079
MLG	22.5	0.047
MRSNet	23.5	0.061

Table 3. Ablation study on the effectiveness of different components in MRSNet. In the table, “√” indicates the module/setting is used, and “×” indicates it is not used.

Methods	FDE	MTEM	FUS1	FUS2	PSNR	SSIM	LPIPS
G1	$\times$	$\times$	$\times$	$\times$	24.52	0.6828	0.5351
G2	$\sqrt$	$\times$	$\times$	$\times$	25.11	0.7084	0.3703
G3	$\sqrt$	$\times$	$\sqrt$	$\times$	26.12	0.7522	0.3465
G4	$\sqrt$	$\sqrt$	$\times$	$\times$	26.92	0.7855	0.2273
G5	$\sqrt$	$\sqrt$	$\sqrt$	$\times$	26.99	0.7891	0.1964
G6	$\sqrt$	$\sqrt$	$\sqrt$	$\sqrt$	27.41	0.7983	0.1918

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, B.; Wang, L.; Wu, Y.; Liu, X.; Gan, L. A Deep Learning Framework with Multi-Scale Texture Enhancement and Heatmap Fusion for Face Super Resolution. AI 2026, 7, 20. https://doi.org/10.3390/ai7010020

AMA Style

Xu B, Wang L, Wu Y, Liu X, Gan L. A Deep Learning Framework with Multi-Scale Texture Enhancement and Heatmap Fusion for Face Super Resolution. AI. 2026; 7(1):20. https://doi.org/10.3390/ai7010020

Chicago/Turabian Style

Xu, Bing, Lei Wang, Yanxia Wu, Xiaoming Liu, and Lu Gan. 2026. "A Deep Learning Framework with Multi-Scale Texture Enhancement and Heatmap Fusion for Face Super Resolution" AI 7, no. 1: 20. https://doi.org/10.3390/ai7010020

APA Style

Xu, B., Wang, L., Wu, Y., Liu, X., & Gan, L. (2026). A Deep Learning Framework with Multi-Scale Texture Enhancement and Heatmap Fusion for Face Super Resolution. AI, 7(1), 20. https://doi.org/10.3390/ai7010020

Article Menu

A Deep Learning Framework with Multi-Scale Texture Enhancement and Heatmap Fusion for Face Super Resolution

Abstract

1. Introduction

2. Related Work in the Literature

2.1. Single Image Super-Resolution

2.2. Face Super-Resolution

2.3. Applications of Attention Mechanisms in Image Processing

3. Method

3.1. Residual Attention-Guided Unit (RAGU)

3.2. Multi-Scale Texture Enhancement Module

3.3. Prior Attention Guidance Module (PAGM) and Heatmap Adaptation

3.4. Loss Function Design

3.4.1. Pixel-Level L 2 Loss

3.4.2. Overall Loss

4. Experiments and Results Analysis

4.1. Dataset

4.2. Implementation Details

4.3. Comparisons with the State-of-the-Art

4.3.1. Quantitative Evaluations

4.3.2. Qualitative Evaluations

4.3.3. Model Complexity and Efficiency

4.4. Ablation Studies

4.4.1. Effectiveness of Face Detail Enhancer

4.4.2. Effectiveness of Multi-Scale Texture Enhancement Module

4.4.3. Effectiveness of Multi-Stage Fusion Attention

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.4.1. Pixel-Level $L_{2}$ Loss