Lightweight CNN–Transformer Hybrid Network for Efficient Face Super-Resolution

Liu, Ao-Lin; Xu, Yi-Han; Zhou, Wen

doi:10.3390/app16126221

Open AccessArticle

Lightweight CNN–Transformer Hybrid Network for Efficient Face Super-Resolution

by

Ao-Lin Liu

^1,*,

Yi-Han Xu

¹

and

Wen Zhou

²

¹

College of Information Science and Technology & Artificial Intelligence, Nanjing Forestry University, Nanjing 210037, China

²

College of Low Altitude Equipment and Intelligent Control, Guangzhou Maritime University, Guangzhou 510725, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(12), 6221; https://doi.org/10.3390/app16126221 (registering DOI)

Submission received: 17 May 2026 / Revised: 8 June 2026 / Accepted: 15 June 2026 / Published: 20 June 2026

Download

Browse Figures

Versions Notes

Abstract

Face super-resolution (FSR) aims to reconstruct high-quality high-resolution face images from low-resolution inputs. Although CNN–Transformer hybrid models have shown promising performance by jointly modeling local textures and global dependencies, their large parameter sizes and high computational costs hinder practical deployment in resource-constrained scenarios such as mobile devices and embedded systems. Meanwhile, existing lightweight SR models usually reduce complexity by simplifying network depth, channel dimensions, or convolutional operations, which may weaken feature representation capability and lead to insufficient recovery of fine facial structures. To address these issues, this paper proposes HCTIUNet, a lightweight CNN–Transformer hybrid network based on an inverted U-shaped architecture. Specifically, the proposed network integrates lightweight CNN branches for local facial texture extraction and Transformer branches for global dependency modeling, while introducing a multi-scale feature interaction strategy and a global feature refinement module to enhance facial structural details. Experimental results on the FFHQ, CelebA, and Helen datasets demonstrate that HCTIUNet achieves competitive performance under the ×8 face super-resolution setting, obtaining PSNR/SSIM/LPIPS values of 27.55 dB/0.765/0.225, 27.63 dB/0.761/0.212, and 27.53 dB/0.777/0.213, respectively. Moreover, HCTIUNet contains 10.5 M parameters, requires 9.9 G FLOPs, and achieves an inference time of 0.021 s. These results indicate that the proposed method achieves a favorable trade-off between reconstruction accuracy, perceptual quality, and computational efficiency, making it suitable for efficient face super-resolution applications.

Keywords:

hybrid models; lightweight; face super-resolution; HCTIUNet

1. Introduction

Face super-resolution is a specialized subfield of the broader super-resolution (SR) task, first introduced by Baker et al. [1] It focuses on recovering high-resolution (HR) face images from their low-resolution (LR) counterparts, with a particular emphasis on crucial facial information and fine details. Compared to generic SR tasks, FSR places greater emphasis on the reconstruction of facial structures, which demands not only the restoration of high-fidelity details (such as skin textures and facial contours) but also the preservation of identity features. Traditional interpolation methods (e.g., bicubic interpolation) and early CNN-based models (e.g., SRCNN [2]) tend to produce blurry images, falling short of the high-fidelity demands in modern applications. The advent of deep learning technologies has brought a revolutionary breakthrough to FSR, enabling models to learn the complex mapping from low-resolution to high-resolution spaces.

The core objective of model lightweighting is to effectively reduce computational complexity and resource consumption while preserving model performance. This specifically entails reducing the number of parameters, decreasing computational workloads, accelerating inference efficiency, and optimizing underlying operators alongside network structure designs. Centering around these goals, existing research has primarily advanced along two technical paths: lightweight neural network design and model compression methods. Characterized by their small footprints, rapid inference speeds, and deployment-friendly nature, lightweight neural networks focus on enhancing model efficiency from a structural perspective, mitigating redundant computation through the introduction of efficient operators and modules. For instance, techniques such as Depthwise Separable Convolution, Group Convolution, and Channel Shuffle are widely employed to substantially reduce parameter counts and computational overhead. Furthermore, by designing lightweight feature extraction modules and attention mechanisms, these approaches enhance information utilization efficiency without sacrificing representation capability.

Although existing CNN-, Transformer-, and CNN–Transformer-based SR methods have achieved remarkable progress, their designs still present several limitations for lightweight face super-resolution. Attention-based CNN methods such as RCAN [3] mainly enhance channel-wise feature representation but have limited ability to explicitly model long-range facial dependencies. CNN–Transformer aggregation methods such as SCTANet [4] improve local–global feature interaction, but their feature aggregation structures may introduce considerable computational burden. Meanwhile, many lightweight SR models reduce parameters by simplifying network depth or channel dimensions, which may weaken the recovery of fine facial structures. To address these issues, this paper proposes HCTIUNet, a lightweight CNN–Transformer hybrid network that integrates local texture extraction, global dependency modeling, multi-scale feature interaction, and feature refinement in a unified framework.

Regarding the network design in this paper, the core architecture is constructed by stacking multiple Lightweight CNN–Transformer Blocks (LPCTBs) to form an inverted U-shaped structure. Distinct from the traditional symmetric U-Net, the designed inverted U-Net structure (IUNet) prioritizes progressive feature enhancement and the reconstruction process, ensuring elevated feature representation efficiency within a lightweight design. Within this IUNet framework, the LPCTB serves as the fundamental building block. Its CNN branch is primarily responsible for local feature modeling and detail extraction, whereas the Transformer branch is dedicated to capturing global contextual information and long-range dependencies. To further mitigate computational overhead, the LPCTB adopts a lightweight design strategy that optimizes both convolutional operations and attention mechanisms, enabling the model to substantially reduce parameter counts and computational complexity while preserving its representation capability. Finally, the outputs from both branches are integrated via a feature fusion operation, thereby achieving collaborative modeling of local details and global structural information.

Following the backbone network, a Refinement module is introduced to further polish the fused features, leveraging global information to refine edges and textural details while enhancing the discriminative capacity and representation quality of the reconstructed features. Subsequently, an upsampling module progressively restores the spatial resolution of the image to generate the final high-resolution output. While preserving the core concepts of the baseline model, HCTIUNet effectively reduces parameter counts and computational complexity by introducing lightweight convolutional structures, optimizing attention mechanisms, and simplifying feature fusion strategies. Meanwhile, through a well-conceived structural layout, the proposed framework minimizes the performance degradation caused by lightweighting, successfully striking an optimal balance between model efficiency and reconstruction quality.

In general, the main contributions of this paper are summarized as follows:

To address the high computational cost and large parameter size of existing CNN–Transformer-based face super-resolution methods, we propose HCTIUNet, a lightweight hybrid network that integrates CNN-based local feature extraction and Transformer-based global dependency modeling within a unified framework.
To improve local–global feature interaction under a lightweight design, we construct an inverted U-shaped architecture composed of lightweight CNN–Transformer interaction blocks. This structure enables progressive multi-scale feature exchange and enhances the representation of facial textures and structural information.
To alleviate the limited feature representation capability of existing lightweight SR models, we design a lightweight CNN–Transformer interaction block, in which the CNN branch extracts local facial details while the Transformer branch captures long-range contextual dependencies, thereby achieving complementary local and global feature modeling.
To enhance the reconstruction of fine facial structures and reduce the loss of detail caused by lightweight processing, we introduce multi-scale feature fusion and global feature refinement mechanisms. These modules further improve the representation of key facial regions and enhance reconstruction quality.

It should be emphasized that the methodological contribution of HCTIUNet does not lie in treating each elementary operation as completely new. Operations such as depthwise convolution, residual connection, channel/spatial attention, MDTA-like attention, and multi-scale fusion have been widely used in previous image restoration and super-resolution studies. The main contribution of this work lies in how these operations are reorganized and adapted within an inverted U-shaped CNN–Transformer framework for efficient face super-resolution.

2. Related Work

2.1. CNN-Based Face Super-Resolution Methods

Early studies investigated face hallucination from different perspectives; with the development of deep learning, convolutional neural networks (CNNs) have become the dominant framework for image and face super-resolution. Representative CNN-based SR models, such as VDSR [5], demonstrated that deeper convolutional networks can effectively learn the nonlinear mapping between low-resolution and high-resolution image spaces. In the field of FSR, several methods further introduced facial structure modeling and attention mechanisms to improve the reconstruction of key facial regions. For example, deep cascaded bi-network methods [6], region-based CNN methods [7], and facial-prior-guided networks such as FSRNet [8] exploit facial structures or region information to enhance reconstruction quality. Although CNN-based methods are effective in extracting local textures and restoring fine image details, their receptive fields are usually limited by convolutional operations. As a result, they may have difficulty modeling long-range dependencies and global facial structures.

2.2. Transformer-Based and CNN–Transformer Hybrid Methods

Transformer architectures have shown strong capability in modeling long-range dependencies. The original Transformer was proposed for sequence modeling tasks [9], and later vision Transformer variants, such as Swin Transformer [10], introduced hierarchical and window-based self-attention mechanisms for visual representation learning. Inspired by these advances, Transformer-based and CNN–Transformer hybrid models have been increasingly applied to image restoration and face super-resolution tasks. Compared with pure CNN models, Transformer-based methods can capture broader contextual information and global structural relationships, which is beneficial for recovering coherent facial structures.

Recent CNN–Transformer hybrid methods attempt to combine the local texture extraction capability of CNNs with the global dependency modeling ability of Transformers. For example, Yoo et al. [11] proposed an Enriched CNN-Transformer Feature Aggregation Network (ECT), which effectively integrates convolutional and transformer features through a dedicated aggregation strategy, achieving competitive performance in image super-resolution tasks. Zhao and Zhang [12] proposed the Semantic Attention Adaptation Network (SAAN), which leverages semantic attention to adaptively emphasize important facial regions and improve the reconstruction of facial structures in face super-resolution tasks. Nevertheless, Transformer-based and CNN–Transformer hybrid models often introduce considerable computational cost due to attention operations, complex feature aggregation, or stacked network blocks. This limits their practical application in resource-constrained scenarios, such as mobile devices and embedded systems. Therefore, it is necessary to design a more efficient CNN–Transformer interaction mechanism that preserves global modeling capability while reducing model complexity.

2.3. Prior-Guided FSR Methods

Generative adversarial networks (GANs) have been widely explored in image super-resolution and face restoration because of their ability to generate visually realistic and perceptually pleasing details. For example, adversarial residual learning has been introduced into image super-resolution to improve perceptual quality and high-frequency detail reconstruction [13]. In the broader field of face restoration, GAN-based and perceptual optimization strategies have also been widely discussed as important approaches for improving visual realism [14]. Prior-guided methods use facial landmarks, parsing maps, identity information, or multi-scale facial priors to constrain the reconstruction process and improve the fidelity of facial components [15]. These methods are particularly effective in recovering semantically important regions, such as eyes, mouths, and facial contours.

However, GAN-based methods also have several limitations. First, adversarial training is often unstable and sensitive to hyperparameter settings. Second, GAN-based models may hallucinate unrealistic or identity-inconsistent details, which is undesirable for face-related applications that require faithful reconstruction. Third, the introduction of discriminators, perceptual networks, or additional identity constraints increases training complexity and computational cost. Moreover, prior-guided methods often require additional annotations, auxiliary networks, or carefully designed supervision signals, which may increase training complexity and reduce deployment flexibility. Therefore, designing a lightweight model that can enhance both local facial details and global structural consistency without relying heavily on external priors remains an important problem.

2.4. Lightweight Super-Resolution Methods

Lightweight super-resolution aims to reduce parameter size, computational cost, and inference latency while maintaining acceptable reconstruction performance. Several efficient SR models achieve lightweight design by using compact convolutional structures, residual feature distillation, channel reduction, or efficient feature reuse. For example, RFDN [16] uses residual feature distillation to improve feature utilization under a compact network design. Laplacian pyramid-based SR methods [17] improve reconstruction efficiency through progressive upsampling. In addition, lightweight convolutional strategies inspired by efficient neural network designs, such as depthwise separable convolution and compact feature extraction, have been widely used to reduce computational overhead.

Although lightweight SR models are efficient, excessive simplification of network depth, channel dimensions, or feature extraction modules may weaken feature representation capability. In face super-resolution, this limitation is more significant because facial images contain fine-grained structures and identity-related details. Lightweight models may produce over-smoothed results or fail to accurately reconstruct key facial components. Therefore, the key challenge is to achieve a better trade-off between reconstruction quality and computational efficiency.

To address these limitations, this paper proposes HCTIUNet, a lightweight CNN–Transformer hybrid network for efficient face super-resolution. Different from existing methods, HCTIUNet integrates lightweight CNN-based local feature extraction, Transformer-based global dependency modeling, inverted U-shaped multi-scale feature interaction, and global feature refinement into a unified framework. The proposed design aims to improve facial detail reconstruction and structural consistency while maintaining computational efficiency.

3. Materials and Methods

3.1. Overall Architecture of HCTIUNet

As shown in Figure 1, the proposed lightweight CNN–Transformer hybrid network for image super-resolution, HCTIUNet, consists of three components: a shallow feature extraction module, a reverse U-shaped backbone structure, and an image reconstruction module. The input low-resolution image is first processed by a 3 × 3 convolutional layer for shallow feature extraction. The extracted shallow features are then fed into the inverted U-shaped backbone, which consists of multiple lightweight CNN–Transformer interaction blocks (LPCTBs). Each LPCTB contains a CNN branch for local facial texture extraction and a Transformer branch for global dependency modeling. The CNN branch adopts convolutional layers with a kernel size of 3 × 3 and uses a channel reduction–expansion strategy to reduce computational cost. The Transformer branch adopts the proposed MCT module, where the embedding dimension, number of attention heads, and depthwise convolution settings are fixed throughout the network.

The MFEU module receives multi-scale features from different stages of the backbone. Before feature fusion, adaptive pooling is used to align the spatial resolution, and 1 × 1 convolution is used to align the channel dimension. The aligned features are then fused by a dynamic weighting mechanism and further refined by spatial attention. The GFRB module consists of interleaved local feature extraction units and Transformer units, which are used to refine the deep features before image reconstruction. Finally, the upsampling module restores the spatial resolution through sub-pixel convolution, followed by a 3 × 3 convolutional layer to generate the final super-resolved image.

We assume that the input low-resolution face image is

I_{LR}

, and the model’s output super-resolved image is

I_{SR}

. The shallow feature extraction module performs preliminary encoding of the input low-resolution image, using 3 × 3 convolutional operations to extract basic texture information, thereby providing input for subsequent deep feature modeling:

F_{s f} = H_{c o n v}^{3 \times 3} (I_{L R}),

(1)

The proposed IUNet architecture is a multi-level codec comprising multiple efficient Transformer and CNN interaction blocks. It effectively extracts global features within the channel and local features at different network stages, while better balancing performance and speed. The process can be represented as:

F_{0} = F_{s f}

(2)

F_{i} = H_{L P C T B}^{(i)} (F_{i - 1}), i = 1, 2, \dots, N

(3)

F_{N} = H_{I U} (F_{s f}) + F_{s f}

(4)

where

F_{sf}

denotes the nonlinear mapping function of shallow feature extraction,

F_{i - 1}

denotes the output features of the (i − 1) th LPCTB module,

F_{i}

denotes the output features of the i-th LPCTB module, and

H_{LPCTB}^{(i)}

denotes the nonlinear mapping function corresponding to the i-th LPCTB module. Each LPCTB consists of a CNN branch and a Transformer branch.

H_{IU}

represents the IUNet module, and

F_{N}

denotes the final deep feature representation.

Following the backbone network, a feature refinement module is introduced to further enhance the deep features. The extracted deep feature representation

F_{N}

is fed into the refinement module to model long-range dependencies between pixels in the feature maps. The process can be represented as:

F_{R} = H_{r e f} (F_{N})

(5)

where

F_{R}

denotes the output of the refinement module, and

H_{ref}

denotes the nonlinear mapping function corresponding to the feature refinement module. Finally, the image spatial resolution is gradually restored through the upsampling module, and the final high-resolution image is generated using a convolutional layer.

F_{u} = F_{u p} (F_{R})

(6)

I_{S R} = H_{c o n v}^{3 \times 3} (F_{u})

(7)

Here,

F_{u}

represents the output of the upsampling stage, and

F_{up}

denotes the subpixel convolution layer.

3.2. IUNet

In recent years, some studies have attempted to incorporate Transformers into the UNet [18] architecture to enhance the network’s ability to model global information and thereby improve overall performance. However, due to the continuous increase in the number of feature channels, the integration of Transformers with UNet often results in high parameter counts and computational overhead. Furthermore, most existing lightweight SR models only utilize global features at the end of the model or even disregard the importance of global features, thereby limiting SR performance. To address this issue and achieve effective integration of local features and global information at different network stages, this paper designs a lightweight Inverted UNet (IUNet) architecture and proposes an efficient CNN–Transformer Interaction Block (LFIEB) to alternately extract local and global features, thereby maintaining feature stability from multiple perspectives.

IUNet adopts a multi-stage encoder–decoder framework that integrates local details and global context from different stages across various scales through layer-by layer feature interaction and fusion, thereby constructing more discriminative hybrid feature representations. Compared to traditional architectures (such as EDSR), the proposed IUNet achieves a better balance between performance and computational efficiency. IUNet utilizes subpixel convolution and inverse subpixel convolution for upsampling and downsampling. During the encoding phase, it progressively reduces the number of feature channels while expanding the spatial resolution of the feature maps; during the decoding phase, it progressively restores the spatial scale and increases the channel dimension. Through this “channel-decreasing—spatial expansion—recovery” structural design, combined with a cross-stage feature fusion strategy, the model maintains strong feature representation capabilities while effectively reducing the parameter size. This significantly improves computational efficiency without compromising reconstruction performance. Additionally, the encoder of IUNet employs Cross-MDTA to achieve adaptive selection and fusion of cross-layer features, enabling the network to dynamically focus on key regions, Simultaneously, the proposed MFEU module is introduced to perform multi-scale enhancement on the fused features, further improving detail reconstruction capabilities. The combination of these two approaches achieves synergistic optimization through “cross-layer alignment + multi-scale refinement.”

In the proposed IUNet architecture, the encoder is primarily responsible for hierarchical feature learning and multi-scale feature construction, providing rich structural information and contextual prior knowledge for the subsequent decoding stage. The encoding and decoding stages have the same number of LPCTBs (2 LPCTBs). As shown in Figure 2, each encoding stage includes a specially designed LPCTB module and an upsampling block. The LPCTB module achieves a joint representation of local texture details and global structural information through the collaborative modeling of CNN and Transformer branches. Assuming the feature map input to IUNet is

X \in R^{H \times W \times C}

, the input features first undergo feature extraction via the LPCTB module, as follows:

F_{1} = H_{L P C T B}^{(1)} (X)

(8)

Building on this, the Cross-MDTA module is introduced to further enhance the features. In the encoder stage, this module degenerates into a self-attention form, modeling only the long-range dependencies within the input features to enhance the discriminative power of the feature representations:

F_{1}^{'} = A_{s e} (F_{1})

(9)

where

A_{se}

represents the Cross-MDTA block. Subsequently, upsampling is performed to spatially rearrange the features, achieving a resolution transformation:

F_{1} = H_{u p} (F_{1}^{'})

(10)

In subsequent stages, each encoder doubles the size of the feature map and halves the number of feature channels, yielding multi-scale feature representations of dimensions 2H × 2W × C/4 and 4H × 4W × C/16, here H and W denote the spatial height and width of the shallow feature map extracted from the LR input image. In our ×8 face super-resolution setting, all HR images are resized to 128 × 128, and the corresponding LR inputs are 16 × 16. Therefore, H and W are set to 16 in the main experimental setting, and the base channel number C is set to 64. Meanwhile, the LPCTB module between the encoding and decoding stages further refines all encoded features aggregated here and enhances them once again. This allows the model to focus on more facial structures, continuously enhancing feature responses in different facial regions, which helps improve the discriminative power of feature representations and provides richer and more stable semantic information support for subsequent high-resolution reconstruction.

The decoding stage primarily focuses on feature utilization. Through downsampling operations, it gradually restores the original spatial resolution—that is, by reducing spatial dimensions and expanding channel dimensions. During the decoding process at each layer, features are first rearranged via an upsampling block to restore spatial resolution and enhance channel expressiveness. Subsequently, a cross layer attention module (Cross-MDTA) is introduced, using features from the corresponding layer in the encoding stage as K (Key) and V (Value), and decoded features as Q (Query), to achieve adaptive fusion of cross-layer features and guide the recovery of structural information. The fused features are further fed into the Multi Scale Feature Enhancement Module (MFEU). The MFEU integrates feature responses from different receptive fields to enhance the expression of fine-grained texture information and key facial regions. The decoder then feeds the latent features of the intermediate input image into the LPCTB module to further model the enhanced features, enabling the synergistic optimization of local detail and global structural information. In this way, all local and global features at different scales can be fully utilized to reconstruct high-quality facial images. At the end of the decoding stage, the network uses a 3 × 3 convolutional layer to transform the learned features into the final SR features.

In the CNN branch of the LPCTB, this paper designs a CNN-based Local Face Semantic Information Extraction Block (LFIEB) to extract local features, enabling efficient capture of local feature representations. As shown in Figure 3, the designed LFIEB adopts a phased feature extraction strategy, gradually modeling local texture information through two-level convolutional operations, and introduces residual connections to fuse shallow and deep features, thereby enhancing the completeness and stability of feature representations. To reduce model complexity, the LFIEB employs a channel compression strategy at the front end of the module to halve the number of feature channels (the channel reduction ratio is set to 2), reducing computational overhead, and then restores the channel dimension at the end of the module, thereby achieving a balance between lightweight design and expressive capability. After feature extraction is complete, channel attention and spatial attention [19] mechanisms are introduced at the end of the module to adaptively re-calibrate the features, highlighting key regional information while suppressing redundant information. Finally, the attention-weighted features are fused to serve as the output of LFIEB.

Traditional Vision Transformers (ViTs) have been widely applied to image super resolution (SR) tasks due to their excellent ability to model global features. However, because their computational complexity is limited by the resolution of the input image, ViTs cannot be directly applied to the various feature extraction stages of lightweight SR models. Therefore, in the Transformer branch of LPCTB, this paper designs a Multi scale Channel Transformer (MCT) block based on a multi-head dynamic transposed attention mechanism to capture long-range dependencies and model global contextual information. This paper adopts Multi-head Transposed Auto-attention (MDTA) instead of the multi-head attention mechanism to reduce computational complexity and improve feature modeling efficiency. Additionally, deep convolutions are introduced to enhance the perception of local structural information and emphasize channel dependencies among important features. As shown in Figure 4, assuming the input feature map is

X \in R^{H \times W \times C}

, here C = 64, the input features first undergo normalization and 1 × 1 convolution for channel mapping, projecting the features into a high-dimensional embedding space, the embedding dimension is set to the channel dimension of the input feature. The number of attention heads is set to 4, and generating Q (Query), K (Key), and V (Value) features respectively:

Q, K, V = H_{c o n v}^{1 \times 1} (L N (X))

(11)

Then, deep separable convolutions [20] are applied to Q, K, and V to enhance their local sensitivity, followed by dimension scaling to obtain

Q^{'}, K^{'}, V^{'} \in R^{C^{'} \times (W H)}

, and attention weights are computed via an attention mechanism.

Q = R s (H_{d c o n v}^{3 \times 3} (Q^{'}))

(12)

K = R s (H_{d c o n v}^{3 \times 3} (K^{'}))

(13)

V = R s (H_{d c o n v}^{3 \times 3} (V^{'}))

(14)

A = S o f t m a x ((Q^{'} \cdot {(K^{'})}^{T} / \sqrt{d}) \cdot V^{'})

(15)

Here, Rs represents the mapping function corresponding to the dimensionality adjustment. After obtaining the global attention features, a 1 × 1 convolution is applied to perform a linear projection, and these are fused with the input features via a residual connection to obtain the enhanced global feature representation.

F_{G} = X + H_{c o n v}^{1 \times 1} (R s (A))

(16)

Here,

F_{G}

denotes the enhanced global features. Building on this, to further enhance feature expressiveness, MCT introduces a convolution-based feature enhancement and recalibration mechanism, where in the enhanced global features first undergo nonlinear transformations via 1 × 1 convolutions and deep convolutions, and their expressiveness is further enhanced through the GELU activation function;

F_{M} = σ (H_{c o n v}^{1 \times 1} (H_{d c o n v}^{3 \times 3} (F_{G})))

(17)

where

F_{M}

represents the further-enhanced intermediate features. Subsequently, channel attention is introduced to adaptively re-scale the features, thereby highlighting important channel information and suppressing redundant feature responses; finally, the feature dimensions are restored via 1 × 1 convolutions, and the module output is obtained through a residual connection. In this way, MCT achieves an effective combination of local feature enhancement and global dependency modeling while maintaining a lightweight structure.

F_{C} = H_{C A} (F_{M})

(18)

F_{O} = F_{G} + H_{c o n v}^{1 \times 1} (F_{C})

(19)

Here,

F_{C}

denotes the intermediate features after adaptive re-scaling via channel attention, and

F_{O}

denotes the final output features of the MCT block.

To further exploit the multi-scale information contained in the input image and enhance the model’s capability to capture and represent features from different stages, a Multi-scale Fusion Enhancement Unit is introduced in this study. The inputs of the proposed module are derived from feature maps of different resolutions and different stages of the network, which contain abundant multi-scale spatial and semantic information. As illustrated in Figure 5, the MFEU is designed to achieve unified modeling and efficient fusion of cross-scale features, thereby improving the network’s representation ability for complex structures and fine-grained details.

Specifically, adaptive pooling operations are first employed to process the input feature maps from different branches, unifying their spatial resolutions to 128 × 128. This operation reduces the spatial discrepancy among multi-scale features and enhances feature alignment capability. Subsequently, 1 × 1 convolutions are applied to adjust and project the channel dimensions of each feature map, achieving channel alignment while simultaneously performing preliminary feature recalibration to improve feature consistency across different scales.

After spatial and channel alignment, the MFEU introduces a dynamic weight learning mechanism to adaptively fuse features from different branches. This mechanism dynamically assigns fusion weights according to the importance of input features, enabling the network to emphasize task-relevant information while suppressing redundant or less informative features, thereby improving the utilization efficiency of multi-scale representations. The fused feature maps are further refined through a spatial attention mechanism to enhance the responses of critical regions and strengthen the representation of important facial structural details.

Finally, a 1 × 1 convolution is employed to generate the output feature map. While maintaining the original spatial resolution, the proposed module effectively integrates and enhances multi-scale information, providing more discriminative and informative feature representations for subsequent image reconstruction stages. Here

D_{128}

,

D_{64}

and

D_{32}

indicate feature branches with different channel dimensions, while R denotes the reference reconstruction feature branch to be enhanced. For simplicity, we give the formula for the

D_{64}

branch:

\begin{matrix} D_{64}^{AP} & = A P (D_{64}), \end{matrix}

(20)

\begin{matrix} D_{64}^{Conv} & = Sigmoid (W_{conv} \cdot D_{64}^{AP} + b_{conv}), \end{matrix}

(21)

\begin{matrix} F_{fused} & = \sum_{i} w_{i} \cdot F_{i}, \end{matrix}

(22)

\begin{matrix} F_{SA} & = SA (F_{fused}) \cdot F_{fused}, \end{matrix}

(23)

\begin{matrix} \begin{matrix} R_{128} & = {Conv}_{1 \times 1} (SA (w_{64} \cdot S (W_{conv} \cdot A P (D_{64}) + b_{conv}) \\ + \sum_{i \neq 64} w_{i} \cdot F_{i})), \end{matrix} \end{matrix}

(24)

where

D_{64}

denotes the input feature map with 64 channels, the dimension of

D_{64}^{AP}

is

[h, w, 128]

, and the dimension of

D_{64}^{Conv}

is also

[h, w, 128]

.

F_{i}

denotes the output feature map of each branch (e.g.,

D_{64}^{Conv}

etc.).

F_{fused}

represents the adaptively fused features, and SA represents the spatial weight of the spatial attention mechanism which is applied to

F_{fused}

in order to enhance the representation of the features in the critical region,

R_{128}

is the final output of the MFEU module and the dimension should be

[128, 128, 128]

.

3.3. Feature Refinement

As mentioned above, the IUNet module effectively extracts both local and global features within the channels, generating more discriminative hybrid features. However, relying solely on one-dimensional global features makes it difficult to fully recover high-quality facial details and structural information. Therefore, this paper further introduces the proposed Global Feature Refinement Block (GFRB) to refine and enhance the features output at each layer during the encoding stage, thereby improving the overall reconstruction performance.

As shown in Figure 6, GFRB adopts a multi-level cascaded architecture consisting of five Local Feature Extraction Units (as shown in Figure 7) and two Transformer modules stacked alternately. Each LFEU uses 1 × 1 convolution for channel adjustment and 3 × 3 convolution for local spatial refinement. Within the LFEU, input features first undergo channel reduction to lower the feature dimension, thereby reducing subsequent computational complexity; subsequently, channel expansion restores the feature’s expressive power, striking a balance between lightweight processing and representational capability. Building on this, 3 × 3 convolutions are used to perform local spatial modeling of features to capture fine-grained texture information, and channel attention is introduced to adaptively re-scale features, thereby highlighting key channel responses and suppressing redundant information. Subsequently, 1 × 1 convolutions are used to fuse features and reorganize channels, combined with further 3 × 3 convolutions to refine the features. Finally, residual connections are employed to fuse the input features with the enhanced features, yielding the output of LFEU. Building on this, GFRB fuses the locally enhanced features extracted by LFEU with the global contextual features modeled by the Transformer module, achieving collaborative optimization of multi-level features. At each stage, fusion is achieved through feature concatenation and 1 × 1 convolutions, effectively integrating semantic information from different levels to enhance the richness and stability of feature representations. Additionally, cross-layer skip connections are introduced within the module to facilitate the effective transmission and reuse of shallow and deep features, further mitigating the issue of information degradation in deep networks.

3.4. Loss Function

In the lightweight network framework proposed in this paper, complex perceptual losses (such as VGG-based perceptual loss) or GAN losses are not introduced to avoid additional computational overhead and training instability. This paper adopts pixel-level reconstruction loss as the primary optimization objective. Although MSE loss possesses good mathematical properties and can stably optimize model parameters, its squared error penalty tends to cause reconstruction results to converge toward the mean, thereby leading to oversmoothing and affecting the recovery of fine details. In contrast, the MAE loss is more robust to outliers and better preserves high-frequency details in the image. Therefore, this paper continues to use the MAE loss as the reconstruction loss. The pixel-level reconstruction loss is defined as:

L_{R E C} = \frac{1}{N} \sum_{i = 1}^{N} (|I_{S R}^{i} - I_{H R}^{i}|)

(25)

where

I_{SR}

represents the super-resolved result,

I_{HR}

represents the true value, and N continues to denote the number of images processed per batch.

4. Experimental Results and Analysis

4.1. Datasets

To validate the effectiveness of the proposed face super-resolution method, experiments were conducted on three widely used facial image datasets, including FFHQ, CelebA, and Helen. Among them, the FFHQ dataset contains 70,000 high-resolution facial images with a resolution of 1024 × 1024. All images are automatically aligned and centered through face detection and alignment preprocessing to ensure consistent facial scale and position. In the experiments, 10,000 images were randomly selected for training and 1000 images were used for testing. The CelebA dataset consists of 202,599 facial images with an original resolution of 178 × 218, covering diverse facial poses, illumination conditions, and complex background variations. In this study, 20,000 images were selected for training and 1000 images were used for testing. The Helen dataset is primarily designed for fine-grained facial landmark localization tasks and contains 2330 facial images with a resolution of 400 × 400, accompanied by detailed landmark annotations. Among them, 1500 images were selected for training and 500 images were used for testing.

During the data preprocessing stage, all selected images were cropped and uniformly resized to 128 × 128 as HR images. Subsequently, the corresponding LR images were generated through downsampling operations. The same LR-HR pair generation protocol was used for both training and testing. All comparison methods were trained and evaluated under the same ×8 setting to ensure fairness and consistency. thereby comprehensively evaluating the performance of the proposed model on multi-scale face super-resolution tasks.

In the preprocessing procedure, all HR images were first uniformly cropped and normalized, and the corresponding LR images were generated using bicubic interpolation downsampling to construct paired training samples. Data augmentation strategies such as random cropping and horizontal flipping were further employed during training to improve the generalization capability of the model. To ensure reproducibility and statistical reliability, the training and testing subsets were generated using a fixed random seed. The selected training and testing images were strictly non-overlapping. Through comparative experiments conducted on the same datasets, the advantages of the proposed lightweight network architecture in terms of reconstruction performance and computational efficiency can be more intuitively demonstrated.

To comprehensively evaluate the performance of the proposed method in image super-resolution tasks, Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS) were adopted as quantitative evaluation metrics. Among them, PSNR and SSIM were mainly used to measure the similarity between reconstructed images and ground-truth images at the pixel and structural levels, respectively, while LPIPS was employed to evaluate the perceptual similarity between generated images and real images from a perceptual quality perspective, and the final results are reported as mean values with 95% confidence intervals.

4.2. Parameter Settings

The key parameters for this experiment were determined through multiple rounds of tuning. The experiment begins by cropping the original high-resolution images to a resolution of 128 × 128 and highlighting the facial regions within the images. Gaussian blur was applied before bicubic downsampling to simulate realistic degradation (a Gaussian kernel with a size of 5 and a standard deviation of 1.0). After blurring, the HR images were downsampled using bicubic interpolation with a scale factor of ×8 to generate the corresponding LR images. The same degradation protocol was applied to all compared methods to ensure a fair comparison. The initial learning rate was set to

2 \times 10^{- 4}

and gradually decayed using the cosine annealing method. The Adam optimizer was used for model training, with

β_{1} = 0.9

,

β_{2} = 0.99

. The batch size was set to 16, random rotation and horizontal flipping were used for data augmentation, and All experiments were implemented using the PyTorch 2.0.1 deep learning framework on an NVIDIA GeForce RTX 4070 GPU with CUDA 11.8.

To ensure the fairness and comparability of the experimental results, all comparison methods in this paper were implemented using their official open-source code and followed the default network configurations recommended by the original papers. The training settings were unified as much as possible. For each dataset, the same training and testing splits were used for all comparison methods, and the training and testing subsets were strictly non-overlapping. Thus, all models were trained and tested under the same LR-HR pair generation setting. Additionally, all experiments were conducted in the same software and hardware environment to minimize the impact of external factors on the results.

4.3. Experimental Results and Comparison

To validate the performance of the proposed lightweight HCTIUNet in super resolution tasks, this paper selects current mainstream super-resolution reconstruction methods for comparison, such as RCAN [3], SCTANet [4], and SISN [13]. Additionally, lightweight super-resolution models such as VDSR [5], RFDN [16], XLSR [17], and MSFSR [21] were selected for comparative experiments on facial super-resolution. The experiments compared the performance of these methods on three benchmark datasets at an 8× magnification factor, with the results shown in Table 1. As shown in Table 1, the proposed HCTIUNet achieved competitive performance across multiple datasets. On the FFHQ dataset, HCTIUNet achieved a PSNR score of 27.55 dB, outperforming most comparison methods and falling slightly short of the currently leading SCTANet model. On the Helen dataset, HCTIUNet achieved the best result on the SSIM metric, indicating its outstanding performance in restoring fine-grained structural details. Further analysis reveals that compared to traditional reconstruction models (such as VDSR and RFDN), HCTIUNet demonstrates significant improvements in both PSNR and SSIM metrics while maintaining a low LPIPS value, suggesting that it not only enhances reconstruction accuracy but also offers advantages in perceptual quality. It should be noted that HCTIUNet does not obtain the best score on every dataset or metric. For example, SCTANet achieves higher PSNR values on FFHQ and CelebA, and RCAN obtains a slightly lower LPIPS value on CelebA. Therefore, the advantage of HCTIUNet should be interpreted from the perspective of the trade-off between reconstruction quality and computational efficiency rather than from a single quality metric.

The visual comparison results in Figure 8 further validate the effectiveness of the proposed method. Some lightweight models (such as XLSR, MSFSR, and RFDN) generally exhibit noticeable blurring in their reconstruction results, with severe loss of detail, particularly in key facial regions (such as the edges of the eyes, mouth, and ears), making it difficult to recover clear structural information. While traditional methods (such as VDSR and RCAN) can restore relatively clear contours to some extent, they still suffer from insufficient texture detail or excessive smoothing in local regions. In contrast, the proposed HCTIUNet produces clearer and more natural reconstruction results visually. Specifically, in the eye region, HCTIUNet effectively restores the edge structure and fine textures of the eyes, resulting in clearer contours; in the mouth region, the reconstruction results are richer in detail while maintaining good structural consistency; in the facial contour and skin tone transition areas, the images generated by HCTIUNet are smoother and more natural, visually closer to real high-resolution images; and from the magnified visual effect of the local facial area shown in Figure 9, HCTIUNet produces clearer and more structurally consistent facial details than several comparison methods. In the eye region, the proposed method better preserves the eyelid boundary and eyebrow structure, whereas some lightweight methods generate blurred edges. In the nose region, HCTIUNet reconstructs a more coherent contour and avoids the structural distortion observed in some comparison results. In the mouth region, the proposed method better maintains lip shape and local texture continuity. In addition, along the facial contour, HCTIUNet produces a sharper and more natural boundary with fewer over-smoothed artifacts. These observations indicate that the proposed method is effective in preserving both local textures and global facial structures.

Furthermore, as shown in Table 2, HCTIUNet contains 10.5 M parameters and requires 9.9 G FLOPs, with an inference time of 0.021 s. Figure 10 presents the visual comparison results on the FFHQ dataset. It should be noted that HCTIUNet is not the most lightweight model in terms of parameter size or FLOPs. For example, RFDN and XLSR have much lower parameter counts and computational costs. Therefore, the efficiency advantage of HCTIUNet should not be interpreted as having the lowest model complexity among all compared methods.

Instead, HCTIUNet is positioned as an efficient CNN–Transformer hybrid model that aims to balance reconstruction quality and inference speed. Compared with high-capacity models such as RCAN, SCTANet, SISN, and VDSR, HCTIUNet achieves competitive reconstruction performance with relatively fast inference. Compared with ultra-lightweight models such as RFDN and XLSR, HCTIUNet introduces higher computational complexity but provides stronger reconstruction quality, especially in preserving facial structures and local details. Overall, HCTIUNet achieves a favorable trade-off among reconstruction accuracy, perceptual quality, and inference efficiency, rather than being the smallest model in terms of parameters or FLOPs.

4.4. Ablation Study

To validate the effectiveness of each module in the proposed model, this paper conducts stepwise ablation experiments on the key components of HCTIUNet. By progressively introducing different modules under identical settings, the experiments compare and analyze the impact of each component on model performance, thereby evaluating their roles in feature modeling and improving reconstruction quality. To make the ablation study clearer, the model variants are defined as follows. Baseline denotes the basic model using only the CNN-based local feature extraction branch. IUNet-G denotes the model that adds the Global Feature Refinement Block (GFRB) to the baseline structure. IUNet-M denotes the model that introduces the proposed Multi-scale Channel Transformer (MCT) module for global dependency modeling. HCTIUNet denotes the complete model that integrates the inverted U-shaped CNN–Transformer framework, MCT-based global modeling, multi-scale feature interaction, and GFRB-based feature refinement. We then conducted a quantitative analysis of each model’s PSNR, SSIM, and model parameter counts on standard test datasets to validate the contributions of each module in improving reconstruction accuracy, perceived quality, and lightweight design. All experiments were conducted under the same training parameter settings to ensure fairness and consistency in performance comparison, and model performance was evaluated using the Helen dataset. The results of ablation experiments under different module configurations are shown in Table 3 and visual comparison results are shown in Figure 11.

As shown in Table 3, model performance varies across different module configurations. From the overall trend, as key modules are progressively introduced, the model exhibits a continuous improvement in both PSNR and SSIM metrics. In the baseline model, which uses only a basic convolutional structure, the model performance is relatively low, indicating that relying solely on local convolutional operations makes it difficult to effectively model complex facial structural information. Even when the GFRB module is introduced based on the baseline model, although the reconstruction quality improves to some extent, the improvement is relatively limited. Upon introducing the MCT module, PSNR increased from 26.80 dB to 27.44 dB, and SSIM rose from 0.761 to 0.770. This indicates that the designed Transformer branch effectively captures long-range dependencies, thereby enhancing global structural representation and improving perceptual quality to some extent. Upon further integration of the GFRB module, model performance improved further, with PSNR reaching 27.57 dB and SSIM reaching 0.775. This demonstrates that the GFRB module effectively enhances feature representation and refines critical structural information, thereby significantly improving the perceptual quality and detail representation of the reconstructed images. It can be seen that the designed MCT module is primarily responsible for global dependency modeling, ensuring structural consistency for the model, while the GFRB module further enhances local detail recovery capabilities through multi-stage feature refinement. The two modules are functionally complementary, enabling the model to achieve a more balanced performance between reconstruction accuracy and visual quality.

Furthermore, to further validate the effectiveness of the proposed Multi-scale Channel Transformer module, we conducted a comparative analysis with the standard Transformer architecture, and use IUNet-T replaces the proposed MCT with a standard Transformer block to evaluate the effectiveness of the proposed Transformer design. The experimental results are shown in Table 4 and Figure 12. After introducing the standard Transformer, both the PSNR and SSIM metrics improved compared to the baseline model, indicating that the attention mechanism has certain advantages in modeling long-range dependencies and can improve the overall structural consistency of the image. After replacing the standard Transformer with the proposed MCT module, the model performance improved further, specifically manifested in both PSNR and SSIM metrics outperforming the variant model with the standard Transformer, indicating that the proposed MCT possesses stronger capabilities in perceptual quality and detail recovery. Analysis reveals that while the standard Transformer primarily focuses on modeling global dependencies, it has certain shortcomings in expressing local structures, whereas the MCT module proposed in this paper introduces deep convolutional operations into the attention mechanism, enabling the model to effectively capture local spatial information while performing global modeling. By modeling the channel dimension, it enhances feature representation capabilities, thereby achieving a more refined feature representation. Furthermore, the branch-based feature processing approach further enhances the flexibility of the attention mechanism, allowing the model to better adapt to the complex structural variations in facial images.

5. Discussion

This paper proposes HCTIUNet, a reverse U-shaped network that combines CNN and Transformer architectures, for the task of lightweight facial super-resolution reconstruction. To address the limitations of traditional lightweight models—namely, insufficient modeling of global dependencies and limited recovery of high-frequency details—this paper optimizes the network architecture, feature fusion methods, and global refinement mechanisms to achieve an effective balance between model performance and computational efficiency. In terms of overall architecture design, this paper constructs a core IUNet based on a reverse U-shaped structure, enabling information exchange between features at different scales through multi-level feature encoding and decoding. The proposed LPCTB combines CNN and Transformer branches to extract local texture information and global contextual features, respectively, thereby enhancing the model’s ability to represent complex facial structures. In the CNN branch, a local facial information extraction mechanism is introduced to enhance the recovery of local details; in the Transformer branch, a MCT module based on multi-head dynamic transposed attention is proposed. By combining deep convolutions with channel modeling mechanisms, it achieves the collaborative modeling of longrange dependencies and local structural information. Furthermore, the study designed a GFRB to enhance and refine the key features extracted during the encoding phase, thereby further improving the perceptual quality and structural consistency of the reconstructed images.

In the experimental section, we evaluated the proposed model on the FFHQ, CelebA, and Helen datasets and compare it with various mainstream and lightweight super resolution methods. The experimental results demonstrate that HCTIUNet achieves competitive reconstruction performance while maintaining relatively low inference latency and moderate computational complexity, indicating a favorable balance between reconstruction quality and efficiency. Although the ablation study demonstrates the cumulative effectiveness of the proposed architectural components, the current experimental design does not completely isolate the individual contributions of the inverted U-shaped architecture, the Multi-Feature Extraction Unit (MFEU), the hybrid attention mechanism, and the refinement block. The presented variants were constructed in a progressive manner to evaluate the overall performance gain obtained by incrementally introducing these components. Consequently, the reported results primarily reflect their combined effects rather than the independent contribution of each module. A more comprehensive component-wise ablation analysis would provide deeper insights into the role of each architectural element and remains an important direction for further investigation.

The proposed HCTIUNet effectively integrates local and global information under lightweight conditions, achieving high-quality restoration of facial image details and structural features, thereby providing a valuable reference for future research on lightweight facial super-resolution models.

Author Contributions

Conceptualization, A.-L.L. and Y.-H.X.; methodology, A.-L.L.; validation, A.-L.L. and Y.-H.X.; resources, Y.-H.X. and W.Z.; data curation, A.-L.L.; writing—original draft preparation, A.-L.L.; writing—review and editing, A.-L.L., Y.-H.X. and W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable for studies not involving humans.

Data Availability Statement

The data used in this study are publicly available benchmark datasets. The FFHQ dataset is available at https://github.com/NVlabs/ffhq-dataset, the CelebA dataset is available at https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html, and the Helen dataset is available at http://www.ifp.illinois.edu/~vuongle2/helen/, all accessed on 14 June 2026. No new datasets were generated during this study.

Conflicts of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

Abbreviations

The following abbreviations are used in this manuscript:

HCTIUNet	Hybrid CNN–Transformer Inverted U-Net Architecture
IUNet	Inverted UNet
MDTA	Multi-head Transpositional Self-Attention
LPCTB	Lightweight Processing of CNN–Transformer Block
LFIEB	Local Face semantic Information Extraction Block
MCT	Multi-scale channel Transformer
MFEU	Multi-scale Fusion Enhancement Unit
GFRB	Global Feature Refinement Block
LFEU	Local Feature Extraction Unit

Appendix A

To further evaluate the statistical reliability of the reported results, this appendix provides the corresponding 95% confidence intervals (CI) for the quantitative comparisons presented in Table 1 of the main manuscript. The confidence intervals were computed based on the image-level evaluation results of each method on the FFHQ, CelebA, and Helen test datasets. The reported values are presented in the form of mean ± 95% CI, providing additional information regarding the variability and robustness of the performance metrics.

Table A1. The 95% confidence intervals (CI) of different FSR methods on the FFHQ, CelebA, and Helen Datasets.

(a) FFHQ Dataset
Methods	Scale	PSNR ↑	SSIM ↑	LPIPS ↓
RCAN	$\times 8$	26.93 ± 0.22	0.767 ± 0.015	0.236 ± 0.024
SCTANet		27.69 ± 0.14	0.782 ± 0.006	0.206 ± 0.010
SISN		26.12 ± 0.18	0.771 ± 0.008	0.228 ± 0.013
VDSR		26.45 ± 0.12	0.760 ± 0.008	0.237 ± 0.011
RFDN		25.72 ± 0.21	0.676 ± 0.017	0.251 ± 0.016
MSFSR		25.33 ± 0.24	0.660 ± 0.018	0.250 ± 0.021
XLSR		25.25 ± 0.25	0.622 ± 0.020	0.235 ± 0.023
HCTIUNet		27.55 ± 0.15	0.765 ± 0.009	0.225 ± 0.009
(b) CelebA Dataset
Methods	Scale	PSNR ↑	SSIM ↑	LPIPS ↓
RCAN	$\times 8$	27.08 ± 0.18	0.766 ± 0.017	0.209 ± 0.018
SCTANet		27.75 ± 0.12	0.780 ± 0.004	0.215 ± 0.013
SISN		26.24 ± 0.15	0.752 ± 0.010	0.233 ± 0.010
VDSR		26.80 ± 0.14	0.772 ± 0.009	0.244 ± 0.017
RFDN		25.70 ± 0.19	0.653 ± 0.014	0.265 ± 0.021
MSFSR		25.16 ± 0.22	0.660 ± 0.016	0.250 ± 0.018
XLSR		25.25 ± 0.20	0.624 ± 0.014	0.240 ± 0.017
HCTIUNet		27.63 ± 0.14	0.761 ± 0.006	0.212 ± 0.015
(c) Helen Dataset
Methods	Scale	PSNR ↑	SSIM ↑	LPIPS ↓
RCAN	$\times 8$	26.84 ± 0.22	0.735 ± 0.015	0.243 ± 0.031
SCTANet		27.76 ± 0.04	0.775 ± 0.006	0.186 ± 0.008
SISN		26.15 ± 0.13	0.753 ± 0.008	0.243 ± 0.013
VDSR		26.56 ± 0.16	0.766 ± 0.007	0.231 ± 0.015
RFDN		25.45 ± 0.23	0.682 ± 0.021	0.269 ± 0.019
MSFSR		25.88 ± 0.14	0.671 ± 0.023	0.238 ± 0.020
XLSR		25.51 ± 0.18	0.636 ± 0.025	0.267 ± 0.022
HCTIUNet		27.53 ± 0.09	0.777 ± 0.005	0.213 ± 0.010

Note: The best results are highlighted in bold.

References

Baker, S.; Kanade, T. Limits on super-resolution and how to break them. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1167–1183. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Learning a deep convolutional network for image superre-solution. In Computer Vision–ECCV; Springer International Publishing: Cham, Switzerland, 2014; pp. 184–199. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2018; pp. 286–301. [Google Scholar]
Bao, Q.; Liu, Y.; Gang, B.; Yang, W.; Liao, Q. SCTANet: A spatial attention-guided CNN-transformer aggregation network for deep face image super-resolution. IEEE Trans. Multimed. 2023, 25, 8554–8565. [Google Scholar] [CrossRef]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2016; pp. 1646–1654. [Google Scholar]
Zhu, S.; Liu, S.; Loy, C.C.; Tang, X. Deep cascaded bi-network for face hallucination. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2016; pp. 614–630. [Google Scholar]
Lu, T.; Wang, H.; Xiong, Z.; Jiang, J.; Zhang, Y.; Zhou, H.; Wang, Z. Face hallucination using region-based deep convolutional networks. In 2017 IEEE International Conference on Image Processing (ICIP); IEEE: Piscataway, NJ, USA, 2017; pp. 1657–1661. [Google Scholar]
Chen, Y.; Tai, Y.; Liu, X.; Shen, C.; Yang, J. FSRNET: End-to-end learning face super-resolution with facial priors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 2492–2501. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 10012–10022. [Google Scholar]
Yoo, J.; Kim, T.; Lee, S.; Kim, S.H.; Lee, H.; Kim, T.H. Enriched CNN-transformer feature aggregation networks for super resolution. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 4956–4965. [Google Scholar]
Zhao, T.; Zhang, C. SAAN: Semantic attention adaptation network for face super resolution. In 2020 IEEE International Conference on Multimedia and Expo (ICME); IEEE: Piscataway, NJ, USA, 2020; pp. 1–6. [Google Scholar]
Wang, Q.; Gao, Q.; Wu, L.; Sun, G.; Jiao, L. Adversarial Multi-Path Residual Network for image super-resolution. IEEE Trans. Image Process. 2021, 30, 6648–6658. [Google Scholar] [CrossRef] [PubMed]
Li, W.; Wang, M.; Zhang, K.; Li, J.; Li, X.; Zhang, Y.; Gao, G.; Ma, Z. Survey on deep face restoration: From non-blind to blind and beyond. arXiv 2023, arXiv:2309.15490. [Google Scholar]
Zhang, C.; Liu, Z. Face super-resolution with progressive embedding of multi-scale face priors. In IEEE International Joint Conference on Biometrics; IEEE: Piscataway, NJ, USA, 2022; pp. 1–8. [Google Scholar]
Liu, J.; Tang, J.; Wu, G. Residual feature distillation network for lightweight image super-resolution. In ECCV Workshops; Springer International Publishing: Cham, Switzerland, 2020; pp. 41–55. [Google Scholar]
Ayazoglu, M. Extremely lightweight quantization robust real-time single-image super resolution for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 2472–2479. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In MICCAI; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, C.; Gong, D.; Wang, H.; Li, Z.; Wong, K.Y.K. Learning spatial attention for face super-resolution. IEEE Trans. Image Process. 2020, 30, 1219–1231. [Google Scholar] [CrossRef] [PubMed]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Zhang, Y.; Wu, Y.; Chen, L. MSFSR: A multi-stage face super-resolution with accurate facial representation via enhanced facial boundaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops; IEEE: Piscataway, NJ, USA, 2020; pp. 504–505. [Google Scholar]

Figure 1. Network structure of the proposed HCTIUNet.

Figure 2. IUNet network structure.

Figure 3. Structure of the Local Face semantic Information Extraction block (LFIEB).

Figure 4. Multi-scale Channel Transformer block.

Figure 5. Network structure of multi-scale fusion enhancement unit (MFEU).

Figure 6. Structure of the Global Feature Refinement Module (GFRB).

Figure 7. Structure of the Local Feature Extraction Unit (LFEU).

Figure 8. Visual comparison results of different FSR methods with ×8 magnification factor on the FFHQ and Helen testsets.

Figure 9. Local magnified visual comparison results of different FSR methods on the FFHQ testset.

Figure 10. Comparison of scatter plots of different model parameters and computational amounts.

Figure 11. Visual comparison of different structural models on the Helen testset with ×8 magnification factor.

Figure 12. Visual comparison of different Transformer structural models on the Helen testset with ×8 magnification factor.

Table 1. Performance evaluation of different FSR methods in FFHQ, CelebA and Helen Datasets.

Methods	Scale	FFHQ	CelebA	Helen
Methods	Scale	PSNR ↑ SSIM ↑ LPIPS ↓	PSNR ↑ SSIM ↑ LPIPS ↓	PSNR ↑ SSIM ↑ LPIPS ↓
RCAN	$\times 8$	26.93/0.767/0.236	27.08/0.766/0.209	26.84/0.735/0.243
SCTANet		27.69/0.782/0.206	27.75/0.780/0.215	27.76/0.775/0.186
SISN		26.12/0.771/0.228	26.24/0.752/0.233	26.15/0.753/0.243
VDSR		26.45/0.760/0.237	26.80/0.772/0.244	26.56/0.766/0.231
RFDN		25.72/0.676/0.251	25.70/0.653/0.266	25.45/0.682/0.269
MSFSR		25.33/0.660/0.250	25.16/0.624/0.240	25.88/0.671/0.238
XLSR		25.25/0.622/0.235	25.48/0.627/0.259	25.51/0.636/0.267
HCTIUNet		27.55/0.765/0.225	27.63/0.761/0.212	27.53/0.777/0.213

Note: The best results are highlighted in bold. The corresponding confidence intervals are provided in Appendix A Table A1.

Table 2. Comparison of parameters and calculation efficiency of different models.

Model	Params	FLOPs	Inference Time
RCAN	15.9 M	4.1 G	0.069 s
SCTANet	26.9 M	11.2 G	0.056 s
SISN	20.4 M	9.3 G	0.087 s
VDSR	17.53 M	9.9 G	0.071 s
RFDN	500 K	120.3 M	0.027 s
MSFSR	6.34 M	1.2 G	0.370 s
XLSR	701 K	180 M	0.010 s
HCTIUNet	10.5 M	9.9 G	0.021 s

Note: The best results are highlighted in bold.

Table 3. Parameters and performance evaluation of different structural models on the Helen Dataset.

Model	LFIEB	MCT	GFRB	PSNR/SSIM	Params/FLOPs
BaseLine	✓	×	×	26.78/0.761	6.2 M/6.5 G
IUNet-G	✓	×	✓	26.91/0.763	7.3 M/8.1G
IUNet-M	✓	✓	×	27.44/0.770	9.2 M/9.6 G
HCTIUNet	✓	✓	✓	27.57/0.775	10.5 M/10.1 G

Note: The best results are highlighted in bold.

Table 4. Ablation Experiments on Different Transformer Architectures.

Model	Scale	PSNR/SSIM	Params	FLOPs
BaseLine	$\times 8$	26.83/0.766	6.2 M	6.5 G
IUNet-T		27.22/0.772	7.8 M	8.1 G
HCTIUNet		27.54/0.775	10.5 M	10.0 G

Note: The best results are highlighted in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, A.-L.; Xu, Y.-H.; Zhou, W. Lightweight CNN–Transformer Hybrid Network for Efficient Face Super-Resolution. Appl. Sci. 2026, 16, 6221. https://doi.org/10.3390/app16126221

AMA Style

Liu A-L, Xu Y-H, Zhou W. Lightweight CNN–Transformer Hybrid Network for Efficient Face Super-Resolution. Applied Sciences. 2026; 16(12):6221. https://doi.org/10.3390/app16126221

Chicago/Turabian Style

Liu, Ao-Lin, Yi-Han Xu, and Wen Zhou. 2026. "Lightweight CNN–Transformer Hybrid Network for Efficient Face Super-Resolution" Applied Sciences 16, no. 12: 6221. https://doi.org/10.3390/app16126221

APA Style

Liu, A.-L., Xu, Y.-H., & Zhou, W. (2026). Lightweight CNN–Transformer Hybrid Network for Efficient Face Super-Resolution. Applied Sciences, 16(12), 6221. https://doi.org/10.3390/app16126221

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight CNN–Transformer Hybrid Network for Efficient Face Super-Resolution

Abstract

1. Introduction

2. Related Work

2.1. CNN-Based Face Super-Resolution Methods

2.2. Transformer-Based and CNN–Transformer Hybrid Methods

2.3. Prior-Guided FSR Methods

2.4. Lightweight Super-Resolution Methods

3. Materials and Methods

3.1. Overall Architecture of HCTIUNet

3.2. IUNet

3.3. Feature Refinement

3.4. Loss Function

4. Experimental Results and Analysis

4.1. Datasets

4.2. Parameter Settings

4.3. Experimental Results and Comparison

4.4. Ablation Study

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI