Next Article in Journal
Enabling Perspective-Aware Ai with Contextual Scene Graph Generation
Previous Article in Journal
Digital Transformation in Energy Sector: Cybersecurity Challenges and Implications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CGFTNet: Content-Guided Frequency Domain Transform Network for Face Super-Resolution

School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
*
Author to whom correspondence should be addressed.
Information 2024, 15(12), 765; https://doi.org/10.3390/info15120765
Submission received: 16 October 2024 / Revised: 22 November 2024 / Accepted: 28 November 2024 / Published: 2 December 2024

Abstract

:
Recent advancements in face super resolution (FSR) have been propelled by deep learning techniques using convolutional neural networks (CNN). However, existing methods still struggle with effectively capturing global facial structure information, leading to reduced fidelity in reconstructed images, and often require additional manual data annotation. To overcome these challenges, we introduce a content-guided frequency domain transform network (CGFTNet) for face super-resolution tasks. The network features a channel attention-linked encoder-decoder architecture with two key components: the Frequency Domain and Reparameterized Focus Convolution Feature Enhancement module (FDRFEM) and the Content-Guided Channel Attention Fusion (CGCAF) module. FDRFEM enhances feature representation through transformation domain techniques and reparameterized focus convolution (RefConv), capturing detailed facial features and improving image quality. CGCAF dynamically adjusts feature fusion based on image content, enhancing detail restoration. Extensive evaluations across multiple datasets demonstrate that the proposed CGFTNet consistently outperforms other state-of-the-art methods.

1. Introduction

Facial super resolution (FSR), commonly referred to as facial hallucination, is a pivotal technology designed to enhance low-resolution (LR) images into high-resolution (HR) counterparts. This enhancement is particularly important in various applications, notably in identity recognition, video surveillance, and facial analysis. In practical settings, the quality of captured facial images often suffers due to variations in hardware configurations, positioning, and camera angles. Such degraded image quality significantly hinders essential tasks like facial recognition and analysis.
FSR tasks are inherently complex, as the nuanced structures and textural details of facial images, which occupy only a minor fraction of the overall face, are vital for distinguishing individual faces. These details are frequently obscured in LR images, presenting a daunting challenge in accurately restoring these critical features.
Historically, FSR methods such as bicubic interpolation, local feature analysis, and sparse representation have been employed. These techniques are adept at improving resolution where image content is straightforward, with clear edges and repetitive patterns. However, they falter with the intricate and varied structures typical of facial images. Convolutional neural networks (CNNs), known for their robust local feature extraction capabilities, have enhanced the prediction of subtle facial features, leading to the development of numerous CNN-based FSR frameworks [1,2,3]. Despite their success, CNNs generally lack the ability to model spatial correlations over extended distances due to their static filters, resulting in suboptimal recovery of facial structures and often blurred reconstructions.
To address the limitations of CNN in capturing global facial information, researchers are exploring architectures with enhanced global sensing capabilities. Some approaches have integrated auxiliary priors [4,5], such as facial parsing maps, landmark heatmaps, and 3D models, to aid reconstruction. Although these methods provide some improvement, they rely heavily on labor-intensive manual annotations, and inaccuracies in these annotations can degrade the reconstruction quality. The introduction of transformer models, which have revolutionized natural language processing, into the visual domain offers promising new directions for FSR tasks. Transformers excel in managing long-distance dependencies, thereby improving the reconstruction of the global facial structure and spawning a new generation of transformer-based FSR models [6,7,8,9].
Nevertheless, despite their proficiency in global information processing, transformers are computationally intensive and may lack sensitivity to localized details. Thus, it is crucial to balance the enhancement of local information processing with maintaining an extensive receptive field and minimizing computational demands. To this end, we propose a novel content-guided frequency domain transform network (CGFTNet) for FSR. Our network incorporates the Frequency Domain Reparameterized Focus Enhancement module (FDRFEM) and the Content-Guided Multi-Scale Fusion (CGCAF) module. FDRFEM enhances feature representation in the frequency domain to capture detailed facial structures, while CGCAF leverages a content-guided attention mechanism to optimally merge features from the encoder and decoder, ensuring effective feature preservation and enhancing super-resolution performance. Consistent with the architectural paradigms of the majority of preceding FSR models [10,11,12], we also utilize an encoder-decoder framework.
In summary, our contributions are threefold:
  • We introduce the FDRFEM, which includes a frequency domain transformation branch and reparameterized focus convolution. This module enhances feature detail while preserving global structures.
  • We present a novel module CGCAF that dynamically integrates rich features from both the encoder and decoder, facilitating the reconstruction of high-quality images.
  • We develop the CGFTNet, a synergistic CNN-transformer network that achieves competitive performance on various FSR metrics.

2. Related Work

This section offers a thorough review of the advancements in image super-resolution technology, placing special emphasis on the role of deep learning in enhancing the resolution of natural and facial images. We also delve into the super-resolution techniques that are particularly effective under challenging conditions, such as those encountered in low-light scenarios. Furthermore, the discussion extends to the cutting-edge applications of transformer models within this domain and examines strategies for leveraging the synergistic strengths of CNN and transformers to achieve superior quality in super-resolution reconstruction. The following discourse elaborates on these technologies and their implications for the field of image super-resolution.

2.1. Super Resolution

Deep convolutional neural networks have achieved transformative advancements in enhancing the resolution of natural images, leveraging their superior feature extraction capabilities [13,14,15]. Li et al. [16] developed a novel multi-scale residual network (MSRN), which optimizes image feature utilization for super-resolution reconstruction using multi-scale residual blocks (MSRBs). Wang et al. [17] developed a lightweight remote-sensing image super-resolution network based on attention-driven multilevel feature fusion. Guo et al. [18] introduced an innovative dual regression network approach, incorporating additional constraints on low-resolution data to significantly reduce the potential mapping function space, thereby enhancing the performance of single-image super-resolution models. Gao et al. [19] designed a lightweight, efficient feature distillation interaction weighted network (FDIWN) for single-image super resolution (SISR), which, through its specially designed modules, achieves enhanced super-resolution performance with reduced computational overhead.

2.2. Face Super Resolution

CNN-based super-resolution technology has also significantly propelled the development of FSR. For example, Zhang et al. [20] proposed the super-identity convolutional neural network (SICNN), which improves identity information recovery in low-resolution facial images by employing super-identity loss and domain ensemble training methods, significantly boosting the recognizability of ultra-low-resolution faces. Additionally, some researchers have tailored FSR models to exploit facial inherent features, such as facial contour maps and feature points. Chen et al. [21] presented FSRNet, an end-to-end trainable network that exploits facial geometric priors, like facial landmark heatmaps and parsing maps, to reconstruct extremely low-resolution facial images without the need for precise alignment. Jin et al. [22] used a stepwise training approach, facial attention loss functions, and a streamlined facial alignment network (FAN) to produce highly realistic 8× super-resolution images that retain intricate facial details. Chan et al. [23] developed a hybrid covariance attention and cross-layer fusion transformer network. Although these models demonstrate promising results, they depend on extensive dataset annotations, and the accuracy of prior information significantly influences the reconstruction outcomes.

2.3. Face Super Resolution Under Specific Conditions

In addition to natural-state facial reconstruction, researchers have also explored face super-resolution techniques under specific conditions. Deepak Rai et al. [24] proposed a novel neuro-fuzzy inference-based local constraint representation (NFILcR) method for robust facial image super-resolution under low-light conditions, which substantially enhances the super-resolution performance of low-resolution facial images captured in such environments by integrating an adaptive neuro-fuzzy inference system (ANFIS) with local constraint representation methods. Yin et al. [25] developed MetaF2N, a blind image super-resolution method that fine-tunes model parameters through a meta-learning framework to target specific degradations in natural images, incorporating MaskNet to predict variable loss weights across different positions, thereby minimizing the disparity between recovered and actual facial images, which significantly impacts super-resolution performance.

2.4. Transformer

Recent attention has been drawn to transformer models, noted for their exemplary performance in multiple domains, including face super-resolution reconstruction tasks, thus advancing the field considerably. Bao et al. [6] unveiled a spatial attention guided CNN-transformer aggregation network (SCTANet), which synergistically blends CNN and transformer strengths to enhance the reconstruction quality and efficiency of facial details through a hybrid attention aggregation (HAA) module and sub-pixel MLP upsampling (SMU) module. Qi et al. [7] introduced an efficient latent style guided transformer-CNN framework (ELSFace), utilizing latent style encoding to direct the reconstruction of basic facial features, enhanced by high-frequency enhancement blocks (HFEB) and a sharp loss optimization algorithm to improve the quality and perceptual clarity of reconstructed facial images. Zhang et al. [26] developed a multi-stage auxiliary learning network. Gao et al. [8] developed a CNN-transformer cooperative network (CTCNet) with a multi-scale connected encoder-decoder backbone, featuring the Local-Global Feature Cooperation module (LGCM) and a multi-scale feature fusion unit (MFFU) that effectively concentrate on both local facial details and global facial structures, significantly improving face super-resolution reconstruction quality. Shi et al. [9] proposed a novel dual-branch network architecture that integrates the benefits of transformers and CNN through multi-scale parallel self-attention mechanisms and local variant convolutions, capturing both local and non-local dependencies in facial images to substantially enhance facial detail reconstruction quality. Although transformer architectures excel in capturing global image features, an over-reliance on global self-attention mechanisms may neglect important subtle local features. Thus, effectively integrating both macrostructures and microdetails remains crucial for achieving high-quality image reconstruction, which is the central aim of this research.
In summary, the domain of image super-resolution has experienced significant advancements, largely attributable to the emergence of deep learning methodologies. The convergence of CNN and transformers has paved the way for innovative strategies aimed at resolution enhancement, particularly under challenging circumstances such as low-light scenarios. The cutting-edge techniques reviewed herein not only underscore the efficacy of these architectures in the reconstruction of high-fidelity images but also accentuate the necessity of harmonizing global and local features to achieve superior outcomes. As ongoing research endeavors to expand the frontiers of the feasible, the synergistic integration of CNN, and transformers is poised to transform our approach to image super-resolution.

3. Architectural Details

In this section, we begin with a comprehensive overview of the overall architecture of the proposed CGFTNet. Subsequently, we explore in detail the critical components that constitute the network.

3.1. Overview of CGFTNet

As depicted in Figure 1, our proposed CGFTNet, which is derived from the baseline model CTCNet [8], features a tripartite U-shaped architecture, consisting of encoding, bottleneck, and decoding stages. In the encoding stage, the network is tailored to capture multi-scale local and global features; in contrast, the decoding stage focuses on the integration of these features and the reconstruction of the image. Moreover, to facilitate effective integration of features, connections along with FDRFEM are implemented between the encoding and decoding stages. For a clearer exposition of our model’s architecture, we introduce several key terminologies: I L R denotes the input low-resolution image, I S R refers to the super-resolution image produced by the model, and I H R represents the actual high-resolution image.
(1)
Encoding Stage: As outlined earlier, our objective in the feature extraction phase is to capture the inherent structure of the image. Beginning with a low-resolution input image I L R , we initially employ a 3 × 3 convolutional kernel for basic feature extraction. These features are then processed through three consecutive stages of feature extraction. Each stage incorporates a Local-Global Coherence module (LGCM), which consists of a facial structure awareness unit (FSAU) and a transformer unit. For more details about LGCM, FSAU, and FRM, please refer to [8]. Following this, the features are subjected to a downsampling process, involving a 3 × 3 convolutional layer with a stride of 2 designed to reduce the dimensions of the feature maps while retaining essential information. This is coupled with a LeakyReLU activation function and another 3 × 3 convolutional layer. As a result, each feature extraction stage halves the size of the output feature maps and doubles the channel count. For example, given an initial feature map I L R R C × H × W , then at the i -th feature extraction stage, the encoder will produce a feature map: I e n i R 2 i C × H 2 i × W 2 i .
(2)
Bottleneck: Situated between the feature extraction and reconstruction phases is a critical bottleneck stage. In this stage, all features accumulated during the encoding process are consolidated. To ensure these features are effectively leveraged in the subsequent reconstruction phase, a Feature Refinement module (FRM) is employed. The FRM is designed to optimize and amplify the encoded features, enhancing the delineation of facial structure details. The intervention of the FRM allows for a more focused processing of facial features, thereby enhancing information expression across different facial regions.
(3)
Decoding Stage: The objective of the decoding stage is to restore high-quality facial images. We introduce an innovative module, CGCAF. The decoder receives deep features from the low-resolution image and integrates these features progressively via the CGCAF to construct a super-resolution image. As illustrated in Figure 1, the decoder comprises an upsampling module, the CGCAF, and the Local-Global Coherence module (LGCM). The upsampling module includes a 6 × 6 transposed convolution layer with a stride of 2, followed by a LeakyReLU activation function and a 3 × 3 convolution layer. This transposed convolution layer expands the feature map dimensions and facilitates information extraction. Consequently, the decoder reduces the number of output feature channels while doubling the feature map dimensions with each step. The CGCAF fuses features extracted during the encoding phase, ensuring comprehensive utilization of both local and global features from the encoding and decoding stages to produce high-quality facial images. The decoding phase culminates with a 3 × 3 convolution layer that maps the integrated features to the final super-resolution output I O u t .
Ultimately, an enhanced facial image is produced by merging the low-resolution image I L R with the super-resolution output I O u t . Throughout the training phase, provided with a dataset I L R i , I H R i i = 1 N , we refine the parameters of our CGFTNet model by minimizing a pixel-level loss function:
L ( Θ ) = 1 N i = 1 N | | F C G F T N e t ( I L R i , Θ ) I H R i | | 1 ,
where N denotes the total number of training images. The terms I L i R and I H i R correspond to the low-resolution (LR) image and the associated true high-resolution (HR) image of the i-th image, respectively. Concurrently, F C G F T N e t ( · ) and Θ refer to the CGFTNet network and its respective network parameters. The notation | | · | | 1 represents the L1 norm.

3.2. Content-Guided Channel Attention Fusion

To augment the capability for feature representation and to enhance the integration of features between the encoder and decoder, we have incorporated a content-guided channel attention fusion strategy in the decoding phase. This approach strengthens the convolutional neural network’s ability to represent distinct regions within the features, particularly those regions that are crucial for facial restoration tasks.
Specifically, our principal goal is to explore and exploit the features derived during the encoding stage in the decoding process. Traditional convolution layers, known as vanilla convolution, lack constraints in exploring solution spaces, such as spatial and channel attentions, which are typically computed independently, without interchanging information between channel and spatial dimensions. This results [17] in suboptimal utilization of facial priors such as structural, color, and texture consistency and does not adequately account for the unique significance of each channel within the feature space or the contextual information across spatial dimensions. To address this challenge, we designed the CGCAF module, detailed in Figure 2. As outlined, CGCAF first computes channel and spatial attentions using global average pooling (GAP) and global max pooling (GMP), along with 1 × 1 and 7 × 7 convolution layers. It then merges channel and spatial attentions through a straightforward addition process to create a preliminary spatial importance map (PSIM). Following this, the content of the input features guides the production of the final channel-specific spatial importance map (SIM). This is achieved by alternating the arrangement of each channel of feature X through channel shuffle operations, followed by group convolution and activation via a sigmoid function. To effectively merge features from the encoding and decoding processes at corresponding levels, they are weighted and summed based on the SIM. This approach acknowledges that different layers have varying receptive fields, and simple additive or concatenative operations are insufficient to address mismatches in receptive fields before fusion. The specific formula is provided below:
X = F e n + F d e ,
F C = C 1 × 1 ( m a x ( 0 , C 1 × 1 ( X G A P C ) ) ) ,
F S = C 7 × 7 ( [ X G A P S , X G M P S ] ) ,
F C O S = F C + F S ,
F = σ ( G C 7 × 7 ( G S ( [ X , F C O S ] ) ) ) ,
where m a x ( 0 , x ) represents the activation function ReLU, C k × k ( · ) denotes a convolution with a kernel size of k × k , and [ · ] signifies a channel-wise concatenation operation. F C , F S , X G A P C , X G A P S and X G M P S respectively refer to features processed by channel attention, features processed by spatial attention, global average pooling across the channel dimension, global average pooling across the spatial dimension, and global max pooling across the spatial dimension. σ represents the activation function Sigmoid, C S ( · ) indicates channel shuffle operations, and G C k × k ( · ) denotes a grouped convolution layer with a kernel size of k × k . By incorporating skip connections, the input features are directly added to the fusion process, aiding in mitigating the problem of vanishing gradients and simplifying the learning trajectory. A 1 × 1 convolution then aligns the channels with the subsequent network layers. For example, the feature maps acquired during the encoding phase have channel counts of 32, 64, and 128. According to our overall network design requirements, the 128-channel feature maps from the encoding stage must also emerge as 128 channels after processing through our CGCAF module. The purpose of the post-concatenation 1 × 1 convolution is precisely this, where as the pre-concatenation 1 × 1 convolution is intended to reduce computational demands. It is evident that the two feature maps sourced from the encoding and decoding phases, along with the feature map derived through weighted summation, ultimately undergo a concatenation operation. This aims to minimize the loss of prior knowledge during the complex computational process.
Due to the extensive concatenation of information, some channel importance information is lost; we resolve this issue using a channel attention (CA) strategy. As illustrated in Figure 3, the specific architecture of CA comprises adaptive average pooling, two 1 × 1 convolutions, dual activation functions, and a residual connection. The complete mathematical formulation of the CGCAF module is as follows:
F o u t = C A ( C 1 × 1 ( C o n c a t ( F e n , F d e , C 1 × 1 ( F p ) ) ) ) ,
F p = F e n + F d e + F e n × F + F d e × ( 1 F ) ,
where C A represents the channel attention and F refers to the spatial importance map (SIM) mentioned earlier.

3.3. Focus Convolution Feature Enhancement Module

As a critical module in CGFTNet, the FDRFEM is designed to facilitate global spatial interactions. As illustrated in Figure 4, FDRFEM is composed of a frequency-domain branch and a reparameterized focused convolution, each responsible for global and local feature extraction, respectively.
(1)
Frequency-Domain Branch: In the context of FSR, a major challenge lies in effectively extracting key facial features (such as eyes, eyebrows, and mouth) while simultaneously guiding the network to focus on these features. To address this, we leverage Fourier transform to enable the model to extract critical information for improved detail recovery. As depicted in Figure 4, the frequency-domain branch first transforms the input features into the frequency domain via a 2D real-valued fast Fourier transform (FFT). In the frequency domain, the feature maps are modulated by a set of learnable complex filters, which selectively adjust the frequency components of the feature maps, enhancing or suppressing specific frequencies. This process allows the model to capture the global frequency characteristics of the input data. Subsequently, an inverse fast Fourier transform (iFFT) is applied to convert the features back into the spatial domain, allowing the model to exploit these frequency-enhanced feature maps for improved performance.
(2)
Reparameterized Focused Convolution: As mentioned earlier, the frequency-domain branch simulates a convolution operation in the spatial domain, utilizing kernels of global size and circular padding, thereby capturing spatial information across the entire feature map and facilitating the modeling of long-range dependencies. The reparameterized focused convolution is designed to capture local information and enhance the model’s representational capacity through reparameterization without incurring additional inference costs. For a pre-trained model, RefConv applies a learnable refocusing transformation to the base convolution kernels inherited from the pre-trained model, establishing parameter correlations. For instance, in depth-wise convolution, RefConv allows the convolution kernel parameters of one channel to be linked with those of other channels, enabling a broader refocus beyond the input features alone. In the case of depth-wise convolution, RefConv utilizes the pre-trained convolution kernel as a base weight and applies the refocusing transformation to generate new convolution kernels. This transformation incorporates trainable refocusing weights that are combined with the base weights to produce transformed weights for processing the input features. The objective of the refocusing transformation is to learn incremental adjustments to the base weights, analogous to residual learning, where residual blocks capture incremental changes to the base feature maps. In doing so, RefConv enables each convolution channel to establish connections with other channels, facilitating the learning of novel feature representations.
By integrating the frequency-domain branch and reparameterized focused convolution, FDRFEM effectively captures both local facial features and global relationships, thereby enabling high-quality image reconstruction.

4. Experiment

4.1. Datasets

In our experiments, we employed the CelebA [27] dataset for training and evaluated the model’s performance on the Helen [28] dataset. Since the face images in CelebA exhibit varying heights and widths, we first cropped the images based on their central points and resized them to a fixed resolution of 128 × 128 pixels to serve as high-resolution (HR) images. Subsequently, we downsampled these high-resolution (HR) images to 16 × 16 pixels, and then upsampled them back to 128 × 128 pixels using bicubic interpolation, treating them as the low-resolution (LR) inputs. For all subsequent experiments that required interpolation, we used bicubic interpolation, and all conclusions are based on this premise. The model was trained using 18,000 samples from the CelebA, with an additional 1000 samples reserved for testing. Furthermore, we evaluated the model’s generalization ability by directly testing it on the Helen dataset using the model pre-trained on CelebA.

4.2. Implementation Details

We implemented our model using the PyTorch framework and optimized it with the Adam optimizer, setting β 1 to 0.9 and β 2 to 0.99. The initial learning rate was configured to 2 × 10 4 . To assess the quality of the super-resolution (SR) results, we utilized four objective image quality evaluation metrics: Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM) [29], Learned Perceptual Image Patch Similarity (LPIPS) [30], and Visual Information Fidelity (VIF) [31].

4.3. Ablation Studies

In this section, we present a series of ablation studies to validate the effectiveness of our proposed model. All ablation studies were conducted on the CelebA test set.
(1)
Effectiveness of FDRFEM: The FDRFEM is a key component of CGFTNet, designed to achieve global spatial interactions. To evaluate the effectiveness of FDRFEM and the feasibility of this integration, we conducted a set of ablation studies. FDRFEM comprises a frequency-domain branch and a reparameterized focused convolution. Thus, we designed three modified models for comparison. The first model removes all FDRFEM from the encoder and decoder, labeled as “w/o FDRFEM”. The second model removes all frequency-domain branches while retaining the reparameterized focused convolutions, labeled as “FDRFEM w/o Fourier”. The third model removes all reparameterized focused convolutions while keeping the frequency-domain branches in FDRFEM, labeled as “FDRFEM w/o Refconv”. The results of these modified networks are shown in Table 1. Based on these results, we can draw the following observations:
(a)
By comparing the first and last rows of Table 1, we observe that the introduction of FDRFEM significantly improves model performance, confirming its effectiveness.
(b)
Comparing the first three rows, it becomes clear that either the frequency-domain branch or the reparameterized focused convolution independently improves model performance, as both global relationships and local features contribute to image reconstruction.
(c)
The comparison of the last three rows further demonstrates that both the frequency-domain branch and reparameterized focused convolution serve distinct roles in the FSR task. The frequency-domain branch helps the network capture long-range dependencies, while the reparameterized focused convolution captures local details, providing complementary information for the final SR image reconstruction. Relying on only one of these components leads to suboptimal performance, further validating the effectiveness of FDRFEM and the viability of combining CNN and transformer architectures.
(2)
Effectiveness of CGCAF: The CGCAF module is specifically designed for feature fusion between the encoding and decoding stages. In this section, we conduct a series of experiments to demonstrate the efficacy of this connection. The first two experiments verify the necessity of content-guided attention (CGA), which refers to the part of the CGCAF module that precedes the acquisition of the F p feature map, as shown in the Figure 2. The third and fourth experiments retain CGA but apply only concatenation or addition operations for feature fusion. The fifth and sixth experiments explore the necessity of the final channel attention mechanism. The last two experiments, where neither CGA nor channel attention is used, serve as baselines. Each set of experiments includes a comparison between the Concat and Add operations to explain our final choice of Concat. From the results in Table 2, we can observe the following:
(a)
Employing feature fusion strategies between the encoding and decoding stages can significantly impact model performance, which underscores the importance of investigating strategies that leverage encoding-stage features during decoding for image reconstruction.
(b)
By comparing the third row with the fifth, we can find that the channel attention (CA) mechanism also positively impacts model performance.
(c)
By comparing the third row with the others, we can observe that the combination of the Concat operation and CA yields the best results.
(d)
By comparing each pair of experiments that includes CGA with those that do not, such as the first row with the third row, the second row with the fourth row, we can observe that models incorporating CGA demonstrate superior performance metrics compared to those without.
(3)
Effectiveness of two key modules: Based on the results presented in the Table 3, several observations can be made. We adopt CTCNet [8] as our baseline model. The term “Threelines” refers to a pruned variant of CTCNet, wherein the number of feature lines interconnecting the encoder and decoder has been reduced, retaining only three lines. Comparing the first and second rows, it is evident that the pruned model performs better than the baseline. The comparison between the second and third rows indicates that introducing the FDRFEM module alone improves the model’s performance. Analyzing the last four rows shows that simultaneously incorporating both FDRFEM and CGCAF yields higher performance than introducing each module individually.

4.4. Comparison with Other Methods

In this section, we compare our CGFTNet with several methods, including classical FSR techniques such as SPARNet [32] and SISN [33], as well as more recent approaches like ELSFace [7], SCGAN [34], SCTANet [6], SFMNet [35], and the pioneering transformer-based image restoration model SwinIR [36]. For a fair comparison, all models were trained on the same CelebA. (See Figure 5).
(1)
Comparison on the CelebA: Quantitative comparisons on the CelebA test set with other SOTA methods are presented in Table 4. The results clearly demonstrate that CGFTNet outperforms competing methods in terms of PSNR, VIP, LPIPS, and SSIM. These findings highlight the superior performance and effectiveness of CGFTNet. Additionally, as shown in the visual comparisons in Figure 6, most previous methods struggle to accurately restore facial features, such as the eyes and nose, whereas our CGFTNet reconstructs facial structures with greater precision, producing results that more closely resemble the original high-resolution (HR) images. This further substantiates the advantages of CGFTNet in facial restoration tasks.
(2)
Comparison on the Helen dataset: To assess the generalization capability of CGFTNet, we directly tested the model, trained on the CelebA, on the Helen test set. The quantitative results for ×8 SR experiments on the Helen test set are provided in Table 4. As shown in Table 4, CGFTNet consistently achieves the best performance on the Helen dataset as well. Furthermore, the visual comparisons in Figure 7 reveal that the performance of many competing methods significantly degrades on this dataset, failing to accurately restore facial details. In contrast, CGFTNet maintains its ability to recover realistic facial contours and details, further affirming its robustness and generalizability.
(3)
Comparison of real-world facial images: Recovering facial images in real-world scenarios remains a formidable challenge. While the previously mentioned experiments were conducted in simulated environments, these settings do not fully capture the complexity of real-world conditions. To further validate the effectiveness of CGFTNet, we performed experiments on low-quality facial images from the WiderFace dataset [37], which were captured in natural scenes, representing real-world diversity and complexity. These images are inherently low-resolution, requiring no additional downsampling. In this context, we aim to recover facial images with finer texture details and well-preserved facial structures. Figure 7 illustrates a visual comparison of the reconstruction performance on real-world images. Thanks to the collaborative CNN-Transformer mechanism and the specifically designed modules of CGFTNet, our method effectively restores cleaner facial details and more accurate facial structures. Furthermore, we evaluated the performance of CGFTNet in downstream tasks, such as facial matching. For this experiment, high-resolution frontal facial images of the candidates were used as source samples, while the corresponding low-resolution facial images from real-world conditions served as target samples. To ensure the robustness of our results, we conducted 8 separate trials, randomly selecting five pairs of candidate samples in each trial and calculating the average similarity. The quantitative results, presented in Table 5, demonstrate that our method consistently achieves higher similarity scores across all trials. This further demonstrates that CGFTNet can generate more realistic high-resolution facial images in real-world surveillance applications, showcasing its high practicality and adaptability.

4.5. Model Complexity Analysis

The results demonstrate that our model surpasses most competing methods in both quantitative and qualitative evaluations. Furthermore, model size and execution time are crucial metrics for assessing model efficiency. In Figure 8, we present a comparison of parameter count, performance, and execution time across models. Notably, our CGFTNet achieves the best quantitative performance while maintaining comparable execution time and parameter count. Overall, CGFTNet achieves an optimal balance between model size, performance, and execution time.

5. Conclusions

In this study, we introduce a deep learning framework termed CGFTNet, meticulously engineered for the purpose of FSR. Our methodology is anchored in an encoder-decoder architecture, augmented with two pioneering modules: the FDRFEM and the CGCAF. These modules are pivotal in capturing both the minutiae of local facial features and the overarching structures of the face, which are essential for achieving high-fidelity image reconstruction.
The FDRFEM module incorporates a frequency-domain branch that harnesses the fast Fourier transform (FFT) to distill global features, complemented by a reparameterized focused convolution that refines the representation of local features. This synergistic approach enables our model to adeptly capture and synthesize both global and local facial attributes, thereby enhancing super-resolution performance.
The CGCAF module is crafted to dynamically amalgamate features from the encoder and decoder phases, ensuring that the reconstructed image is enriched with a comprehensive array of features that encapsulate the intricacies of facial morphology. This content-guided attention mechanism optimizes the feature fusion process, culminating in superior detail restoration and elevated image quality.
Through rigorous experimentation on both synthetic datasets, such as CelebA, and real-world datasets, we have demonstrated that CGFTNet surpasses several state-of-the-art methodologies across a spectrum of quantitative metrics, including peak signal-to-noise ratio (PSNR), Structural Similarity Index (SSIM), learned perceptual image patch similarity (LPIPS), and visual information fidelity (VIF). Qualitative assessments further reveal that our model adeptly reconstructs facial images with a higher degree of accuracy and detail, particularly in the depiction of intricate features such as eyes, noses, and mouths, which often elude other methods.
Additionally, we have conducted experiments to appraise the generalization prowess of CGFTNet by subjecting it to the Helen dataset and real-world facial imagery from the WiderFace dataset. The findings indicate that our model not only excels on standardized datasets but also demonstrates robustness and efficacy in real-world contexts, underscoring its practical applicability.
Regarding model complexity, CGFTNet achieves a judicious equilibrium between performance, parameter count, and execution time. Our analysis reveals that, while attaining superior quantitative performance, CGFTNet maintains a comparable level of efficiency, rendering it a feasible solution for real-world applications where computational resources may be constrained.
In conclusion, the CGFTNet signifies a substantial leap forward in the realm of facial super resolution. Its proficiency in restoring high-quality facial images from low-resolution inputs has far-reaching implications for applications such as identity recognition, video surveillance, and facial analysis. The model’s proficiency in downstream tasks, including facial matching, further accentuates its practical utility. Future endeavors will concentrate on expanding the model’s capabilities to address more complex and diverse real-world conditions, as well as exploring additional domains where the reconstruction of high-quality facial images is paramount.

Author Contributions

Conceptualization, Y.Y.; Methodology, S.C. and A.D.; Software, Y.Y.; Validation, S.C.; Writing—Original Draft Preparation, Y.Y.; Writing Review & Editing, S.C.; Supervision, A.D.; Project Administration, A.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Scientific and Technological Innovation 2030 Major Project under Grant 2022ZD0115800 and the Key Laboratory Open Projects in Xinjiang Uygur Autonomous Region under Grant 2023D04028.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available at https://mmlab.ie.cuhk.edu.hk/projects/CelebA.html, https://exposing.ai/helen/, http://shuoyang1213.me/WIDERFACE/ (accessed on 15 October 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Dogan, B.; Gu, S.; Timofte, R. Exemplar Guided Face Image Super-Resolution Without Facial Landmarks. In Proceedings of the IEEE CVPR, Long Beach, CA, USA, 16–20 June 2019; pp. 1814–1823. [Google Scholar]
  2. Jiang, J.; Yu, Y.; Hu, J.; Tang, S.; Ma, J. Deep CNN Denoiser and Multi-layer Neighbor Component Embedding for Face Hallucination. In Proceedings of the IJCAI 2018, Stockholm, Sweden, 13–19 July 2018; pp. 771–778. [Google Scholar]
  3. Grm, K.; Scheirer, W.J.; Struc, V. Face Hallucination Using Cascaded Super-Resolution and Identity Priors. IEEE Trans. Image Process. 2020, 29, 2150–2165. [Google Scholar] [CrossRef] [PubMed]
  4. Yu, X.; Fernando, B.; Ghanem, B.; Porikli, F.; Hartley, R. Face Super-Resolution Guided by Facial Component Heatmaps. In Proceedings of the Computer Vision—ECCV 2018—15th European Conference, Munich, Germany, 8–14 September 2018; Proceedings, Part IX; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2018; Volume 11213, pp. 219–235. [Google Scholar]
  5. Bulat, A.; Tzimiropoulos, G. Super-FAN: Integrated Facial Landmark Localization and Super-Resolution of Real-World Low Resolution Faces in Arbitrary Poses with GANs. In Proceedings of the IEEE CVPR, Salt Lake City, UT, USA, 18–22 June 2018; pp. 109–117. [Google Scholar]
  6. Bao, Q.; Liu, Y.; Gang, B.; Yang, W.; Liao, Q. SCTANet: A spatial attention-guided CNN-transformer aggregation network for deep face image super-resolution. IEEE Trans. Multimed. 2023, 25, 8554–8565. [Google Scholar] [CrossRef]
  7. Qi, H.; Qiu, Y.; Luo, X.; Jin, Z. An Efficient Latent Style Guided Transformer-CNN Framework for Face Super-Resolution. IEEE Trans. Multimed. 2024, 26, 1589–1599. [Google Scholar] [CrossRef]
  8. Gao, G.; Xu, Z.; Li, J.; Yang, J.; Zeng, T.; Qi, G. CTCNet: A CNN-Transformer Cooperation Network for Face Image Super-Resolution. IEEE Trans. Image Process. 2023, 32, 1978–1991. [Google Scholar] [CrossRef] [PubMed]
  9. Shi, J.; Wang, Y.; Yu, Z.; Li, G.; Hong, X.; Wang, F.; Gong, Y. Exploiting Multi-Scale Parallel Self-Attention and Local Variation via Dual-Branch Transformer-CNN Structure for Face Super-Resolution. IEEE Trans. Multimed. 2024, 26, 2608–2620. [Google Scholar] [CrossRef]
  10. Srivastava, A.; Chanda, S.; Pal, U. Aga-gan: Attribute guided attention generative adversarial network with u-net for face hallucination. Image Vis. Comput. 2022, 126, 104534. [Google Scholar] [CrossRef]
  11. Li, W.; Guo, H.; Liu, X.; Liang, K.; Hu, J.; Ma, Z.; Guo, J. Efficient Face Super-Resolution via Wavelet-based Feature Enhancement Network. arXiv 2024, arXiv:2407.19768. [Google Scholar]
  12. Peng, Q.; Jiang, Z.; Huang, Y.; Peng, J. A Unified Framework to Super-Resolve Face Images of Varied Low Resolutions. arXiv 2023, arXiv:2306.03380. [Google Scholar]
  13. Wang, Z.; Chen, J.; Hoi, S.C.H. Deep Learning for Image Super-Resolution: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3365–3387. [Google Scholar] [CrossRef] [PubMed]
  14. Li, J.; Pei, Z.; Zeng, T. From Beginner to Master: A Survey for Deep Learning-based Single-Image Super-Resolution. arXiv 2021, arXiv:2109.14335. [Google Scholar]
  15. Gao, G.; Wang, Z.; Li, J.; Li, W.; Yu, Y.; Zeng, T. Lightweight Bimodal Network for Single-Image Super-Resolution via Symmetric CNN and Recursive Transformer. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI, Vienna, Austria, 23–29 July 2022; pp. 913–919. [Google Scholar]
  16. Li, J.; Fang, F.; Mei, K.; Zhang, G. Multi-scale Residual Network for Image Super-Resolution. In Proceedings of the ECCV 15th European Conference, Munich, Germany, 8–14 September 2018; Lecture Notes in Computer Science. Springer: Berlin/Heidelberg, Germany, 2018; Volume 11212, pp. 527–542. [Google Scholar]
  17. Wang, H.; Cheng, S.; Li, Y.; Du, A. Lightweight Remote-Sensing Image Super-Resolution via Attention-Based Multilevel Feature Fusion Network. IEEE Trans. Geosci. Remote Sens. 2023, 61, 2005715. [Google Scholar]
  18. Guo, Y.; Chen, J.; Wang, J.; Chen, Q.; Cao, J.; Deng, Z.; Xu, Y.; Tan, M. Closed-Loop Matters: Dual Regression Networks for Single Image Super-Resolution. In Proceedings of the IEEE/CVF CVPR, Seattle, WA, USA, 13–19 June 2020; pp. 5406–5415. [Google Scholar]
  19. Zhang, X.; Gao, P.; Liu, S.; Zhao, K.; Li, G.; Yin, L.; Chen, C.W. Accurate and Efficient Image Super-Resolution via Global-Local Adjusting Dense Network. IEEE Trans. Multimed. 2021, 23, 1924–1937. [Google Scholar] [CrossRef]
  20. Zhang, K.; Zhang, Z.; Cheng, C.; Hsu, W.H.; Qiao, Y.; Liu, W.; Zhang, T. Super-Identity Convolutional Neural Network for Face Hallucination. In Proceedings of the ECCV 15th European Conference, Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; Volume 11215, pp. 196–211. [Google Scholar]
  21. Chen, Y.; Tai, Y.; Liu, X.; Shen, C.; Yang, J. FSRNet: End-to-End Learning Face Super-Resolution With Facial Priors. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2492–2501. [Google Scholar]
  22. Kim, D.; Kim, M.; Kwon, G.; Kim, D. Progressive Face Super-Resolution via Attention to Facial Landmark. In Proceedings of the 30th British Machine Vision Conference, BMVC, Cardiff, UK, 9–12 September 2019; p. 192. [Google Scholar]
  23. Cheng, S.; Chan, R.; Du, A. CACFTNet: A Hybrid Cov-Attention and Cross-Layer Fusion Transformer Network for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
  24. Rai, D.; Rajput, S.S. Low-light robust face image super-resolution via neuro-fuzzy inferencing-based locality constrained representation. IEEE Trans. Instrum. Meas. 2023, 72, 5015911. [Google Scholar] [CrossRef]
  25. Yin, Z.; Liu, M.; Li, X.; Yang, H.; Xiao, L.; Zuo, W. MetaF2N: Blind Image Super-Resolution by Learning Efficient Model Adaptation from Faces. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 12987–12998. [Google Scholar]
  26. Zhang, H.; Cheng, S.; Du, A. Multi-Stage Auxiliary Learning for Visible-Infrared Person Re-identification. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 12032–12047. [Google Scholar] [CrossRef]
  27. Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 3730–3738. [Google Scholar]
  28. Le, V.; Brandt, J.; Lin, Z.; Bourdev, L.D.; Huang, T.S. Interactive Facial Feature Localization. In Proceedings of the ECCV 12th European Conference on Computer Vision, Florence, Italy, 7–13 October 2012; Proceedings, Part III. Springer: Berlin/Heidelberg, Germany, 2012; Volume 7574, pp. 679–692. [Google Scholar]
  29. Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
  30. Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
  31. Sheikh, H.R.; Bovik, A.C. Image information and visual quality. IEEE Trans. Image Process. 2006, 15, 430–444. [Google Scholar] [CrossRef] [PubMed]
  32. Chen, C.; Gong, D.; Wang, H.; Li, Z.; Wong, K.K. Learning Spatial Attention for Face Super-Resolution. IEEE Trans. Image Process. 2021, 30, 1219–1231. [Google Scholar] [CrossRef] [PubMed]
  33. Lu, T.; Wang, Y.; Zhang, Y.; Wang, Y.; Wei, L.; Wang, Z.; Jiang, J. Face hallucination via split-attention in split-attention network. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual, 20–24 October 2021; pp. 5501–5509. [Google Scholar]
  34. Hou, H.; Xu, J.; Hou, Y.; Hu, X.; Wei, B.; Shen, D. Semi-cycled generative adversarial networks for real-world face super-resolution. IEEE Trans. Image Process. 2023, 32, 1184–1199. [Google Scholar]
  35. Wang, C.; Jiang, J.; Zhong, Z.; Liu, X. Spatial-Frequency Mutual Learning for Face Super-Resolution. In Proceedings of the IEEE/CVF CVPR, Vancouver, BC, Canada, 17–24 June 2023; pp. 22356–22366. [Google Scholar]
  36. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Gool, L.V.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the IEEE/CVF ICCVW 2021, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
  37. Yang, S.; Luo, P.; Loy, C.C.; Tang, X. WIDER FACE: A Face Detection Benchmark. In Proceedings of the CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 5525–5533. [Google Scholar]
Figure 1. An overview of the proposed CGFTNet. FDRFEM and CGCAF are two core modules that we have proposed.
Figure 1. An overview of the proposed CGFTNet. FDRFEM and CGCAF are two core modules that we have proposed.
Information 15 00765 g001
Figure 2. The architecture of the proposed CGCAF.
Figure 2. The architecture of the proposed CGCAF.
Information 15 00765 g002
Figure 3. The architecture of the proposed channel attention (CA), which is a component of CGCAF.
Figure 3. The architecture of the proposed channel attention (CA), which is a component of CGCAF.
Information 15 00765 g003
Figure 4. The architecture of the proposed FDRFEM.
Figure 4. The architecture of the proposed FDRFEM.
Information 15 00765 g004
Figure 5. Visual comparisons of multiple methods for 8x super-resolution on the CelebA test set.
Figure 5. Visual comparisons of multiple methods for 8x super-resolution on the CelebA test set.
Information 15 00765 g005
Figure 6. Visual comparisons of multiple methods for 8x super-resolution on the Helen test set.
Figure 6. Visual comparisons of multiple methods for 8x super-resolution on the Helen test set.
Information 15 00765 g006
Figure 7. Visual comparisons of multiple methods for 8x super resolution on the real world images.
Figure 7. Visual comparisons of multiple methods for 8x super resolution on the real world images.
Information 15 00765 g007
Figure 8. Model complexity studies for ×8 SR on the CelebA test sets. Our CGFTNet achieves a better balance between model size, model performance, and execution time.
Figure 8. Model complexity studies for ×8 SR on the CelebA test sets. Our CGFTNet achieves a better balance between model size, model performance, and execution time.
Information 15 00765 g008
Table 1. Verify the effectiveness of FDRFEM in Figure 4.
Table 1. Verify the effectiveness of FDRFEM in Figure 4.
MethodsPSNRSSIMVIFLPIPS
w/o FDRFEM28.0620.80490.53640.1514
FDRFEM w/o Fourier28.0650.80670.53630.1487
FDRFEM w/o Refconv28.1200.80710.53690.1513
FDRFEM28.1250.80590.53740.1481
Table 2. Verify the effectiveness of CGCAF in Figure 2.
Table 2. Verify the effectiveness of CGCAF in Figure 2.
CGAConcatAddCAPSNRSSIM
××28.1250.8059
××27.9910.8039
×28.1610.8075
×28.1590.8074
××28.1530.8069
××28.0470.8047
×××28.1210.8050
×××27.9770.8028
Table 3. Verify the effectiveness of CGFTNet in Figure 1.
Table 3. Verify the effectiveness of CGFTNet in Figure 1.
BaselineThreelinesFDRFEMCGCAFPSNRSSIM
×××27.9900.8036
××28.0620.8049
×28.1250.8059
×28.0340.8039
28.1610.8075
Table 4. Quantitative comparisons for x8 SR on the CelebA and Helen test sets.
Table 4. Quantitative comparisons for x8 SR on the CelebA and Helen test sets.
MethodsCelebAHelen
PSNR SSIM VIF LPIPS PSNR SSIM VIF LPIPS
Bicubic24.050.64490.40190.569723.790.67390.43530.5436
SwinIR27.880.79670.45900.200126.530.78560.43980.2644
SISN27.910.79710.47850.200526.640.79080.46230.2571
ELSFace28.020.80180.52320.152626.900.80130.49310.1874
SCGAN28.030.80130.52360.151127.040.81150.50560.1809
SCTANet28.030.80170.52400.152026.850.80740.50410.1911
SFMNet28.040.80240.52450.149627.090.80940.49890.1788
SPARNet27.730.79490.45050.199526.430.78390.42620.2674
CTCNet27.990.80360.53480.155926.960.80480.52450.1797
Ours28.160.80750.53690.145527.180.81290.52940.1748
Table 5. Comparison results for average similarity of face images super-resolved by different methods.
Table 5. Comparison results for average similarity of face images super-resolved by different methods.
MethodsAverage Similarity
Case1 Case2 Case3 Case4 Case5 Case6 Case7 Case8
SwinIR0.89420.89800.90310.89580.89580.90960.90190.8827
SISN0.89300.91250.89600.90850.88880.90410.90750.8731
ELSFace0.91110.91960.89630.90560.90840.90470.91390.8940
SCGAN0.89840.88360.90380.89790.89170.90430.89110.8964
SCTANet0.88550.87860.86990.87770.87600.87750.88400.8741
SFMNet0.90320.89960.90550.90370.90510.89560.89660.8887
SPARNet0.90760.91860.90130.90070.90540.90040.90800.9043
CTCNet0.91380.91910.91080.90260.90660.90180.91450.9069
Ours0.92780.92370.91090.91840.92420.92160.92020.9138
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yekeben, Y.; Cheng, S.; Du, A. CGFTNet: Content-Guided Frequency Domain Transform Network for Face Super-Resolution. Information 2024, 15, 765. https://doi.org/10.3390/info15120765

AMA Style

Yekeben Y, Cheng S, Du A. CGFTNet: Content-Guided Frequency Domain Transform Network for Face Super-Resolution. Information. 2024; 15(12):765. https://doi.org/10.3390/info15120765

Chicago/Turabian Style

Yekeben, Yeerlan, Shuli Cheng, and Anyu Du. 2024. "CGFTNet: Content-Guided Frequency Domain Transform Network for Face Super-Resolution" Information 15, no. 12: 765. https://doi.org/10.3390/info15120765

APA Style

Yekeben, Y., Cheng, S., & Du, A. (2024). CGFTNet: Content-Guided Frequency Domain Transform Network for Face Super-Resolution. Information, 15(12), 765. https://doi.org/10.3390/info15120765

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop