1. Introduction
Image Super-Resolution (ISR), as a core technology in the field of computer vision, and Face Image Super-Resolution (FISR), as an important branch of Image Super-Resolution (ISR), aims to restore high-quality high-resolution (HR) face images from low-resolution (LR) inputs. It plays a key role in scenarios, such as medical image analysis, security surveillance, and digital identity authentication [
1]. However, the challenges of FISR stem from its ill-posed inverse problem nature; there are infinitely many possibilities in the high-resolution solution space corresponding to a single low-resolution input, and the loss of high-frequency details caused by image degradation makes it challenging to balance pixel accuracy and semantic authenticity in the reconstruction results [
2]. Although deep learning technology has driven rapid development in this field, existing methods still face bottlenecks, such as insufficient realism, low computational efficiency, and weak controllability of generation, urgently requiring breakthroughs by combining new generative paradigms and domain prior knowledge [
3,
4,
5].
Convolutional Neural Networks (CNNs) have achieved significant success in image super-resolution due to their powerful feature extraction and non-linear mapping capabilities. Early CNN-based methods, such as SRCNN [
6], pioneered deep learning to image super-resolution tasks. Subsequent research has led to substantial improvements in network architecture and training strategies, resulting in more advanced models, like VDSR [
7], DRCN [
8], and EDSR [
9]. To better restore facial texture details, Chen proposed the Spatial Attention Residual Network (SPARNet) [
10], which incorporates a Facial Attention Unit (FAU) using spatial attention mechanisms. The use of convolutional layers enables adaptive focus on important facial structural regions while minimizing attention on areas lacking distinctive features, thereby improving the reconstruction of facial details. Chen further developed SPARNetHD by combining SPARNet with a multi-scale discriminator, enabling the generation of higher-quality facial images at 512 × 512 resolution with good generalization on low-quality facial images. G. Gao introduced the CNN-Transformer Collaborative Network (CTCNet) to improve network performance by considering global and local facial features [
11]. CTCNet features a multi-scale connected encoder–decoder structure and a Local–Global Feature Collaboration Module (LGCM) containing a Facial Structure Attention Unit (FSAU) and Transformer modules. This design aims to enhance the consistency of local facial details and global facial structure recovery. However, traditional CNN-based super-resolution methods have limitations in facial image super-resolution tasks. They are mostly generic image super-resolution models that do not fully utilize facial image’s unique structure and prior knowledge of facial images. These methods often use pixel-level Mean Squared Error (MSE) or Peak Signal-to-Noise Ratio (PSNR) as optimization objectives. While these metrics may improve objectively, the resulting facial images can appear overly smooth and lack realistic high-frequency details.
Generative Adversarial Networks (GANs) are widely used in facial super-resolution tasks to enhance the perceptual quality and realism of reconstructed facial images. Through adversarial training between a generator and a discriminator, GANs enable the generator to learn realistic facial image distributions, producing more authentic super-resolution images. Early GAN-based methods, like SRGAN, demonstrated GANs’ effectiveness in super-resolution tasks [
12]. Subsequent improved GAN models, such as ESRGAN [
13] and RankSRGAN [
14], enhanced the perceptual quality of super-resolved images. In facial super-resolution, GANs have been extensively applied. For instance, a GAN combined with attention mechanisms was proposed for multi-scale facial image super-resolution [
15]. This method used deep residual networks and deep neural networks as the generator and discriminator, respectively, and integrated attention modules into the residual blocks of the deep residual network to reconstruct super-resolution facial images highly similar to HR images and hard to distinguish by the discriminator. To address the over-smoothing issue of MSE-oriented SR methods and the potential artifacts from GAN-oriented SR methods, Zhang and Ling introduced Supervised Pixel-level GAN (SPGAN) [
16]. Unlike traditional unsupervised discriminators, SPGAN’s supervised pixel-level discriminator focuses on whether each generated SR facial image pixel is as realistic as the corresponding pixel in the ground-truth HR facial image. To boost SPGAN’s facial recognition performance, it uses facial identity priors by feeding the input facial image and its features extracted from a pre-trained facial recognition model into the discriminator. This identity-based discriminator focuses more on the intricate texture details that are essential for accurate facial recognition. Despite GANs’ significant progress in facial super-resolution, challenges remain. GAN training can be unstable, and prone to issues like mode collapse. Additionally, GAN-generated images may lack pixel-level precision, sometimes producing unwanted artifacts.
Transformers, originally revolutionary in NLP, have been introduced to computer vision. Their self-attention mechanisms effectively capture long-range dependencies in images. The research combines Transformers with CNNs, leveraging the former’s global modeling and the latter’s local feature extraction. For example, Shi proposed a two-branch Transformer-CNN structure for face super-resolution [
17]. It has a Transformer and CNN branches. The Transformer branch extracts multi-scale features and explores local and non-local self-attention. In contrast, the CNN branch uses locally variant attention blocks to enhance network capability by capturing adjacent pixel variations. These branches are fused via modulation blocks to combine global and local strengths. To further integrate CNN and Transformer advantages, Qi introduced ELSFace, an efficient latent style-guided Transformer-CNN framework for face super-resolution [
18]. It includes feature preparation and carving stages. The preparation stage generates basic facial contours and textures, guided by latent styles for better facial detail representation. CNN and Transformer streams recursively restore facial textures and contours in parallel in the carving stage. Considering possible high-frequency feature neglect during long-range dependency learning, a high-frequency enhancement block (HFEB) and a Sharp Loss were designed in the Transformer stream for improved perceptual quality. However, these hybrid models often face challenges. Their global computations may overlook high-frequency details, and the increased computational complexity limits efficiency in high-resolution image generation. Additionally, feature alignment between CNN and Transformer features is challenging, requiring complex strategies for fusion.
Some studies have tried to incorporate facial priors into deep learning models. For example, Grm proposed the Cascaded Super-Resolution and Identity Prior model for face hallucination (C-SRIP) [
19], which consists of a cascaded super-resolution network and a face recognition model. The super-resolution network progressively upscales low-resolution face images, while the recognition model serves as an identity prior to guiding the reconstruction of high-resolution images. Ma proposed a deep-face super-resolution model that uses iterative collaboration between attention restoration and landmark estimation [
20]. It employs two recurrent networks for face image recovery and landmark estimation. In each iteration, the restoration branch leverages prior landmark knowledge to generate higher-quality images, aiding more accurate landmark estimation, guiding better image recovery, and progressively improving both tasks.
As novel generative models, diffusion models have achieved remarkable success in image generation. Compared to GANs, they offer more stable training, higher-quality samples, and greater diversity. Conditional diffusion models introduce condition information, like low-resolution images, into diffusion and reverse diffusion processes to guide image generation. In super-resolution tasks, these models typically take low-resolution images as conditional inputs to learn the low to high-resolution mapping. Saharia introduced SR3 [
21], which achieves super-resolution through repeated refinement. SR3’s success highlights the potential of diffusion models in super-resolution, especially at high magnifications. The iterative nature of diffusion models allows for gradual detail refinement, aiding in the recovery of high-frequency information. Implicit diffusion models, which integrate implicit neural representations with diffusion models for continuous super-resolution, operate in a continuous space rather than traditional models’ pixel or discrete latent space. S. Gao proposed the Implicit Diffusion Model (IDM) for high-fidelity continuous image super-resolution [
22]. IDM combines implicit neural representations and denoising diffusion models in an end-to-end framework, using implicit neural representations during decoding to learn continuous resolution representations. The proposed method enables the generation of super-resolution images at any resolution without the need for retraining, which is crucial for practical applications where different resolutions may be needed. Traditional diffusion models require hundreds of sampling steps, resulting in low generation efficiency. To speed up diffusion models’ inference, researchers have developed various accelerated sampling techniques, such as DDIM and PNDM. These methods can reduce the number of sampling steps in diffusion models, thus speeding up the inference process. For instance, Yue proposed ResShift, an efficient diffusion model for image super-resolution that significantly reduces the number of diffusion steps through residual shifting [
23]. In addition to accelerated sampling techniques, some research focuses on optimizing the network architecture and training methods of diffusion models to boost their efficiency and performance. For example, Xiao introduced EDiffSR, an efficient diffusion probabilistic model for remote sensing image super-resolution [
24]. EDiffSR’s EANet uses simplified channel attention and basic gating operations to achieve good noise prediction with a low computational footprint. These acceleration and optimization methods are crucial for promoting the practical application of diffusion models. However, they fail to effectively integrate facial attribute priors, resulting in uncontrolled generation outcomes (such as the inability to adjust age or expression directionally).
Recent studies have demonstrated the effectiveness of attention mechanisms and semantic priors in enhancing visual perception tasks, particularly in time-sensitive applications. For instance, complementary techniques leveraging spatiotemporal attention and adaptive feature fusion have significantly improved in real-time processing while maintaining semantic consistency [
25]. These insights further motivate our design to integrate facial attribute priors with lightweight attention modules, enabling high efficiency and controllability.
Although existing technologies have advanced FISR in various dimensions, its core contradictions still focus on the following aspects:
Trade-offs between Detail and Efficiency: Although the complex architectures of CNNs and Transformers can enhance reconstruction quality, their computational costs are excessively high; diffusion models produce excellent generation quality, yet their inference speed struggles to meet real-time requirements.
Insufficient utilization of prior knowledge: Most models do not fully integrate the structural attributes of the face (such as the distribution of facial features and expressions) with semantic information (such as age and gender), resulting in generated images lacking semantic consistency.
Limited control ability: Existing methods mainly focus on quality improvement and cannot have direct control over the generated results, making it challenging to meet the personalized needs in practical scenarios (such as adjusting lighting and repairing specific areas).
In response to the aforementioned challenges, this paper proposes a super-resolution framework (FACDIM) that integrates conditional diffusion implicit models with prior knowledge of facial attributes. The core innovations are as follows:
Hierarchical Feature Enhancement Architecture: Pre-super-resolution module (PSRM)—for extremely low-resolution inputs (e.g., 16 × 16), it performs preliminary feature enhancement through a lightweight residual network, alleviating the detail recovery pressure on subsequent modules.
Dual-stream feature extraction: Design a lightweight attribute extraction module (FAEM) and a global feature encoder (FGFVE) to capture local attributes (such as eye shape and curvature of the mouth corners) and the overall facial contour. Through adaptive group normalization (AdaGN), feature fusion is achieved, injecting strong semantic guidance into the diffusion process. Different from SR3, which directly extracts mixed features through CNN, it leads to the coupling of local attributes and global structure. However, the dual-stream design of FACDIM (FAEM + FGFVE) decouples local attributes from global features for the first time and avoids information interference data adaptability through the dynamic fusion of AdaGN. Flexibility and scalability are also better. It can be used to conduct targeted experiments on self-annotated custom attributes (such as those not provided by CelebA).
Efficient Conditional Diffusion Implicit Model: In traditional diffusion models, the Markov chain is replaced with implicit neural representations, reducing the number of sampling steps to 1/5 of the original by modeling in continuous space, significantly enhancing generation efficiency. Incorporating Property-Aware Residual Blocks (FPARB) into U-Net, combined with self-attention mechanisms to optimize the interaction between attribute vectors and noise features, ensures that the generated images retain high-frequency details (such as skin texture) while strictly adhering to prior constraints.
User-controllable generation mechanism: This mechanism supports the targeted manipulation of generated results by manually adjusting attribute vectors (such as increasing the “smile” attribute value and decreasing the “age” attribute value), providing flexible tools for scenarios, such as security repair and digital entertainment.
Figure 1 shows the overall architecture diagram of this model.
The contribution of this paper lies not only in the innovation at the technical level but also in the establishment of a new paradigm where high fidelity and efficient controllability coexist: by deeply integrating attribute priors with diffusion models, it resolves the contradiction between the lack of detail and rigid generation in traditional methods, and breaks through the bottleneck of low efficiency in diffusion models, providing reliable technical support for applications, such as medical image restoration and virtual digital humans. The differences between this paper’s model and other existing methods, as well as the details and novelty of this model, are presented in
Table 1.
3. Discussion and Results
3.1. Experimental Configuration
3.1.1. Experimental Dataset
This study selected three datasets for systematic experimental validation: the FFHQ (Flickr-Faces-High-Quality) dataset [
34], the CelebA (CelebFaces Attributes) dataset [
35], and the MAFA (MAsked FAces) dataset [
36]. The experimental design is divided into two parts: conventional scenario experiments and challenging scenario experiments, each employing adapted datasets tailored to distinct experimental objectives.
In the conventional scenario experiments, two widely recognized high-quality face datasets, FFHQ and CelebA, were utilized to validate the model’s fine-grained semantic controllability and detailed feature learning capability. Among these, the FFHQ dataset is a commonly used high-quality dataset in computer vision and machine learning for tasks, such as face recognition, image generation, and image synthesis. The CelebA dataset contains diverse celebrity faces with varying attributes, expressions, poses, and backgrounds, making it suitable for training models that require generalization across diverse facial features.
In the challenging scenario experiments, additional experiments were conducted on the MAFA dataset to verify the robustness of our model under demanding real-world conditions. Compared to CelebA and FFHQ, the MAFA dataset comprises 33,811 real-world facial images with significant occlusion features (e.g., masks, sunglasses, hand occlusions) and extreme illumination variations. Its data characteristics better simulate partial occlusions and lighting changes commonly encountered in practical facial image scenarios.
For the conventional scenario experiments, the CelebA dataset was employed as the training set, while the FFHQ dataset served as the validation set. During the training phase, 160,000 images from CelebA were selected. In the challenging scenario experiments, 25,000 occluded images from the MAFA dataset were used for training, with the remaining 5811 images reserved for testing. The low-resolution images (LR) were obtained by center cropping the original images to 178 × 178 pixels and downsampling to 16 × 16 pixels, while the high-resolution images (HR) were cropped from the corresponding regions and downsampled to 128 × 128 pixels. Online random rotation (rotation angles included 90°, 180°, and 270°) and online random horizontal flipping were used as data augmentation methods. The low-resolution images were subjected to bicubic interpolation degradation and then concatenated (concatenated) with the noisy high-resolution images before being fed into our diffusion model. The attribute values in the dataset were converted to binary (1/0), where attribute vector values from 0 to 1 represent the increasing enhancement of the facial attribute; the eight facial attributes used in this paper are as follows: Bags_Under_Eyes; Young; Smiling; Chubby; Big_Lips; Blurry; Attractive; Male, chosen based on their semantic relevance to facial structure and perceptual quality in super-resolution tasks. Attributes like Smiling, Big_Lips, and Bags_Under_Eyes directly influence facial geometry and texture, which are critical for recovering high-frequency details; Blurry, Attractive, Young, and Male act as global semantic constraints, guiding the model to prioritize clarity and aesthetic consistency. These attributes are widely annotated in the CelebA dataset and align with prior studies on face super-resolution (e.g., [
16,
19,
35]), ensuring reproducibility and fair comparison with existing methods. Additionally, during preliminary experiments, we observed that these attributes provided sufficient fine-grained control over key facial regions while maintaining computational efficiency. Due to the constraints of computational resources and time, the study selects key attributes to verify the method’s effectiveness, avoiding increased model complexity from too many attributes. Expanding the attribute set further would require balancing model complexity and training stability, which we prioritized in this initial framework.
3.1.2. Experimental Parameters
Diffusion Process: A linear noise schedule with 1000 steps is employed (), utilizing the AdamW optimizer, which is a variant of the Adam optimizer. The primary distinction lies in the weight decay management approach. The Adam optimizer combines the benefits of momentum and adaptive learning rates, whereas AdamW addresses weight decay more effectively to prevent overfitting. The first moment decay rate is , and the second moment decay rate is with an initial learning rate of 0.0002. Weight decay utilizes a cosine annealing schedule to gradually reduce the learning rate following the shape of a cosine function throughout the training process. This strategy helps maintain a larger learning rate initially to accelerate convergence and decrease it later for fine-tuning parameters. This approach assists the model in escaping local optima and finding better global solutions. The experimental setup includes the Windows 11 operating system, Python 3.7.10 as the development version, PyTorch 1.11.0 + CUDA 11.7 as the framework, and experiments are conducted on a single NVIDIA A100 GPU with a batch size of 64, using mixed precision training (FP16), with 500 training epochs, and a gradient clipping threshold set to 1.0.
3.1.3. Loss Function
Considering that facial structures contain rich texture information while also requiring the continuity between local areas, this experiment introduces edge smoothing loss on the basis of the original reverse diffusion stage. By constraining the gradient changes between pixels in the image, leveraging the inherent smoothness present in facial images, edge smoothing loss effectively reduces the noise and artifacts that may arise during the generation process, enhancing the structural consistency and the authenticity of texture details in the generated images, thus making the results clearer and more realistic. Loss function: the combination of mean squared error (MSE) of noise prediction and edge smoothing loss (Edge Smoothness Loss) is used, where the mathematical expression for MSE loss is as follows:
Edge smoothing loss aims to constrain the gradient consistency between the generated image
and the real high-resolution image
in the edge regions. Its mathematical expression is as follows:
In which the gradient operator
can be implemented using the Sobel operator:
In the actual computation, the gradient is calculated separately for each of the RGB channels and then averaged. During the training process of the diffusion model, the current estimated generated image
needs to be recovered from the noisy image
to compute the gradient loss:
The boundary smoothing loss can be expressed as follows:
Joint noise prediction MSE loss and boundary smoothing loss, the total loss is as follows:
where
is the smoothing loss weight, with an initial value set to 0.3.
3.1.4. Evaluation Indicators
Peak Signal-to-Noise Ratio (PSNR) is an objective standard for measuring image quality. It evaluates image quality by calculating the mean squared error between the original image and the processed image, expressed in decibels (dB). A higher PSNR value indicates better quality of the reconstructed image and a higher degree of similarity to the original image. The calculation formula is as follows:
Here,
represents the maximum pixel value in the image, and
denotes the mean squared difference between corresponding pixels of the two images. For multi-channel images, the calculation formula for
is as follows:
In this context, represents the -th pixel value of the real image, B denotes the -th pixel value of the generated image, and indicates the total number of pixels in the image.
The Structural Similarity Index (SSIM) is an image quality assessment metric that is based on structural information. It takes into account not only the luminance and contrast of the image but also the structural similarity between images. The SSIM value ranges between [−1, 1], with values closer to one indicating a higher similarity between the generated image and the reference image. The calculation formula for SSIM is as follows:
In the formula, and represent the mean values of images and , respectively. and are the standard deviations of images and , respectively. is the covariance between images and , and is a constant used to stabilize the formula.
3.2. Experimental Results Comparative Analysis
Conducted super-resolution experiments from 16 × 16 to 128 × 128 on the validation set FFHQ, with the experimental results shown in
Figure 11:
In this paper’s experiments, we compare our method with other advanced face super-resolution methods (SR3, FSRGAN [
37], SRFLOW [
38], SRCNN, SRGAN [
12], PULSE [
39]). Analysis of
Figure 12 shows that our conditional diffusion implicit model, which integrates facial attributes as prior knowledge, better restores facial details and textures, offering superior visual quality and realism in super-resolution reconstruction. SR3, as a pure diffusion model, lacks explicit semantic guidance, leading to blurry facial features (e.g., eyes, lips) and reliance on random noise for high-frequency detail recovery, which can cause jagged edges or artifacts. It is also computationally expensive, requiring 2000 iterative denoising steps, which slows down inference. FSRGAN, based on GANs, struggles with limited image diversity, often producing similar hairstyles or skin tones. This is due to GANs’ tendency to fall into local optima and their sensitivity to hyperparameters, causing unstable training. SRFLOW, based on normalizing flows, has slow generation due to the need for complete forward and backward propagation. Flow models also tend to over-smooth textures, resulting in detail loss.
In terms of objective evaluation metrics, the experimental results show that our proposed conditional diffusion implicit model, which integrates facial attributes as prior knowledge, achieves the best scores in both PSNR and SSIM compared to other advanced models. Specifically, our model’s PSNR is 2.16 dB higher than SR3, 3.36 higher than SRFLOW, and 3.67 higher than FSRGAN. For SSIM, our model is 0.08 higher than SR3, 0.12 higher than SRFLOW, and 0.14 higher than FSRGAN. The detailed data are shown in
Table 2.
3.3. Robustness Analysis on MAFA Dataset
We retained the same model architecture and hyperparameters as in previous experiments (
Section 3.1.2) but fine-tuned the pre-trained FACDIM model on MAFA for 50 epochs to adapt to occlusion patterns. To ensure fairness, all baseline methods (SR3, FSRGAN, SRFLOW) were also fine-tuned under identical settings. We evaluated both objective metrics (PSNR, SSIM) and perceptual quality by measuring FID scores between generated images and ground-truth HR images. Additionally, we introduced Occlusion Robustness Score (ORS), defined as the average SSIM difference between occluded and non-occluded regions, to quantify the model’s ability to recover masked areas.
Figure 13 shows the experimental results of each model on the MAFA dataset and the enlarged comparison figure.
Table 3 compares the performance of FACDIM and baseline methods on MAFA. Our model achieves a PSNR of 24.31 dB and SSIM of 0.69, outperforming SR3 by 1.52 dB and 0.07, respectively.
Figure 13 visualizes the super-resolution results for a heavily occluded face, while SR3 generates blurred features and FSRGAN introduces artifacts around occluded regions.
3.4. Ablation Experiment
To validate the effectiveness of the proposed AdaGN and self-attention fusion strategy, we conducted an ablation study comparing it with three alternative fusion approaches:
Concatenation + MLP: Concatenate the low-resolution features and attribute vectors, followed by an MLP for dimensionality reduction.
Cross-Attention: Use a cross-attention layer where attribute vectors serve as keys/values and image features as queries.
Channel Attention: Apply SENet-style channel attention to weight feature channels based on attribute vectors.
The experiments were performed on the FFHQ validation set under 8× super-resolution settings. All variants shared identical backbone architectures (PSRM, FAEM, FGFVE) and training configurations. The detailed data are shown in
Table 4.
The experimental results show the following: AdaGN achieves the highest fidelity (PSNR/SSIM), demonstrating its superiority in preserving facial details and semantic consistency. AdaGN is 43% faster than cross-attention and requires fewer parameters due to lightweight affine transformations. Concatenation + MLP struggles with feature misalignment, while cross-attention introduces computational overhead from query-key interactions. AdaGN dynamically scales and shifts feature maps using condition vectors, enabling efficient and implicit fusion without disrupting spatial relationships. This aligns with diffusion models’ need for stable gradient flow during iterative denoising.
To verify the effectiveness of our module design in the task of facial image super-resolution, we set up four control experiments:
Without using the facial attribute vector extraction module;
Without employing the global facial feature vector encoder;
Removing the implicit diffusion model setup;
Removing AdaGN fusion of facial attribute vectors as prior knowledge.
The complete model is also tested. Results, evaluated by PSNR and SSIM, are shown in
Table 5.
In addition, to verify the control of facial attribute vectors on facial image super-resolution, this paper conducted an attribute-control sensitivity analysis, testing and analyzing eight attributes from the CelebA dataset. In the experiment, the output values of the facial attribute vector extraction module were modified while keeping other attribute vector values unchanged. The differences in super-resolution results were observed by changing single attribute vector values. The experimental results are shown in
Figure 14.
Experimental analysis shows that setting the eight attribute vectors to a maximum value of one creates significant feature differences and visual semantic consistency in the super-resolved images. To verify the model’s ability to precisely control local features via facial attribute vectors, experiments with varying vector magnitudes were conducted. The results, shown in
Figure 15, demonstrate that the model can adjust the expression of specific attributes by altering corresponding vector values. This confirms that the model achieves fine-grained semantic editing through attribute-conditioned manipulation, ensuring reliable semantic controllability in face super-resolution tasks.
To fully evaluate the model’s strengths in terms of inference speed, resource usage, and complexity, this experiment compares the following key metrics: inference time, VRAM usage during training and inference, and parameter count.
Inference time is the duration taken to generate a 128 × 128 image on an NVIDIA A100 with a batch size of 1. It is the average of 100 trials, excluding I/O time. Training VRAM is the peak usage during training, while inference VRAM is the peak usage for a single-image inference. The parameter count is the total number of trainable parameters in the model. Experimental results are presented in
Table 6.
Using implicit diffusion, our model reduces sampling steps and thus shortens inference time compared to SR3. The single-step inference complexity of diffusion models is dominated by the U-Net structure. The improved U-Net employs FPARB (attribute-aware residual blocks) and self-attention mechanisms, reducing the single-step complexity to where is the time step embedding dimension and is the network depth. Compared to traditional diffusion models (such as SR3), FACDIM reduces the number of sampling steps to 1/2 (from 2000 steps in experiments to 1000 steps) through implicit representation, optimizing overall time complexity from to thanks to feature sharing with parameters. However, diffusion models are slower than FSRGAN due to multi-step iteration, FSRGAN. The complexity of single-step generation based on GAN is . The actual reasoning efficiency (0.7 s) is fast, but the quality and controllability of generation are insufficient. Our experimental results show much higher fidelity. Thanks to a lightweight feature vector encoder and shared conditional mapping MLP, our model’s training VRAM usage is lower than SRFLOW’s. Also, our model only needs 50% of SRFLOW’s inference VRAM since there is no need to save intermediate variables during backpropagation. In terms of parameter count, our model has 13% fewer parameters than SR3 due to shared AdaGN parameters in conditional fusion. SRFLOW forward and backward propagation must be complete, and the complexity is as high as , resulting in a reasoning delay of 10.3 s, and the number of parameters (207M) is significantly higher than FACDIM. Compared to SRFLOW, our model reduces parameters by 38% as flow models have many dense, invertible units that increase parameters. In the security surveillance scenario (input resolution 640 × 480, target output 1080p), the single-frame inference time of FACDIM is around 2.3 s (A100 GPU). Through parallel optimization, it can achieve an end-to-end latency of about 4.6 s, meeting some real-time requirements (such as a response within 5 s). If deployed to edge devices (NVIDIA Jetson AGX Xavier), the inference time increases to 8.2 s, and further lightweighting is still needed. The current FACDIM on A100 has an FPS of 0.43, which is better than SR3 (0.18 FPS) and SRFLOW (0.10 FPS).