5.1. Baseline Models Results
This table illustrates that SwinIR achieved the highest PSNR and SSIM (0.5976), indicating better reconstruction quality. The LDM model also achieved the lowest FID score (183.72), indicating superior perceptual quality. In contrast, the ESRGAN model recorded the lowest PSNR and SSIM scores but the highest FID score, even though it was the fastest of the models. These results highlight a trade-off between quality and efficiency across the three models shown in
Table 2.
Quantitative results of the performance of ESRGAN, swinir and LDM models using PSNR, SSIM and FID scales show that the SwinIR model achieved the best values in both PSNR (24.82 DB) and SSIM (0.5976), which indicates higher accuracy of reconstruction and better ability to preserve structural characteristics. In contrast, the LDM model recorded the lowest FID score (183.72), indicating higher cognitive realism and greater similarity to the real image distribution. The ESRGAN model ranked last in structural consistency, achieving lower PSNR and SSIM values and the highest FID (237.81), indicating relatively weaker performance in preserving structure.
In general, these results indicate that transformer-based models are more focused on reconstruction accuracy, while diffusion-based models excel at achieving higher cognitive quality. This comparative assessment of the three models is presented in
Figure 12.
The performance trends of the compared models in terms of PSNR, SSIM, and FID show that the SwinIR model achieves clear superiority in the reconstruction-based measures (PSNR and SSIM), while the LDM model achieves the best cognitive performance, with the lowest FID. This visual representation also supports the experimentally arrived at trade-off between structural accuracy and perceptual realism.
These results provide a clear visualization of the impact of different architectures used on the performance of improving the resolution of satellite images, showing how each approach affects the quality of reconstruction and visual perception.
5.2. Sequential Pipeline Results
To explore the potential of leveraging the complementary strengths of the models, several two- and three-phase hybrid pathways have been designed, connecting the models sequentially and in multiple arrangements.
Table 3 shows that integrating models (e.g., SwinIR ← ESRGAN, LDM ← ESRGAN, and SwinIR ← LDM) led to minor quantitative improvements on some measures, as reflected in the quantitative results for two-phase and three-phase hybrid tracks. The findings show that sequencing models yield only slight improvements over individual models. As illustrated in
Table 3, the SwinIR LDM pipeline provides a small gain in FID (185.07) over SwinIR only, while achieving a relatively high PSNR (24.19 dB). Likewise, LDM → ESRGAN yields one of the lowest FID values (183.87) but with no significant improvements in PSNR or SSIM. Three-stage pipelines fail to show any consistent advantages over two-stage configurations. In most cases, PSNR and SSIM values decrease marginally compared with the optimal standalone model (SwinIR), and FID gains are small. These results indicate that additive performance gains from sequentially stacking models are not inevitable, presumably due to error propagation and learned redundancy.
Figure 13 shows Model order is a factor which affects performance without providing any significant improvement.
As shown in
Figure 13, while some two-stage hybrid pipeline combinations marginally improve perceptual quality (FID), Reconstruction quality, in terms of PSNR and SSIM metrics, does not improve significantly when compared to the highest performing individual model.
Similarly, As shown in
Figure 14, moving onto three stage architectures does not result in proportional improvements; rather, there is an instance where one metric decreases slightly.
Overall, the hybrid pipelines showed only small quantitative improvements but made the computation more complex.
5.4. Fine-Tuning Using LoRA
This table illustrates that SwinIR achieved the highest PSNR and SSIM values, indicating superior reconstruction quality. LDM obtained the lowest FID score, reflecting better perceptual realism. In contrast, ESRGAN recorded the lowest PSNR and SSIM with the highest FID.
The results of the LoRA-fine-tuned models show significant gains across all assessment measures compared to the baseline, as shown in
Table 5. The SwinIR with LoRA achieves the best reconstruction fidelity, with PSNR 36.20 dB and SSIM 0.8650, guaranteeing high structural preservation and pixel-level accuracy. The perceptual realism in LDM is impressive, as that amount reduces FID (44.96).
Similarly, ESRGAN adapted to LoRA achieves significant improvements in PSNR and SSIM and significantly reduces FID compared to the baseline counterpart. All in all, the findings show that LoRA is an effective method for making models more adaptable to satellite images by enabling them to refine task-specific parameters rather than the entire set. The scale of the enhancement across all three architectures points to the efficiency of parameter-efficient fine-tuning for remote sensing super-resolution.
As shown in
Figure 16, the performance comparison among LoRA-fine-tuned ESRGANs, SwinIR, and LDM models demonstrates the impact of fine-tuning on image reconstruction quality.
The performance improvement occurs after they use LoRA fine-tuning.
Figure 17 clearly indicates a large positive change in PSNR and SSIM, and a steep decline in FID, across all models compared to their baseline values. The steady upward trend across the various architectural paradigms, namely GAN-based, transformer-based, and diffusion-based, makes it evident that LoRA is a model-agnostic, robust adaptation strategy. The results also confirm that the low-rank parameter estimates have high potential to improve both reconstruction fidelity and perceptual quality in super-resolving satellite images.
As shown in
Figure 17: Comparison of image restoration performance across ESRGANs, SwinIR, and LDM models before and after LoRA fine-tuning.
Directly compare baseline and LoRA fine-tuned models based on the PSNR, SSIM, and FID metrics in
Figure 17. Subplot (a) shows that all models achieve a significant gain in PSNR upon fine-tuning, demonstrating a substantial enhancement in pixel-level reconstruction accuracy. The highest PSNR is obtained with SwinIR, followed by ESRGAN and LDM. Subplot (b) presents a steady increase in the value of SSIM in all architectures, which indicates the increased structural similarity and the superbness of the spatial details in a satellite image. The trend of improvement is also consistent, indicating LoRA’s success across various model designs. Most notably, subplot (c) shows a melting pot of FID scores after fine-tuning, indicating much better perceptual realism and a closer proximity to real image distributions. The amount of FID reduction in all models indicates the effectiveness of LoRA in enhancing the generators without compromising their structure. Overall, the visual patterns in
Figure 17 clearly show that the fine-tuning of the LoRA achieves consistent and improved results in the GAN, transformer, and diffusion-based architectures.
As shown in
Table 6, we observe that SwinIR shows the best reconstruction quality, with the highest PSNR (36.20 ± 1.05 dB) and SSIM (0.8650 ± 0.0381) values; LDM, on the other hand, has the best perceptual quality with the smallest FID (44.96 ± 3.18). ESRGANs, on the other hand, performs worse in terms of SSIM and has a higher FID, indicating less perceptual quality. The baseline models suffer from much lower performance across all the metrics, which demonstrates that the improved models are effective.
Although
Table 6 describes the performance differences between the models, it does not show whether these differences are significant. Therefore,
Table 7 is added to confirm these observations with pairwise statistical significance tests (Wilcoxon signed-rank test and
t-test).
As shown in
Table 7, the results confirm that all performance differences are statistically significant (
p < 0.05), which shows that the improvements are not due to random variation. In addition to reconstruction accuracy and perceptual quality, inference time is evaluated to measure the computational efficiency of all model configurations.
The basic models exhibit average processing times, with the ESRGAN model being the most computationally efficient, while the LDM model has the highest response time due to its iterative propagation process showns at
Table 8 and
Figure 18.
It is worth noting that models optimized with LoRA Technology achieve a noticeable improvement in reconstruction performance with only a slight increase in inference time, reflecting their high efficiency. In turn, serial processing line configurations lead to a significant increase in inference time due to the cumulative implementation of multiple models, often without commensurate performance gains. This effect is increased in multi-stage processing lines, which impose a high computational load, limiting their suitability for immediate or time-sensitive applications such as disaster monitoring and precision agriculture. Although the ensemble approach provides a balance between performance and efficiency, it still requires additional computational cost compared to individual models. Overall, these results show that LoRa-based models offer the best balance between Reconstruction quality and computational efficiency, making them more suitable for practical applications.
5.5. Discussion
Although LoRA reduces training complexity by limiting the number of learnable parameters, it does not decrease inference time, as the additional low-rank computations are performed during execution. As shown in
Table 8, this results in only a marginal increase in inference time—for example, from 26.85 to 28.5 s for ESRGAN, from 33.57 to 36.2 s for SwinIR, and from 81.35 to 89.5 s for LDM. Despite the relatively small added cost, LoRA delivers significant improvements in reconstruction performance, with marked increases in PSNR and SSIM and a decrease in FID. Finally, models using LoRA are still much more efficient than sequential processing, which incurs much higher costs. In this way, LoRA strikes a good compromise between improved performance and efficiency, making it a useful approach for fine-tuning super-resolution models.
One of the potential confounders when analyzing the fine-tuning performance results is the fact that any increase in performance may potentially result from the training itself, not necessarily from LoRA’s adaptation technique. As baseline models have been trained in limited settings where they have not converged fully, there is still room for improvement in their performance through training, which could be achieved even without LoRA. Conducting an ablation study as a control group by training the models for a similar number of iterations but without LoRA modules could provide stronger causality. While being aware of this limitation, we recognize it as one of the most promising directions to pursue further. Nevertheless, along with significant performance improvements, the number of trainable parameters (less than 3.1%) of the overall model parameters) is clearly indicative of an additional inductive bias introduced by LoRA’s technique.
To validate these observations, a comprehensive statistical analysis was conducted across the test dataset. Evaluation metrics are reported alongside standard deviation and confidence intervals to ensure the reliability of the results. In addition, pairwise statistical significance tests confirm that the observed improvements are statistically meaningful, indicating that LoRA’s performance gains are consistent rather than attributable to random variation.
The experimental results yield several noteworthy observations. First and foremost, hybrid sequential pipelines provide only marginal improvements in performance while incurring considerable computational overhead in the context of satellite image super-resolution. Second, ensemble inference with output averaging tends to introduce unwanted over-smoothing artifacts. In contrast, LoRA-based parameter-efficient fine-tuning proves highly effective while demanding minimal computational resources.The results obtained from fine-tuning through the LoRA method always prove to be effective regardless of the type of model architecture used. In general, LoRA fine-tuning is superior to model combination techniques, and it seems to be an ideal technique when it comes to fine-tuning restoration and generation-based super-resolution models.
The significant improvement in the PSNR score when using the LoRA fine-tuning technique, compared to traditional super-resolution techniques, may be due to various reasons.
First, the baseline models were not originally adapted to the target dataset, which limits their capacity to reconstruct domain-specific images accurately. LoRA fine-tuning addresses this limitation by enabling the model to learn the statistical characteristics of the dataset effectively.
Second, the dataset comprises specialized images that could be quite different from those seen by the model during the pre-training phase. Using a pre-trained model on this new dataset tends to produce poor results, but through fine-tuning using LoRA, the model learns to adjust to the specialized nature of the data, which produces much better results for image reconstruction.
Third, the use of LoRA allows the model to build on prior experience while adjusting only a few parameters.
Regarding trainable parameters: in full fine-tuning, all model parameters are updated during training. By contrast, LoRA injects low-rank adaptation matrices into selected layers—such as attention layers—while keeping the original pre-trained weights frozen. This approach optimizes only a small fraction of the total parameters, greatly reducing training cost. With respect to training time, since LoRA modifies only a limited number of parameters, the backward pass is more efficient. Computing gradients and performing optimizer updates requires substantially less computation than full fine-tuning, thereby reducing total training time.
In terms of GPU memory usage, LoRA reduces memory consumption by storing gradients only for the low-rank matrices rather than for the entire model. This allows training with larger batch sizes or on hardware with limited memory capacity. As shown in
Table 9, a detailed comparison between full fine-tuning and LoRA-based fine-tuning is provided for each super-resolution model, covering total parameters, trainable parameters and their ratio, and GPU memory requirements under each training strategy.
As shown in
Table 9, full fine-tuning requires updating all model parameters, which incurs considerable computational and memory costs. In comparison to this, LoRA requires the training of just about about 1–3% of the parameters. Besides saving on the GPU memory required, this strategy not only decreases the time taken to train but also improves the performance of the pre-trained model. The benefits of LoRA are most evident in larger models like LDM, where the parameter and memory reductions are much more significant.