1. Introduction
Medical imaging [
1,
2,
3,
4], internet video delivery [
5,
6,
7], surveillance and security via person identification [
8,
9,
10], and remote sensing [
11,
12,
13,
14,
15,
16] are just some examples of real-world applications where image super-resolution (SR) has been used. In SR, we aim at recovering high-resolution (HR) images from low-resolution (LR) ones. This is a non-trivial and generally ill-posed problem since multiple HR images exist corresponding to a unique LR image [
17]. It is important to remark that, in this article, image resolution means precisely the dimensionality of the image. For instance, an image has a resolution of
pixels. This is not to be confused with other definitions of resolution existing in certain communities, such as remote sensing (spatial resolution, radiometric resolution, temporal resolution, …).
Several classical methods have been designed for image SR, such as bicubic interpolation and Lanczos resampling [
18], edge-based methods [
19], statistical methods [
20], among others. But, naturally, with all the developments related to deep learning (DL) and deep neural networks (DNNs) [
21], a significant number of strategies have been proposed addressing SR via DL/DNNs as reported in recent secondary studies [
11,
17]. Among the DL techniques, convolutional neural networks (CNNs) [
13,
22,
23,
24], generative adversarial networks (GANs) [
12,
14,
25,
26], and attention-based networks [
15,
27,
28] have been employed to solve image SR problems.
The majority of studies published up to now focus on supervised SR where models are trained with both LR images and the corresponding HR ones [
11,
17]. But, in reality, it is not easy to have images from the same scene but with distinct resolutions so that having the pairs LR-HR for training is not as direct as it is supposed to be. Hence, in unsupervised SR [
29], only unpaired LR-HR images are available for training, so that the models are more able to learn real-world LR-HR mappings.
But an even more attractive strategy for real-world scenarios is blind image SR [
30,
31] which is based on the idea that the degradation process/kernels is/are unknown, and, hence, techniques in this context rely basically on LR images, not requiring the high-resolution ones. There is an increasing interest in blind image SR.
Despite the huge number of proposed techniques and experiments boosted by DL techniques [
11,
17], they do not usually design evaluations with high scaling factors, capping it at 2× or 4×. One of the exceptions is the experiment presented in [
11] where authors considered 2×, 4×, and 8× scaling factors. However, the authors considered a multi-sensor remote sensing dataset consisting of mostly publicly available very HR satellite images. Even if the images are from different satellites and regions, eventually the diversity of the images and feature spaces are not enough to proper evaluate SR techniques.
Several other datasets have been used for training image SR approaches, such as BSDS300 [
32], BSDS500 [
33], DIV2K [
34], PIRM [
35], Set5 [
36], Set14 [
37], and Urban100 [
38]. These datasets comprise different categories but still lack of images of other domains. For instance, all the datasets above do not seem to present images obtained via satellite sensors or medical images. It is interesting to stress that images taken by sensors embedded in satellites or airplanes have considerable differences compared to natural images due to different shooting content and shooting methods [
39]. Moreover, as previously remarked, evaluations are usually limited to the 4× scaling factor. In [
40], results are presented for the 8× scaling factor but considering only the Set5 and Set14 datasets. We believe that performing experiments addressing not only high scaling factors (e.g., 8×) but also datasets of different domains is very important to better identify the most adequate image SR approaches.
Image quality assessment (IQA) metrics (methods) are roughly divided into two categories: full-reference (FR-IQA) and no-reference (NR-IQA) [
41]. In FR-IQA, we evaluate the similarity between a distorted image and a given reference image, and classical metrics which have been extensively used are peak signal-to-noise ratio (PSNR) [
11] and structural similarity index measure (SSIM) [
42]. On the other hand, NR-IQA metrics are proposed to assess image quality without a reference image, being more suitable for perceptual quality. Natural image quality evaluator (NIQE) [
43] and perception index (PI) [
35] are two traditional metrics. However, recent metrics can be devised based on DNNs, as presented in the
New Trends in Image Restoration and Enhancement (NTIRE) workshop held in 2022 at the Conference on Computer Vision and Pattern Recognition (CVPR 2022) [
41]. And, for the first time, NTIRE 2022 Challenge on Perceptual Image Quality Assessment had a track addressing NR-IQA. We believe that for evaluating approaches in a completely “blind” setting, NR-IQA metrics are more interesting since they do not demand reference images, only the LR ones.
In this article, we present a high-scale (8×) controlled experiment which evaluates five recent DL techniques tailored for blind image SR: Adaptive Pseudo Augmentation (APA) [
44], Blind Image SR with Spatially Variant Degradations (BlindSR) [
45], Deep Alternating Network (DAN) [
46], FastGAN [
47], and Mixture of Experts Super-Resolution (MoESR) [
48]. Relying basically on public sources, we adapt and create 14 LR image datasets (each one with 100 samples) from five different broader domains: Aerial, Fauna, Flora, Medical, and Satellite. Another distinctive characteristic of our evaluation is that some of the DL approaches were designed for single-image SR but others not.
Two NR-IQA metrics were selected, being the classical NIQE and the recent vision transformer(ViT)-based multi-dimension attention network for no-reference image quality assessment (MANIQA) score [
49], to assess the techniques. The MANIQA model was the winner of the NTIRE 2022’s NR-IQA track obtaining a performance considerably higher than classical strategies, such as PI and NIQE.
The contributions of this study are:
We design and execute a controlled experiment considering five recent blind image SR approaches and focusing on a specific large scaling factor (8×). Note that we decided to present a more detailed analysis of the results, trying to explain in more depth the behaviours of the approaches. We also perform a correlation analysis taking into account the HR images produced by the two best DL techniques. We believe that independent and unbiased evaluations are important to indicate the most suitable approaches to be selected by professionals who need to work with image SR in their practical settings;
We adapt and create 14 LR image datasets from five different broader domains: Aerial, Fauna, Flora, Medical, and Satellite. We believe that making available these datasets to the community, obtained from quite distinct domains, is interesting to provide other possibilities to evaluate the blind image SR techniques considering not only single-image but non-single-image techniques (we had indeed done this);
We consider a recent DNN-based NR-IQA metric (MANIQA score) in addition to a classical one (NIQE), and present some remarks by using such metrics.
This article is structured as follows.
Section 2 briefly presents the theoretical background and related work. In
Section 3, we show in detail the design of our experiment. Results are in
Section 4 while
Section 5 discusses some important points. In
Section 6, conclusions and feature directions are pointed out.
2. Background
This section presents an overview of the theoretical background related to image SR. More details can be seen elsewhere [
11,
17]. At the end, we also discuss about related work. The goal of image SR is to recover the corresponding HR images from the LR images. The LR image,
, is usually modelled as the output of the following degradation
where
means a degradation mapping function,
is the corresponding HR image, and
represents the parameters of the degradation process. When the degradation process (
and
) is unknown, which is very common to happen, and only LR images are available for the techniques, we have the so called blind image SR. In this case, the idea is to recover an HR approximation,
, of the ground truth HR image,
, from the LR image as presented below
where
is the SR model and
are the parameters of
.
Despite of the fact that the degradation process is usually unknown, several studies model the degradation as follows
where ↓ is a downsampling operation and
s is a scaling factor. But, some studies [
50] do such a degradation modelling as presented below
where
means the convolution between a blur kernel,
, and the HR image,
, and
s is a scaling factor. Moreover,
is some additive white Gaussian noise with standard deviation
.
2.1. Metrics
As previously mentioned, there are several metrics for IQA, being full-reference (FR-IQA) or without relying on a reference (NR-IQA). We present here a brief discussion about the two NR-IQA metrics we have considered in this research: NIQE [
43] and MANIQA score [
49].
The main motivation to develop NIQE is that, at the time it was proposed, NR-IQA models required knowledge about anticipated distortions in the form of training samples and corresponding human opinion scores. Thus, NIQE is a completely blind approach which only uses measurable deviations from statistical regularities observed in natural images, without training on human-rated distorted images and, hence, without any exposure to distorted images. The idea is to construct a “quality aware” collection of statistical features based on a successful space domain natural scene statistic (NSS) model.
NIQE is derived by computing 36 identical NSS features from patches of the same size,
, from the image to be analysed, fitting them with a multivariate Gaussian model (MVG) [
51], then comparing its MVG fit to the natural MVG model. They considered a patch of dimension (resolution)
, although other dimensions were also evaluated. The quality of the distorted image is expressed as the distance between the quality aware NSS feature model and the MVG fit to the features extracted from the distorted image, as shown below
where
and
are the mean vectors of the natural MVG model and the distorted image’s MVG model, respectively. Furthermore,
and
are the covariance matrices of the natural MVG model and the distorted image’s MVG model, respectively. Thus, the
lower the NIQE value, the better the perceptual quality of the generated HR image.
While NIQE is a very traditional metric, the MANIQA score is a very recent one. Top approach of the NTIRE 2022’s NR-IQA track [
41], the motivation for the development of the MANIQA model and its corresponding score is that NR-IQA metrics/methods are limited when assessing images created by GAN-based image restoration algorithms [
26,
52]. As shown in
Figure 1, the MANIQA model consists of four components: feature extractor using ViT [
53], transposed attention block, scale swin transformer block, and a dual-branch structure for patch-weighted quality prediction.
In more detail, a distorted image is cropped into patches of dimension 8 × 8. Then, the patches are inputted into the ViT for extracting the features. Transposed attention block and scale swin transformer block are used to strengthen the global and local interaction. A dual-branch structure is proposed for predicting the weight and score of each patch. Notice that the higher the MANIQA score, the better the perceptual quality of the generated HR image.
2.2. Related Work
This study is a controlled experiment within blind image SR via DL/DNNs. Basically every study within blind image SR presents experimental evaluations and, to a lesser or greater extent, they are related to ours. As mentioned in
Section 1, even if there are many techniques and experiments proposed boosted by DL techniques, as corroborated by secondary studies [
11,
17], in general the studies do not consider high scaling factors such as 8×. When there are exceptions [
11,
40], there is no significant diversity of images and feature spaces to proper evaluate the image SR approaches, taking into account quite distinct broader domains such as medicine images, images obtained by satellites via sensors with different characteristics, and images more “usual” like those of animal’s faces.
It is worth noting that dealing with this range of different domains is important to better classify the techniques in terms of their performances. In other words, the wider/general the assessments are, the better. Furthermore, independent evaluations, like the one presented in this article, are valuable to guide professionals in making the correct decisions in choosing the most appropriate blind image SR solution. Our study addresses all these previous points and these are the main differences between our effort and other already published articles.
3. Experiment Design
In this section, we describe the main design options of the controlled experiment to assess the performance of the five DL techniques for blind image SR: APA, BlindSR, DAN, FastGAN, and MoESR.
Figure 2 presents the workflow (activity diagram) related to this study as a whole. Such a figure emphasises the main steps, detailed below, related to the design of the experiment and ends with an additional discussion of relevant points observed when carrying out the experiment (also detailed below).
3.1. Research Questions and Variables
The research questions (RQs) related to this experiment are:
RQ_1—Which out of the five algorithms for blind image SR is the best regarding the metrics NIQE and MANIQA score? And which can be considered the best overall?
RQ_2—Does the two top approaches present similar behaviours when deriving HR images?
The motivation for RQ_1 is self-explained. Regarding RQ_2, the idea here is to perceive whether the two best approaches present similar behaviours. In other words, are the images detected as having the best, as well as the worst, quality perceptions, based on the MANIQA scores, somewhat “common” to both best DL techniques? Our goal here is to see whether the best algorithms “agree” in terms of the HR images they create based only on the LR ones. The independent variables are the DL models. The dependent variables are the values of the metrics, i.e., NIQE and MANIQA score.
3.2. Datasets
The 14 datasets comprise five different broader domains: Aerial, Fauna, Flora, Medical, and Satellite. Note that the Aerial broader domain means data/images obtained at altitudes lower than 100 km (62 miles; Kármán Line) above the mean sea level. For example, images obtained by sensors attached/embedded in airplanes or unmanned aerial vehicles (drones). The Satellite broader domain relates to the space and it implies data/images obtained from sensors, usually onboard satellites, which are at a minimum altitude of 100 km above the mean sea level.
Table 1 and
Table 2 present details about the datasets we created based on publicly released images. In
Table 2, note that the original resolution of the images in the datasets are diverse. Some source datasets, like
condoaerial and
massachbuildings, are formed by HR images where all samples have the same resolution. The
plantpat dataset has HR images too but there are several different resolutions. Others like
catsfaces and
isaid have HR, LR, and even medium resolution images. Three Satellite datasets,
amazonia1,
cbers4a, and
deepglobe, are composed only by LR images and all have the same resolution.
Given the different domains and image resolutions (in some cases, as shown in
Table 2, we just had at our disposal LR images), we therefore decided to default the input LR images to a resolution of 128 × 128 pixels. Thus, we downsampled the images which have higher resolution than that based on the bicubic interpolation method as many others have been doing [
11,
17].
Figure 3 presents some LR images from the datasets we created.
Note that some criticise doing such a resizing because real-world cameras actually accomplish a series of operations (demosaicing, denoising, …) to finally produce 8-bit Red-Green-Blue (RGB) images [
17]. Thus, RGB images have lost lots of original signals, becoming significantly different from the original images taken by the camera. Therefore, it would not be interesting to directly use the manually downsampled RGB images obtained via, for instance, a bicubic interpolation method for image SR.
However, we must emphasise that our main goal is to perform an experiment using DL techniques for blind image SR, based on different domains, in order to provide some sort of recommendation to professionals who need to choose a technique that best fits their needs. Although the issues cited earlier exist, they impact equally all the DL techniques and hence they do not compromise our analysis.
In addition to standardising the resolution of the LR input images, we also considered the same small number of images in the datasets: 100 samples. There are more details about this point in the next section.
3.3. Amount of Required Images
Taking into account the amount of required images, recent DL techniques for image SR are created assuming a single image, known as single-image SR, but they can also be devised assuming that there are a few (limited) number of samples, which we can call few-shot image SR. Note that, here, we consider that few-shot SR assumes few but more than one image. There are also other possibilities, like zero-shot SR presented by [
54], where authors coped with unsupervised SR by training image-specific SR networks at test time rather than training a model on huge datasets.
Traditionally, when researchers propose a new single-image SR technique, they naturally compare their approach to other single-image SR strategies. However, it would be nice to compare single-image SR techniques to non-single-image ones to perceive whether it is advantageous to consider approaches which rely on a unique image or, eventually, it is better to adopt solutions which require more than one image but not so many of them.
Thus, three of the selected techniques were designed addressing single-image SR, i.e., BlindSR, DAN, and MoESR. The other two, APA and FastGAN, are GAN-based approaches requiring a few samples indeed and were not conceived for single-image SR. Since the two latter did require more than one sample but not too many, we limited the size of the datasets to 100 LR images, as mentioned in the previous section. In
Section 3.4, we provide an overview of such DL strategies.
All runnings were performed using a Bull Sequana X1120 computing node of the SDumont supercomputer (
https://sdumont.lncc.br/machine.php?pg=machine (accessed on 23 July 2023)), where such a node has 4× NVIDIA Volta V100 graphics processing units (GPUs). Each run was limited to four days being considered the latest model when the execution exceeded this time.
3.4. DL Techniques
In this section, we briefly describe the selected DL techniques starting with the single-image approaches. In [
45], authors proposed a framework that can achieve blind image SR in a fully automated manner, while also meeting the practical scaling needs of video production. We name it here as BlindSR and such a framework is composed of three main components. The first one is a degradation-aware SR network used to generate an HR image based on a LR input image and the corresponding blur kernel. Secondly, a kernel discriminator is trained to analyse the output HR image and predict errors caused by incorrect blur kernel input. Finally, an optimisation procedure aims to recover both the degradation kernel and the HR image by minimising the predicted error using their kernel discriminator.
DAN solves the blind image SR problem via an alternating optimisation algorithm which restores an HR image and estimates the corresponding blur kernel alternately [
46]. Thus, these two subproblems are handled both via convolutional neural modules namely Restorer and Estimator, respectively. More specifically, the Restorer restores an HR image based on the predicted Estimator’s kernel, and the Estimator estimates a blur kernel with the help of the restored HR image.
Some blind SR techniques train a unique degradation-aware network (for multiple kernels) on external datasets like BlindSR described above [
45]. But there are performance problems with these approaches. There are also self-supervised techniques [
55] but they are usually costly and there is limited information to learn from a single image. Aiming at benefiting from the best characteristics of both types of solution, MoESR [
48] was proposed considering different experts for different degradation kernels. For every input image, the technique predicts the degradation kernel and super-resolve the LR image using the most adequate kernel-specific expert. To predict the degradation kernel, they use two components in combination. Image Sharpness Evaluator (ISE) assesses the sharpness of the images generated by the experts. These evaluations are used by the Kernel Estimation Network (KEN) to estimate the kernel and select the best pretrained expert network.
Regarding the non-single-image techniques, the greatest motivation to create APA [
44] is related to the main challenge in training GANs with limited (few) data: the risk of discriminator overfitting which can lead to unstable training dynamics [
56,
57]. To deal with this issue, the technique called Adaptive Pseudo Augmentation, or APA, regularises the discriminator without introducing any external augmentations or regularisation terms. Unlike previous approaches that relied on standard data augmentations [
56,
57], APA leverages the generator within the GAN itself to provide the augmentation, a more natural way of regularisation of the overfitting of the discriminator. In accordance with the authors, the APA approach is simple and more adaptable to different settings and training conditions than model regularisation, without requiring manual tuning.
Other few-shot image SR solution and based on GANs is FastGAN [
47]. Its authors were driven by the realisation that training GANs on high-fidelity (HR) images often necessitates a vast number of training images and large-scale GPU clusters. To address the challenge of few-shot image synthesis via GAN with minimal computing costs, FastGAN was proposed as a lightweight GAN architecture which produces high-quality results at a resolution of 1024 × 1024 pixels. Their technique involves incorporating two techniques: a skip-layer channel-wise excitation module and a self-supervised discriminator trained as a feature-encoder.
3.5. Metrics
As we have already mentioned, NIQE and the MANIQA score were the selected metrics. NIQE was calculated considering the HR images (1024 × 1024 pixels) generated by the techniques. However, according to its authors, the MANIQA model [
49] is best suited for 224 × 224 pixel images. Thus, we downsampled the created HR image (1024 × 1024 pixels) to a resolution of 224 × 224 pixels in order to calculate the MANIQA score.
Note that such a degradation approach accomplished to calculate the MANIQA score was the same for all generated HR images and, therefore, there is no favouring of a certain technique in relation to the obtained value. Furthermore, we got the MANIQA score considering a subset of the datasets and HR images (1024 × 1024) and we did not notice a great difference in the scores compared to when we used the reduced images (224 × 224). Thus, we followed the authors’ suggestions and considered the images in the lowest resolution (224 × 224) as input for calculating the MANIQA score.
As shown in
Table 2, and as we have already pointed out, the original images of some datasets are HR ones. But there are datasets which have only original LR images and not the corresponding HR samples. In addition to that, NR-IQA metrics are more in line with the reality when we deal with a blind setting. These are the reasons to select only NR-IQA metrics for our experiment.
5. Discussion
We start this section by stressing some points based on a visual analysis.
Figure 7 shows two LR images and pieces of the corresponding HR images generated by the three single-image blind SR techniques. The top HR image generated by DAN, based on an LR image of the
isaid Satellite dataset, is the one with the highest MANIQA score (0.850084) of all images. We clearly see that the roof divisions of the building appear more vivid in the image created by DAN. MoESR’s image appears blurred while BlindSR’s image seems to present something like a salt-and-pepper noise.
On the other hand, the bottom HR image created by BlindSR is the one with the lowest MANIQA score (0.240234) of all images. The LR image is from the
cbers4a Satellite dataset where the scene presents water and land. This is a cloudy scene obtained by the CBERS-4A satellite and it is interesting to realise that such bright images from
cbers4a, in general, produced the lowest MANIQA scores (see
Figure 5).
Other LR images and the corresponding pieces of HR images created by the three single-image blind SR models are presented in
Figure 8. MoESR produced the highest MANIQA score for the top images (
condoaerial,
flowers,
melanomaisic,
amazonia1) while DAN produced the highest score for the bottom one (
deepglobe).
Since APA and FastGAN are GAN-based non-single image techniques, some known issues of GANs appeared here. In order to use APA in a custom dataset, three phases are required: prepare the dataset, training, and inference for generating images. As usual, the training is the more demanded step and APA indeed exceeded the limit we set up (four days) not finishing the training. We considered its latest model for inference. We used 4× NVIDIA Volta V100 GPUs and the datasets are very small (100 samples). It is important to mention that other DL techniques demanded only one out of the four GPUs and the limit was enough for them. Thus, APA is a very “heavy” model. Furthermore, some HR images were flipped and others upside down compared to the source LR images. There are even cases where the images were not indeed generated (it is likely that the images are basically noise).
Figure 9 shows these cases where we have the LR image and the corresponding APA’s HR image. Note that we downsampled the APA images to 128 × 128 pixels for this figure to compare to the LR images. Moreover, the
deepglobe image is here just to show a sample of the dataset since APA’s output seems to be basically noise.
As for FastGAN, the required phases to use it are training and inference, and its training is considerably faster/lighter than APA. But both FastGAN and APA presented a very known problem related to GANs: mode collapse [
58]. It happens when the generator model produces a not very significant set of images that fail to capture the full diversity of the real data distribution. Thus, the fake created samples are quite similar or even identical. In
Figure 10, we see some images from FastGAN where mode collapse occurs. Again, we downsampled the HR images for better presentation.
Based on the MANIQA scores (
Section 4.1.2), APA and FastGAN were the worst techniques. Thus, we can conclude based on the results of our experiment that, for blind image SR, single-image and non-GAN-based approaches are the best way to go.
Considering the best two DL techniques in accordance with the MANIQA score, we may say that the HR images created by MoESR are sharper than the ones of DAN.
Figure 11 presents the measures of the maximum overall contrast of the images with the highest MANIQA scores taking into account both techniques and the 14 datasets. For instance, for the
amazonia1 dataset, the image derived by MoESR has higher MANIQA score than the one of DAN and, thus, we considered the former and the corresponding DAN’s image. As for
isaid, DAN’s image got higher MANIQA score and, thus, this image and the corresponding MoESR’s image were analysed.
Figure 12 follows the same reasoning but now the images are the ones with the lowest MANIQA scores.
Since the higher the maximum overall contrast the sharper the image, we can realise that the images of MoESR are sharper than the ones of DAN. But, note that perceptual quality is not necessarily the same thing as sharpness. An image with higher contrast does not imply that it has a better perceptual quality than one with lower contrast.
A possible explanation for the images created by MoESR being sharper than DAN’s images, and sometimes present oversharpening, is the combination of the perfomances of ISE and KEN. The ISE component is trained to detect blurry or oversharpened regions and predicts errors from the ground-truth image. Since KEN uses the sharpness measures from ISE to estimate the kernel and select the best pretrained model, misleading evaluations of sharpness by ISE may compromise the decision made by the KEN component.
On the other hand, HR images generated by DAN are generally blurry which might lead to a poor perceptual quality. In DAN, the kernel is initialised by Dirac function, and it is also reshaped and then reduced by principal component analysis (PCA) [
59]. The kernel is reduced by PCA and, thus, the estimator only needs to estimate the PCA result of the blur kernel. There is naturally loss of information when using PCA for dimensionality reduction, and recent evaluations show that PCA results are not as reliable and robust as it is usually assumed to be [
60]. This is a possible explanation for this issue related to DAN.
As mentioned in
Section 4, there is not an agreement between the ranking of best and worst techniques considering a classical metric (NIQE) and a DNN-based one (MANIQA score). While APA presented the best performance under NIQE, it is almost the worst solution according to the MANIQA score. Thus, we believe that relying on a more recent NR-IQA metric, like the MANIQA score and which presented quite superior performance [
41] than traditional metrics, is more advisable.
Notice that we do not have the mean opinion scores (MOSs) of the HR images as we do not submit them for observers to assign it. But, looking at the HR images generated by all DL techniques for all sets (see the datasets repository of this research), it is not difficult to see that the perception quality of the images as a whole needs to improve. Note that, in MoESR and DAN, the generation of the HR images had to be done in two stages, using the 2× and 4× scaling factors in sequence. In general, there are several blurred images and others with oversharpening. Despite the sophistication of the evaluated DNNs, we believe that new approaches, addressing larger scaling factors, are necessary for the future.
6. Conclusions
Given the significant number of DL techniques/DNNs that are being created at a rapid pace, it is relevant to perform independent and unbiased controlled experiments to suggest the most suitable approaches for professionals. This article is in line with this reasoning, where we presented a large scaling factor (8×) experiment which evaluated recent DL techniques tailored for blind image SR: APA, BlindSR, DAN, FastGAN, and MoESR. Some of the techniques were designed for single-image while others for non-single-image blind SR. In addition to selecting a larger scaling factor, another difference in this study is that we showed a more detailed analysis of the results, also focusing on explaining in depth the behaviours of the approaches. Mostly based on publicly released sources of images, we adapted and created 14 LR image datasets from five different broader domains to provide a significant range of distinct images to the techniques. We also considered two NR-IQA metrics, the classical NIQE and the DNN-based MANIQA score.
Results show that the MoESR model was the most outstanding approach followed by DAN. GAN-based approaches presented classical issues, such as mode collapse and noise, in some cases. The outcomes of our experiment suggest that single-image and non-GAN-based techniques are more promising for blind image SR, although such a conclusion needs more experimentation to be completely confirmed. By visually inspecting the HR images created by the techniques, we may state that, despite the remarkable solutions, new strategies are required for larger scaling factor requirements.
As future directions, we firstly intend to carry out even higher scaling factor experiments, such as 16×. We will also increase not only the amount of blind image SR approaches to assess but also the number of datasets to get more evidence about the conclusions we have made. Other recent NR-IQA metrics will also be considered in these new controlled experiments.