Online Learning for Reference-Based Super-Resolution

: Online learning is a method for exploiting input data to update deep networks in the test stage to derive potential performance improvement. Existing online learning methods for single-image super-resolution (SISR) utilize an input low-resolution (LR) image for the online adaptation of deep networks. Unlike SISR approaches, reference-based super-resolution (RefSR) algorithms beneﬁt from an additional high-resolution (HR) reference image containing plenty of useful features for enhancing the input LR image. Therefore, we introduce a new online learning algorithm, using several reference images, which is applicable to not only RefSR but also SISR networks. Experimental results show that our online learning method is seamlessly applicable to many existing RefSR and SISR models, and that improves performance. We further present the robustness of our method to non-bicubic degradation kernels with in-depth analyses.


Introduction
Deep learning-based single-image super-resolution (SISR) algorithms [1][2][3][4][5][6][7][8][9] have shown remarkable progress in recent years.However, these algorithms still suffer from blurry output images because they are generally trained to minimize the mean squared error (MSE) or mean absolute error (MAE) between the network output and ground truth images.This problem has led to various efforts to generate high-frequency details with a generative adversarial network (GAN) and/or perceptual losses [10][11][12].However, these methods often lead to reduced reconstruction performance with unexpected visual artifacts.The reason for this is that GAN-based deep networks often generate visually pleasing images, but fail to recover genuine information lost during the downsampling (degradation) process.In order to reconstruct the lost information, reference-based super-resolution (RefSR) methods have been proposed.RefSR algorithms aim to benefit from rich highfrequency details of an external high-quality reference image such as video frames [13,14] or similar web images [15] during the reconstruction, and many RefSR methods attempt to align and combine information from a low-resolution (LR) input image and a highresolution (HR) reference image to synthesize a HR image.To this end, most of the studies so far have explored how to find similar features and match the features [16][17][18] of the LR image and the reference image well.For instance, patch matching [19], deformable convolution [20], and attention [21] techniques have been utilized.The aforementioned methods have succeeded in transferring the high-frequency detail of a reference image.However, these reference-based algorithms show performance degradation when irrelevant high-resolution images are given as references.To this end, we present an online learning technique inspired by zero-shot super-resolution (ZSSR) [22].In ZSSR, a LR input image I LR and its downsampled version I LR ↓ are used for supervision during the inference phase.However, ZSSR has difficulties when dealing with a large scaling factor (e.g., ×3, ×4).
Therefore, in this paper, we propose a method to effectively exploit both LR and HR reference images for online learning to update not only RefSR but also SISR models.Furthermore, using a pre-trained SR model, we create a pseudo-HR image from a LR input image, then use this pair of a pseudo-HR images ĪHR and a downsampled pseudo-HR image ĪHR ↓ as another datum for the online learning of the SR model.In summary, we perform online learning for both SISR and RefSR models by utilizing three types of supervision, including I LR , I Re f , and ĪHR .We use not only each supervision individually, but also combinations of each supervision for online learning.As a result, the proposed method can benefit from more images during the online adaptation, compared with ZSSR [22].
Our key contributions can be summarized as follows: • We propose an online learning method for reference-based super-resolution with various data pairs for supervision.To this end, we present three methods for SISR models and four methods for RefSR models; • Our method is very simple, but it is effective, and can be seamlessly combined with both SISR and RefSR models; • Our method shows consistent performance improvements without being significantly affected by the degree of similarity between the reference and input images.

Related Works
In this section, we review deep learning-based SISR and RefSR methods.Then, we introduce recent SR approaches using online adaptation.
Single-image super-resolution (SISR) restores high-frequency details of a LR image using only an input LR image.The traditional SISR approaches [23,24] usually exploit the self-similarity or self-recurrence of the input LR image.With deep learning, Dong et al. [25] introduced a SISR model with just three convolutional layers and outperformed traditional SISR methods with large margins.Kim et al. [3,4] increased the SR performance in terms of PSNR and SSIM by using very deep convolutional layers.Lai et al. [26] suggested LapSRN, which progressively restores high-frequency details with the Laplacian pyramid.Lim et al. [5] does away with unnecessary modules in residual networks such as batch normalization layers, and achieved improved performance.Zhang et al. [7] introduced a channel attention model to take care of the inter-dependency across different channels.The aforementioned methods have significantly increased SR performance in terms of PSNR and SSIM.However, these methods may be blurry or visually unpleasing to human eyes.To enhance visual quality, Johnson et al. [12] proposed perceptual loss that minimizes errors on high-level features.Ledig et al. [11] adopted the GAN framework to generate photorealistic images.Furthermore, Wang et al. [27] introduced a relativistic adversarial loss based on a residual-in-residual dense block to produce more realistic images.Perceptionbased methods have succeeded in producing visually good results, but there is a limit to recovering information lost during the downsampling process.
Unlike SISR, reference-based image super-resolution (RefSR) uses an additional HR reference image as an input to restore high-frequency details.Therefore, information lost in the downsampling process can be obtained from the reference image.Zheng et al. [19] proposed RefSR-Net to combine information of both LR and reference images based on patch matching.Specifically, RefSR-Net extracts local patches from both LR and reference images and then searches for correspondences between them.After that, the resulting matches are used to synthesize a HR image.However, the patch-match-based approach has difficulty with handling non-rigid deformations and, thus, suffers from blur or grid artifacts.Using optical flow [28,29] for pixel-wise matching, Zheng et al. [30] presented CrossNet, combining a warping process and image synthesis.CrossNet can effectively handle non-rigid deformations between the input and reference images; however, it is vulnerable to large displacements.With the recent progress of neural-style transfer [31,32], Zhang et al. [33] proposed SRNTT, performing texture transfer from the reference image.This is particularly robust when an unrelated reference image is paired with an input image.Shim et al. [20] proposed SSEN, which aligns features of input and reference images using non-local blocks and deformable convolutions.Intra-image global similarities extracted from non-local blocks are utilized to estimate relative offset to relevant reference features.After that, deformable convolution operations are used to align reference features to those from the input low-resolution image.It is an end-to-end trainable network that does not require optical flow estimation or explicit patch-match.Yang et al. [21] introduced an attention-based RefSR method called TTSR, and have achieved significant performance improvements.Recently, Jiang et al. [34] presented a knowledge-distillation technique to solve the matching difficulties caused by the scale difference between the reference image and the LR input image.Lu et al. [35] introduced the MASA network for addressing the computational burden problem that may occur in LR image and reference image matching.However, these methods are sensitive to the similarity of the reference image.
Recently, Shocher et al. [22] proposed a zero-shot super-resolution that makes the CNN model flexibly adapt to the test image.In other words, parameters of the CNN model are updated during the test phase, and a CNN model optimized for the input image can be obtained.Furthermore, to reduce the number of updates, meta-learning techniques are applied in [36,37].In this paper, as the first study to apply online learning to the RefSR problem, we achieved robust RefSR despite the difference in similarity between input and reference images.

Methods
In this section, we introduce our online learning methods for the RefSR problem using both SISR and RefSR models.We then describe the inference process of our method.

Online Learning
In online learning, the most important point is how to exploit input data given at the test phase.Online learning exploits self-similarity in the image.Using the characteristics of online learning to learn internal features, we proceed with high-resolution reference images with similar characteristics and use them to recover test images to improve performance.
For the RefSR problem, this is more crucial, because two kinds of input data are available: a LR image I LR and the reference image I Re f .Therefore, we develop various methods to construct pairs of train-input X and train-target (supervision) Y from multiple images (i.e., I LR , I Re f ).Although our main goal is to solve the RefSR problem in this work, we present methods to construct pairs of data D which can be used to train not only RefSR but also SISR models at the test time.

SISR Model
Existing SISR models require data pairs D s consisting of an input X and supervision Y (i.e., D s = {X, Y}) for training.To be specific, we present three strategies, D LR s , D Pse s , and s consists of a downsampled LR image and an input LR image and is denoted by D LR s = X : I LR ↓, Y : I LR .Note that this is a commonly used configuration to exploit self-similarity in SISR such as ZSSR [22].Next, D Pse s is constructed with a pseudo-HR image ĪHR obtained from a pre-trained SR model P φ (•) as follows: where φ is the pre-trained network parameter.Then, we downsample ĪHR to construct a pair of training samples, and the set is defined as D Pse where x and y are extracted patches from X and Y in D s .Note that the aforementioned data pairs can be used individually or in combination.

RefSR Model
In order to train RefSR models, a pair of data D r that comprises an input X, a reference R, and supervision Y (i.e., D r = {X, R, Y}) is required.Thanks to the additional input R, it is possible to construct more diverse data pairs than SISR models, and we propose where x, r and y are extracted patches from X, R, and Y in D r .Similar to SISR models, data pairs can be used individually or combined for online learning.The data pairs for online learning of SISR and RefSR models are summarized in Table 1.
Table 1.Online learning data pairs for SISR and RefSR models.

Model
With the updated parameters of SISR or RefSR models at the test stage, we estimate the final super-resolved output image as follows: Notably, unlike RefSR, SISR models are updated using the reference image in the online learning phase, but the reference image is not used for the final inference.

Experiments
In this section, we describe implementation details and demonstrate both quantitative and qualitative comparisons with existing methods.We also provide various empirical analyses, including experiments according to the similarity between the reference image and the input LR image and experiments using non-bicubic degradation LR images.

Implementation Details
For both SISR and RefSR models, we used the CUFED dataset [33], which consists of 11,871 pairs of input and reference images to pre-train the models for ×4 upscaling.As baseline SISR models, we have adopted light-weight versions of SimpleNet [22], RCAN [7], and EDSR [5] for fast execution time.Specifically, the number of residual blocks is reduced from 20 to 6 for the RCAN, and from 32 to 16 for EDSR with 64 feature dimensions.Each model is trained for 100 epochs with 32 batch sizes.We use all training data, including both HR and reference images.For RefSR models, SSEN [20] and TTSR [21] are adopted.SSEN is trained for 200 epochs with a batch size of 32, and TTSR is trained for 200 epochs with a batch size of 9.In the online learning phase, the CUFED5 dataset [33] is used.It consists of 126 groups of images, and each group contains a HR image and 5 reference images with different levels of similarity.Images are augmented with random crop (128 × 128), rotation, and flip.The initial learning rate is set to 1 × 10 −4 for ADAM, and we multiply by 0.1 when the loss values stop decreasing [22].Our method is implemented using PyTorch on Ubuntu 16.04 with a single RTX 2080 GPU.

Experimental Results
For all experiments, we have trained SISR and RefSR models by following their original configurations to obtain the baseline models.After that, the proposed online learning is applied to verify the effectiveness of our algorithm.Note that our method does not introduce any additional modules to the baseline models.All the models are evaluated on the CUFED5 test set.We evaluate in terms of PSNR, SSIM, and LPIPS [12], and the LPIPS value is measured with the VGG model.
Table 2 shows SISR online learning results over the baseline models.Online learning with only D LR s degrades performance in all models because the size of D LR s is too small (30 × 20) to exploit abundant information.In contrast, D Pse s contains plenty of HR details useful for the inference and, thus, performance is consistently improved with D Pse s as in [38] RefSR online learning results on the SISR baseline models are shown in Table 3.In RefSR online learning, baseline models always show performance improvements with D , as shown in Table 4. Figure 1 shows qualitative comparisons between existing methods and ours.Note that Figure 1f,k show superior performance over their baseline counterparts Figure 1e,j.(b-k) Results of Bicubic, SimpleNet [22], EDSR [5], RCAN [7], Ours+RCAN [7], SRNTT [33], SRNTT-

Reference Similarity
We first analyze the effect of similarity of reference images on online learning.The CUFED5 dataset [33] provides five similarity levels, from the lowest (i.e., XL) to the highest (i.e., XH), depending on the content similarity between the reference and LR images.For both SISR and RefSR models, performance improvement is proportional to the level of similarity, and the best performance is obtained with the reference with the highest similarity XH.This result is expected, because XH reference images contain a large amount of real high-frequency details closely related to the lost details of LR images.Therefore, online learning with XH reference images can train baseline models with strong and relevant HR guidance.On the contrary, the amount of relevant information is reduced with decreasing similarity of the reference images; therefore, performance improvement also decreases, as shown in Tables 3 and 4. •

Pseudo HR vs. LR for Supervision
For RefSR online learning for SISR models in , as reported in Table 4.The reason for this is that ĪHR has a relatively similar resolution to I Re f than I LR ; thus, it is much easier to align and combine with the information in the reference image.

Non-Bicubic Degradation
We further validate our algorithm with a LR image with non-bicubic degradation for both RCAN (SISR) and TTSR (RefSR) models.For non-bicubic ×4 degradation, we have utilized isotropic (g w ) and anisotropic (g ani ) Gaussian kernels of width w with direct (g d ) and bicubic (g b ) subsampling methods presented in MZSR [36].Table 5 shows that RCAN and TTSR pre-trained with bicubic degradation produce inferior SR results because they cannot handle non-bicubic degradation.However, RCAN and TTSR models can achieve substantial performance gains with the proposed online learning if the non-bicubic degradation model is given during the online learning (i.e., Non-blind).Moreover, RCAN and TTSR can be improved during the online learning in a blind manner that is conducted using each input obtained by downsampling with a random kernel [36].Figure 2 shows qualitative non-bicubic comparisons between existing methods and ours.Therefore, we conclude that the proposed method can handle any type of degradation (i.e., bicubic and non-bicubic), regardless of the awareness of the degradation kernel (i.e., blind and non-blind).

Conclusions
We have proposed an online learning algorithm for RefSR to exploit various types of data for network adaptation in the test stage.The proposed method has brought significant performance improvements to both SISR and RefSR models without introducing any additional network parameters.Specifically, various types of data pairs are proposed using input LR, pseudo-HR, and reference HR images, and the role of each data pair is verified with different similarity levels of the reference images.Extensive experimental results demonstrate the validity, efficiency, and versatility of the proposed algorithm.

s=
X : ĪHR ↓, Y : ĪHR .Finally, we utilize a reference image for D Re f s .Similar to D LR s and D Pse s , a downsampled image and the original reference images are paired as D Re f s = X : I Re f ↓, Y : I Re f .Using these three data pairs acquired in the test phase, pre-trained parameters θ s of a SISR model SISR θ s (•) are updated by minimizing the following loss function: four methods to construct D r : D LR r , D Pse r , D Re f 1 r , and D Re f 2 r .First, D LR r is composed of a downsampled LR image, a reference image, and an input LR image, and denoted by D LR r = X : I LR ↓, R : I Re f , Y : I LR (cf., D LR s ).Similarly, we define the second data pair D Pse r = X : ĪHR ↓, R : I Re f , Y : ĪHR by including I Re f to D Pse s as a reference R. In addition, we can utilize a downsampled reference image I Re f ↓ and the original one I Re f as an input X and supervision Y, respectively, and an input LR image as a reference R to make the third data pair D Re f 1 r = X : I Re f ↓, R : I LR , Y : I Re f .Finally, we replace I LR in D Re f 1 r with ĪHR to make the last data pair D Re f 2 r = X : I Re f ↓, R : ĪHR , Y : I Re f .Note that D Re f 1 r and D Re f 2 r are extended from D Re f s by adding I LR and ĪHR as the reference.With these data pairs, we can update network parameters θ r of the pre-trained RefSR model RefSR θ r (•) with the following loss function: Re f s because it contains real high-frequency details not available in D Pse s .We achieve the best results by using both D LR s and D Re f s , rather than using either D LR s or D Re f s , respectively.Similar results are observed with RefSR online learning on the RefSR models where the best performance is mostly achieved with D Pse r + D Re f 2 r

Table 2 .
SISR online learning results on SISR models.

Table 3 .
RefSR online learning results on SISR models.

Table 4 .
RefSR online learning results on RefSR models.

Table 3
For XH and H reference images, baseline networks can exploit highly relevant information from them, as we inspected.However, the role of D to the knowledge from a pre-trained model (i.e., D Pse s ), compared to the self-supervision (i.e., D LR s ).Meanwhile, for RefSR online learning for RefSR models, finetuning with ĪHR achieves better overall performance than fine-tuning with I LR .In other words, for all similarity levels, D Pse r + D

Table 5 .
Online learning results with non-bicubic degradation.