A Dynamic Convolution Kernel Generation Method Based on Regularized Pattern for Image Super-Resolution

Image super-resolution aims to reconstruct a high-resolution image from its low-resolution counterparts. Conventional image super-resolution approaches share the same spatial convolution kernel for the whole image in the upscaling modules, which neglect the specificity of content information in different positions of the image. In view of this, this paper proposes a regularized pattern method to represent spatially variant structural features in an image and further exploits a dynamic convolution kernel generation method to match the regularized pattern and improve image reconstruction performance. To be more specific, first, the proposed approach extracts features from low-resolution images using a self-organizing feature mapping network to construct regularized patterns (RP), which describe different contents at different locations. Second, the meta-learning mechanism based on the regularized pattern predicts the weights of the convolution kernels that match the regularized pattern for each different location; therefore, it generates different upscaling functions for images with different content. Extensive experiments are conducted using the benchmark datasets Set5, Set14, B100, Urban100, and Manga109 to demonstrate that the proposed approach outperforms the state-of-the-art super-resolution approaches in terms of both PSNR and SSIM performance.


Introduction
The goal of single image super-resolution (SISR) is to reconstruct high-quality highresolution (HR) images from degraded low-resolution (LR) images. It has very wide applications in video surveillance, remote sensing, and medical and military imaging. Another interesting work related to SISR is the face hallucination which enlarges input regions by approximately linear mapping SVD values among different resolutions [1]. Its hallucination capability was further expanded with the same mapping across different views [2]. The pioneering networked SISR work was done by Dong et al. [3]. Their proposed neural network SRCNN established an end-to-end mapping from an input interpolated LR image to the output HR image. Then VDSR [4], DRCN [5], DRRN [6], and MemNet [7] were successively proposed, which further improved the image reconstruction performance. These methods up-sampled an LR input at the very first to the required size of a network output, rather than using an upscaling module to increase the spatial resolution at the end.
However, recent research works found that such an early interpolation on LR image will inevitably result in detail loss and greatly increase the amount of model calculation. Extracting features from the original LR input and increasing the spatial resolution at the end of the network has become a popular deep SISR structure. Shi et al. proposed an efficient sub-pixel convolution layer in ESPCN [8], which enlarged the LR feature map to the output size at the end of a network. With the efficient sub-pixel convolution layer, many methods, such as EDSR [9], RDN [10], RFANet [11], SAN [12], DID [13], treated SR recovery with different scale factors as independent tasks, and applied sub-pixel convolution layers for feature map expansion at the end. While sub-pixel convolutional layers are only feasible to integer scale factors, and a specific network model must be designed for each scale factor, each network model can magnify images merely with a fixed integer scale factor.
To avoid the design of different network models for different scale factors, the metalearning technique [14] has been introduced to develop various SR approaches. The feed-forward model (FFM) in the meta feature representation [14] provided a feedforward mapping method that directly predicted the required parameters of a test instance. Similar to the Hypernetworks [15], the weight of another neural network was generated in a feedforward process. To perform image super-resolution at any scale in one model, Hu et al. proposed Meta-SR [16] to use the Meta-Upscale Module to improve the spatial resolution at the end of the network. For different scale factors and position coordinate offsets, the weight prediction network in the Meta-Upscale Module can generate different convolution kernels to generate the final SR image. However, the Meta-Upscale Module still shared the convolution kernel spatially and did not consider the content information of the current image. Chen et al. proposed LIIF [17], using a multi-layer perceptron at the end of the network to replace the traditional upscaling layer and predict the gray value of each pixel in the output SR image. However, since the input of the multi-layer perceptron is a one-dimensional vector, the original position information of the feature vector will be lost in the process of converting a multi-dimensional vector of the feature map into a one-dimensional vector of an input.
The major challenge of single image SR is how to perform upscaling reconstruction adaptively to the spatially variant image content. According to the characteristics of involution [18], if the convolution kernel is shared spatially, the parameters of the convolution kernel cannot be flexibly adjusted to match different inputs. On the contrary, we can use space-specific kernels for more flexible modeling in the spatial dimension. Similar to the space-specific involution, introducing a regularized pattern to guide the generation of convolution kernel will be helpful in the upscaling module. Motivated by this, in this paper, we propose a specific regularized pattern extraction network to extract the regularized pattern from LR features and then generate a space-specific convolution kernel according to different regularized patterns.
The two contributions of this paper are summarized as follows.
(1) A regularized pattern extraction method is proposed to extract the regularized pattern from LR features. This will adaptively guide the image reconstruction in a spatially variant manner. Furthermore, both position information and scale information are used in the weight prediction network with the proposed regularized pattern. As a result, the convolution weight prediction network can accurately match the relationship between input parameters and output convolution kernel parameters.
(2) A dynamic convolution kernel generation method is proposed to generate the most matching convolution kernel parameters according to the regularized pattern and position and scale information of the current position. Consequently, the pixels at different positions in the SR image can be processed differently, which enhances the texture consistency with the HR image and improves the network performance.
The rest of this paper is organized as follows. The dynamic convolution kernel generation method is proposed and then further exploited to develop a super-resolution approach in Section 2. The proposed approach is evaluated with state-of-the-art approaches in extensive experiments in Section 3. Finally, Section 4 concludes this paper.

Proposed Dynamic Convolution Kernel Generation Based on Regularized Pattern for Image Super-Resolution
Different pixel points in the LR image have different image contents. As shown in Figure 1, the blue points are in the flat color block area, and the red points are in the edge area. During the Meta-SR upsampling process [16], the convolution kernels used for these two positions are the same. The difference in the content information of these two positions is not considered. We propose a dynamic convolution kernel generation method to adaptively generate convolution kernels according to local image content, which is represented by using the proposed regularized pattern. For the blue and red points in Figure 1, the proposed method produces different convolution kernels matching their regularized content patterns, implementing space-specific reconstruction operations. Assuming that an input LR image is I LR , the LR feature F LR is extracted from I LR by the LR feature extraction network. We use the feature tensor V ∈ R H×W×inC to represent the F LR , where H is the height of I LR , W is the width of I LR , and inC is the number of channels of V. In the feature tensor V, the feature vector V i ,j ∈ R inC corresponds to the feature representation on the pixel point (i , j ) of the LR image.

Proposed Regularized Pattern Extraction Method
The regularized pattern extraction method is proposed in this section to guide the image upscaling reconstruction. Different pixel positions on the input LR image I LR contain different image content, such as relatively smooth background regions, or edges of an object that changes drastically. Their differences are manifested in their features F LR . We define the regularized pattern P as where p() is the regularized pattern extraction function, ⊗ is the convolution operator, W 1 and B 1 are the weight and bias of the first convolution layer on regularized pattern extraction network, W 2 and B 2 are the weight and bias of the second convolution layer on regularized pattern extraction network, σ() is the Relu activation function, S() is the Sigmod activation function. The regularized pattern defined in (1) is an abstraction of the LR feature F LR , features that can distinguish content information of different positions are merged to obtain the regularized patterns. During the model training, the weight of the convolution kernel in the regularized pattern extraction network defined in (1) is constantly updated under the constraint of the L1 loss function, focusing on features that can best distinguish the content information to seek the regularized pattern vector with the least structural risk.

Proposed Dynamic Convolution Kernel Generation Method
In this section, a dynamic convolution kernel generation method is proposed to adaptively generate convolution kernels according to local image content, which is represented using the proposed regularized pattern described in Section 2.1.
First, task-level samples and data samples need to be generated. Suppose that the scale factor range is [r min , r max ] when performing SR reconstruction on LR images, and the probability of using each scale factor in the range for super-resolution reconstruction is equal, that is, the distribution p(r) of the scale factor r is a discrete uniform distribution in [r min , r max ] as p(r) = U(r min , r max ) (2) We use the values of all scale factors in the distribution (2) to downsample the training set HR images to obtain the training set LR images corresponding to different scale factors r. Each time a scale factor r s is randomly selected from the distribution p(r) as the current task, and then a pair of LR-HR image patches are randomly selected from the training set corresponding to the scale factor r s as training samples.
Suppose that the length and width of the LR image patch are L, there are L 2 pixels on the LR image patch, and there are ( L × r s ) 2 pixels on the corresponding reconstructed SR image patch. The weight prediction network needs to generate a convolution kernel for each pixel in the SR image matching its RP, position, and scale information. Then the generated convolution kernels are used to map the LR image to the HR image. So, the number of data samples in the current task with factor r s is ( L × r s ) 2 .
Second, given a scale factor r s , the input LR image I LR with the height L and the width W, the LR feature obtained after passing I LR through the feature extraction network is F LR . Then, F LR is highly abstracted to extract the regularized pattern P, which represents the structure of different position information to distinguish image content at different locations as where p α is the regularized pattern extraction function and α is the parameter of the regularized pattern extraction network. Third, for a pixel point (i, j) in the SR image, the mapping pixel position in the LR image is i , j , and the position and scale information is M i,j which can be obtained as follows. Suppose that for the pixel (i, j) in I SR , its mapping (i , j ) can always be found in F LR , where the V i ,j is most closely related to the RGB value of the pixel (i, j) in I SR . The mapping formula from I SR to F LR is as [16] i where m() is the position mapping function, r is the scale factor, and is the floor function. Then, for the feature vector V i ,j in the LR feature F LR , the corresponding multiple pixel points (i, j) in I SR have a different relative positional relationship with V i ,j . Define the relative offset function to express this difference [16] where o() is the relative offset function. Then, the position and scale information M i,j at the pixel point (i, j) of the SR image I SR can be obtained as [16] Fourth, the corresponding regularized pattern is the vector P i ,j in position i , j of P. For different pixels, the regularized pattern, location, and scale information are different. That means the relative deviation from its mapped location and the structural information of the location are unique. We generate the best-matched convolution weights for each pixel as where F θ is the convolution weight prediction function, θ is the parameter of the convolution weight prediction network, and W i,j is the convolution weight corresponding to the pixel (i, j) in the SR image. The convolution weight prediction network generates a total of (L × r × W × r) convolutions to form the convolution weight set W set as Fifth, for the gray value of the pixel point (i, j) in the SR image, the LR feature F LR i ,j at the mapping position i , j in the LR image is the most closely related. Performing matrix product of the convolution weight W i,j and the LR feature F LR i ,j we obtain the gray value V i,j of the pixel point as For the entire SR image, it is obtained by upsampling the LR features as where f W set is the upsampling function, and W set is the convolution kernel weight set.
Sixth, for the generated SR image patch, the L1 loss function is used to measure the error between the SR image patch and the HR image patch where L s is the error between I SR and I HR in current task with the scale factor r s . In each task, the regularized pattern extraction network parameters α and convolution weight prediction network parameters θ are updated using gradient descent: where α and θ are the parameters before the update, α and θ are the parameters after the update, and β is the learning rate. By continuously extracting different scale factors from the distribution as different tasks to train the model, the parameters α and θ are continuously updated. The purpose of meta-learning training is to obtain appropriate parameters α and θ, so that the sum of task losses of all the scale factors sampled in the distribution p(r) is the smallest. Finally, we use the trained network for the inference. Suppose that the scale factor of the current task is r, the length of the input LR image corresponding to the current task is L, and the width is W, so the length of the SR image is L × r , and the width is W × r . For each pixel in the SR image, the convolution weight prediction network generates a convolution kernel matching its regularized pattern according to Equation (7). Then the generated convolution kernel is used to map the LR features of the corresponding positions to RGB values according to Equation (9), and finally, the SR image is formed. Figure 2 is an example of SR images generated with scale factors of 1.6, 2.2, 2.8, 3.4, and 4.0, respectively.

Justification of the Proposed Dynamic Convolution Kernel Generation Method
To demonstrate the various convolution kernels generated according to different image content, an experiment is conducted as follows.
Assuming that the scale factor r is 2, for the pixel X(i , j ) in the low-resolution image I LR , we can generate four convolution kernels W 2i ,2j , W 2i +1,2j , W 2i ,2j +1 , W 2i +1,2j +1 . We define these convolution kernels as a convolution kernel group G i ,j on this same pixel location, which corresponds to G 1 i ,j , G 2 i ,j , G 3 i ,j , G 4 i ,j , and the variation of the convolution kernel group G i ,j at pixel point X(i , j ) in I LR is defined as where the C i ,j is the variation of the convolution kernel group G i ,j at the pixel point X(i , j ) in I LR , and D() is the function of calculating the variation between the two convolution kernel groups and defined as where abs() is an absolute value function, and m, n are different values at corresponding positions in two different convolution kernels.
In our experiment, we use the test image 253,027 from the B100 dataset [19], the img59 image from the Urban100 dataset [20], and the YumeiroCooking image from the Manga109 dataset [21] as the test images. Then, we apply Equation (10) on these images to obtain the variation value of the convolution kernel group at each position, and then normalize the values to be a range of [0,255]. These values are visualized as color images using COLORMAP_JET in OpenCV.
As seen from Figure 3, we can find that in the grassland, sky, and large-area color blocks, where the content changes slowly and the regularization pattern is relatively simple, the change variation of the convolution kernel group is very small. The convolution kernel group of these pixel points is very similar to the convolution kernel groups of their neighbor pixels. On the contrary, the zebra patterns, clothing patterns, and architectural textures change drastically. The regularized pattern yields rich information, the convolution kernel group changes much. The regularized pattern guides the generation of the convolution kernel, which prompts the convolution weight prediction network to generate the optimal convolution kernel.

Proposed Image Super-Resolution Approach
An overview of the proposed network structure is shown in Figure 4. It contains three parts: (i) feature extraction network, (ii) regularized pattern extraction network, and (iii) convolution weight prediction network. We name our network as Regularized Pattern Based-RDN (RPB-RDN) since we chose RDN [10] as the first-part feature extraction network, which has been used also in Meta-RDN [16] and LIIF-RDN [17]. The second part regularized pattern extraction network and the third part convolution weight prediction network are presented as follows, respectively.
The regularized pattern extraction network consists of two convolutional layers, a ReLU activation function layer, and a Sigmoid activation function layer. Both the numbers of input and output channels of the Conv1 layer are inC, and the Relu activation function layer is used to perform nonlinear mapping on the LR feature F LR . The number of input channels of the Conv2 layer is inC, and the number of output channels is outC, so that the final regularized pattern has a suitable number of channels. Finally, the Sigmoid activation function layer maps the regularized pattern to [0, 1] so that it has the same value range as the position and scale information. Visualization of dynamic convolution weights that are generated by our proposed approach. The first column presents the original test images, and the second column presents the visualized convolution weight variation values calculated using Equation (10). (a) 253027 from B100 dataset [19]. (b) img59 from Urban100 dataset [20]. (c) YumeiroCooking from Manga109 dataset [21]. The convolution weight prediction network consists of two full connection layers and a ReLU activation function layer. We concatenate the regularized pattern vector of the current position and the position and scale information to get the vector V in as the input of the first full connection layer. In our network, the dimensions of the regularized pattern vector P i ,j and the position and scale information vector M i,j are both 3, so the dimension of the vector V in is 6. Considering that the output vector dimension of the entire convolution weight prediction network is outC × inC × k × k, we set the number of output units of the first full connection layer to 256 for the diversity of the output of the entire convolution weight prediction network while ensuring speed. Therefore, the number of input units of the second full connection layer is 256, and the output of that is a vector V out whose dimension is outC × inC × k × k. Then we transform V out into a group of convolution kernels. The number of convolution kernels is the same number of SR image gray channels outC, and the parameter number of each convolution kernel is inC × k × k. This convolution weight prediction network is expressed as where W i is the weight of the ith fully-connected layer, b i is the bias of the ith fully connected layer and σ() is the Relu activation function.

Experimental Results
To evaluate the performance of the proposed RPB-RDN network and its various proposed components, including the proposed regularized pattern extraction network and the convolution weight prediction method, extensive experimental results are provided in this section, including the comparison between RPB-RDN and other SOTA methods.

Experimental Setup
In this paper, the high-resolution image set DIV2K is used. There are a total of 1000 images in DIV2K, 800 images for training, 100 images for verification, and 100 images for testing. All experimental models are trained with a DIV2K training image set. For testing, five standard benchmark data sets are used, including Set5 [22], Set14 [23], B100 [19], Urban100 [20], and Manga109 [21]. The PSNR and SSIM performance metrics are used to evaluate the results of image super-resolution reconstruction. All performance metrics are calculated on the Y channel of the YCbCr color space of the image. Given two images, the detailed formulas of PSNR and SSIM [24] are provided as where MaxV is the maximum intensity value that image pixels can take, MSE is the mean square error between the two images, µ x is the average intensity value of the image x, µ y is the average intensity value of the image y, σ 2 x is the variance of image x, σ 2 y is the variance of image y, σ xy is the covariance of image x and image y, c 1 and c 2 are constants used to maintain stability [24].

Implementation Details
We use the L1 loss function to train the network. During the network training process, 8 low-resolution image patches with a size of 50 × 50 are randomly selected as a batch input. We increase the number of patches by flipping horizontally or vertically and randomly rotating 90 • . The optimizer is Adam, and the learning rate is initialized to 0.0001, which is reduced by every 400 epochs. All experiments are run in parallel on 2 GPUs. The training scale factor varies from 1 to 4, the step size is 0.1, and the distribution of the scale factors is uniform. Each image patch in a batch has the same scale factor. The dimension of the regularized pattern vector P i ,j is set to 3, which can speed up the matching efficiency and improve the reconstruction effect.

Performance Evaluation on the Proposed Regularized Pattern Extraction Method
To study the impact of the regularized pattern extraction method, an experiment is conducted to compare two network structures as follows. The first one (denoted as 'baseline model') is a single-layer convolutional network, which only performs a limited linear transformation on LR features. The second one is our proposed network. Since the proposed model uses the Sigmoid activation function at the end of the network, the regularized pattern is the same as the value range of the position and scale information, which can help to identify the relationship between input and output and speeds up the network convergence. Table 1 shows that the proposed model achieves better results in X2, X3, and X4 SR tasks in the three data sets of B100, Urban100, and Manga109, with an average increase of 0.04 dB in PSNR and 0.0005 in SSIM compared with the baseline model.

Performance Evaluation on the Proposed Convolution Weight Prediction Method
To verify the effectiveness of the convolution weight prediction method based on regularized pattern, an experiment is conducted using the benchmark dataset B100 [19] with scale factors ranging from 1.1 to 4.0 and a step length of 0.1 using Meta-RDN [16], and our RPB-RDN model respectively. As shown in Table 2, the proposed model, which integrates the convolution weight prediction method based on the regularized pattern, achieves better results than Meta-RDN [16] in all tasks with different scale factors. In a total of thirty tasks, RPB-RDN improves PSNR by 0.06 dB on average over Meta-RDN [16]. Table 2. PSNR (dB) performance evaluation on the proposed convolution weight prediction method using the B100 dataset [19]. The best performance is highlighted in the bold format.

Performance Evaluation on the Inference Time
In this experiment, we compare the running time of RDN [10], Meta-RDN [16], LIIF-RDN [17], and our RPB-RDN using Xeon4210 and NVIDIA 2080Ti. We choose the B100 [19] as the test dataset and take the image pre-processing time out of consideration. The experimental results are shown in Table 3. The meta-upsampling module in RPB-RDN is more time-consuming than the sub-pixel convolutional layer in RDN [10], so the overall time-consumption of RPB-RDN is longer than RDN [10]. Compared with Meta-RDN [16], RPB-RDN adds a regularized pattern extraction network so that the overall time consumption has increased, but the difference is not large. LIIF-RDN [17] uses a multi-layer perceptron that is more time-consuming than convolutional layers, so the overall timeconsuming of LIIF-RDN [17] is longer than RPB-RDN.

The Superior of the Proposed Method in Texture Reconstruction
We propose a texture dataset Texture, which crops the central part of images from five benchmark datasets Set5 [22], Set14 [23], B100 [19], Urban100 [20], and Manga109 [21]. The size of the cropped image is 1/16 of the original image. The foreground part in the center of the image generally has richer textures than the background part, and it is more difficult to restore. Comparing the SR results of the texture dataset Texture can further explore the texture image restoration ability of various methods.
We use Meta-RDN [16], LIIF-RDN [17], and RPB-RDN to perform the X2, X3, and X4 super-resolution reconstruction tasks on the proposed texture dataset Texture. The experimental results are shown in Table 4. RPB-RDN achieves better results than Meta-RDN [16] and LIIF-RDN [17] on all scales, which proves the superiority of the proposed content-adaptive convolution kernel generation methods for texture restoration. On the PSNR metric, RPB-RDN has an average improvement of 0.13 dB and 0.14 dB over Meta-RDN [16] and LIIF-RDN [17]. On the SSIM metrics, RPB-RDN has an average improvement of 0.0007 and 0.0008 over Meta-RDN [16] and LIIF-RDN [17].

Qualitative Results
Finally, we compare the SR images generated by our RPB-RDN with those generated by Bicubic, RDN [10], Meta-RDN [16], and LIIF-RDN [17]. As seen in Figure 5, it can be found that our method can recover textures that are recovered wrongly by other methods, especially in zebra patterns, patterns on clothes, and lines of buildings. Owing to our regularized pattern-based convolution kernel generation method, pixels in the different regularized patterns will generate different convolution kernels to match the regularized pattern so that the generated SR image and HR image have stronger texture consistency.

Conclusions
In this paper, we propose a dynamic kernel generation method based on the regularized pattern for super-resolution image reconstruction. It can generate convolution kernels for different pixels that match their regularized pattern so that the generated SR images have stronger texture consistency with HR images. Experiments show that our proposed method achieves better performance than other state-of-the-art approaches.

Conflicts of Interest:
The authors declare no conflict of interest.