Lightweight Single Image Super-Resolution with Selective Channel Processing Network

With the development of deep learning, considerable progress has been made in image restoration. Notably, many state-of-the-art single image super-resolution (SR) methods have been proposed. However, most of them contain many parameters, which leads to a significant amount of calculation consumption in the inference phase. To make current SR networks more lightweight and resource-friendly, we present a convolution neural network with the proposed selective channel processing strategy (SCPN). Specifically, the selective channel processing module (SCPM) is first designed to dynamically learn the significance of each channel in the feature map using a channel selection matrix in the training phase. Correspondingly, in the inference phase, only the essential channels indicated by the channel selection matrixes need to be further processed. By doing so, we can significantly reduce the parameters and the calculation consumption. Moreover, the differential channel attention (DCA) block is proposed, which takes into consideration the data distribution of the channels in feature maps to restore more high-frequency information. Extensive experiments are performed on the natural image super-resolution benchmarks (i.e., Set5, Set14, B100, Urban100, Manga109) and remote-sensing benchmarks (i.e., UCTest and RESISCTest), and our method achieves superior results to other state-of-the-art methods. Furthermore, our method keeps a slim size with fewer than 1 M parameters, which proves the superiority of our method. Owing to the proposed SCPM and DCA, our SCPN model achieves a better trade-off between calculation cost and performance in both general and remote-sensing SR applications, and our proposed method can be extended to other computer vision tasks for further research.


Introduction
Single image super-resolution (SR), aims at recovering a high-resolution image (HR) from its low-resolution image (LR) counterpart [1]. There is a recognized need for SR techniques in many fields [2][3][4][5][6][7], such as remote sensing, medical imaging, security surveillance, and hyperspectral images to name a few.
SR is a typically ill-posed problem because there is more than one solution for an LR input. Additionally, there is significant room for further improving the performance of SR. For these reasons, SR has been a subject of intense research for many years. Consequently, numerous SR methods have been proposed by researchers. These methods can be classified into three classes in general [8]: interpolation-based methods, reconstruction-based methods, and learning-based methods.
(1) We propose the convolution neural network with selective channel processing strategy (SCPN). Extensive experiments prove that our model can achieve higher performance than previous works in remote sensing images in the SR field; (2) We propose the selective channel processing module (SCPM), which contains trainable parameters in a channel selection matrix to decide whether to process the corresponding channels in the feature map in the next convolution layer. This strategy markedly reduces the calculation consumption and the model size; (3) We propose the differential channel attention (DCA) block, which is more suitable for the SR tasks in restoring more high-frequency details and further improves the representation ability of the networks.

Related Work
SR is a classic low-level task in computer vision. We can coarsely divide the existing methods into two categories: the traditional methods and the deep-learning-based methods. Attention mechanism and adaptive inference are two typical strategies to improve the performance of the model. Due to space limitations, we offer a brief overview of the deep-learning SR methods, the attention mechanisms, and the adaptive inference strategy.

Deep-Learning SR Methods
With the rapid development of deep learning [34,35], recent years have witnessed a rapid rise in the growth of the deep-learning-based SR methods. As a pioneer work, Dong et al. [23] first proposed SRCNN. This method shows significant advantages over former methods, which uses a three-layer convolution neural network to learn the mapping function between LR and HR. In the year of 2016, Kim et al. [24] proposed the VDSR network, which is inspired by ResNet [25] architecture. Owing to the residual learning mechanism, the VDSR model deepens the network to 16 convolutional layers to learn more high-frequency prior knowledge. These methods mentioned above use interpolated images as input, resulting in additional computational waste. In order to effectively solve this problem, Shi et al. [36] proposed the ESPCN model, whose merit is the sub-pixel convolution layer for upscaling feature maps in the model. Due to the efficiency of this strategy, most of the following works use sub-pixel convolution layers in their models to promote performance enhancement and reduce calculation. To further tap the potential of CNN, Lim et al. [26] proposed a deep and wide network named EDSR, which won the 2017 NTIRE competition of SR. The previous works show that the width and depth of the network have a correlation with the performance in a certain range. However, for the lack of computing resources, the edge and mobile devices cannot support such large models; therefore, these methods lack practicality in real-world scenery.

Attention Mechanisms for SR
Zhang et al. [28] first proposed RCAN, which is the first attention model in the SR field. RCAN proposed the channel attention mechanism, which uses the global average pooling strategy to measure the importance of the features in each layer and calculates the weight of each channel by a multi-layer perceptron. In addition, Dai et al. [37] proposed SAN, which modified the channel attention using the covariance average pooling. RFANet, proposed by Liu et al. [38], takes advantage of the spatial attention mechanism to enhance the critical areas for better reconstructing features. Zhang et al. [39] proposed a non-local residual network for the image restoration task. In this work, they used the pixel-level non-local attention mechanism to capture long-distance spatial contextual information for SR reconstruction. Mei et al. [40] further explored the non-local attention at patch level with a cross-scale strategy. Recently, Liang et al. [41] proposed the SwinIR model, which is based on the epidemic Transformer [42,43] model to carry forward the self-attention mechanism in the SR field.

Adaptive Inference
Adaptive inference techniques have attracted increasing interest because of their ability to adapt the network structure based on the input. One typical type of adaptive inference is to select the path of inference at the levels of layers. Notably, Srivastava et al. [44] proposed the dropout strategy to prevent neural networks from overfitting, which performs as a pioneer in this field. Wu et al. [45] proposed the BlockDrop strategy, and implemented it on ResNet to drop several residual blocks to improve efficiency. Mullapudi et al. [46] proposed HydraNet, which has multiple branches and can dynamically choose a set for inferencing the results. Another type of adaptive inference technique is the early stopping strategy, which can skip at the location whenever it is judged to be unnecessary. Specifically, Figurnov et al. [47] proposed a spatially adaptive computation time strategy to terminate calculating at a space position where features are deemed good enough. Liu et al. [48] proposed the AdaSR model, which utilizes an adapter to adapt the number of convolutional layers implemented at different locations.

Adaptive Inference
Adaptive inference techniques have attracted increasing interest because of their ability to adapt the network structure based on the input. One typical type of adaptive inference is to select the path of inference at the levels of layers. Notably, Srivastava et al. [44] proposed the dropout strategy to prevent neural networks from overfitting, which performs as a pioneer in this field. Wu et al. [45] proposed the BlockDrop strategy, and implemented it on ResNet to drop several residual blocks to improve efficiency. Mullapudi et al. [46] proposed HydraNet, which has multiple branches and can dynamically choose a set for inferencing the results. Another type of adaptive inference technique is the early stopping strategy, which can skip at the location whenever it is judged to be unnecessary. Specifically, Figurnov et al. [47] proposed a spatially adaptive computation time strategy to terminate calculating at a space position where features are deemed good enough. Liu et al. [48] proposed the AdaSR model, which utilizes an adapter to adapt the number of convolutional layers implemented at different locations. Figure 1. Ratio of the non-zero elements in the feature maps of EDSR and RCAN. Figure 2 shows the channels in the feature maps in the head, middle, and tail of the RCAN network. It is observed that many of the feature maps are filled with zeros, which contain little texture information. It is obvious that they contribute less to the process of reconstruction, which can be overlooked for simplicity and fast inference. In addition, Figure 2 demonstrates that the first and last feature maps of the backbone network store more activated textures than the ones in the middle of the network. Inspired by these observations, we propose the following selective channel processing network.  Figure 2 shows the channels in the feature maps in the head, middle, and tail of the RCAN network. It is observed that many of the feature maps are filled with zeros, which contain little texture information. It is obvious that they contribute less to the process of reconstruction, which can be overlooked for simplicity and fast inference. In addition, Figure 2 demonstrates that the first and last feature maps of the backbone network store more activated textures than the ones in the middle of the network. Inspired by these observations, we propose the following selective channel processing network.  Figure 2. Visualization of the feature maps in RCAN. (a,b), respectively, represent the feature map in the first and last blocks in the first residual group. (c,d) represent the feature map in the first and last blocks in the fifth residual group. (e,f) represent the feature map in the first and last blocks in the last residual group.

Selective Channel Processing Network (SCPN)
In this section, we first introduce the architecture of our proposed SCPN model. Then, we give a detailed description of the selective channel processing module and the differential channel attention block. Finally, we introduce the implementation details of the proposed SCPN.

Network Architecture
As shown in Figure 3, our selective channel processing network (SCPN) consists of three parts: the shallow feature extraction, the deep feature extraction, and the up-sampling reconstruction.

Conv Upsampler
The shallow feature extraction The deep feature extraction The up-sampling reconstruction Given a low-resolution image and its counterpart high-resolution image , the output of our model is denoted as . The shallow feature extraction consists of one convolution layer with the kernel size 3 × 3, following the earlier research [26][27][28]37,38,40], and the extracted features 0 is represented as:

Selective Channel Processing Network (SCPN)
In this section, we first introduce the architecture of our proposed SCPN model. Then, we give a detailed description of the selective channel processing module and the differential channel attention block. Finally, we introduce the implementation details of the proposed SCPN.

Network Architecture
As shown in Figure 3, our selective channel processing network (SCPN) consists of three parts: the shallow feature extraction, the deep feature extraction, and the up-sampling reconstruction.

Selective Channel Processing Network (SCPN)
In this section, we first introduce the architecture of our proposed SCPN model. Then, we give a detailed description of the selective channel processing module and the differential channel attention block. Finally, we introduce the implementation details of the proposed SCPN.

Network Architecture
As shown in Figure 3, our selective channel processing network (SCPN) consists of three parts: the shallow feature extraction, the deep feature extraction, and the up-sampling reconstruction.  Given a low-resolution image and its counterpart high-resolution image , the output of our model is denoted as . The shallow feature extraction consists of one convolution layer with the kernel size 3 × 3, following the earlier research [26][27][28]37,38,40], and the extracted features 0 is represented as: Given a low-resolution image I LR and its counterpart high-resolution image I HR , the output of our model is denoted as I SR . The shallow feature extraction consists of one convolution layer with the kernel size 3 × 3, following the earlier research [26][27][28]37,38,40], and the extracted features F 0 is represented as: where P SFE (·) denotes the convolution operation. Then, F 0 is sent to the deep feature extraction part for extracting more effective features F DF , which can be denoted as: where P DFE is the deep feature extraction part, which consists of m selective channel processing modules. Finally, we utilize the up-sampling reconstruction part to convert the deep features into output results, denoted as: where P UP means the up-sampling reconstruction, which contains the convolution layers and an up-sampler. Following [36], the up-sampler includes a convolution layer and a sub-pixel convolution, which also corresponds to our lightweight design principle.
To make the procedure of SCPN clearer, we present a flowchart in Figure 4.
where (·) denotes the convolution operation. Then, 0 is sent to the deep feature extraction part for extracting more effective features , which can be denoted as: where is the deep feature extraction part, which consists of m selective channel processing modules. Finally, we utilize the up-sampling reconstruction part to convert the deep features into output results, denoted as: where means the up-sampling reconstruction, which contains the convolution layers and an up-sampler. Following [36], the up-sampler includes a convolution layer and a sub-pixel convolution, which also corresponds to our lightweight design principle.
To make the procedure of SCPN clearer, we present a flowchart in Figure 4.

Selective Channel Processing Module (SCPM)
As shown in Figure 5, the proposed selective channel processing module (SCPM) has different forms in the training phase and the inference phase, which will be explicitly introduced below.

Selective Channel Processing Module (SCPM)
As shown in Figure 5, the proposed selective channel processing module (SCPM) has different forms in the training phase and the inference phase, which will be explicitly introduced below.

SCPM in the Training Phase
Channel selection matrix. To set up the modules, channel selection matrixes are needed to learn to judge whether each channel in the feature maps is important or not in the feature maps generated by the convolution layers, and whether to transmit it to the next convolution layer. Ideally, we utilize the binary code, i.e., 0 and 1, to represent the 'selection' manipulations of the corresponding channels. To make the parameters of the channel selection matrix learnable, for the reason that the softmax function cannot convert the numbers close to the binary code, we adopt the Gumbel softmax distribution [50] to approximate the one-hot distribution. To be specific, for the l-th layer in the m-th SCPM, the channel selection matrix has two columns, and the number of rows in the matrix equals the number of channels. We input the parameters of the channel selection matrix into a Gumbel softmax function, and generate the parameters to reweight the feature maps output by the convolution layers: where c denotes the channel index, and ∈ ℝ ×2 represents the Gumbel noise tensor. In addition, τ denotes the temperature coefficient of the Gumbel softmax function. When τ tends to ∞, all results of Gumbel softmax function tends to 0.5, which makes the generated elements uniformly distributed. Conversely, when τ infinitely tends to 0, results from the function become one-hot, which makes the channel selection matrix binary fit our settings. When initializing the network architecture before the training phase, we use the random function to generate parameters for every SCM with Gaussian distribution (0,1). We denote the first column in as ,1 , and the second column as ,2 . Architecture. Figure 5a illustrates the flow path of the SCPM in the training phase. Four convolution layers are set for deeply processing the features from input. Let us denote the input feature map as , the output of the n-th convolution layer as , and the output feature map as . Then we can get:

SCPM in the Training Phase
Channel selection matrix. To set up the modules, channel selection matrixes are needed to learn to judge whether each channel in the feature maps is important or not in the feature maps generated by the convolution layers, and whether to transmit it to the next convolution layer. Ideally, we utilize the binary code, i.e., 0 and 1, to represent the 'selection' manipulations of the corresponding channels. To make the parameters of the channel selection matrix learnable, for the reason that the softmax function cannot convert the numbers close to the binary code, we adopt the Gumbel softmax distribution [50] to approximate the one-hot distribution. To be specific, for the l-th layer in the m-th SCPM, the channel selection matrix CSM m l has two columns, and the number of rows in the matrix equals the number of channels. We input the parameters of the channel selection matrix into a Gumbel softmax function, and generate the parameters M m l to reweight the feature maps output by the convolution layers: where c denotes the channel index, and G m l ∈ R C×2 represents the Gumbel noise tensor. In addition, τ denotes the temperature coefficient of the Gumbel softmax function. When τ tends to ∞, all results of Gumbel softmax function tends to 0.5, which makes the generated elements uniformly distributed. Conversely, when τ infinitely tends to 0, results from the function become one-hot, which makes the channel selection matrix binary fit our settings. When initializing the network architecture before the training phase, we use the random function to generate parameters for every SCM with Gaussian distribution N(0, 1). We denote the first column in M m l as M m l,1 , and the second column as M m l,2 . Architecture. Figure 5a illustrates the flow path of the SCPM in the training phase. Four convolution layers are set for deeply processing the features from input. Let us denote the input feature map as F in , the output of the n-th convolution layer as F n , and the output feature map as F out . Then we can get: where Conv n denotes the n-th convolution layer, and denotes the element-wise multiplication. DCA denotes the differential channel attention block, which will be detailed in the following article. Training strategy. During the training phase, we adjust the temperature coefficient τ with the following formula: where t is the number of epochs. It is shown that τ drops from 1 slightly to 0.4 at the 300 th epoch and maintains 0.4 during the following epochs in the training phase.

SCPM in the Inference Phase
Channel selection matrix. Channel selection matrixes are properly optimized in the training phase in order to represent whether to preserve the channel to the next convolution layer, or directly send it to the addition layer at the end of the fourth convolution layer for feature-adding. In the inference phase, the channel selection matrixes work as a basis for the channel splitting processes. To get the binary code of the channel selection matrixes, for the two elements M m l [c, 1] and M m l [c, 2], we replace the larger one with 1 and the smaller one with 0 directly. In the channel selection matrix CSM m l , the positions of elements equal to 1 in the first column M m l,1 denote the coordinate number of channels to preserve to be sent to the next layer, and the positions of elements equal to 1 in the second column M m l,2 mean the coordinate number of channels to pass to the addition layer.
Architecture. As shown in Figure 5b, the architecture of SCPM in the inference phase has a different shape from that in the training phase. The significant difference is that we introduce the channel splitting strategy to extract the channels from the output channels. For the l-th layer in the m-th SCPM, we first split the number of channels indicated by M m l,1 and then extract the convolution kernels at the corresponding positions in the next convolution layer in the meanwhile. Then, the two-dimensional convolutions are made using the extracted kernels and feature maps.
To be explicit, the process of the inference phase can be denoted as: where Conv 1 means the first convolution layer, Conv2d means the 2-D convolution function, F l M m l,1 (1) means the extracted feature map from the original feature map F l , whose indexes equal to the positions of '1' in M m l,1 , and w l denotes the original weight of the l-th convolution layer. Other symbols have the same meanings as those in Section 4.2.1. With the combination of 2-D convolution and the selective channel processing strategy, we can avoid calculating the channels which contribute less for SR reconstruction in the feature maps, thus, we reduce the number of channels for calculating to a large extent, and do not have to store the parameters of the redundant convolutional kernels, therefore, save many redundant consumptions.

Differential Channel Attention Block
The channel attention mechanism is a widely used strategy in both high-level and low-level computer vision tasks. As a common practice, either global average pooling or global maximum pooling is utilized to generate the channel descriptor of the feature maps, and the channel descriptor will be processed to become the weight of each channel of the feature map. RCAN shows the advantage of this mechanism by achieving a higher rate of PSNR and SSIM. However, by only using the average value of each channel, we cannot extract the richer information from the feature map, e.g., the high-frequency details, the distribution and deviation of data, etc., therefore, having some negative impact on SR performance. To solve the problem and further boost the performance of our model, we propose the differential channel attention block (DCA), whose procedure is shown in Figure 6. We first calculate the mean value of each channel, whose formula is: where mv c means the mean value of the c-th channel, x c means the c-th channel of the input feature, (i, j) means the coordinate of the element in x c , and H and W means the height and width of the channel feature, respectively.

Implementation Details
As a supplement, we introduce the implementation details to explicitly explain ou SCPN architecture. We set the number of the SCPMs as 6. There are four convolution lay ers in each SCPM whose kernel size is 3 × 3 and the zero-padding parameter is one and stride one. Another convolution layer in the SCPM has the kernel size of 1 × 1, the stride of 1 and no zero-paddings. The number of feature maps in our SCPN is set to 64 for bette SR reconstruction results. In the up-sampling reconstruction section, the 3 × 3 convolution layer transforms the number of channels to 3 × 2 , where r is the rate of SR. Then, the pixel-shuffle layer turns the number of channels to 3 (i.e., red, green, and blue channels) and the height and width of features become r times the original ones.

Pseudocode of the Proposed Network
To better explain the procedure of our SCPN, we present the PyTorch-like pseudo code of the SCPN in the two phases (Algorithm 1).  In the meanwhile, the standard deviation value of the input feature map is calculated, formulated as:

Algorithm 1 The
where sdv c means the standard deviation value of the c-th channel, and other symbols have the same meanings as in the formulas above. With the standard deviation value, we take the whole distribution of data into the model. Hence, our model has a better ability to reconstruct high-frequency information. This manipulation has the formula: where sv means the summed value. Then, the addition operation is completed, and the summed values are sent to a multi-layer perceptron (MLP) for further processing. The MLP has three layers, where the first layer has 64 elements, the second 16, and the third 64. After this process, the weights of channels are formed. This process can be denoted as: where y denotes the generated weight of channels. Finally, we multiply the weights and the input feature, denoted as: where denotes the element-wise multiplication. With the plug-and-play DCA block, the proposed SCPN further upgrades its reconstruction performance.

Implementation Details
As a supplement, we introduce the implementation details to explicitly explain our SCPN architecture. We set the number of the SCPMs as 6. There are four convolution layers in each SCPM whose kernel size is 3 × 3 and the zero-padding parameter is one and stride one. Another convolution layer in the SCPM has the kernel size of 1 × 1, the stride of 1 and no zero-paddings. The number of feature maps in our SCPN is set to 64 for better SR reconstruction results. In the up-sampling reconstruction section, the 3 × 3 convolution layer transforms the number of channels to 3 × r 2 , where r is the rate of SR. Then, the pixel-shuffle layer turns the number of channels to 3 (i.e., red, green, and blue channels), and the height and width of features become r times the original ones.

Pseudocode of the Proposed Network
To better explain the procedure of our SCPN, we present the PyTorch-like pseudocode of the SCPN in the two phases (Algorithm 1).

Datasets and Evaluation Metrics
During the training phase, we utilize the DIV2K [51] dataset to construct our training set, which is widely used in the image restoration tasks, especially in the SR field. It contains 800 high-quality natural images with 2-K resolution and three channels of colors, i.e., red, green, and blue. For evaluating the performance of our SCPN, five standard benchmark datasets, i.e., Set5 [49], Set14 [52], B100 [53], Urban100 [54], and Manga109 [55] were selected as test sets. To be exact, Set5 and Set14 have 5 and 14 images without complex patterns, respectively. The B100 dataset contains 100 images of natural and cultural scenery. The Urban100 dataset comprises 100 images, whose semantics are about urban scenes. Manga109 contains 109 manga volumes drawn by professional manga artists in Japan. To build the low-resolution inputs in the datasets, we adopt the commonly used imresize function in MATLAB (www.mathworks.com, accessed on 23 July 2022), which utilizes the bicubic model for degradation.
In order to quantify the SR efficiency of our SCPN and its competitors, we adopt two universal standards, i.e., peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [29] on the luminance channel in YCbCr space converted from RGB space. In simple terms, PSNR calculates the pixel-wise differences between the super-resolved images and the ground truth. At the same time, SSIM indicates the structural similarity, e.g., luminance, contrast, and structures between the two images. The higher scores of the evaluation metrics mean the better performance of the model.

Training Details
A pretreatment was carried out before training. That is, we subtracted the mean value from images in the training set. During the training phase, we crop the low-resolution images to patches whose height and width fit 192/r, where r is the rate of SR upscaling. Corresponding high-resolution images are cropped in the meanwhile to be the labels for training. Data augmentation was conducted after the data loader read the images, that is, random 90 • rotations and horizontal flips. We trained our model with the L1 loss function and ADAM optimizer [56], whose hyper-parameters are: β 1 = 0.9, β 2 = 0.999, and = 10 −8 . The initial learning rate was set to 2 × 10 −4 , then decreased to half after every 400 epochs for SR upscale rates of 2 and 3, and after every 500 epochs for an SR upscale rate of 4. The minibatch size was set to 16. We implemented all the experiments using the PyTorch framework on a workstation with an NVIDIA (www.nvidia.com, accessed on 23 July 2022) RTX2080Ti GPU.

Effectiveness of Selective Channel Processing Strategy
To demonstrate the effect of our proposed selective channel processing strategy, we designed two new variant modules, i.e., Module-A and Module-B, to replace the original SCPM in our SCPN, and trained them with the same strategy.
As shown in Figure 7a, all the feature maps generated by the convolution layers are added by the addition layer without cooperating with the channel selection matrixes. In Figure 7b, for the feature maps with channel numbers of 64, 16 channels in front are split to pass to the concatenating layer, and the rest are preserved to be sent into the next convolution layer for further processing. As a comparison, our SCPM selects which channels to preserve or pass to the addition layer to skip processing with a learnable channel selection matrix. Comparative results are shown in Table 1.
random 90° rotations and horizontal flips. We trained our model with the L1 loss function and ADAM optimizer [56], whose hyper-parameters are: 1 = 0.9, 2 = 0.999, and = 10 −8 . The initial learning rate was set to 2 × 10 −4 , then decreased to half after every 400 epochs for SR upscale rates of 2 and 3, and after every 500 epochs for an SR upscale rate of 4. The minibatch size was set to 16. We implemented all the experiments using the PyTorch framework on a workstation with an NVIDIA (www.nvidia.com, accessed on 23 July 2022) RTX2080Ti GPU.

Effectiveness of Selective Channel Processing Strategy
To demonstrate the effect of our proposed selective channel processing strategy, we designed two new variant modules, i.e., Module-A and Module-B, to replace the original SCPM in our SCPN, and trained them with the same strategy.
As shown in Figure 7a, all the feature maps generated by the convolution layers are added by the addition layer without cooperating with the channel selection matrixes. In Figure 7b, for the feature maps with channel numbers of 64, 16 channels in front are split to pass to the concatenating layer, and the rest are preserved to be sent into the next convolution layer for further processing. As a comparison, our SCPM selects which channels to preserve or pass to the addition layer to skip processing with a learnable channel selection matrix. Comparative results are shown in Table 1.  As illustrated in Table 1, the network with Module-A, which has no channel selection matrix to selectively pass the channels to the next layer, shows a significant performance drop. The main reason is that redundant features are passed to the following convolution layer in the Module-A architecture, and this degrades the SR performance. In our SCPN, the channel selection matrix can pass the feature pieces, which are needed by the next  As illustrated in Table 1, the network with Module-A, which has no channel selection matrix to selectively pass the channels to the next layer, shows a significant performance drop. The main reason is that redundant features are passed to the following convolution layer in the Module-A architecture, and this degrades the SR performance. In our SCPN, the channel selection matrix can pass the feature pieces, which are needed by the next convolution layer, and pass the rest of the channels to the addition layer, which leads to fewer parameters, less computational cost, and higher performance of PSNR and SSIM. An inspection of the table shows that the network with Module-B has fewer parameters than our SCPN. Although the architecture of Module-B seems more lightweight, its strategy is only to pass a fixed quantity of channels to the next convolution layer, and aggregate the rest channels in the concatenation layer, which leads to the lack of processing a proportion of features for SR reconstruction. Owing to our selective channel processing strategy, our SCPN achieves a better trade-off between computational complexity and performance.

Visualization of Channel Selection Matrixes
We visualize the selective channel matrixes in Figure 8. It is observed that, in models of any scale factor, more channels in the feature map are preserved in the layers in the tail of the model than those in the front. This illustrates that more features in the front have less significance, which can be stridden over to avoid computational redundancy. Figure 7 also shows that more layers are preserved to be sent into the next layers than the ones sent to the addition layer, which demonstrates that most of the channels in the feature maps are of significance for reconstructing the final SR results. These observations also echo the phenomena shown in Section 3.

Visualization of Channel Selection Matrixes
We visualize the selective channel matrixes in Figure 8. It is observed that, in models of any scale factor, more channels in the feature map are preserved in the layers in the tail of the model than those in the front. This illustrates that more features in the front have less significance, which can be stridden over to avoid computational redundancy. Figure  7 also shows that more layers are preserved to be sent into the next layers than the ones sent to the addition layer, which demonstrates that most of the channels in the feature maps are of significance for reconstructing the final SR results. These observations also echo the phenomena shown in Section 3.

Quantitative Evaluation and Visual Comparision
In order to test the effectiveness of the proposed model, we compare the SCPN with the bicubic interpolation method and nine state-of-the-art models, including SRCNN [23], FSRCNN [57], VDSR [24], DRCN [58], LapSRN [59], SRFBN-S [60], CARN [30], IDN [31], and IMDN [32]. Since we mainly focus on the lightweight network designs in this paper, several recent works with more than 2 M parameters (e.g., EDSR [26] (~40 M), RCAN [28] (~15 M), and SAN [37] (~15 M)) are not included for comparison. We report the quantitative comparison in Table 2. Table 2. Quantitative results of the compared methods in the format of PSNR/SSIM. #Params denotes parameters for short. The results are either reproduced by ourselves with the official settings or copied directly from the origin paper. Bold numbers indicate the best performance.

Quantitative Results
It can be seen from Table 2, that our SCPN outperforms the state-of-the-art methods with a higher PSNR and SSIM value. Our method also keeps a slim model size, which holds its parameters within one million.
Explicitly, the bicubic interpolation method has no prior knowledge for the SR reconstruction, therefore, it has inferior performance. SRCNN modifies the interpolated images with a shallow network architecture, which achieves a 2~3 dB progress in PSNR over the interpolation methods. FSRCNN and VDSR further increase the number of layers but do not achieve rapid growth due to the limitation of the network architectures. DRCN and SRFBN-S utilize the recursive mechanism, which can recurrently use the modules in the networks with the shared parameters. This mechanism saves the parameters but limits the network to learn more prior knowledge. IDN and IMDN propose and enhance the information distillation mechanism, respectively, which helps to reconstruct SR images without too many parameters. Our proposed SCPN utilizes the selective channel processing strategy, which empowers the network to save parameters and achieve better performance. Our proposed method surpasses all the methods above and achieves state-of-the-art performance and keeps a slim model size in the meanwhile.

Qualitative Results
We provide the visual comparison of some selected pictures (i.e., img047, img067, img076, and img087 in the Urban100 dataset) generated by our SCPN and other previous works, which are shown in Figure 9. First, take img047 and img087 as an example. The difficulty in reconstructing the images is to show the edges of the windows in the buildings. Our SCPN can precisely recover the edges, making the SR images look sharper than others. Regarding img067, our SCPN performs better when facing complex textures in comparison with the other methods. It should be noted that our method recognizes the two-line stripes and tries to make them clearer, while other methods ignore this detail. In img076, our SCPN restores the blocks on the wall with more regular textures, and other methods cannot reconstruct these rectangles. In sum, our proposed SCPN model generates clearer SR results than other methods, especially in detailed sections. Owing to the selective channel processing strategy and DCA, our method achieves the best performance with limited parameters.

Remote Sensing Image Super-Resolution
Remote sensing technology is now widely used in agriculture, forestry, military, and other fields. As enhancing the quality of remote sensing images is of great significance,

Remote Sensing Image Super-Resolution
Remote sensing technology is now widely used in agriculture, forestry, military, and other fields. As enhancing the quality of remote sensing images is of great significance, we conducted experiments on remote sensing datasets in order to fit our method with the remote sensing field. Because of the difference of shooting angels and the existing distribution bias between the natural and remote sensing images, we utilized the pretrained model with the DIV2K dataset and fine-tuned it on the remote sensing dataset. Owing to transferring the external knowledge from the natural image domain to the remote sensing image domain, our model achieves faster convergence and better performance in the remote sensing SR tasks.
We conducted experiments on the UC Merced Land-use [61] dataset, which is used by most remote sensing SR methods. The UC Merced Land-use dataset is one of the most famous datasets in the remote sensing research area. It contains 21 classes of landuse scenes, and each class includes 100 aerial images with a high spatial resolution (i.e., 0.3 m/pixel) and size of 256 × 256. Following the settings of the previous works [7,62], we randomly selected 40 images per class (i.e., totally 840 images) to construct the training set, and randomly chose 40 images in the training set as a validation set. Furthermore, we constructed the UCTest dataset with the 120 randomly selected images from the remaining part of the dataset. The acquisition of the HR-LR pairs for training and testing is the same as that for the common images in Section 5.1. The training strategy for remote sensing images is the same as that for common images in Section 5.2, and the only difference is that we load the weight of the model trained by common datasets for the transfer strategy mentioned above. We also trained the IMDN model with the same strategy for comparison.
The NWPU-RESISC45 dataset [63] is a publicly available benchmark dataset, which covers 45 classes with 700 images in each class extracted from Google Earth. We randomly chose 180 images from NWPU-RESISC45 to make up a test dataset named RESISCTest to evaluate the performance and generalization ability of our model. Table 3 shows the mean PSNR and SSIM value of test datasets by the compared methods. We can observe that our SCPN achieves a higher PSNR value (approximately 0.1 dB), and a higher SSIM index (approximately 0.004) than its main competitor, i.e., IMDN. It is noteworthy that IMDN-T achieved its best performance after more than 1000 epochs of fine-tuning, while our SCPN only needs 8 epochs of fine-tuning to achieve its best performance, which illustrates that our method has better generalization ability and is easier for training.
To fully demonstrate the effectiveness of our method, we provide six visual results of the scale factor ×4 in the two test datasets, which are shown in Figure 10. The results shown illustrate that our SCPN-T restores more high-frequency information precisely and reconstructs remote sensing pictures with better visual effects.    Application in real-world cases. To further test the performance of our method in realworld scenes, we captured three remote-sensing images from the Landsat-8 satellite [64][65][66][67], which are the landscapes around Xuanwu Lake, Xinjizhou National Wetland Park, and Lukou International Airport in Nanjing. The original size of these images is 900 × 619. Our method successfully super-resolved these images with good visual effects and abundant details, which are shown in Figure 11. It is demonstrated that our proposed method can be well-applied to real-world remote-sensing scenery. Application in real-world cases. To further test the performance of our method in real-world scenes, we captured three remote-sensing images from the Landsat-8 satellite [64][65][66][67], which are the landscapes around Xuanwu Lake, Xinjizhou National Wetland Park, and Lukou International Airport in Nanjing. The original size of these images is 900 × 619. Our method successfully super-resolved these images with good visual effects and abundant details, which are shown in Figure 11. It is demonstrated that our proposed method can be well-applied to real-world remote-sensing scenery.

Conclusions
In this paper, we propose a lightweight convolution neural network with the selective channel processing strategy (SCPN) for single image super-resolution. Specifically, we propose selective channel processing modules (SCPM) to execute our selective channel processing strategy, which utilizes channel selection matrixes with learnable parameters. In the training phase, selective channel matrixes are softened and multiple the corresponding feature maps to guide the model distinguish the importance of each channel. In the inference phase, the values in the selective channel matrixes are hardened to work as the gates, which decide whether to process the corresponding channels in the next convolution layer or pass the channels to the addition layer directly for simplified calculation. What is more, we propose the differential channel attention block in order to restore more high-frequency details. Extensive experiments demonstrate that our method achieves a better trade-off between model complexity and performance, which keeps the number of parameters within 1 M, and gets higher PSNR and SSIM values of the test datasets beyond its competitors. Sections 5 and 6 show that our method can generate natural images and remote-sensing images with higher quality and fine details and achieve better results beyond previous state-of-the-art methods both in quantitative and qualitative comparisons. Specifically, our SCPN achieves an approximately 0.1 dB higher PSNR value and 0.004 higher SSIM value beyond IMDN, its main competitor. In the future, we will explore efficient ways to deploy our lightweight model on mobile devices. At the same time, we will explore the other lightweight strategies in the SR field, such as introducing the sparsity convolution in the models to further reduce the size and calculation complexity of our models.

Conclusions
In this paper, we propose a lightweight convolution neural network with the selective channel processing strategy (SCPN) for single image super-resolution. Specifically, we propose selective channel processing modules (SCPM) to execute our selective channel processing strategy, which utilizes channel selection matrixes with learnable parameters. In the training phase, selective channel matrixes are softened and multiple the corresponding feature maps to guide the model distinguish the importance of each channel. In the inference phase, the values in the selective channel matrixes are hardened to work as the gates, which decide whether to process the corresponding channels in the next convolution layer or pass the channels to the addition layer directly for simplified calculation. What is more, we propose the differential channel attention block in order to restore more high-frequency details. Extensive experiments demonstrate that our method achieves a better trade-off between model complexity and performance, which keeps the number of parameters within 1 M, and gets higher PSNR and SSIM values of the test datasets beyond its competitors. Sections 5 and 6 show that our method can generate natural images and remote-sensing images with higher quality and fine details and achieve better results beyond previous state-of-the-art methods both in quantitative and qualitative comparisons. Specifically, our SCPN achieves an approximately 0.1 dB higher PSNR value and 0.004 higher SSIM value beyond IMDN, its main competitor. In the future, we will explore efficient ways to deploy our lightweight model on mobile devices. At the same time, we will explore the other lightweight strategies in the SR field, such as introducing the sparsity convolution in the models to further reduce the size and calculation complexity of our models.