A Lightweight Feature Distillation and Enhancement Network for Super-Resolution Remote Sensing Images

Super-resolution (SR) images based on deep networks have achieved great accomplishments in recent years, but the large number of parameters that come with them are not conducive to use in equipment with limited capabilities in real life. Therefore, we propose a lightweight feature distillation and enhancement network (FDENet). Specifically, we propose a feature distillation and enhancement block (FDEB), which contains two parts: a feature-distillation part and a feature-enhancement part. Firstly, the feature-distillation part uses the stepwise distillation operation to extract the layered feature, and here we use the proposed stepwise fusion mechanism (SFM) to fuse the retained features after stepwise distillation to promote information flow and use the shallow pixel attention block (SRAB) to extract information. Secondly, we use the feature-enhancement part to enhance the extracted features. The feature-enhancement part is composed of well-designed bilateral bands. The upper sideband is used to enhance the features, and the lower sideband is used to extract the complex background information of remote sensing images. Finally, we fuse the features of the upper and lower sidebands to enhance the expression ability of the features. A large number of experiments show that the proposed FDENet both produces less parameters and performs better than most existing advanced models.


Introduction
To recover high-resolution (HR) images from low-resolution (LR) images, the singleimage super-resolution task (SISR) is a hot topic in computer vision, which is closely related to various computer vision tasks, such as target detection [1,2] and scene marking [3,4].
SISR is an ill-posed problem-that is, there are many solution spaces for restoring HR images from LR images. From the current point of view, the methods of image SR reconstruction are mainly divided into three types: interpolation-based methods [5], reconstruction-based methods [6], and learning-based methods [7]. The first methods such as bicubic interpolation [8]. Although the algorithm is simple, due to the lack of attention to edge information, the reconstructed image will lose details. Reconstruction-based methods require prior information to constrain the reconstruction process, and when dealing with image tasks with large magnification factors, the performance of the algorithm will become poor because of the lack of prior information.
With the rapid development of computer vision in recent years, the method based on deep learning has gradually become the mainstream. More and more deep networks with excellent performance are being created. Dong et al. [9] proposed the first model to reconstruct HR images using a convolution neural network (CNN) approach, and remarkable results were achieved compared to traditional methods, but the huge computational load limited the possibility of it being able to perform well. Deconvolution [10] and sub-pixel convolution [11] methods were proposed that can realize post-sampling at the network end, which greatly reduces the computational cost. Kim et al. designed a very deep network [12], and the introduction of residual learning [13] can effectively alleviate the gradient problem and promote the information flow of the network to accelerate the convergence of the network. Zhang et al. [14] considered the correlation between feature map channels, and proposed the channel attention mechanism combined with residual learning, which allowed the network to be more inclined to learn high-frequency information. Haris et al. [15] introduced an error feed-back mechanism to obtain better reconstruction results by calculating the upper and lower projection errors. Dai et al. [16] proposed a second-order channel attention module and non-local augmented residual group structure, which realized more powerful feature representation and feature-related learning.
As a branch of image SR, remote sensing image SR, has also developed rapidly. Lei et al. [17] proposed a multi-level representation for learning remote sensing images, concatenating the results obtained after different layers of convolution and then combining these groups with a convolution layer, which can represent local details and global environment priors. RDBPN [18] was improved from DBPN [15] by replacing its downsampling unit with a simpler downscaling unit, which greatly simplified the network. Haut et al. [19] adopted residual, skip connection and parallel convolution layers with a kernel size of 1 × 1 to extract more informative feature and reduce the network's information loss. Zhang et al. [20] proposed a parallel multi-scale convolution method to extract multi-scale features and combined it with a channel attention mechanism to further utilize multi-scale feature. Zhang et al. [21] replaced element-wise addition with weighted channel connections in skip connections and performed feature optimization by modeling complex high-order statistics to further refine the extracted features. Dong et al. [22] proposed a second-order, multi-scale super-resolution network to subtly capture multi-scale feature information by aggregating features from different deep learning algorithms in a single path. Xu et al. [23] combined the details of remote sensing images with the background information by connecting local and global memory to increase the receptive field. In order to speed up the calculation speed of the model, the space size of the feature map is also reduced by the way of down-sampling. Aiming at the problem that traditional supervision methods struggle to obtain paired HR and LR images, Zhang et al. [24] designed a cyclic convolution neural network composed of two cyclic modules, which could be trained with unpaired data and had good robustness to image noise and blur. Li et al. [25] designed a recursive block, which focused on high-frequency information through the attention mechanism, and combined low-resolution and high-resolution hierarchical local information to reconstruct the image. Dong et al. [26] proposed a dense-sampling network, which enabled the network to jointly consider multiple levels when performing reconstruction priors and achieved good experimental results.
All the above methods achieved state-of-the-art performance at that time. However, their biggest problem is having too many parameters, which will place a heavy computational burden on hardware facilities, and make them not conducive to effective use in real life. In recent years, some scholars have also begun to focus on lightweight image SR networks that can be used in daily life. For example, take the network based on feature distillation [27][28][29]. Although the channel-separation operation can gradually expand the receptive field and extract more comprehensive information, this operation will lead to insufficient information flow between the separated channels [30], thereby hindering the expression of feature information. There is also the multi-scale feature extraction network MADNet, having the attention mechanism proposed by Lan et al. [31], which using repeated feature extraction blocks not only makes the network's structure redundant, but also increases the number of network parameters.
In the field of remote sensing, it is very unreasonable to obtain high-quality images by spending more on high-precision sensors, especially in some specific fields of application, such as field surveying, individual reconnaissance, and vehicular satellite positioning and navigation. Most of the devices they use are portable, placing higher requirements on the weight of remote sensing image SR algorithms. Therefore, in order to build a network with fewer parameters and more competitive performance, we propose a lightweight feature distillation and enhancement network (FDENet). The parameters number only 501K, which is almost half the number in the advanced MADNet, and the experimental performance was also better. Figure 1 gives the overall architecture of FDENet. We exploit the backward fusion module (BFM) [32] to fuse the features extracted by four cascaded FDEBs, and then use the Gaussian context transformer (GCT) [33] to improve the featureexpression ability. FDEB contains two parts: the feature-distillation part and the featureenhancement part. The feature-distillation part uses channel separation to extract layered features. To avoid the problem of insufficient information expression caused by this operation, we use the proposed stepwise fusion mechanism (SFM) to fuse features retained after stepwise distillation to promote information flow. The bilateral bands in the featureenhancement part are used to enhance feature and extract complex background information from remote sensing images. Finally, the features of the two are fused to enhance the feature expression. Overall, the main contributions are as follows:

1.
We propose a shallow pixel attention block (SRAB), which introduces the pixel attention mechanism, which can make the network pay attention to repair the missing texture details with very few parameters.

2.
We propose the SFM, which fuses the retained feature after stepwise distillation to make full use of the reserved features and promote information flow, so that it can make the feature expression more comprehensive.

3.
We propose a bilateral feature-enhancement module (BFEM), which extracts contextual information and enhances the resulting feature separately by means of a bilateral band.

Proposed Method
In this part, we will introduce the structure of our proposed network and then introduce the proposed feature-distillation part and feature-enhancement part in detail.

Network Architecture
FDENet's overall structure is shown in Figure 1. We first extract primary features from the LR image, then extract deep features through four cascaded FDEBs, and finally, pass the data through a 3 × 3 convolution layer and an upsampling layer to obtain the SR image.
(1) Primary feature extraction: Given a LR image I LR ∈ R H×W×3 , where H, W, and 3 are the length, width, and number of channels, to make the network as lightweight as possible, we only use a 3 × 3 convolution layer for primary feature extraction, and let F init (. . .) denote a convolution layer with a kernel size of 3 × 3 and the number of channels C. Then, the obtained primary feature F 0 is: (2) Deep feature extraction: We use four lightweight cascaded FDEBs to extract deep feature. Let F FDEB i (. . .) and F GCT denote the feature generated after passing through the ith FDEB block and the feature generated after passing through the GCT module, respectively, where i ∈ [1,4]. Then, the output deep feature F d is: (3) Reconstruction layer: Let F up (. . .) denote a convolution layer with a kernel size of 3 × 3 and an upsampling layer. Then, the final reconstructed SR image I SR is:

The Proposed FDEB
The proposed FDEB consists of a feature-distillation part and a feature-enhancement part. Next, we will give more details.

Feature-Distillation Part
The structure of feature-distillation part is shown in the blue box in Figure 2 right. First, we use a convolution layer with a kernel size of 3 × 3 to extract the features roughly, and then use the method of stepwise distillation to increase the receptive field and further extract the layered features. Specifically, we use the stepwise channel separation operation to retain some of the features and extract information from the other portion. However, the channel-separation operation of stepwise feature distillation inevitably leads to insufficient information flow between channels, hindering the expression of features. Thus, we propose the SFM, which fuses the features retained after each distillation and uses the SRAB to extract information. This can not only make full use of the retained feature, but also effectively avoids the problem of insufficient information flow between channels. We take the proposed SRAB (as shown in Figure 3) as the basic unit of FDEB feature extraction. On top of the SRB proposed by RFDN [29], we introduce the pixel attention mechanism, which enables the network to focus on repairing the missing textural details when extracting features. Let the input feature of the nth FDEB be F n i ; then, the output feature F n D of this process can be described as: Equations (4)-(6) constitute the feature distillation process used to extract the stratified features, the SFM, and the general formula of SRAB, respectively. F n i represents the input features of the nth FDEB; F n coarse i and F n re f ined i represent the ith distillation feature and the ith retained feature in the nth FDEB, respectively. split n j represents the jth operation of Equations ?? , ?? and ?? are the feature distillation process used to extract the stratified 154 feature, the SFM and the general formula of SRAB, respectively. Where F n i represent the 155 input feature of the nth FDEB, F n coarse i and F n re f ined i represent the ith distillation feature and 156 the ith retained feature in the nth FDEB, respectively. split n j represent the jth operation of 157 channel separation in the nth FDEB, SRAB represent our shallow pixel attention block, F i 158 represent the input feature of corresponding SRAB i , F n distil i represent the retained feature 159 after fusing and extracting feature, and F n D represent the output feature of this whole 160 process. 161 Figure 2. Comparison between RFDB and FDEB. Left, the structure of RFDB. Right, the structure of our FDEB,© represent the feature fusion, ⊕ and ⊗ represent the element-wise summation and the element-wise summation multiplication, respectively. The green and brown boxes represent the basic feature extraction unit and the feature enhancement block, respectively. SRB SRAB Figure 3. the comparison between SRB and SRAB. Left, the structure of SRB. Right, the structure of SRAB, ⊕ and © represent the element-wise summation and feature fusion, respectively.

Bilateral Feature Enhancement Module
Compared with natural images, remote sensing images have more complex structural and background information. Therefore, in order to make full use of its background information, we propose the BFEM (as shown in Figure 4), which can focus on extracting the background information of remote sensing image while enhancing features. Let F n where Conv1(. . .) represent the 1*1 convolution layer; concate(. . .) represents the operation of fusing features; F BFEM up and F BFEM down represent the output features of upper and lower sidebands, respectively. In the upper sideband, we use the enhanced spatial attention [34] (ESA) to expand the receptive field extracted by FDEBs and help to obtain a clearer reconstructed image. This part is composed of a step convolution layer with a step size of 2 and a kernel size of 3 × 3; a maxpooling layer with a step size of 3 and a kernel size of 7 × 7; and three convolution layers with a kernel size of 3 × 3. LetF represent the features obtained through these above steps and F n D represent the input features for the upper sideband. Then, the feature F BFEM up obtained after passing the upper sideband can be expressed as: where ⊗ and ⊕ represent element-wise multiplication and element-wise summation, respectively. Conv1(. . .) represents the convolution layer with a kernel size of 1 × 1. σ represents the sigmoid function. The lower sideband is used to extract the contextual feature information of remote sensing images to help obtain more details from the complex background. This part is composed of an avgpooling layer with step size of 2, a kernel size of 2 × 2, a convolution layer with a kernel size of 1 × 1 and a bilinear upsampling layer. LetF represent the features obtained through these above steps and F n D represent the input features of the lower sideband. Then, the feature F BFEM down obtained after passing the lower sideband can be expressed as:

Gaussian Context Transformer
The structure of GCT as shown in Figure 5, compared with some other attention mechanism, it is not only lighter, but also can achieve context feature motivation, leading to better performance. Therefore, we pass the features through the GCT to improve the features' expression ability before sampling.  [35] is a dataset of 900 natural images with a resolution of 2K. It includes various natural images of buildings, animals, plants, etc. Following SMSR [26], we chose the first 800 images as the training set and the last 100 images as the validation set. Following the example of FeNet [30], we randomly selected 240 images from the UC Merced dataset containing 21 scenes to make two test sets, RS-1 and RS-2 [36]. Both of them contain 120 images and cover for the composite evaluation. RS-1 contains 120 images from ten classes, including agricultural, airplane, baseballdiamond, beach, buildings, chaparral, denseresidential, forest, freeway, and golfcourse (12 images per class). The RS-2 contains 120 images from ten classes, including intersection, mediumresidential, mobilehomepark, overpass, parkinglot, river, runway, sparseresidential, storagetanks, and tenniscourt (12 images per class). In order to further prove the generalization ability of the proposed model, we also tested it on four natural benchmark datasets-Set5 [37], Set14 [38], Urban100 [39], and BSD100 [40].

Degradation Method
We used the bicubic interpolation method to downsample the original high-resolution image ×2, ×3, and ×4 in MATLAB R2018a to obtain LR images as training and test data.

Training Details
We chose the L1 loss function [41] as the training loss function, which calculates the sum of the absolute difference between the actual value and the target value. Letŷ i represent the SR image and y i represent the real HR image. Then, the loss function can be expressed as: In order to get the most out of training data, we used random rotation and flipping to enhance the data. The randomly cropped training patch size of the HR image was 192 × 192, and we set the pixel range of the input image to between [0, 1]. The ADAM [42] was used as an optimizer with β 1 = 0.9, β 2 = 0.999; the initial learning rate was set to 5 × 10 −4 ; and the learning rate decayed by half every 200 epochs, for a total of 500 epochs. All experiments were implemented using the pytorch framework, and we used a NVIDIA Tesla V100 GPU to complete the entire training and testing process.

Evaluation Index
We used the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) to evaluate the results [43]. Let x and y be the ground truth value and the reconstructed SR image, respectively. Then, the PSNR value is: PSNR(x, y) = 10 log 10 255 2 MSE(x, y) (11) where H and W represent the height and width of the given image; 255 represents the maximum RGB value for each pixel; X(i, j) and Y(i, j) represent the sizes of the pixels corresponding to the the real HR image and generated SR image, respectively. The SSIM value is: where µ x (µ y ) and σ x , σ y represent the mean and variance, respectively; σ xy is the covariance of x and y; C 1 and C 2 are constants. We evaluate the PSNR and SSIM values on the y channel of the transformed YCbCr space [12]. The conversion method is as follows: where Y, Cb, and Cr represent the brightness, the difference between the blue part of the input signal and the brightness value of the RGB signal, and the difference between the red part of the input signal and the brightness value of the RGB signal, respectively.

Results on Remote Sensing Images
We quantitatively compare the results of FDENet with the results published in CVPR, ECCV, TGRS, and other well-known conferences and journals over the years. It can be clearly seen in Table 1 that our model has excellent performance when the amplification coefficients are ×2, ×3, and ×4. Take the advanced FeNet [30] as an example. Our PSNR values on the data sets RS-1 and RS-2 are 0.02 and 0.09 dB higher. On the whole, we have fewer parameters and multi-adds than most models. Figures 6 and 7 are the visualization results of each model on the remote sensing image datasets. Take Figure 6 as an example. Compared with the advanced FeNet, the white car and red car in our reconstructed image have a clearer outline and more comprehensive details. From other comparison graphs, we can also see that the edge details of our model's result graph are also richer. Table 1. Quantitative evaluation results of remote sensing data sets. "Params" represents the model parameter quantity; the best and second best results are red and blue, respectively. "-" indicates that no result is provided.

Results on Natural Images
In order to further prove the generalization ability of our model, we compared it with remote sensing image SR models on four natural benchmark datasets, Set5 [37], Set14 [38], BSD100 [39], and Urban100 [40]. Table 2 shows our quantitative comparison's results. It can be seen in the table that our model still performs better on natural images than other SR models of remote sensing images. Although its performance for ×2 magnification is slightly inferior to that of the advanced MADNet [31] and FeNet [30], it is far superior on all data sets of ×3 and ×4. Figure 8 shows the visualization results of each model under the ×3 magnification factor on BSD100 [39] and Urban100 [40] datasets, from which we can see that after our model's reconstruction, the window shape is better restored and the outlines are clearer; more details are retained.   Table 2. Quantitative results of four super-resolution benchmark datasets. "Params" and "Multi-Adds" represent the model's parameter quantity and model complexity, respectively. The best and second-best results are red and blue, respectively. "-" indicates that no result was provided.

Method
Scale

Comparison of SRB and SRAB
Our feature extractor's basic unit, SRAB, introduces the pixel attention mechanism, which has been proven suitable for lightweight networks and can repair the missing texture details of images in the feature-extraction process [32]. The pixel attention mechanism only uses a convolution layer with a kernel size of 1 × 1 and a sigmoid function to obtain the attention maps, and then they will be multiplied with the input features. Due to this method being used in the feature-distillation part, the number of channels of the feature map will gradually decrease during the distillation process, so the amount of parameters introduced can be almost ignored. Table 3 shows that under the same conditions, the results of our FDENet after using SRAB are better than using SRB on four test sets. This fully proves the effectiveness of the SRAB.

Comparison of ESA and BFEM
Since the background information of remote sensing images is more important than that of nature images, it contains a variety of complex scenes, and the scales of features in different scenes are not the same. Therefore, we propose the BFEM, which can focus on extracting the contextual information of remote sensing images. Compared with the ESA proposed by RFANet [34], the BFEM proposed by us adds a lower sideband for extracting context information, to avoid introducing a large number of parameters. Instead of using a large convolution kernel, we use an avgpooling layer, a bilinear upsampling layer, and some convolution layers with a kernel size of 1*1 to achieve this goal. Table 4 is the result of FDENet replacing the feature-enhancement module with an upsampling factor of four for ESA [34] and BFEM. Under the same conditions, we can see that our BFEM has only 37K more parameters than ESA, showing more powerful performance on four test sets.

Analysis of SFM
The SFM we proposed fuses the reserved features after each extraction and extracts the features through SRAB, which not only makes full use of the reserved features, but also alleviates the problem of insufficient information flow in the process of feature extraction. Table 5 shows the results of our ablation experiment on SFM. Due to our SFM adopting the strategy of fusing reserved features, we are always in a very light state when using reserved features. This is not only due to the effectiveness of SFM itself, but also due to the ability of our SRAB to effectively extract feature. From the table, we can see that our SFM only introduces 9K parameters, and the PSNR values for the four data sets were 0.03, 0.01, 0.01, and 0.09 dB higher, respectively.

Analysis of Model Complexity
The quantity of parameters is an important indicator for evaluating the quality of lightweight models. From the results shown in Tables 1 and 2, although our number of parameters exceeds those of SRCNN [9], LGCNet [17], and the advanced lightweight model FeNet [30], our performance is far ahead of theirs, which fully makes up for the shortcoming of more parameters. In a comprehensive comparison, our number of parameters was still shown to be less than those of most models, and our performance is more competitive. In addition to evaluating the complexity of the model with parameter quantity, we also used multi-adds to evaluate the computational complexity of the network. We set the size of the query image (HR image) to 1280 × 720. Compared with some recent models, such as IDN [27], LESRCNN [44], and MADNet [31], FDENet also has relatively few multi-adds.

Conclusions
In this article, we proposed a lightweight feature distillation and enhancement network for SR tasks of remote sensing images. Specifically, we proposed a SFM that can effectively alleviate the problem of insufficient information flow caused by channel separation during feature distillation. We use the designed lightweight SRAB as the main feature extraction method of the FDEB, which can make the network more inclined to extract high-frequency details when extracting features without introducing a large number of parameters. After feature extraction, we enhance the features with SFM, which can extract background information of remote sensing images while enhancing the features. A large number of experiments showed that our model has strong competitiveness compared with some advanced models in terms of performance and parameter quantity. This provides a certain application foundation for lightweight remote sensing image superresolution reconstruction in field investigations, individual reconnaissance, and other fields of application.