Next Article in Journal
A Preliminary Study on the Inversion Method for the Refraction Structure Parameter from Vortex Electromagnetic Waves
Next Article in Special Issue
SMBCNet: A Transformer-Based Approach for Change Detection in Remote Sensing Images through Semantic Segmentation
Previous Article in Journal
Weather Radar Echo Extrapolation with Dynamic Weight Loss
Previous Article in Special Issue
R-MFNet: Analysis of Urban Carbon Stock Change against the Background of Land-Use Change Based on a Residual Multi-Module Fusion Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Pixel-Wise Attention Residual Network for Super-Resolution of Optical Remote Sensing Images

1
College of Marine Science and Technology, China University of Geosciences, Wuhan 430074, China
2
Hubei Key Laboratory of Marine Geological Resources, China University of Geosciences, Wuhan 430074, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2023, 15(12), 3139; https://doi.org/10.3390/rs15123139
Submission received: 24 April 2023 / Revised: 1 June 2023 / Accepted: 13 June 2023 / Published: 15 June 2023
(This article belongs to the Special Issue Convolutional Neural Network Applications in Remote Sensing II)

Abstract

:
The deep-learning-based image super-resolution opens a new direction for the remote sensing field to reconstruct further information and details from captured images. However, most current SR works try to improve the performance by increasing the complexity of the model, which results in significant computational costs and memory consumption. In this paper, we propose a lightweight model named pixel-wise attention residual network for optical remote sensor images, which can effectively solve the super-resolution task of multi-satellite images. The proposed method consists of three modules: the feature extraction module, feature fusion module, and feature mapping module. First, the feature extraction module is responsible for extracting the deep features from the input spatial bands with different spatial resolutions. Second, the feature fusion module with the pixel-wise attention mechanism generates weight coefficients for each pixel on the feature map and fully fuses the deep feature information. Third, the feature mapping module is aimed to maintain the fidelity of the spectrum by adding the fused residual feature map directly to the up-sampled low-resolution images. Compared with existing deep-learning-based methods, the major advantage of our method is that for the first time, the pixel-wise attention mechanism is incorporated in the task of super-resolution fusion of remote sensing images, which effectively improved the performance of the fusion network. The accuracy assessment results show that our method achieved superior performance of the root mean square error, signal-to–reconstruction ratio error, universal image quality index, and peak signal noise ratio compared to competing approaches. The improvements in the signal-to-reconstruction ratio error and peak signal noise ratio are significant, with a respective increase of 0.15 and 0.629 dB for Sentinel-2 data, and 0.196 and 1 dB for Landsat data.

Graphical Abstract

1. Introduction

In recent decades, remote sensing images have been widely used and satisfactory results have been achieved in target recognition [1], land cover classification [2], agricultural inspection [3], aerosol monitoring [4], and other fields [5,6,7,8]. The rise of remote sensing prompted a growing need for images with more refined details [9]. However, it is impractical to simultaneously satisfy high spatial resolution, high spectral resolution, and high temporal resolution. In fact, a single sensor must make a trade-off between them. As the spatial resolution increases, the instantaneous field of view (IFOV) decreases, and to ensure the signal-to-noise ratio of the imagery, the spectral bandwidth needs to be broadened, and hence a finer spectral resolution cannot be achieved [10]. Alongside that, although each sensor has a predetermined time revisit period, the actual time revisit period of available images often falls short of expectations due to cloud cover, shadows, and other negative atmospheric influences. It is apparent that the lack of spatial and temporal information limits the application potential of remote sensing images [11,12,13].
One feasible solution is to fully leverage the information correlation among multiple images to enhance the low-resolution images, improving their local resolution and/or spectral resolution. This process incorporates richer feature information, allowing for a more comprehensive and mutually interpretable representation of the same scene when combined with high-resolution images. With the development of the aerospace industry, more and more satellites have been launched into space for earth observation, providing a large and sufficient source of data for the enhancement of low-resolution images. Since the first Landsat satellite was launched in 1972, the Landsat program has continued to acquire surface imagery, providing millions of valuable images, making it the longest record of surface observations to date and facilitating scientific investigations on a global scale [14,15]. Among them, Landsat 8, launched in 2013, and Landsat 9, launched in 2021, carry the Land Imager (OLI), which includes nine bands with a spatial resolution of 30 m, including a 15 m panchromatic band of, achieving a revisit cycle of 16 days [16]. Landsat satellite imagery with a 30 m spatial resolution can be used both as fine images to fuse with lower spatial resolution imagery, and as coarse images to fuse with higher resolution imagery. The pairing of MODIS and Landsat is typical of the former, and based on the frequent revisit cycles of MODIS satellites, the guidance of Landsat data is expected to generate daily Earth observation reflectivity products [17,18,19]. In this case, the choice of a coarse image is more extensive, and it may be either Google Imagery [20], SPOT5 [21], or Sentinel-2 [22]. Google images have a resolution of 0.3 m, but are updated slowly and historical images with definite dates cannot be downloaded. Although SPOT has a wealth of historical imagery, as a commercial remote sensing system, it is not freely available to the public. The Sentinel 2A satellite, launched in 2015, and the Sentinel 2B satellite, launched in 2017, form a dual-satellite system with a revisit cycle of five days [23]. Due to the similarity in band specifications, the overlapping coverage, same coordinate system, and open access, Sentinel 2 MSI images can be combined with Landsat OLI images to form HR–LR pairs for super-resolution enhancement [24,25].
Recently, the advancement of deep learning has brought a fresh vitality to the super-resolution processing of remote sensing images. The learning-based methods effectively take full advantage of the spectral correlation and spatial correlation present in remote sensing images, enabling the establishment of nonlinear connections between low-resolution (LR) and high-resolution (HR) images. Initially, the research focus of remote sensing image super-resolution mainly revolved around discussing the applicability of existing models to remote sensing images. Leibel et al. retrained the SRCNN model on Sentinel 2 images and demonstrated its ability to enhance single-band images from Sentinel 2 [26]. Tuna et al. employed the SRCNN and VDSR models in conjunction with the IHS transform on SPOT and Pleiades satellite remote sensing images to enhance their resolution by a multiplicative factor of 2, 3, and 4 [27]. Since 2017, there has been a proliferation of specialized super-resolution models designed for remote sensing applications. Lei et al. introduced a novel approach known as the combined local–global network (LGCNet) to acquire multi-level representations of remote sensing (RS) images by integrating outputs from various convolutional layers. Through experimentation on a publicly accessible remote sensing dataset (UC Merced), it was demonstrated that the proposed model exhibited superior accuracy and visual performance compared to both SRCNN and FSRCNN [28]. Xu et al. introduced the deep memory connected network (DMCN), which leverages memory connections to integrate image details with environmental information [29]. Furthermore, the incorporation of dense residual blocks serves as an effective approach to enhance the efficacy of convolutional neural network (CNN)-based architectures. Jiang et al. introduced a deep distillation recursive network (DDRN) to address the challenge of video satellite image super-resolution [30]. The DDRN comprises a series of ultra-dense residual blocks (UDB), a multi-scale purification unit (MSPU), and a reconstruction module. The MSPU module specifically addresses the loss of high-frequency components during the propagation of information. Deeba et al. have introduced the wide remote sensing residual network (WRSR) as an advanced technique for super-resolution in remote sensing. This novel approach aims to enhance accuracy and minimize losses through three key strategies: widening the network, reducing its depth, and implementing additional weight-normalization operations [31]. Huan et al. contend that the utilization of a single up-sampling operation results in the loss of LR image information. To address this issue, they propose incorporating complementary blocks of global and local features in the reconstruction structure as a means of alleviating that problem [32]. Lu et al. proposed a multi-scale residual neural network (MRNN) based on multi-scale features of objects in remote sensing images [33]. The feature extraction module of the network extracts information from objects at various scales, including large, medium, and small scales. Subsequently, a fusion network is employed to effectively integrate the extracted features from different scales, thereby reconstructing the image in a manner that aligns with human visual perception.
Although CNN-based methods are widely used in various tasks and have achieved excellent performance, the homogeneous treatment of input data across channels imposes limitations on the representational capacity of convolutional neural networks (CNNs). To address these challenges, attention mechanisms have emerged as a viable solution, enhancing the representation of feature information and enabling the network to focus on salient features while suppressing irrelevant ones. The squeeze-and-excitation (SE) block, proposed by Hu et al., enhances the representation capability of the network by explicitly modeling the interdependencies between channels and adaptively recalibrating the channel feature responses [34]. Zhang et al. proposed a highly sophisticated residual channel attention network (RCAN) that incorporates a residual error (RIR) structure, enabling the network to emphasize the learning of high-frequency information [35]. Woo et al. presented the convolutional block attention module (CBAM), which utilizes both channel and spatial attention module mappings to sequentially infer attention mapping. This attention mapping is subsequently multiplied with the input feature mapping to achieve adaptive feature refinement [36]. Dai et al. proposed a deep second-order attention network (SAN) to enhance feature representation and feature correlation learning. They incorporated a second-order channel attention (SOCA) module, which utilized global covariance pooling to learn feature interdependencies and generate more discriminative representations [37]. Niu et al. argued that channel attention overlooks the correlation between different layers and presented the holistic attention network (HAN). HAN consists primarily of a layer attention module (LAM) and a channel-space attention module (CSAM). The LAM adaptively highlights hierarchical features by considering the correlations among different layers, while the CSAM learns the confidence level of each channel’s positions to selectively capture more informative features [38]. The majority of the aforementioned approaches exhibit a satisfactory performance. However, the intricate architecture of the developed attention module results in elevated operational expenses for remote sensing images. By incorporating channel attention and spatial attention in an optimized manner, pixel attention enables the acquisition of feature information using fewer parameters and with a reduced computational complexity [39]. Therefore, it is very suitable for super-resolution tasks of remote sensing images. In this paper, we proposed a lightweight pixel-wise attention residual network (PARNet) for remote sensing image enhancement. Compared with the existing CNN-based sharpening methods, the main contributions of this paper can be summarized as follow:
  • We propose a lightweight SR network to improve the spatial resolution of optical remote sensing images.
  • Our method can raise the spatial resolution of Landsat 8 bands at 30 m into 10 m, which is also applicable to other satellite image super-resolution enhancement tasks, such as Sentinel-2.
  • For the first time, the pixel attention mechanism is included in the super resolution fusion task of remote sensing images.
  • Our method effectively identifies information on land cover and land use change (LULC), atmospheric cloud disturbance, and surface smog, and makes reasonable judgments in the fusion process.
This paper is organized in the following way. As shown in Figure 1, the structure of the proposed PARNet network is presented in detail in Section 2. In Section 3, we evaluate the experimental results quantitatively and qualitatively in terms of accuracy metrics and visual presentation, verifying the necessity of pixel-wise attention mechanism and reconstruction module, and test the anti-interference capability of our model on Landsat data. Section 4 gives a full discussion of the whole experiment and suggests the next focus. Finally, Section 5 concludes the paper.

2. Proposed Method

In this paper, we propose a method termed pixel-wise attention residual network (PARNet). As shown in Figure 2, The proposed method consists of three modules: the feature extraction module, feature fusion module, and feature mapping module. The details are introduced as follows.

2.1. Feature Extracting

The feature extraction module first extracts the shallow features from the concatenated LR and HR bands using a convolution operation, followed by deeper features from a series of aligned Resblocks. All features are connected in the channel dimension and the number of channels is subsequently reduced by a convolution layer.
The structure of our Resblock is shown in Figure 3. Deeper and more complex feature information is extracted from two identical parallel branches and then fused prior to the output of the convolution layer. Each branch consists of a constant scale layer behind the convolution layer and the activation layer. Besides this, to transmit the spectral information, the input of Resblock is stacked directly on the output side, using a skip connection layer. The operation of the m-th Resblock can be expressed as:
{ x m 1 = λ φ ( w m 1 × x m 1 + b m 1 ) x m 2 = λ φ ( w m 2 × x m 1 + b m 2 ) x c a t = C a t ( x m 1 , x m 2 ) x m = w m 3 × x c a t + b m 3 + x m 1 ,
where xm1, xm2, and xcat are intermediate results, λ is the residual scaling with a factor of 0.2, {wm, bm} is the weight matrix and the basis of the convolution in Resblock, xm−1 and xm denote the input and the output of Resblock, respectively, and φ refers to the LeakyRelu activation function, which is specifically designed for solving the Dead ReLU problem.
The output of each Resblock is connected to the channel dimension. To prevent the generation of overly complex models, the number of channels is controlled by a 3 × 3 convolutional layer to produce a deeper feature map before feature fusion. These operations can be expressed as follows:
{ x c a t = C a t ( x 1 , x 2 , , x m ) x o u t = w o u t × x c a t + b o u t ,
where x1, x2, and xm are the outputs corresponding to each Resblock, {wout, bout} is the weight matrix and the basis of the convolution, and xout is the final depth feature maps of the whole feature extraction module.

2.2. Pixel-Wise Attention Scheme

Before presenting the feature fusion module, it is necessary to introduce the pixel-wise attention scheme. Inspired by channel attention [34] and spatial attention [36], the pixel-wise attention scheme was first proposed in AIM 2020 challenge [39]. As shown in Figure 4, pixel-wise attention uses only a 1 × 1 convolution layer and a sigmoid function to generate a C × H × W attention map, which is then multiplied by the input features. Note that C is the number of channels, while H and W are the height and width of the features.
The PA layer can be described as
x k = f P A ( x k 1 ) × x k 1
where fPA(·)is the structure of a 1 × 1 convolution layer, followed by a sigmoid function. xk−1 and xk denote the input and output feature maps of PA.

2.3. Feature Fusing

It should be noted that in previous work on SR, the reconstruction module fundamentally comprised mainly up-sampling, a fully connected layer, and a convolution layer [40,41,42]. Few researchers have placed attention mechanisms at the feature fusion stage.
In this work, we propose the PA-based reconstruction (PA-REC) component. Experimental results show that introducing PA could significantly improve the final performance with little parameter cost. As shown in Figure 5, a PA layer is adopted between two convolution layers, each of which is tracked by a LeakyRelu activator. Subsequently, a convolution layer makes the channel of the spatial residual image to be the same as the up-sampled LR bands, followed by a constant scaling layer. The PA–REC can be formulated as follows:
{ x f 1 = φ ( w f 1 × x 1 + b f 1 ) x f 2 = φ ( f P A ( x f 1 ) × x f 1 ) x f 3 = φ ( w f 3 × x f 2 + b f 3 ) x 2 = λ ( w f 4 × x f 3 + b f 4 ) ,
where x1 and x2 denote the input and output of this component, {wf, bf} mean the weight matrix and basis of the convolution layers of the PA-REC, and xf1, xf2, and xf3 are the intermediate results.

2.4. Feature Mapping

Given that the HR target has the same spectral content as the input LR, the fused residual feature map is added directly to the up-sampled low-resolution images to maintain the fidelity of the spectrum.

3. Experiments

3.1. Data

A brief introduction about Sentinel-2, Landsat 8, and Landsat 9 is given here. Sentinel-2 consists of two identical satellites, Sentinel-2A and Sentinel-2B, with a revisit period of ten days for one satellite and 5-day revisit period for the two satellites. Sentinel-2 carries a multispectral imager (MSI) covering 13 spectral bands with 4 bands at 10 m (B2–B4, B8), 6 bands at 20 m (B5–B7, B8a, B11–B12), and 3 bands at 90 m (B1, B9–B10). The Landsat8 satellite carries an operational land imager (OLI) consisting of seven multispectral bands (B1–B7) with a resolution of 30 m, a panchromatic band of 15 m (B8), and a cirrus band of 30 m (B9). Landsat 9 builds upon the observation capabilities established by Landsat 8, enhancing the radiometric resolution of the OLI-2 sensor from 12 bits in Landsat 8 to an even more advanced 14 bits. This heightened radiometric resolution enables the sensor to discern more subtle variations, particularly in dimmer regions such as bodies of water or dense forests. For convenience, we have reconciled Landsat 8 and 9 as Landsat in the following sections. A description of the Sentinel-2 and Landsat parameters is described in Table 1 and Table 2.
Four level 1C products of Sentinel-2 and Landsat were acquired from the European Space Agency (ESA) Copernicus Open Access Center (https://scihub.copernicus.eu/, accessed on 31 December 2022) and the United States Geological Survey (USGS) Earth Explorer (https://earthexplorer.usgs.gov/, accessed on 1 January 2023), respectively. Four pairs of Landsat–Sentinel images were selected globally for training, as shown in the black box in Figure 6A–E, showing the geographic locations in detail. Vignettes of the four image pairs and respective acquisition times are shown in the top right panel. Of these, the first group had the longest collection interval between Landsat imagery and Sentinel-2 imagery, 13 days apart, followed by the third group, 1 day apart, and the remaining two groups were collected on the same day. In addition, the first two pairs represent the agricultural region of Florida, USA, and the Andes in western Argentina, both which have a simple composition of geomorphological types, while the other two cover the Yangtze River Delta and Pearl River Delta of China, respectively. Each landform type is complex and contains at least three different types of cover. All images were captured under low cloud cover conditions.

3.2. Experimental Details

N2× represents the 20 m→10 m network for Sentinel-2 images and N3× represents the 45 m→15 m network for Landsat images. Two networks were trained separately with different super-resolution factors, which were f = 2 for the N2× and f = 3 for the N3×.
For the N2× network, we separated the Sentinel-2 bands into three sets A = {B2, B3, B4, B8} (GSD = 10 m), B = {B5, B6, B7, B8a, B11, B12} (GSD = 20 m), and C= {B1, B9} (GSD = 60 m). As B10 has poor radiometric quality, it was excluded. The dataset C was also excluded here, as we only focus on the N2× network.
For the N3× network, the Landsat images were first cropped to the same coverage as Sentinel-2, which is 7320 × 7320 in the panchromatic (GSD = 15 m) band and 3660 × 3660 in the multispectral (GSD = 30 m) bands, respectively. The Landsat bands were also separated into two sets, S = {B1, B2, B3, B4, B5, B6, B7} (GSD = 30 m) and P = {B8} (GSD = 15 m). Due to the limitations of the equipment, we super-resolved a portion of the spectral bands in S, which is D = {B1, B2, B3, B4}.
N2× and N3× are similar in many ways except for the input. N2× takes the bands in B as the LR input and produces SR images with 10 m GSD using the HR information from 10 m bands, while N3× takes 30 m bands in D into 10 m using the information from the panchromatic band B8 and 10 m bands of Sentinel-2.
Since images with a GSD = 10 m corresponding to bands of 20 m do not exist in reality, these down-sampled images are used as the input to generate the SR image to its original GSD according to Wald’s protocol [43]. The original images are then used as validation data to assess the accuracy of the model we trained. As shown in Figure 7, the Gaussian filter with a pixel standard deviation σ = 1/s pixels was used to perform blurring on the original image. Then, we down-sampled the blurry image by averaging s × s windows, where s = 2 for N2× and s = 3 for N3×.
As shown in Table 3, we segmented the entire down-sampled image into small patches without duplicate pixels. The patch size for N2× is 90 × 90 at 10 m bands and 45 × 45 at 20 m bands. For N3×, the patch size of the input 10 m bands of Sentinel-2, the multispectral bands, and the panchromatic band of Landsat are 60 × 60, 40 × 40, and 20 × 20, respectively. A total of 3721 patches were sampled per image. Among them, 80% are used as training data, 10% are used as validation data, and the remaining 10% are used as testing data.
All networks are implemented in the Keras framework with TensorFlow 2.4 as the back end. The training is performed in the GPU mode of an Intel core I9 CPU equipped with Windows 3.6 GHz, NVIDIA GeForce RTX 2080, and 16 GB of RAM. Each model was trained for 200 epochs with batch = 10 batch size, L1-norm as the loss function, and Nadam [44] with β1 = 0.9, β2 = 0.999, and ɛ = 10−8 as the optimizer. We use an initial learning rate of l r = 1 × 10−4, which is halved whenever the validation loss does not decrease for 1 consecutive epoch. To ensure the stability of the optimization, we normalize the raw 0~10,000 reflectance values to 0–1 prior to processing.

3.3. Quantitative Evaluation Metrics

To assess the effectiveness of our method, we take the deep-learning-based methods of SRCRR and DSenNet as the benchmark methods. All parameters of the baseline are set as suggested in the original publication. In this paper, we adopted five indicators for quantitative assessment, namely the root mean square error (RMSE), signal-to-reconstruction ratio error (SRE), universal image quality index (UIQ) [45], peak signal-to-noise ratio (PSNR), and the Brenner gradient function (Brenner). Among these assessment metrics, RMSE, SRE, UIQ, and PSNR are employed for evaluating the performance of the downscaling experiments. On the other hand, the Brenner metric is utilized for evaluating the quality of the original images. In what follows, we use y to represent the reference image, x to present the super-resolved image, and µ and σ stand for the mean and standard deviation, respectively.
The root mean square error (RMSE) measures the overall spectral differences between the reference image and the super-resolved image; the smaller, the better (n is the number of pixels in x).
R M S E ( x , y ) = 1 n ( y x ) 2
The signal-to-reconstruction ratio error (SRE) is measured in decibels (dB), which measures the error relative to the average image intensity, making the error comparable between images of different brightness; the higher, the better.
S R E ( x , y ) = 10 log 10 μ x 2 y x 2 / n
The universal image quality index (UIQI), which is a general objective image quality index, portrays the distortion of the reconstructed image with respect to the reference image in three aspects: correlation loss, brightness distortion, and contrast distortion. UIQI is unitless, with a maximum value of 1. In general, the larger the UIQI value, the closer the reconstructed image is to the reference image.
U I Q I ( x , y ) = σ x y σ x σ y 2 μ x μ y ( μ x ) 2 + ( μ y ) 2 2 σ x σ y σ x 2 + σ y 2
The peak signal-to-noise ratio (PSNR) measures the quality of the reconstructed image, and the value varies from 0 to infinity; the higher the PSNR, the better the quality of the reconstructed image, as more detailed information is recovered by the model based on the coarse image. Here, Max(y) takes the maximal value of y.
P S N R ( x , y ) = 20 log 10 ( M a x ( y ) R M S E ( y , x ) )
The gradient function (Brenner) serves as a widely used metric for assessing image sharpness. It is calculated as the squared difference in grayscale values separated by two pixels. A higher Brenner value indicates a sharper image.
B r e n n e r = y x ( f ( x + 2 , y ) f ( x , y ) ) 2
where f(x, y) represents the gray value of pixel (x, y) corresponding to image f.

3.4. Experimental Results

In this section, we compare our results with the bicubic up-sampling methods, SRCNN and DSen2, using five metric indicators: RMSE, SRE, UIQ, PSNR, and Brenner. SRCNN, which is a pioneering deep learning model in the field of super-resolution reconstruction, possesses a simple structure, with only three convolutional layers having the kernel sizes of 9 × 9, 1 × 1, and 5 × 5, respectively [46]. To achieve networks capable of super-resolving arbitrary Sentinel 2 images without additional training, Lanaras et al. conducted extensive sampling on a global scale and proposed the DSen2 model. This model employs a series of residual blocks comprising convolutional layers, activation functions, and constant scaling to extract high-frequency information from the input layer, which is then added to the LR image as the output. Being an end-to-end learning network for image data, DSen2 can also be applied to other super-resolution tasks involving multispectral sensors of varying resolutions [47].

3.4.1. Evaluation at a Lower Scale

N2× Sentinel-2 SR. The sharpening results for each Sentinel-2 datum on the test set are shown in Table 4, where the data in the table are average values from 20 m bands. It can be seen that the deep-learning-based method is significantly better than the interpolation-based bicubic up-sampling method. Compared with SRCNN, DSen2 further improves accuracy, especially in PSNR, showing a significant improvement of 7.8694 on the Andes dataset. Our method slightly outperformed DSen2, with a decrease of ~0.15 in SRE and an increase of ~0.6294 dB in PSNR.
N3× Landsat SR. Table 5 shows the average sharpening results of Landsat. Our method shows a visible advantage over others on Landsat super-resolution, with a 1.8 × 10−4 decrease in RMSE and an increase of ~0.1963 in SRE, and ~1.006 dB in PSNR.
Figure 8 and Figure 9 further show the density scatter plots of the reference bands and the fused bands generated by DSen2, SRCNN, and PARNet in bands 1–4. Validation data in the Pearl River Delta from 5 December 2021 show that the fusion results are generally close to the true value across all spectral bands. Compared with other methods, our method is closer to the true value. The advantage of our method was more apparent in the Yangtze River Delta on 26 February 2022, where our result remained symmetrically distributed on either side of the diagonal, while the scatterplots in which DSen2 and SRCNN lie show significant deviations. This phenomenon is due to the fact that DSen2 and SRCNN do not mine the deep features of the image sufficiently. Of these, the architecture of SRCNN is the shallowest, consisting of only three convolutional layers, which are unable to achieve the deep features of the image, and as a result, the slopes of the regression equation in the scatter plot are also the lowest. Note that the slopes of DSen2 are slightly larger than those of the SRCNN, but generally, this is by less than 0.6, and the percentage of the depth features lost should not be underestimated. For our method, the slope of the regression equation is close to one on both B1 and B4, and rises to 0.881 on B2, with only slight overfitting on B3. These results indicate that our method is far superior to DSen2 and SRCNN in its ability to explore deep features.
In addition to the advantages in quality evaluation metrics and density scatter plots, our method also shows satisfactory results in terms of the number of model parameters and model size. Table 6 shows the comparison of the number of model parameters and model size between SRCNN, DSen2, and PARNet, where the model size is in MB. Among them, the SRCNN model has the simplest structure, with only three convolutional layers, so the model parameters and memory size are the smallest, which are 52,004 and 644 KB, respectively. Followed by our PARNet with 836,356 and 9.85 MB. The DSen2 model has the maximum number of model parameters and memory size, which are 52,004 and 644 KB, respectively. Compared with DSen2, the number of model parameters and memory size of our method are less than half of DSen2. Combined with the previous quality evaluation performance, it can be concluded that our method is a lightweight and accurate model.

3.4.2. Evaluation at the Original Scale

In order to prove that our method is equally applicable to real-scale Sentinel-2, we performed an additional verification on the original dataset without down-sampling. As shown in Figure 10, the super-resolved image is clearly sharper and brings out additional detail compared to the respective initial input. Figure 11 shows a further check on the original Landsat dataset. Compared to the original 20 m scene, the super-resolved image carries over the extensive feature information from Sentinel-2 imagery. The previously blurry patches of color morphed into clear exterior shapes. As can be seen from the results, the super-resolved images show higher similarity to the original RGB images.
In the human perception, there is little difference in the super-resolution results obtained from three deep-learning-based methods, whether using Sentinel-2 data or Landsat images. Therefore, we performed additional evaluations using the Brenner gradient, as shown in Table 7 and Table 8. It is evident that the Brenner values derived from the deep learning methods exhibit a clear advantage. Furthermore, the results generated by DSen2 are comparable to our PARNet model, but surpass SRCNN. On the first three bands of Sentinel-2, the PARNet model slightly outperforms DSen2 and consistently outperforms DSen2 across all four bands of Landsat. Considering that our network width is half of DSen2 and the model parameters are also less than half of DSen2, we assert that our proposed model is more cost effective.

3.5. The Impact of the PA–RECblock

Here, we confirmed the influence of the PA–REC block on network performance through an Ablation study. In the first place, we removed the PA–REC block from our PARNet network and denoted it as  N 3 × 1 . Second,  N 3 × 2  contains the PA–REC block without the PA layer. In contrast,  N 3 × 3  has the same structure as PARNet.
As shown in Table 9, after adding the PA–REC block, all evaluation metrics improved compared to the initial model  N 3 × 1 , particularly on the SRE and PSNR, which have increased by 0.65365 and 1.95643 dB, respectively. In addition, even though there is no PA layer in the REC block, the structure consisting of convolution layers, activation functions, and constant scaling can still achieve a high accuracy result and the addition of the PA layer results in the best performance.

3.6. Robustness of the Model

Due to variations in the sampling times of multiple sensors, the images captured by different sensors are not completely identical and exhibit dissimilarities. These dissimilarities manifest in various aspects, including changes in land use and land cover (LULC), atmospheric cloud cover, traces of human activity, etc. In the case of low-resolution ground truth (GT) images, the divergent high-resolution (HR) data can provide misleading guidance. Consequently, the ability to mitigate the interference caused by error information becomes a critical challenge for multi-sensor image super-resolution, and it reflects the robustness of the model. This section investigates the interference resistance of our model to error messages.
As shown in Figure 12, the differences between images from different sampling times are common, with agricultural land changing from vegetation to bare soil in only a single day (a–c), not to mention longer time intervals. Compared with (d), the bare land in (f) increased as a result of crop harvesting. In addition to the LULC changes, traces of human activity also contribute to this difference. On 18 January 2022, the Sentinel-2 MSI sensor captured a scene of heavy smoke from the ground burning in (g), while the Landsat satellite missed this record by coming in 13 days late. On the same day, atmospheric cloud cover prevented Sentinel-2 from being able to capture a clear image in (j), which was avertedly avoided by the late Landsat, which obtained a non-cloud-covered image. Even though there are so many differences between the images captured by Sentinel-2 MSL and Landsat OLI, the fused images still demonstrate a tendency toward the original state, while achieving richer spatial detail. These findings indicate that our network is capable of identifying the changing scenes and making reasonable judgments about the true appearance of the changed region.

4. Discussion

To meet the demand for high spatial and temporal resolutions of remote sensing images, we proposed a lightweight pixel-wise attention residual network (PARNet) for remote sensing super-resolution tasks, which can harmonize the differences in spatial resolution between different sensors and add information about spatial detail to the sensor images with a low and medium spatial resolution, with the help of sensor images with high spatial resolution, thereby achieving a level of spatial resolution comparable to that of high spatial resolution images. The resulting remote sensing products with high spatial and temporal resolution can better serve in applications such as change detection and time series analysis.
In comparison to the existing models such as SRCNN and DSen2, our algorithm put more effort into mining rich local characteristics. As shown in Figure 8 and Figure 9, the predicted values from our method are significantly closer to the reference values on the Pearl River Delta dataset, as well as the Yangtze River Delta dataset. Most importantly, we introduce the pixel-wise attention mechanism into the super-resolution of remote sensing images for the first time. Pixel-wise attention, as a combination of spatial attention and channel attention, provides a more precise regulation of the correlation between feature information, thereby improving the learning and reconstruction efficiency of the network. As Table 9 indicates, the fusion network obtained the best performance when delivering the PA layer to the REC block. Alongside that, although we introduced the pixel-wise attention mechanism into the network framework, it should be noted that the parameters of our network are controlled around 836 K, which is less than half of the 1786 K parameters in DSen2, and outperform DSen2 in all of the quality assessment metrics referenced in this paper. These results reveal that our network achieves an improved balance between convergence performance and computational cost.
Due to the nonuniform acquisition times, the images collected by various sensors are not entirely identical. These variances include changes in LULC, atmospheric cloud cover, human activity traces, etc. For low-resolution GT images, the discrepant HR provides a kind of misleading guidance. Therefore, the ability to avoid the interference of error information is a problem to be faced by multi-sensor image super-resolution, and is a reflection of the robustness of the model. We also explored the interference resistance of our model to error messages. The results in Figure 12 indicate that our method can successfully identify the changing scenes and make reasonable judgements about the true appearance of the changed region, which indicates the high robustness of our model. In future research, a greater emphasis will be placed on establishing super-resolution products of remote sensing images with high spatial and temporal coverage, and applying them to specific environmental monitoring tasks to assign the super-resolution technology a greater role.

5. Conclusions

In this paper, we proposed a lightweight pixel-wise attention residual network (PARNet) for super-resolving optical remote sensing images. The proposed method is generalizable and can be applied to different super-resolution scenarios of remote sensing images. Compared with the existing deep-learning-based methods, the major advantage of our method is that we incorporate the pixel-wise attention mechanism in the feature fusion step, which effectively improved the performance of the fusion network. The 3D feature map generated by pixel-attention adaptively adjusts the pixel-level features in all feature maps to enhance the more important feature weights, and it significantly improves the performance of the network. Furthermore, our model allows for effective control over a number of parameters, thereby reducing the computational costs associated with remote sensing image super-resolution tasks. Additionally, our model exhibits strong robustness against the interference of erroneous HR information. Our model shows significant improvements in the SRE and PSNR on four test datasets. Specifically, it achieves improvements of 0.15 dB and 0.629 dB for Sentinel-2 data, and 0.196 dB and 1 dB for Landsat data. These results indicate that the proposed method is competitive with existing methods.

Author Contributions

Y.C. designed the research framework, analyzed the results, and wrote the manuscript. J.C. and G.C. provided assistance in the preparation work and validation work. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (42274012).

Data Availability Statement

The proposed method and the data used in this paper are available at https://github.com/Jessica-1997/PARNet.git, accessed on 31 May 2023.

Acknowledgments

The authors would like to thank the editor and the anonymous reviewers for their critical and constructive comments and suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Liu, J.; Chen, H.; Wang, Y. Multi-Source Remote Sensing Image Fusion for Ship Target Detection and Recognition. Remote Sens. 2021, 13, 4852. [Google Scholar] [CrossRef]
  2. Chen, J.; Sun, B.; Wang, L.; Fang, B.; Chang, Y.; Li, Y.; Zhang, J.; Lyu, X.; Chen, G. Semi-Supervised Semantic Segmentation Framework with Pseudo Supervisions for Land-Use/Land-Cover Mapping in Coastal Areas. Int. J. Appl. Earth Obs. Geoinf. 2022, 112, 102881. [Google Scholar] [CrossRef]
  3. Wang, C.; Wang, Q.; Wu, H.; Zhao, C.; Teng, G.; Li, J. Low-Altitude Remote Sensing Opium Poppy Image Detection Based on Modified Yolov3. Remote Sens. 2021, 13, 2130. [Google Scholar] [CrossRef]
  4. Chen, H.; Cheng, T.; Gu, X.; Li, Z.; Wu, Y. Evaluation of Polarized Remote Sensing of Aerosol Optical Thickness Retrieval over China. Remote Sens. 2015, 7, 13711–13728. [Google Scholar] [CrossRef] [Green Version]
  5. Kim, D.H.; Sexton, J.O.; Noojipady, P.; Huang, C.; Anand, A.; Channan, S.; Feng, M.; Townshend, J.R. Global, Landsat-Based Forest-Cover Change from 1990 to 2000. Remote Sens. Environ. 2014, 155, 178–193. [Google Scholar] [CrossRef] [Green Version]
  6. Senf, C.; Pflugmacher, D.; Heurich, M.; Krueger, T. A Bayesian Hierarchical Model for Estimating Spatial and Temporal Variation in Vegetation Phenology from Landsat Time Series. Remote Sens. Environ. 2017, 194, 155–160. [Google Scholar] [CrossRef]
  7. Qin, Q.; Wu, Z.; Zhang, T.; Sagan, V.; Zhang, Z.; Zhang, Y.; Zhang, C.; Ren, H.; Sun, Y.; Xu, W.; et al. Optical and Thermal Remote Sensing for Monitoring Agricultural Drought. Remote Sens. 2021, 13, 5092. [Google Scholar] [CrossRef]
  8. Liu, D.; Ye, F.; Huang, X.; Zhang, J.; Zhao, Y.; Huang, S.; Zhang, J. A New Concept and Practice of Remote Sensing Information Application: Post Remote Sensing Application Technology and Application Case to Geology. In Proceedings of the Remote Sensing of the Environment: 15th National Symposium on Remote Sensing of China, Guiyang, China, 9 June 2006; SPIE: Bellingham, WA, USA; Volume 6200, p. 620002. [Google Scholar]
  9. Ozdogan, M.; Yang, Y.; Allez, G.; Cervantes, C. Remote Sensing of Irrigated Agriculture: Opportunities and Challenges. Remote Sens. 2010, 2, 2274–2304. [Google Scholar] [CrossRef] [Green Version]
  10. Gao, D.; Liu, D.; Xie, X.; Wu, X.; Shi, G. High-Resolution Multispectral Imaging with Random Coded Exposure. J. Appl. Remote Sens. 2013, 7, 073695. [Google Scholar] [CrossRef]
  11. Gao, G.; Gu, Y. Multitemporal Landsat Missing Data Recovery Based on Tempo-Spectral Angle Model. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3656–3668. [Google Scholar] [CrossRef]
  12. Chivasa, W.; Mutanga, O.; Biradar, C. Application of Remote Sensing in Estimating Maize Grain Yield in Heterogeneous African Agricultural Landscapes: A Review. Int. J. Remote Sens. 2017, 38, 6816–6845. [Google Scholar] [CrossRef]
  13. Gao, F.; Hilker, T.; Zhu, X.; Anderson, M.; Masek, J.; Wang, P.; Yang, Y. Fusing Landsat and MODIS Data for Vegetation Monitoring. IEEE Geosci. Remote Sens. Mag. 2015, 3, 47–60. [Google Scholar] [CrossRef]
  14. Ju, J.; Roy, D.P. The Availability of Cloud-Free Landsat ETM+ Data over the Conterminous United States and Globally. Remote Sens. Environ. 2008, 112, 1196–1211. [Google Scholar] [CrossRef]
  15. Wulder, M.A.; Loveland, T.R.; Roy, D.P.; Crawford, C.J.; Masek, J.G.; Woodcock, C.E.; Allen, R.G.; Anderson, M.C.; Belward, A.S.; Cohen, W.B.; et al. Current Status of Landsat Program, Science, and Applications. Remote Sens. Environ. 2019, 225, 127–147. [Google Scholar] [CrossRef]
  16. Knight, E.J.; Kvaran, G. Landsat-8 Operational Land Imager Design, Characterization and Performance. Remote Sens. 2014, 6, 10286–10305. [Google Scholar] [CrossRef] [Green Version]
  17. Zheng, Y.; Song, H.; Sun, L.; Wu, Z.; Jeon, B. Spatiotemporal Fusion of Satellite Images via Very Deep Convolutional Networks. Remote Sens. 2019, 11, 2701. [Google Scholar] [CrossRef] [Green Version]
  18. Li, J.; Li, Y.; Cai, R.; He, L.; Chen, J.; Plaza, A. Enhanced Spatiotemporal Fusion via MODIS-Like Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5610517. [Google Scholar] [CrossRef]
  19. Gao, F.; Masek, J.; Schwaller, M.; Hall, F. On the Blending of the Landsat and MODIS Surface Reflectance: Predicting Daily Landsat Surface Reflectance. IEEE Trans. Geosci. Remote Sens. 2006, 44, 2207–2218. [Google Scholar] [CrossRef]
  20. Marques, A.; Rossa, P.; Horota, R.K.; Brum, D.; De Souza, E.M.; Aires, A.S.; Kupssinsku, L.; Veronez, M.R.; Gonzaga, L.; Cazarin, C.L. Improving Spatial Resolution of LANDSAT Spectral Bands from a Single RGB Image Using Artificial Neural Network. In Proceedings of the International Conference on Sensing Technology (ICST), Sydney, NSW, Australia, 2–4 December 2019. [Google Scholar] [CrossRef]
  21. Song, H.; Huang, B.; Liu, Q.; Zhang, K. Improving the Spatial Resolution of Landsat TM/ETM+ through Fusion with SPOT5 Images via Learning-Based Super-Resolution. IEEE Trans. Geosci. Remote Sens. 2015, 53, 1195–1204. [Google Scholar] [CrossRef]
  22. Shao, Z.; Cai, J.; Fu, P.; Hu, L.; Liu, T. Deep Learning-Based Fusion of Landsat-8 and Sentinel-2 Images for a Harmonized Surface Reflectance Product. Remote Sens. Environ. 2019, 235, 111425. [Google Scholar] [CrossRef]
  23. Drusch, M.; Del Bello, U.; Carlier, S.; Colin, O.; Fernandez, V.; Gascon, F.; Hoersch, B.; Isola, C.; Laberinti, P.; Martimort, P.; et al. Sentinel-2: ESA’s Optical High-Resolution Mission for GMES Operational Services. Remote Sens. Environ. 2012, 120, 25–36. [Google Scholar] [CrossRef]
  24. Wang, Q.; Blackburn, G.A.; Onojeghuo, A.O.; Dash, J.; Zhou, L.; Zhang, Y.; Atkinson, P.M. Fusion of Landsat 8 OLI and Sentinel-2 MSI Data. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3885–3899. [Google Scholar] [CrossRef] [Green Version]
  25. Shang, R.; Zhu, Z. Harmonizing Landsat 8 and Sentinel-2: A Time-Series-Based Reflectance Adjustment Approach. Remote Sens. Environ. 2019, 235, 111439. [Google Scholar] [CrossRef]
  26. Liebel, L.; Körner, M. Single-Image Super Resolution for Multispectral Remote Sensing Data Using Convolutional Neural Networks. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci.-ISPRS Arch. 2016, 41, 883–890. [Google Scholar] [CrossRef] [Green Version]
  27. Tuna, C.; Unal, G.; Sertel, E. Single-Frame Super Resolution of Remote-Sensing Images by Convolutional Neural Networks. Int. J. Remote Sens. 2018, 39, 2463–2479. [Google Scholar] [CrossRef]
  28. Lei, S.; Shi, Z.; Zou, Z. Super-Resolution for Remote Sensing Images via Local-Global Combined Network. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1243–1247. [Google Scholar] [CrossRef]
  29. Xu, W.; Xu, G.; Wang, Y.; Sun, X.; Lin, D.; Wu, Y. High Quality Remote Sensing Image Super-Resolution Using Deep Memory Connected Network. In Proceedings of the IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Institute of Electronics, Chinese Academy of Sciences: Beijing, China, 2018. [Google Scholar]
  30. Jiang, K.; Wang, Z.; Yi, P.; Jiang, J.; Xiao, J.; Yao, Y. Deep Distillation Recursive Network for Remote Sensing Imagery Super-Resolution. Remote Sens. 2018, 10, 1700. [Google Scholar] [CrossRef] [Green Version]
  31. Deeba, F.; Dharejo, F.A.; Zhou, Y.; Ghaffar, A.; Memon, M.H.; Kun, S. Single Image Super-Resolution with Application to Remote-Sensing Image. In Proceedings of the 2020 Global Conference on Wireless and Optical Technologies, GCWOT 2020, Malaga, Spain, 6–8 October 2020; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2020. [Google Scholar]
  32. Huan, H.; Li, P.; Zou, N.; Wang, C.; Xie, Y.; Xie, Y.; Xu, D. End-to-end Super-resolution for Remote-sensing Images Using an Improved Multi-scale Residual Network. Remote Sens. 2021, 13, 666. [Google Scholar] [CrossRef]
  33. Lu, T.; Wang, J.; Zhang, Y.; Wang, Z.; Jiang, J. Satellite Image Super-Resolution via Multi-Scale Residual Deep Neural Network. Remote Sens. 2019, 11, 1588. [Google Scholar] [CrossRef] [Green Version]
  34. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [Green Version]
  35. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image Super-Resolution Using Very Deep Residual Channel Attention Networks. In Proceedings of the Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Munich, Germany, 8–14 September 2018; Volume 11211 LNCS, pp. 294–310. [Google Scholar]
  36. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the 15th European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; Volume 11211 LNCS, pp. 3–19. [Google Scholar]
  37. Dai, T.; Cai, J.; Zhang, Y.; Xia, S.T.; Zhang, L. Second-Order Attention Network for Single Image Super-Resolution. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2019, 2019, 11057–11066. [Google Scholar] [CrossRef]
  38. Niu, B.; Wen, W.; Ren, W.; Zhang, X.; Yang, L.; Wang, S.; Zhang, K.; Cao, X.; Shen, H. Single Image Super-Resolution via a Holistic Attention Network. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); LNCS; Springer Science and Business Media Deutschland GmbH: Berlin, Germany, 2020; Volume 12357 LNCS, pp. 191–207. [Google Scholar]
  39. Zhao, H.; Kong, X.; He, J.; Qiao, Y.; Dong, C. Efficient Image Super-Resolution Using Pixel Attention. In Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; Volume 12537 LNCS, pp. 56–72. [Google Scholar] [CrossRef]
  40. Wu, J.; He, Z.; Hu, J. Sentinel-2 Sharpening via Parallel Residual Network. Remote Sens. 2020, 12, 279. [Google Scholar] [CrossRef] [Green Version]
  41. Yue, X.; Chen, X.; Zhang, W.; Ma, H.; Wang, L.; Zhang, J.; Wang, M.; Jiang, B. Super-Resolution Network for Remote Sensing Images via Preclassification and Deep–Shallow Features Fusion. Remote Sens. 2022, 14, 925. [Google Scholar] [CrossRef]
  42. Zhu, Y.; Geiß, C.; So, E. Image Super-Resolution with Dense-Sampling Residual Channel-Spatial Attention Networks for Multi-Temporal Remote Sensing Image Classification. Int. J. Appl. Earth Obs. Geoinf. 2021, 104, 102543. [Google Scholar] [CrossRef]
  43. Wald, L.; Ranchin, T.; Mangolini, M. Fusion of Satellite Images of Different Spatial Resolutions: Assessing the Quality of Resulting Images. Photogramm. Eng. Remote Sens. 1997, 63, 691–699. [Google Scholar]
  44. Kingma, D.P.; Ba, J.L. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015—Conference Track Proceedings, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  45. Wang, Z.; Bovik, A.C. A Universal Image Quality Index. IEEE Signal Process. Lett. 2002, 9, 81–84. [Google Scholar] [CrossRef]
  46. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef] [Green Version]
  47. Lanaras, C.; Bioucas-Dias, J.; Galliani, S.; Baltsavias, E.; Schindler, K. Super-Resolution of Sentinel-2 Images: Learning a Globally Applicable Deep Neural Network. ISPRS J. Photogramm. Remote Sens. 2018, 146, 305–319. [Google Scholar] [CrossRef] [Green Version]
Figure 1. The flowchart of the study.
Figure 1. The flowchart of the study.
Remotesensing 15 03139 g001
Figure 2. The network architecture of the proposed PARNet.
Figure 2. The network architecture of the proposed PARNet.
Remotesensing 15 03139 g002
Figure 3. Expanded view of the Resblock.
Figure 3. Expanded view of the Resblock.
Remotesensing 15 03139 g003
Figure 4. PA: pixel-wise attention.
Figure 4. PA: pixel-wise attention.
Remotesensing 15 03139 g004
Figure 5. Expanded view of the PA-REC.
Figure 5. Expanded view of the PA-REC.
Remotesensing 15 03139 g005
Figure 6. The geographic locations of Landsat and Sentinel-2 imagesused for training and testing. (BE) are the subplots of the four black boxes in (A), showing the specific geographic locations of the overlapping areas (red boxes) between Sentinel and Landsat. The thumbnails and acquisition time of each image are displayed in the upper right corner.
Figure 6. The geographic locations of Landsat and Sentinel-2 imagesused for training and testing. (BE) are the subplots of the four black boxes in (A), showing the specific geographic locations of the overlapping areas (red boxes) between Sentinel and Landsat. The thumbnails and acquisition time of each image are displayed in the upper right corner.
Remotesensing 15 03139 g006
Figure 7. The down-sampling process for simulating the data.
Figure 7. The down-sampling process for simulating the data.
Remotesensing 15 03139 g007
Figure 8. Comparisons between the reference bands and the fused bands produced by DSen2, SRCNN, and PARNet for bands 1–4 on 5 December 2021. (a,d,g,j) The fusion reflectance yielded by DSen2. (b,e,h,k) The fusion reflectance yielded by SRCNN. (c,f,i,l) The fusion reflectance yielded by PARNet. The color scheme indicates the density of the points.
Figure 8. Comparisons between the reference bands and the fused bands produced by DSen2, SRCNN, and PARNet for bands 1–4 on 5 December 2021. (a,d,g,j) The fusion reflectance yielded by DSen2. (b,e,h,k) The fusion reflectance yielded by SRCNN. (c,f,i,l) The fusion reflectance yielded by PARNet. The color scheme indicates the density of the points.
Remotesensing 15 03139 g008
Figure 9. Comparisons between the reference bands and the fused bands produced by DSen2, SRCNN, and PARNet for bands 1–4 on 26 February 2022. (a,d,g,j) The fusion reflectance yielded by DSen2. (b,e,h,k) The fusion reflectance yielded by SRCNN. (c,f,i,l) The fusion reflectance yielded by PARNet. The color scheme indicates the density of the points.
Figure 9. Comparisons between the reference bands and the fused bands produced by DSen2, SRCNN, and PARNet for bands 1–4 on 26 February 2022. (a,d,g,j) The fusion reflectance yielded by DSen2. (b,e,h,k) The fusion reflectance yielded by SRCNN. (c,f,i,l) The fusion reflectance yielded by PARNet. The color scheme indicates the density of the points.
Remotesensing 15 03139 g009
Figure 10. Results of real Sentinel-2 data. The first two rows are the original 10 m scenes (B2, B3, and B4 as RGB) and the original 20 m scenes (B12, B8a, and B5 as RGB). The remaining 10 m scenes, from top to bottom, are obtained via bicubic interpolation, SRCNN network, DSen2 network, and PARNet network, respectively (B12, B8a, and B5 as RGB).
Figure 10. Results of real Sentinel-2 data. The first two rows are the original 10 m scenes (B2, B3, and B4 as RGB) and the original 20 m scenes (B12, B8a, and B5 as RGB). The remaining 10 m scenes, from top to bottom, are obtained via bicubic interpolation, SRCNN network, DSen2 network, and PARNet network, respectively (B12, B8a, and B5 as RGB).
Remotesensing 15 03139 g010
Figure 11. Results of real Landsat-8 data. The first two rows are the original 10 m scenes (B2, B3, and B4 as RGB) of Sentinel-2 and the original 30 m scenes (B12, B8a, and B5 as RGB) of Landsat-8. The remaining 10 m scenes, from top to bottom, are obtained via bicubic interpolation, SRCNN network, DSen2 network, and PARNet network, respectively (B2, B3, and B4 as RGB).
Figure 11. Results of real Landsat-8 data. The first two rows are the original 10 m scenes (B2, B3, and B4 as RGB) of Sentinel-2 and the original 30 m scenes (B12, B8a, and B5 as RGB) of Landsat-8. The remaining 10 m scenes, from top to bottom, are obtained via bicubic interpolation, SRCNN network, DSen2 network, and PARNet network, respectively (B2, B3, and B4 as RGB).
Remotesensing 15 03139 g011
Figure 12. From left to right, the images represent the original Sentinel-2 with 10 m GSD (a,d,g,j), super-resolved Landsat-8 images with 10 m GSD (b,e,h,k), and original Landsat-8 images with 30 m GSD (c,f,i,l). The red rectangles highlight the difference between the true scene of Sentinel-2 and Landsat 8. The scene in (ac) was captured from the Yangtze River dataset, while the others are from the Florida dataset in Figure 6 with bands 4, 3, and 2 as RGB.
Figure 12. From left to right, the images represent the original Sentinel-2 with 10 m GSD (a,d,g,j), super-resolved Landsat-8 images with 10 m GSD (b,e,h,k), and original Landsat-8 images with 30 m GSD (c,f,i,l). The red rectangles highlight the difference between the true scene of Sentinel-2 and Landsat 8. The scene in (ac) was captured from the Yangtze River dataset, while the others are from the Florida dataset in Figure 6 with bands 4, 3, and 2 as RGB.
Remotesensing 15 03139 g012
Table 1. Information for all bands of the Sentinel-2 MSI sensor.
Table 1. Information for all bands of the Sentinel-2 MSI sensor.
BandWavelength (nm)Resolution (m)
VisibleB1433–45360
B2458–52310
B3543–57810
RedB4650–68010
Red-EdgeB5698–71320
B6733–74820
B7773–79320
Near-InfraredB8785–90010
B8a855–87520
Water VapourB9935–95560
SWIR-CirrusB101360–139060
SWIRB111565–165520
B122100–228020
Table 2. Information for all bands of the Landsat OLI sensor.
Table 2. Information for all bands of the Landsat OLI sensor.
BandWavelength for Landsat-8 (nm)Wavelength for Landsat-9 (nm)Resolution (m)
VisibleB1433–453430–45030
B2450–515450–51030
B3525–600530–59030
RedB4630–680640–67030
Near-InfraredB5845–885850–88030
SWIRB61560–16601570–165030
B71360–13901360–138030
PanchromaticB8500–680500–68015
Table 3. Training and testing split.
Table 3. Training and testing split.
NetworkGSDPatchesSplit
  N 2 × 10 m
20 m
(90, 90)
(45, 45)
Training80%
Validation10%
Testing10%
  N 3 × 10 m(60, 60)Training80%
15 m(40, 40)Validation10%
30 m(20, 20)Testing10%
Table 4. Results of the sharpening 20 m bands for Sentinel-2, evaluated at a lower scale (input 40 m, output 20 m). The best results are shown in bold.
Table 4. Results of the sharpening 20 m bands for Sentinel-2, evaluated at a lower scale (input 40 m, output 20 m). The best results are shown in bold.
RMSESREUIQPSNR
FloridaBicubic0.007426.05320.951869.7297
SRCNN0.003429.79530.988584.5335
DSen20.002730.88310.992689.2141
PARNet0.002631.11940.993490.1849
AndesBicubic0.008933.83280.954276.7907
SRCNN0.004636.84100.987489.2641
DSen20.003138.86960.994397.1335
PARNet0.003039.05170.994797.8848
Yangtze River DeltaBicubic0.007025.12620.908667.5037
SRCNN0.003129.15600.983784.5205
DSen20.002430.29610.990189.0881
PARNet0.002430.40280.990789.5992
Pearl River DeltaBicubic0.008121.54540.928364.2546
SRCNN0.003126.11320.989083.7916
DSen20.002726.84110.992286.6477
PARNet0.002626.91980.992586.9322
Table 5. Results of the sharpening bands for Landsat were evaluated at a lower scale (input 90 m, output 30 m). The best results are shown in bold.
Table 5. Results of the sharpening bands for Landsat were evaluated at a lower scale (input 90 m, output 30 m). The best results are shown in bold.
RMSESREUIQPSNR
FloridaBicubic0.011217.62730.873861.2077
SRCNN0.005121.14880.973676.4707
DSen20.004721.50480.977478.0028
PARNet0.004521.68740.979378.8305
AndesBicubic0.009614.78990.826659.2376
SRCNN0.003819.09640.974377.1665
DSen20.003619.37240.977478.3122
PARNet0.003519.53490.979679.1877
Yangtze River DeltaBicubic0.00518.56450.784465.2543
SRCNN0.002421.74510.950278.4773
DSen20.002122.62350.964481.8125
PARNet0.002022.82460.968282.9704
Pearl River DeltaBicubic0.010426.01210.871764.9336
SRCNN0.004629.76170.976480.6697
DSen20.004330.16780.980182.3914
PARNet0.004030.40690.982183.5550
Table 6. Comparison of the number of model parameters and model size.
Table 6. Comparison of the number of model parameters and model size.
SRCNNDSen2PARNet
Model Params52,0041,786,116836,356
Memory Size (MB)0.64420.59.85
Table 7. Results of the sharpening bands for Sentinel-2, evaluated at the original scale (input 20 m, output 10 m). The best results are shown in bold.
Table 7. Results of the sharpening bands for Sentinel-2, evaluated at the original scale (input 20 m, output 10 m). The best results are shown in bold.
B5B6B7B8a
BrennerBicubic0.4440 0.5883 0.8133 0.9799
SRCNN1.5489 2.0811 2.8447 3.3578
DSen21.5798 2.1246 2.8918 3.3844
PARNet1.58042.12672.89893.3801
Table 8. Results of the sharpening bands for Landsat, evaluated at the original scale (input 30 m, output 10 m). The best results are shown in bold.
Table 8. Results of the sharpening bands for Landsat, evaluated at the original scale (input 30 m, output 10 m). The best results are shown in bold.
B1B2B3B4
BrennerBicubic0.1198 0.1464 0.1907 0.3222
SRCNN0.5885 0.7074 0.9220 1.5432
DSen20.6219 0.7599 0.9693 1.5564
PARNet0.63440.77580.97081.5975
Table 9. The impact of the PA–REC block. The symbol ✓ represents the inclusion relationship and Remotesensing 15 03139 i001 refers to the non-inclusion relationship. The values in the table are the average of the four Landsat datasets. The best results are shown in bold.
Table 9. The impact of the PA–REC block. The symbol ✓ represents the inclusion relationship and Remotesensing 15 03139 i001 refers to the non-inclusion relationship. The values in the table are the average of the four Landsat datasets. The best results are shown in bold.
Model   N 3 × 1   N 3 × 2   N 3 × 3
PARemotesensing 15 03139 i001Remotesensing 15 03139 i001
RECRemotesensing 15 03139 i001
RMSE0.003850.003520.00351
SRE22.9597823.6099123.61343
UIQI0.972230.977200.97733
PSNR79.1794381.0841481.13586
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chang, Y.; Chen, G.; Chen, J. Pixel-Wise Attention Residual Network for Super-Resolution of Optical Remote Sensing Images. Remote Sens. 2023, 15, 3139. https://doi.org/10.3390/rs15123139

AMA Style

Chang Y, Chen G, Chen J. Pixel-Wise Attention Residual Network for Super-Resolution of Optical Remote Sensing Images. Remote Sensing. 2023; 15(12):3139. https://doi.org/10.3390/rs15123139

Chicago/Turabian Style

Chang, Yali, Gang Chen, and Jifa Chen. 2023. "Pixel-Wise Attention Residual Network for Super-Resolution of Optical Remote Sensing Images" Remote Sensing 15, no. 12: 3139. https://doi.org/10.3390/rs15123139

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop