Panchromatic Image Super-Resolution Via Self Attention-Augmented Wasserstein Generative Adversarial Network

Panchromatic (PAN) images contain abundant spatial information that is useful for earth observation, but always suffer from low-resolution ( LR) due to the sensor limitation and large-scale view field. The current super-resolution (SR) methods based on traditional attention mechanism have shown remarkable advantages but remain imperfect to reconstruct the edge details of SR images. To address this problem, an improved SR model which involves the self-attention augmented Wasserstein generative adversarial network ( SAA-WGAN) is designed to dig out the reference information among multiple features for detail enhancement. We use an encoder-decoder network followed by a fully convolutional network (FCN) as the backbone to extract multi-scale features and reconstruct the High-resolution (HR) results. To exploit the relevance between multi-layer feature maps, we first integrate a convolutional block attention module (CBAM) into each skip-connection of the encoder-decoder subnet, generating weighted maps to enhance both channel-wise and spatial-wise feature representation automatically. Besides, considering that the HR results and LR inputs are highly similar in structure, yet cannot be fully reflected in traditional attention mechanism, we, therefore, designed a self augmented attention (SAA) module, where the attention weights are produced dynamically via a similarity function between hidden features; this design allows the network to flexibly adjust the fraction relevance among multi-layer features and keep the long-range inter information, which is helpful to preserve details. In addition, the pixel-wise loss is combined with perceptual and gradient loss to achieve comprehensive supervision. Experiments on benchmark datasets demonstrate that the proposed method outperforms other SR methods in terms of both objective evaluation and visual effect.


Introduction
Panchromatic (PAN) images have been widely used in various applications, such as weather forecasts, environmental monitor, and earth observation. However, since the PAN images are always taken from space satellites with a large field of view, their spatial resolution is usually quite limited, and details of ground objects, for this reason, cannot be well distinguished. To resolve this problem, recent works began to focus on the superresolution (SR) of PAN images. Due to the limitation of sensors, the PAN images captured from satellite sensors suffers from the heavy image degradation, which is an urgent need for SR to improve resolution and rich image texture through image processing algorithms.
The performance of SR algorithms [1][2][3][4] has been greatly boosted by the convolutional neural networks. The conventional supervised learning model tries to minimize the error between ground truth and SR results, whereas this design cannot well utilize the difference between those two samples semantically, arising from the fact that the loss functions are usually assigned as basic error functions, such as Mean Square Error (MSE), Structural Similarity Index Measurement(SSIM), or L1 Norm [5]. The the adversarial generative network (GAN) is introduced to resolve this shortage. Unlike normal generative networks, GAN-based methods apply the discriminative network to minimize the semantic distance between generated images and ground truth images through the discriminative error, producing High-resolution (HR) results with more details and naturalness. Although GAN-based SR models have made successful progress, there are still some limitations, such as the training instability and the limited representative ability of spatial-wise and channel-wise features [6]. The training instability is usually caused by the nonlinearity of the discriminative supervision, which may cause mode collapse. In addition, traditional convolution cannot respond to the different contribution among multi-channel and different locations of the feature maps.
Some improved models have been proposed to address the issue of feature representation in SR networks. Residual channel attention networks (RCAN) [7] is introduced to learn features across channels and enhance long-term information. Channel attention is used to exploit the features across different channels, but this design cannot fully use the the relevance among different locations of the feature. To further dig out the hidden relation within features, a channel and spatial attention block (CBAM) [8] is developed via combining channel-wise and spatial attention mechanisms into the network. Y.T. Hu first introduced the CBAM block into SR network [9], which integrates the CBAM features of the channel-wise attention and spatial attention into the residual block (CSAR) to modulate the residual features. The CSAR blocks are stacked in a chain structure to dynamically modulate multi-level features in a global-and-local manner. The multi-level features, in this way, can be adapted and fused with a hierarchical feature map through gated fusion. But, in fact, the relevant information between the channel feature and the spatial feature has not been excavated in CSAR blocks.
To effectively settle the above problem, attention-augmented convolution [10] is introduced in this paper to utilize the relevance among multiple features. Attentionaugmented convolution improves classic convolution by augmenting the features and giving adaptive weight for feature combination, which can flexibly adjust the fraction of attentional channels to keep inter information among features. This allows the network to capture long-range interactions without increasing the number of parameters, whereas the self-attention mechanism has not been fully explored in SR.
In this work, a self augmented attention Wasserstein generative adversarial network (SAA-WGAN) is proposed for PAN images SR. We first integrate a convolutional block attention module (CBAM) into each skip-connection of the encoder-decoder subnet instead of stacking in a chain for CBAM features extraction in multi-scale. To obtain relevant information for hierarchical features, the self augmented attention (SAA) block using attention-augmented convolution are presented for extraction of the hinder feature and contextual information. In our SAA-WGAN, an encode-decode structure with CBAMs is used as one branch network, and the SAA block is utilized as another parallel branch, providing more helpful features in multiply scales and layers for the reconstruction of HR result. In addition, the pixel-wise information and high-level semantic information can be exploited by the combined loss of pixel loss and perceptual loss. As result, our method obtains better visual quality and recovers more image details compared with other state-of-the-art SR methods.
In summary, the main contributions of this paper are listed as follows: (1) We propose a WGAN-based network (SAA-WGAN) for PAN image SR, which is integrated with the encode-decode structure and CBAM. (2) We apply the self-attention module into the WGAN network, from which the longrange features can be well preserved and transferred. (3) The generate loss is a combination of pixel loss, perceptual loss, and gradient loss to achieve the supervise in terms of both image quality and visual effect.
(4) Extensive evaluations have been conducted to verify the above contributions.
The remainder of this paper is organized as follows. We introduce the related Generative Adversarial Networks and Attention Features in Section 2. The proposed method of SAA-WGAN for PAN image super resolution is described in Section 3. The experimental results and analysis are reported in Section 4. The conclusion of this paper is stated in Section 5.

Generative Adversarial Networks
Traditional GANs [11][12][13] always narrow the gap between the generated sample and the real image by minimizing the Kullback-Leibler divergence (KL)distance between discriminative results, and the structure of gan is shown in Figure 1. The discriminator network in GAN can distinguish real and false samples, as well as produce very realistic SR results. Since the KL divergence is not linear for the input distribution space, which means the supervision of the discriminator is non-uniform for all the input samples, thus, the performance of traditional GANs is quite limited.
In SR, the generator network is trained to capture the real data distribution so that its generative samples can be as real as possible, which means to minimize ErlogpDpI, xqqsÈ rlogp1´DpI,xqqs. The discriminative network estimates the probability of a given sample coming from the real dataset, i.e., it can maximize the probability to distinguish SR sample from real data. So, the contest between the discriminator and the generator is usually formulated as a zero sum with cross-entropy targets.
where x is the input,x is the SR image, I is groundtruth, D is discriminator in the network, and G is generator in the network. Hence, the discriminator loss is In practice, a modified generator loss is used:

Attention Features
Attention has enjoyed widespread adoption in convolutional neural networks (CNN ) models, including SR networks, because of its ability to enhance feature representation. Y.L. Zhang proposed channel attention (CA) mechanism to adaptively rescale channel-wise features by considering interdependencies among channels. To consider the channel attention and space attention jointly, Y.T. Hu introduced the channel and spatial attention block (CBAM) [14] module into deep SR network [9], where a set of channel-wise and spatial attention residual (CSAR) blocks was conducted and stacked in a chain structure to dynamically modulate multi-level features in a global-and-local manner. Lately, more attention mechanisms has been applied in super resolution [15,16]. Tao Dai presented a second-order attention network (SAN [15]) that employs repeated local-source residual attention groups (LSRAG) to learn increasingly abstract feature representations. In SAN, a novel trainable second-order channel attention (SOCA) module was developed to adaptively rescale the channel-wise features by using second-order feature statistics for more discriminative representations. Further, L.G. Wang created a parallax-attention mechanism (PASSRnet [16]) to integrate the information from a stereo image pair, handling different stereo images with large disparity variations.
Although these existing attention-based approaches have made good efforts to improve SR performance, the reconstruction of rich details for SISR is still a challenge. In deep networks, the low resolution (LR) inputs and extracted features contain different types of information across channels, locations, and layers, which have different reconstruction contributions for reasons. However, the common convolutional layer imposes locality and translation equivariance via a limited receptive and weight sharing, respectively. The local nature of the convolutional kernel prevents it from capturing global contexts in an image, which is necessary for the details of SR images. Consequently, contributions across different aspects are not equal, which causes that multiple feature maps cannot be fully utilized.
Inspired by the above observations, we propose a method that can capture the global contexts by attention-augmented convolution and extract multi-scale features via an encodedecode network. The features from attention-augmented convolution and the encodedecode network are shown in Figure 2. Features of attention-augmented remain lots of details, such as corners and edges. It further assists the HR image reconstruction in the spatial domain and can be concatenated with the multi-scale feature.

Method
Our system reconstructs high-resolution images via Wasserstein generative adversarial networks with the channel and spatial attention to obtain more representative features. Especially, the attention of channel and spatial is a flexible mechanism to capture information of channel features and position features in a self-adaptive manner such that accumulated important information is weighted highly. Besides, a WGAN network with a comprehensive loss function is applied to achieve a realistic display of SR reconstruction results with more details.

Architecture
The architecture of the self augmented attention WGAN (SAA-WGAN) is illustrated in Figure 2. It consists of two parallel branches, including an encode-decode network and a self attention network. The encode-decode network is composed of two modules, i.e., the encode-decode module (EDM) and the fully convolutional network (FCN). The FCN involves five convolution blocks of eight kernels with a size of 3ˆ3. The EDM is a four-scale encode-decode convolutional module, and the CBAM is rubbed into each scale to enhance multi-scale feature representation. Meanwhile, self augmented attention (SAA) convolution is introduced to make a relation of the space and the channel feature subspace for a powerful convolution.
Wasserstein GAN. Wasserstein GAN is proposed by Martin Arjovsky and others to optimize a discriminator by maximizing the Earth Mover (EM) distance between the discriminative result of fake and real samples. Thus, the "discriminator" is not a direct critic of telling the fake samples apart from the real ones anymore. Instead, it is trained to learn a K-Lipschitz continuous function (satisfy } f } L ď K) to help compute Wasserstein distance [17] which is linear for the entire sample space. The Wasserstein distance is informally defined as min G max D´p´E rDpI, xqs`ErDpI,xqsq, where I is LR image; x is the real image; andx is SR image. Loss is the set of 1-Lipschitz functions. The discriminator loss is Loss D "´ErDpI, xqs`ErDpI,xqs.
In practice, a modified generator loss is expressed as Loss G "´ErDpI,xqs.
Wasserstein GAN removes the logarithm for continuous gradient update and uses gradient penalty for the relevance of parameters and constraints. It solves some problems, such as the unstable gradient of the generator and insufficient diversity of generated data, in GAN. So, it is used in our model to facilitate the reconstruction of more detail in SR image.
Channel and spatial attention block (CBAM). Attention model has been used to help the network to focus on the features which are more critical for the performance. In our model, to fully exploit its information, we utilize a channel and spatial attention block for the feature re-enhancement. The CBAM adopts average-pooling to squeeze the spatial dimension of the input feature map to achieve channel attention. It also applies average-pooling and max-pooling operations along channel axis for spatial attention, then concatenates these features, and generates a spatial attention map by a convolution layer. In this way, the weights of different channels and different positions can be flexibly adjusted under the importance of the information. The input of CBAM is the feature of each layer in EDM, and the multi-scale features captured by CBAM are displayed in the first row of Figure 3.

Fusion of multi-scale feature
Attentionaugmented feature The CBAM structure as Figure 4, CBAM feature math expression can be inferred by a 1D channel attention map M c and a 2D spatial attention map M s , and the CBAM feature F c bam can be expressed as where Â denotes element-wise multiplication. AvgPoolpq is average-pooling, and MaxPoolpq is max-pooling. F c avg and F c max denote channel average-pooled features and max-pooled channel features, respectively. F s avg and F s max denote average-pooled spatial features and max-pooled spatial features, respectively. σ denotes the sigmoid function, and f 7ˆ7 represents a convolution operation with the convolutional filter size of 7ˆ7. W 1 and W 0 are feature weights after pooling and after activation. Self augmented attention (SAA) convolution. Self augmented attention (SAA) convolution aims to compute a weighted average of values from hidden units, and the weights are produced dynamically via a similarity function. It can also capture long-range interactions among input signals and gives the dynamical weights obtained by hidden units to the input. The SAA takes local information and re-calibrated global information into the convolution. That is, two heads of feature subspace can participate in the attention mechanism to get both spatial and channel-wise weighted maps, which are used to re-weight the corresponding location of the input image, and, finally, concatenates with the point-wise convolution to achieve the enhanced convolution operation in SAA. Therefore, augment convolutions are applied to the self-attention mechanism for representative and abstract features, as displayed in the second row of Figure 3.
Self attention-augmented convolution is achieved by concatenating convolutional feature maps into self-attentional feature maps which is capable of modeling longer range dependencies (see Figure 5). First, we flatten the input matrix X shape of pH, W, d in q to pHW, d in q and take an operation of multi-head attention as the transformer architecture [8]. The output of the self-attention module for a head h can be formulated as: where W q , W k P R d inˆd h k , and W v P R d inˆd h v are learned linear transformations used to map the input X to queries Q " XW q , keys K " XW k , and values V " XW v . Attention (Q, K, V) map uses query Q and keys K matrix as weight of values V, and we can obtain a HWˆHW matrix via`XW q˘p XW k q T . Then, the outputs of all heads (1, 2, 3, ..., h) are then concatenated as follows: where W o P R dvˆdv is a linear transformation. O atten pXq is then reshaped into a tensor of shape pH, W, dvq to match the original spatial dimensions.

Attention map Weighted average of values values Output
Standard convolution Concat Input The comprehensive feature F sr is expressed by SAA feature F sel f and FCN feature F f cn .
where Cp¨q function is sum of F f cn and F sel f , and F c function is series of convolution in FCN module. ř Fi cbam is the fusion Fi cbam (i = 1, 2, 3, 4), Fi cbam is the CBAM feature, F sel f is the SAA feature, and F f cn is the FCN feature.

Loss function
Efficient loss functions and deep CNN networks have been exploited in other SR methods [18]. To achieve better performance, we utilize the pixel-wise loss (e.g., L1 loss) to minimize the error between the real image and SR result in pixel-level, which has been widely used in many image reconstruction problems [19]. The pixel-wise loss can get excellent performance in Peak Signal to Noise Ratio (PSNR) but always introduces some artifacts. To avoid image artifacts, we also introduce the perceptual loss in our model. Perceptual loss tries to reduce the feature gap between the real image and the reconstructed image at certain layers of VGG19 features, and it can be used to preserve semantic information and achieve better visual quality. In addition, the gradient loss is used to minimize the gradient difference between the real image and the reconstructed image in different directions. The combination of pixel-wise loss, gradient loss, and perceptual loss is applied to supervise the training process. The combined generator loss can be expressed as Loss f inal G pI,xq "´ra˚loss pix pI,xq`b˚loss per pI,xq`c˚loss grad pI,xqs.
In addition, the discriminator loss is Loss f inal D px,xq " ra˚loss pix pI,xq`b˚loss per pI,xq`c˚loss grad pI,xqś ra˚loss pix pI, xq`b˚loss per pI, xq`c˚loss grad pI, xqs .
We conducted the three experiments using different combination of these losses to validate the effectiveness of the comprehensive loss. It can be seen from Table 1 that the loss acted on the generative network can improve the performance as expected, from 32.23 dB to 33.29 dB. These comparisons firmly demonstrate the effectiveness of loss.

Experimental Evaluation
In this part, we conduct experimental comparison of state-of-art deep learning methods, including SRCNN [20], VDSR [21], EDSR [22], LapSRN [23], RCAN [7], ESPCN [24], RDN [25], SRGAN [11], and CGAN [26]. And the baselines are re-implemented based on the source-code that the authors provided. We implement our models with the TensorFlow framework and train them using NVIDIA Titan V GPU. In the following subsection, we will provide reasonable settings for the implementation details and parameters in our SR model.

Implementation Details
We use the DIV2K dataset, a high-quality (2K resolution) dataset with 800 images, for our training. The training samples are randomly cropped from the original images with a fixed size of 64ˆ64.
The generative model is trained using the loss function in Equation (11) with a = 0.22, b = 0.43, c = 0.35. The learning rate is initialized as 1ˆ10´4 and decayed by a factor of 1 every 1ˆ10 5 of mini-batch updates. For optimization, we used Adam with β 1 = 0.9, β 2 = 0.999, ε " e´1 0 , and step size α " 0.001. We alternately updated the generator and discriminator network until the model converges.

Comparisons and Results
To validate the effects of self-attention, we carried out a series of experiments involving the following three parts: First, we test the performance of SAA-WGAN and the castrated model without SAA on the DOTA dataset, and some results are shown in Figure 6. The details of the airplane, the shade of the tree, and the house in Figure 6 have been significantly improved because of SAA ability to keep long-distance details, indicating that self-attention could improve the network performance.
Then, comparisons are conducted on DOTA and GEO images using state-of-art algorithms, which proves the performance superiority of SAA-WGAN. The results of the SAA-WGAN are displayed in Figures 7-9. We show visual comparisons of different benchmark algorithms on scale ×4 in Figure 7. As can be seen, all the compared methods suffer from blurring artifacts with varying degrees, failing to recover more details. However, our SAA-WGAN can recover them obviously, showing more faithful to the ground truth. Due to the resolution of the image is too high, the size of PAN images are cropped into 64ˆ64.  To further illustrate the universality advantage on other datasets of SAA-WGAN, we compare our method with 8 state-of-the-art methods (Bicubic, SRCNN, SCN, VDSR, LapSRN, EDSR, RDN, RCAN) on some most used SR dataset, e.g., Set5, Set14, BSD100, Urban100. More comparisons about PSNR/SSIM are provided in Table 2. It shows quantitative comparisons for ×2, ×4, and ×8 SR. The best results are annotated with blue text in Table 2. It demonstrates that our method almost achieves the best performance on all the datasets with all scaling factors.
We also find that, when the scaling factor becomes larger (e.g., 8), the PSNR gain of our method also becomes larger. When the scale factor is 2, the PSNR gain of our method tested on BSD100 and Urban100 exceeds RCAN by 1.5 dB and 1.2 dB, respectively. Similarly, on the same two datasets with the scale factor of 4, the proposed method has more gains than RCAN of 2.2 dB and 1.4 dB, respectively. When the scale factor is 8, the PSNR gain of this method exceeds RCAN by 2.99 dB and 1.97 dB, respectively. This observation shows that deeper network structure and powerful attention mechanism can improve network performance.    Figure 10 is the objective evaluation on image patches of Figure 8. In comparison algorithms, the performance curves of the SRGAN and ESPCN are significantly higher than the other five algorithms. Although SRGAN uses discriminative network that can extract the semantic information to get more useful features, ESPCN adopts a reconstruction strategy of concentrating multiple channel features to form a fused feature map which uses the relationship across channels. However, the evaluation indicators of SRGAN and ESPCN cannot exceed the proposed method. The PSNR of SAA-WGAN reaches 32 dB, and the SSIM curve fluctuates around 0.92; it is achieved by its ability of extracting attention feature from hidden units using SAA and CBAM, which is superior to other comparison algorithms in subjective vision and objective evaluation.

Conclusions
We propose a self attention-augmented network SAA-WGAN for PAN image SR. SAA-WGAN uses the EDM to extract multi-scale information and utilizes FCN to reconstruct HR images. The CBAM and the SAA are rubbed into the SAA-WGAN to enhance multiscale feature representation and make use of the relationship in both spatial and channel subspaces. Further, the pixel loss, perceptual loss and gradient loss are combined to supervise the training process. Extensive experiments on benchmark datasets and PAN images demonstrate the effectiveness of our proposed SAA-WGAN.