MSAC-Net: 3D Multi-Scale Attention Convolutional Network for Multi-Spectral Imagery Pansharpening

Pansharpening fuses spectral information from the multi-spectral image and spatial information from the panchromatic image, generating super-resolution multi-spectral images with high spatial resolution. In this paper, we proposed a novel 3D multi-scale attention convolutional network (MSAC-Net) based on the typical U-Net framework for multi-spectral imagery pansharpening. MSAC-Net is designed via 3D convolution, and the attention mechanism replaces the skip connection between the contraction and expansion pathways. Multiple pansharpening layers at the expansion pathway are designed to calculate the reconstruction results for preserving multi-scale spatial information. The MSAC-Net performance is verified on the IKONOS and QuickBird satellites’ datasets, proving that MSAC-Net achieves comparable or superior performance to the state-of-the-art methods. Additionally, 2D and 3D convolution are compared, and the influences of the number of convolutions in the convolution block, the weight of multi-scale information, and the network’s depth on the network performance are analyzed.


Introduction
Multi-spectral (MS) and panchromatic (PAN) images are two remote sensing image types acquired by optical satellites. While they often represent similar scenes, their spectral and spatial resolutions differ. PAN images have high spatial resolution (HR) but low spectral resolution, whereas MS images have high spectral resolution and low spatial resolution (LR). Pansharpening fuses LR-MS and HR-PAN, generating super-resolution MS images with high spatial resolution (HR-MS). This can provide higher quality remote sensing images for such as target detection [1][2][3], distribution estimation [4] and change detection [5][6][7].
The existing pansharpening techniques can be divided into four categories: component substitution [8,9], multi-resolution analysis [10,11], model-based optimization [12,13] and deep learning [14]. Component substitution-based methods achieve pansharpening by replacing MS images' partial spectral components with spatial information from PAN images. However, this technique causes spectral distortion [15]. Multi-resolution analysis methods [16,17] decompose the source and then synthesize the HR-MS images through fusion and inverse transform. Although this technique can maintain good spectral characteristics, it causes spatial distortion due to decomposition. Model-based optimization methods are limited by their dependence on appropriate prior knowledge and hyper-parameters.
In recent years, deep-learning-based research has achieved great success in image processing [18][19][20][21][22][23][24]. Masi et al. [18] were the first to propose a CNN-based pansharpening method (PNN), whose structure is based on the super-resolution CNN [25]. Yuan et al. [19] designed a multi-scale and multi-depth CNN (MSDCNN), which introduces multi-scale information through two branches and a different number of learning blocks for deep feature learning. Yang et al. [26] proposed a residual network for pansharpening. However, most of the existing approaches focus on spatial feature extraction while paying less attention to the spectral information and spatial scale information of fusion images. Consequently, the fusion process is often characterized by spectral information loss or spatial feature redundancy.
The 3D convolution shows promise in volume data analysis [27][28][29]. For example, Mei et al. [30] used 3D CNN to extract spectral features of remote sensing super-resolution images. Compared to traditional 2D CNN methods, 3D CNN methods emphasize extracting spectral features while preserving spatial features [28]. Therefore, 3D convolution characteristics promote generating images with high spatial and spectral resolution.
Inspired by the human visual system [31], the attention mechanism was proved to have a positive effect on image understanding [32] and has been widely applied in image processing due to its focus on local information [33]. For example, Wang et al. [34] presented a model with several attention blocks and combined these with the residual structure to enable focusing specific areas while reducing the number of calculations. Mei et al. [32] utilized attention mechanisms to study spatial and spectral correlations between adjacent pixels. U-Net ( Figure 1) [35], named by its U-shaped structure, has a high capability to extract and represent features and combine low-and high-scale semantic information for semantic segmentation of input images. Guo et al. [36] replaced the basic convolution block of the decoder in the U-Net architecture with the residual channel attention block [37], consequently improving the model capability. Oktay et al. [38] proposed the attention gate (AG) module based on the U-Net architecture to enable automated focus on target structures of different shapes and sizes while suppressing irrelevant areas.
At present, most MS pansharpening methods focus on the injection of spatial details, but ignore the multi-scale features and spectral features of multi-spectral images. This work proposes a novel 3D multi-scale attention deep convolutional network (MSAC-Net) for MS imagery pansharpening. Following the U-Net framework, MSAC-Net consists of a contraction path that extracts high-scale features from LR-MS and HR-PAN (Figure 2, left) and an expansion path that fuses spatial and spectral information (Figure 2, right).The attention mechanism is introduced in the contraction and expansion paths instead of skip connections, supporting focusing spatial details of the feature maps. Furthermore, the deep supervision mechanism enables MSAC-Net to utilize multi-scale spatial information by adding a pansharpening layer at each scale of the expansion path. The results demonstrate that MSAC-Net achieves high performance in both spatial and spectral dimensions. In summary, this work:

1.
Designs a 3D CNN to probe the spectral correlation of adjacent band images, thus reducing the spectral distortion in MS pansharpening; 2.
Uses a deep supervision mechanism that utilizes multi-scale spatial information to solve the spatial detail missing problem; 3.
Applies the AG mechanism instead of the skip connection in U-Net structure and presents experiments demonstrating its advantages in MS pansharpening.
The rest of the manuscript is organized as follows. Section 2 presents the related work, while Section 3 introduces the proposed MSAC-Net. Section 4 describes and analyzes the experimental results. Finally, Section 5 concludes the paper with a short overview of the contributions.  Figure 2. The proposed MSAC-Net architecture.

Related Work
Ronneberger et al. [35] first proposed the U-Net network for medical image segmentation. Because its structure is to combine low-level and high-level semantic information for semantic segmentation of the input image, U-Net has high feature representation and extraction ability. Therefore, in this section, the various components of U-Net and improvements based on U-Net are introduced.
According to the difference of feature information extracted by U-Net, its composition can be divided into three parts.
1. The contraction pathway (i.e., the encoder). It is mainly used for shallow feature extraction and coding. For the improvement of this component, Milletari et al. [39] referred to residual learning for the design of each convolution block on the basis of U-Net, which makes V-Net learn image features more fully. Wang et al. [40] used ResNet-50 pre-trained on the ImageNet dataset as its encoder to accelerate the convergence and achieve better performance via transfer learning.
2. The expansion pathway (i.e., the decoder). The purpose of this design is to restore or enlarge the size of feature map. Therefore, the design directly affects the final result of feature recovery. Guo et al. [36] learned an additional dual regression mapping after expansion pathway to estimate the down-sampling kernel and reconstruct LR images, which forms a closed-loop to provide additional supervision. Banerjee et al. [41] designed decoder-side dense skip pathways, making the features under the final scale contain the feature results of all scales. In addition, Khalel et al. [42] realized multi-task learning with the design of dual encoder. Ni et al. [43] extracted and fused the pre-output features and deep features of the network again, thus realizing the multi-task learning of the network.
3. The skip connection. The skip connections transfer the shallow feature maps of the network to the deep layer, which is helpful for the training of the deep network. Zhou et al. [44] proposed U-Net++, in which each scale feature is designed to carry out feature transfer to a larger scale, thus forming a dense connection structure at skip connections. Ni et al. [43] also added sparable gate unit at the skip connection to improve the accuracy of image spatial feature extraction. Rundo et al. [45] introduced squeeze-and-excitation blocks instead of the skip connection to expect an increased representational power from modeling the channel-wise dependencies of convolutional features [46]. Wang et al. [47] designed a module to generate attentional feature maps by paying attention to the H, W and C of feature maps to replace the skip connection.
In the U-Net, feature maps are learned by "compression-expansion". In addition, there are studies targeting U-Net as a whole for improvement. Yang et al. [48] transformed U-Net as a spatial attention module and inserted it into the network as a branch to extract spatial features of images. Wei et al. [49] replaced each convolution block in the U-Net structure with a small U-Net structure to achieve multi-scale feature extraction. Xiao et al. [50] adopted a dual U-Net structure to inject features extracted from the external U-Net into the internal U-Net, achieving the effect of multi-stage detail injection.

Method
This section will introduce the overall design of MSAC-Net, including its structure, AG module and in-depth monitoring mechanism.
First, we will introduce the pansharpening model based on CNN. Let the LR-MS image with the size h × w × c be denoted M LR ∈ R h×w×c , Similarly, denote the HR-PAN with H × W size as P HR ∈ R H×W and HR-MS as M HR ∈ R H×W×c . Taking LR-MS and HR-PAN as inputs, the pansharpening task of generating the HR-MS can be expressed as: where M(·) represents the mapping from the CNN's input to the out, θ denotes parameters to be optimized and · is a loss function. The CNN can learn the involved knowledge from input data, offering the possibility for MS pansharpening. Table 1 shows the de-sign of all modules in MSAC-Net.The following subsections introduce the details on the representation of these elements within MSAC-Net.
As shown in Figure 2, MSAC-Net consists of three parts: data preprocessing, the contraction pathway (left sub-network) and the expansion pathway (right sub-network). The contraction pathway and the expansion pathway have the same effect as U-Net. Placed between the two paths, the AG mechanism replaces the skip connection, thus improving the local detail feature representation ability. Moreover, MSAC-Net uses multi-scale information pansharpening layers to resolve the MS pansharpening problems.
The traditional 2D CNN approaches commonly cascade P HR with the band dimension of M LR to obtain the input data X ∈ R H×W×(c+1) . In contrast, the proposed MSAC-Net uses 3D cube data as input. As shown in Figure 2, data preprocessing converts the PAN image P HR to P HR ∈ R H×W×c , where P HR i = P HR , P HR i ∈ {i|i ∈ c, R H×W×c }. Similarly, the interpolation algorithm converts M LR to M HR ∈ R H×W×c . Finally, the input image X ∈ R 2×H×W×c is obtained by R(P HR ⊕ M HR ), where ⊕ is the cascade operation, R(·) represents the resizing operation and the number "2" stems from the two features, MS and PAN.
In the contraction pathway, the i-th low-scale feature F L i is obtained as: where Block(·) is is composed of two groups of kernels with a rectified linear unit (ReLU) as an activation function, (i.e., 3 × 3 × 3 kernel + ReLU), where Max(·) denotes the maxpooling layer with a 1 × 2 × 2 kernel. The 1 × 2 × 2 kernel is chosen to down-sampled H and W with a factor of two while keeping the number of bands (c) unchanged.
In the expansion pathway, the i-th high-scale feature F H i is obtained as follows: where Tran(·) represents the transposed convolution with a factor of two, and AG(·) represents stands for the AG module. The pansharpening layer is set after each scale of the convolution block. The next sections introduce further details on the AG module and the pansharpening layer will be introduced in a later section.

The Attention Gate (AG) Module
First, the F H i and F L i change the numbers of channels from F i to F i /2 using a 3 × 3 × 3 kernel and are then cascade. Next, the gate feature (q i att ) is obtained via the ReLU function and 1 × 1 × 1 kernel. Finally, q i att is activated using a sigmoid function and multiplied by F L i to obtain the featurex up i . In contrast to [37], this work uses gate features instead of global pooling because the extracted gate features are more consistent with the PAN features. The AG module's operations can be formalized as: where σ 1 is the ReLU activate function, σ 2 denotes the sigmoid activation, θ is a parameter and ⊗ stands for pixel-by-pixel multiplication on each feature map. Finally, F H i−1 is obtained by the convolution block using the transposed convolution.
Utilizing the AG module instead of the skip connection enables MSAC-Net to pay more attention to each scale's local spatial details, consequently improving the spatial performance of pansharpening. Figure 3. The attention gate (AG) module.

Pansharpening Layer and Multi-Scale Cost Function
MSAC-Net has S scales in the expansion pathway, i.e., S scale spaces. The pansharpening layer at each scale reconstructs the high-scale feature using a 1 × 1 × 1 convolution kernel, obtainingŶ s i . The formula is: where R(·) is the i-th scale pansharpening layer. Accordingly, Y s i is generated using the bicubic interpolation, given as: where Y s 1 represents the ground truth and D(·) is the bicubic interpolation with a factor of 2. The 1 -norm loss is used to constrain Y s i andŶ s i at the ith scale. The formula is expressed as: where s i denotes the loss at the i-th scale. Finally, the proposed MSAC-Net's multi-scale cost function is calculated as (see Figure 2): where λ is the weight of multi-scale information.

Datasets & Parameter Settings
To test the effectiveness of MSAC-Net, we used datasets collected by IKONOS (http: //carterraonline.spaceimaging.com/cgi-bin/Carterra/phtml/login.phtml, accessed on 20 December 2018) and QuickBird (http://www.digitalglobe.com/product-samples, accessed on 20 December 2018) satellites ( Figure 4). Both datasets contain four standard colors (R: red, G: green, B:blue, N:near infrared). In order to ensure the availability of the ground truth, Wald's protocol [51] was used to obtain baseline images in training and simulation tests.
The steps of obtaining simulation data are as follows: (1) The original HR-PAN and LR-MS images were down-sampled with a factor of 4; (2) The down-sampled HR-PAN was used as the input PAN, and the down-sampled LR-MS was used as the input LR-MS; (3) The original LR-MS was used as ground truth in the simulation experiment.
In the real datasets, we only normalized the original image. Therefore, the fusion image obtained from real datasets does not have ground truth.
The dataset information is shown in Table 2. The selected dataset contains rich texture information, such as rivers, roads, mountains, etc. We cropped images randomly from each dataset to 3000 PAN/MS data pairs for a total of 6000 data pairs. The PAN size of 256 × 256 × 1 and MS size of 64 × 64 × 4. Empirically, all data pairs were divided into a training set, validation set and test set in a ratio of 6:3:1. In the experiment of the real dataset, 50 data pairs with PAN image size of 1024 × 1024 × 1 and MS image size of 256 × 256 × 4 were randomly acquired for each dataset.  The proposed method was implemented in Python 3.6 using the Pytorch-1.7 framework and trained and tested on an NVIDIA 1080 GPU. The stochastic gradient descent algorithm was used to converge during training; its parameter settings are shown in Table 3. Among them, the learning rate decay was halved every 2000 iterations. During the training stage, we saved the best model, which achieved the best performance on the validation dataset, and used it for the test.

Compared Methods
The proposed method is validated through a comparison with several pansharpening methods, including GS [52], Indusion [10], SR [13], PNN [18], PanNet [26], MSDCNN [19], MIPSM [20] and GTP-PNet [24]. GS is based on component substitution. Indusion is a multiresolution analysis method. Furthermore, SR is based on sparse representation learning technology, whereas PNN is based on a three-layered CNN. PanNet is a residual network based on high-pass filtering. MSDCNN introduces residual learning and constructs a multi-scale and multi-depth feature extraction based on CNN. MIPSM is a CNN fusion model that uses dual branches to extract features. Lastly, GTP-PNet is a residual learning network based on gradient transformation prior. PNN, PanNet, MSDCNN, MIPSM and GTP-PNet are the most advanced deep learning methods presented in recent literature.

Performance Metrics
The performance of the proposed pansharpening method is analyzed through quantitative and visual assessments. We selected six reference evaluation indicators and one non-reference evaluation indicator. All symbols in the reference evaluation indicators are explained in Table 4, and the reference evaluation indicators are as follows: (1) The correlation coefficient (CC) [53]: CC reflects the similarity of spectral features between the fused image and the ground truth. CC∈ [0, 1], with 1 being the best attainable value. CC can be expressed as follows: (2) Peak signal-to-noise ratio (PSNR) [54]: PSNR is an objective measure of the information contained in an image. A larger value demonstrates that there is less distortion between the two images. PSNR can be expressed as follows: PSNR(X, X) = 10 lg L 2 MSE(X, X) (3) Spectral angle mapper (SAM) [55]: SAM calculates the overall spectral distortion between the fused image and the ground truth. SAM ∈ [0, 1], with 0 being the best attainable value, is defined as follows: (4) Root mean square error (RMSE) [56]: RMSE measures the deviation between the fused image and the ground truth. RMSE ∈ [0, 1], with 0 being the best attainable value, is defined as follows: RMSE(X, X) = MSE(X, X) (5) Erreur relative globale adimensionnelle de synthèse (ERGAS) [57]: ERGAS represents the difference between the fused image and the ground truth. ERGAS ∈ [0, 1], with 0 being the best attainable value, can be expressed as follows: (6) Structural similarity index measurement (SSIM) [54]: SSIM measures the similarity between the fusion image and the ground truth image. SSI M ∈ [0, 1], with 1 being the best attainable value, is defined as follows: (7) Quality without reference (QNR) [58]: As a non-reference evaluation indicator, QNR compares the brightness, contrast and local correlation of the fused image with the original image. QNR ∈ [0, 1], with 1 being the best attainable value, is defined as follows: where usually a = b = 1 and the spatial distortion index D s and the spectral distortion index D λ are based on universal image quality index (Q) [59]. Furthermore, D s , D λ ∈ [0, 1], with 0 being the best attainable value. Q is defined as: thus, D s and D λ are defined as:

The Influences of Multi-Scale Fusion and Attention Gate Mechanism
This section discusses MSAC-Net's and other methods' results and analyzes the benefits of each MSAC-Net module. Considering that the image has the characteristics of scale invariance and U-Net will scale the image at multiple scales in the learning process, MSAC-Net designed a deep supervision mechanism to restrain the multi-scale fusion process. The design of the deep supervision mechanism is to ensure that in the fusion process of each scale, its features are constrained to ensure that the main features of the image are learned by the network. The AG mechanism can gradually suppress the feature regions of unrelated regions, thus highlighting the feature regions [38]. For image fusion, because MS images contain a lot of noise [60], MSAC-Net suppressed the noise area through the AG mechanism, to highlight the feature area. In addition, MSAC-Net uses 3D convolution to preserve the spectral information of MS data and injects the PAN images' spatial information into MS images. These settings allow extracting better performance indicators while retaining the spectral correlation between bands.

The Influence of the AG Module
In Figure 6, the feature maps of the first scale with and without the AG module are visualized. More precisely, Figure 6b,e are represent the feature maps without the AG module, and Figure 6c,f show the feature maps with the AG module. One can notice that the details in Figure 6c are significantly enhanced compared to those in Figure 6b. The detailed features regarding "rivers" and "land" in Figure 6c are extracted. Figure 6e has lower contrast and blurry edges, whereas Figure 6f has sharper contrast and sharper edges. Compared with F L i , F H i contains more local spatial information, and the texture information is enhanced, yielding more accurate results. Therefore, these results demonstrates the AG module's utility in advancing the MSAC-Net's learning of detailed spatial features. As can be observed from Figure 6 and Table 5, the spatial details can be extracted more fully by adding the AG mechanism, albeit with a slight loss of the spectral information. In Table 5, SAM of the "U-Net + scale" method is significantly reduced, and SSIM and PSNR are slightly improved when multi-scale information is introduced. Compared with Figure 7f, the spatial detail error of Figure 7h is reduced. These results show that the reconstruction of multi-scale information can make use of the U-Net's hierarchical relationships to improve the network's expression ability. As compared with Figure 7f-h, Figure 7i shows the smallest difference between the reference image and the method's output. Table 5 demonstrates the MSAC-Net's superiority stemming from its use of the AG module and multi-scale information.

Comparison of 2D and 3D Convolutional Networks
Within this subsection, 2D and 3D convolution are used in MSAC-Net, and their indicators are compared to assess the advantages of employing 3D convolution in MS imagery pansharpening. The experimental results are shown in Figure 8 and Table 6.
The visual assessment indicates that Figure 8c has significant spatial distortion, whereas Figure 8d is more similar to the ground truth. For example, the road in Figure 8d is clearer than in Figure 8c. The reason may lay in the fact that 3D convolution considers the spectral characteristics of the MS image, thus enabling the feature maps to obtain more spectral information during the feature extraction process.
In Table 6, the MSAC-Net using 3D convolution shows a 2∼7% improvement regarding most indicators.

Comparison with the State-of-the-Art Methods
This section compares the proposed method to the eight state-of-the-art methods on the simulated and real dataset. Due to the lack of LR-and HR-MS image pairs, the spatial resolution of both PAN and MS images was reduced four times for training and testing, in accordance with Wald's protocol [51]. During the verification phase, the original MS image and PAN image were used as input and were compared with the whole image and the partial image visually.  Figure 9d-h suffer from significant spectral distortions. In contrast, the proposed method yields a result quite similar to the reference image in terms of visual perception, especially regarding the spatial details. In Figure 10, the resultant images of (d), (e), (j), and (k) are too sharp compared with the reference image, while (h) has a color deviation and (i) is blurred.  In Tables 7 and 8, the methods' results on IKONOS and the QuickBird dataset are listed. The proposed method achieves competitive results on both datasets. With respect to the best values obtained by other methods (as reported in Table 7), although MSAC-Net does not perform well in IKONOS, it is superior only in ERGAS and SSIM. However, other indicators of MSAC-Net in Table 7 are still competitive. With the exception of SAM, Table 8 shows that the proposed method's indicators are superior to those of the current state-of-the-art methods. The results establish the proposed method as satisfactory.

Experiments on the Real Dataset
The results obtained using the real datasets IKONOS and QuickBird are presented in  Figures 11j and 12j are the reference PAN images. As seen in Figure 11, the images generated using (a), (b), (e), and (g) are too sharp at the "river" edge. Thus, Figure 11h has an error in the "river" edge. Moreover, the synthetic chroma of Figure 11d deviates, and (f) is blurred. On the other hand, MSAC-Net enhances the spectral information and refines the spatial details.
In Tables 9 and 10, the QNR, D s and D λ of the eight compared methods are listed for the real IKONOS and QuickBird dataset. The proposed method outperforms other methods regarding all indicators. The indicators in both datasets shown demonstrate that the proposed method can enhances detail extraction and reduces spectral distortion.
Overall, the experiments show the proposed method's capability to retain spectral and spatial information from the original images in the real dataset.

The Effect of Convolution Times
The number of convolution in the convolution block will be discussed here. According to experience, the higher the number of convolution times, the richer the feature expression of the network should be. However, as shown in Figure 13a, when the number of convolution starts from 2 and keeps increasing, PSNR gradually decreases. This may be due to the excessive times of convolution, which makes the network pay too much attention to the form of the samples, ignoring the characteristics of the data, resulting in over-fitting of the network. Therefore, two 3 × 3 × 3 kernels are used for in training.

The Effect of Multi-Scale Information Weight λ
This subsection discusses the effect of weight λ in Equation (10). Figure 13b shows that the increase in weight λ from 0.01 to 1 gradually strengthens the constraints on scale information in MSAC-Net and the resultant feature maps comprise sufficient details at each scale to enhance the results. When λ increases to 10, the network pays more attention to the scale information than the fusion result, affecting the MSAC-Net's final result. Therefore, to balance the scale information and fusion result, λ is set as to 1.

The Effect of Network Depth
A final parameter effect that needs to be investigated is the influence of the MSAC-Net's depth.
As is known to all, increasing the network depth can improve the effect of feature extraction, but a too deep network will lead to degradation. Therefore, based on experience, we discuss the depth of MSAC-Net between three and five layers. Figure 13c shows that as the network depth increases, the average PSNR of the MSAC-Net on the test set gradually improves. Therefore, according to the results in Figure 13c, the depth of MSAC-Net is set to five layers.
Following these experiments, the weight λ was set to 1, the convolution times was 2 and the depth was 5 in MASC-Net.

Conclusions
This work introduces a novel 3D multi-scale attention deep convolutional network (MSAC-Net) method for MS imagery pansharpening. The proposed MSAC-Net method utilizes a 3D deep convolutional network appropriate for the MS images' characteristics. Moreover, the method integrates the attention mechanism and deep supervision mechanism for spectral and spatial information preservation and extraction. The conducted experiments show that MSAC-Net using 3D convolution achieves better quantitative and visual pansharpening performance than the network with 2D convolution. Exhaustive experiments investigated and analyzed the effects of the designed attention module, multi-scale cost function and three critical MSAC-Net factors. The experimental results demonstrate that every designed module positively affects spatial and spectral information extraction, enabling MSAC-Net to achieve the best performance in the appropriate parameters' range. Compared to the state-of-the-art pansharpening methods, the proposed MSAC-Net achieved comparable or even superior performance on the real IKONOS and QuickBird satellites' datasets. Building on the results reported in this study, one can conclude that MSAC-Net is a promising multi-spectral imagery pansharpening method. In future work, we will explore how to make more efficient use of the attention mechanism to obtain smaller CNN while ensuring the quality of the fusion image.