A Spectral-Aware Convolutional Neural Network for Pansharpening

Pansharpening aims at fusing a low-resolution multiband optical (MBO) image, such as a multispectral or a hyperspectral image, with the associated high-resolution panchromatic (PAN) image to yield a high spatial resolution MBO image. Though having achieved superior performances to traditional methods, existing convolutional neural network (CNN)-based pansharpening approaches are still faced with two challenges: alleviating the phenomenon of spectral distortion and improving the interpretation abilities of pansharpening CNNs. In this work, we develop a novel spectral-aware pansharpening neural network (SA-PNN). On the one hand, SA-PNN employs a network structure composed of a detail branch and an approximation branch, which is consistent with the detail injection framework; on the other hand, SA-PNN strengthens processing along the spectral dimension by using a spectral-aware strategy, which involves spatial feature transforms (SFTs) coupling the approximation branch with the detail branch as well as 3D convolution operations in the approximation branch. Our method is evaluated with experiments on real-world multispectral and hyperspectral datasets, verifying its excellent pansharpening performance.


Introduction
Multiband optical (MBO) images, including multispectral (MS) and hyperspectral (HS) images, can provide higher spectral resolution than red, green and blue (RGB) images and panchromatic (PAN) images, which expands the difference of the target objects and increases their identifiability. Such characteristics can be used to improve the effectiveness of various image tasks such as change detection [1], classification [2], object recognition [3], and scene interpretation [4]. Among almost all of these tasks, MS or HS images with high spatial resolution are desired. However, physical constraints make it difficult to acquire high spatial resolution MBO (HR-MBO) images with a single sensor, which are with high spectral resolution and high spatial resolution simultaneously. One way to address this problem is to generate HR-MBO images by fusing low-spatial-resolution MBO (LR-MBO) images and the associated high-resolution PAN images which is usually called pansharpening.
Over the last decades, a variety of pansharpening methods have been proposed in the literature [5][6][7][8]. Often-used methods can be categorized into two main groups: component substitution (CS) methods and multiresolution analysis (MRA) methods. CS methods seek to replace a component of the LR-MBO image with the PAN image, usually in a suitable transformed domain. This class includes intensity-hue-saturation (IHS) [9,10], principal component analysis (PCA) [11][12][13], Band-Dependent

•
We create a two-branch structure, including the detail branch and the approximation branch, in SA-PNN which coincides with the interpretation of a detailed injection. The detail branch, which originated from catenation of the LR-MBO image and the PAN image, mainly fulfills the spatial detail resolution, while the approximation branch, which inputs merely the LR-MBO image, collaborates with the detail branch to inject the details to yield the final HR-MBO image. Therefore, our SA-PNN follows the concept of detail injection, which is a routine that inspired many classical pansharpening methods and hence gains clear interpretability. • We use a spectral-aware strategy in SA-PNN to alleviate spectral distortion. The strategy involves mainly two aspects. On the one hand, spatial feature transforms (SFTs) are for the first time introduced into the pansharpening task to adjust spectra of the processed data adaptively to the observed scene. On the other hand, 3D convolution operations are used in the approximation branch which naturally accommodate the data resolving along the spectral dimension. Since spectral-aware strategies are adopted in SA-PNN, spectral processing or prediction is strengthened; thus, spectral distortion is expected to be reduced.
The rest of this paper is organized as follows. Section 2 briefly introduces the related backgrounds. Section 3 describes the proposed SA-PNN model in detail. The experimental results and analyses are presented in Section 4. Finally, the conclusion is presented in Section 5.

Backgrounds
The goal of pansharpening is to recover HR-MBO images from observed low spatial resolution ones and the connected PAN images. It usually can be formulated as minimizing the loss function of expected square error: where H stands for the mapping from the observed MS/HS image and the associated PAN image to the high-resolution image to be predicted, Y is the ideal high-resolution image, and θ θ θ denotes the parameters connected to a parametric structure. CNNs, as specific instantiations of artificial neural networks, can be used to learn the mapping H( · ; θ θ θ) in an end-to-end fashion. In CNNs, postulates such as the limited receptive field and the spatial invariant weight (so-called weight sharing) are normally suggested. The response of a convolutional layer in a CNN can be given by the following: where * denotes the convolution operation; X l and Y l are the input and output of the lth layer, respectively; W l and B l are the weight and bias matrices, respectively; and ϕ(·) represents the activation function, for which the rectified linear unit (ReLU) is commonly used due to its ability to mitigate gradient vanishing and computational simplicity [42].
Owing to the 3D data arrangement of MS/HS images, two kinds of convolution operations can be involved in CNN-based pansharpening, i.e., 2D convolution and 3D convolution. When applied along spatial dimension, 2D convolution can extract spatial information in MS/HS images. However, 2D convolution is unable to effectively exploit the potential features among bands, i.e., the spectral information encoded in neighboring bands, due to the separated synthesis for each output band. In contrast, 3D convolution is realized by convolving a 3D kernel with a data cube, which is naturally suitable for extracting both the spatial information between neighboring pixels and the spectral correlation in adjacent bands simultaneously.

Proposed Method
In this section, we develop a novel pansharpening neural network, called SA-PNN, which not only has clear interpretability but also is able to effectively alleviate spectral distortion.
The overall structure of our SA-PNN is graphically shown in Figure 1. As can be observed, the network comprises two major branches. One is the detail branch which extracts spatial details; the other is the approximation branch to extract approximations which collaborate the detailed information from the detail branch to build the final HR-MBO image. The detail branch takes the stacked PAN image of size H P × W P and the LR-MBO image of size H M × W M × C as its input while using 2D convolutions to fulfill convolutional layer operations, which is designed with PAN image being assumed to contain most of useful spatial details and 2D convolutions being able to effectively collect spatial information. In contrast, the approximation branch takes only the LR-MBO image as its input and distills spectral approximation which collaborates the detailed information yielded in the detail branch to produce the final HR-MBO image of size H P × W P × C. Since 2D convolutions are unable to effectively build representation along the spectral dimension, they thus may incur spectral distortion; in such an approximation branch, 3D convolutions are used in convolutional layers to strengthen the spectral processing. The detail branch and the approximate branch form the basis of the detailed injection-based structure of our SA-PNN. Apart from the 3D convolution operation for convolution layers, the SFT strategy is also introduced in the approximate branch of SA-PNN (as show in Figure 1). SFT is originally designed to acquire the semantic categorical prior for color image enhancement [43]. We argue here that SFT can be used to adaptively adjust spectra in terms of the observed scene, to play a crucial role in the detail injection; and for the first time (to the best of our knowledge), to introduce it into pansharpening task. More specifically, in our SA-PNN, SFT provides an affine transformation for the convolutional feature maps based on a modulation parameter pair (α, β), where α and β are parameter maps conditioned on the observed scene. The transformation is carried out by a scaling and shifting operation as follows: where F denotes the feature maps and represents element-wise multiplication. As shown in Figure 2, for the lth layer in the approximation branch, we first obtain a modulation parameter pair (α l , β l ) from the feature maps of the lth convolutional layer D l (the lth green cube in Figure 2). Then, we apply the element-wise affine transformation to the feature maps F l (the lth blue cubes in Figure 2) according to (4): where α l k and β l k denote the kth modulation parameter pair at the lth layer, F l k represents the kth feature map of the the approximate branch at the lth layer, and F l k indicates the result after detail injection. It is noteworthy that both α l k and β l k are 2D data while F l k is volumetric data; thus, the operation in (4) actually applies the affine transformation to each band of F l k according to the same parameter pair (α l k , β l k ). As determined by the detail branch which carries high spatial details, α l k and β l k carry spatial information in response to the observed scene. Therefore, with the calculation in (4), the feature map data are modulated spatially conditioned on the observed scene. As shown in Figure 1, undergoing a series of SFT operations, data cubes streamed in the approximation branch are spatially adjusted layer by layer and thus spectral processing is strengthened by spatial adaptiveness. With the detail injection being fulfilled layer-wise, a convolutional layer with a kernel in size 1 × 1 × 1 is applied at the top of the network to yield the final HR-MBO image. As mentioned above, the parameter pair (α, β) carries spatial information related to the observed scene and modulates the feature map data in the approximation branch according to (3) and (4). Intuitively, the modulation should be bidirectional, meaning that α and β could be positive or negative. Therefore, the leaky ReLU (LReLU) [44] in the following rather than the standard ReLU is used in the relevant convolution layers yielding α and β (as shown in Figure 2), where a is a positive constant. In brief, the proposed SA-PNN adopt a two-branch network structure, where one is the detail branch to be mainly responsible for spatial detail distilling while the other is the approximation branch to extract spectral-spatial information from the data and to fulfill injecting details with SFTs in a layer-wise manner; thus, SA-PNN has clear interpretability of the detail injection. Moreover, to strengthen the spectral processing to alleviate spectral distortion in the final pansharpening result, a spectral-aware strategy is developed which comprises SFTs and 3D convolutional layers jointly, where, especially SFTs can make spectra of the processed data to be automatically adjusted with respect to the observed scene.

Experiment Results
To evaluate the performance of our method, we conducted experiments on several real-world multispectral and hyperspectral image datasets and then made comparisons between SA-PNN and several representative pansharpening methods.

Experimental Setup
We carried out experiments on four dataset (three multispectral (MS) datasets acquired with the WorldView-2, IKONOS (the name comes from the Greek word eikōn) [45], and Quickbird sensors and one hyperspectral (HS) dataset acquired with the Reflective Optics System Imaging sensors (ROSIS)). A brief description of these datasets is provided below. The resolution ratio R for this dataset is usually manually set to 5 following the consensus in the pansharpening task, which means that the low-spatial-resolution HS image would be generated by degrading the original HS image with the scale factor of 5 in the subsequent experiments. The radiometric resolution is 13 bit.

Implementation Details
The methods mentioned before were trained on a laptop with an NVIDIA GeForce GTX 1060 GPU (CUDA 9.0 and CUDNN v7), an Intel R Core TM i7-8750H CPU, and 16 GB RAM, using the Tensorflow framework and tested on MATLAB R2017b. The training set of each MS dataset consists of 12,800 patches with the size of 33 × 33 and partial overlaps, which are cropped from a selected sub-scene. Analogously, the validation set consists of 3200 33 × 33 patches, which are cropped from another different sub-scene. In addition, a sub-image different from the training/validation set is selected for testing. For the HS dataset, the process of building the training/validation set is the same as that on the MS datasets except the size of the patches, which is only 11 × 11, limited to the small size of this dataset. The sizes of convolutional kernels are empirically set to 7 × 7 for the detail branch while are 7 × 7 × 3 for the approximation branch. Also, the number of layers in our SA-PNN is empirically set to 4. The batch size using in the training phase is 64. The initial learning rate is set to 0.00005, and it is reduced to half every 10 5 iterations. The mean squared error (MSE) is chosen as the loss function and optimized with the Adam algorithm. The training process terminates at 5 × 10 5 iterations. Table 1 shows the average training time of our SA-PNN on four datasets under GPU mode. Note that the train time on the Pavia dataset is less than that on the other three datasets due to the smaller size of the training patches.

Experimental 1-WorldView-2 Washington Dataset
This dataset covers an urban area in Washington D.C., and we chose a sub-scene with 256 × 256 pixels for testing. Table 2 shows the results of the reduced resolution quality assessment on the WorldView-2 dataset. The best results are shown in bold. As we can see, PRACS and CNMF have worse ERGAS than other methods, which indicates the poor overall pansharpening quality. In contrast, ATWT and MTF-GLP-HPM achieve relatively high scores on all metrics. Other methods like PCA and GSA restore the spatial details to some degree, as reflected in the better SCC than EXP. However, they yield worse SAM, which is useful to quantify the spectral distortion. The CNN-based methods achieve better performance compared to the traditional methods. Among them, our proposed SA-PNN yields the best numeric assessment results, especially in the impressive performance gain of SAM. Although the numerical indicators show the performance of our proposed method clearly, we also focus on visual inspection to find noticeable distortions that elude quantitative analyses. As shown in Figure 3, the results of the traditional methods suffer from some spatial blur, such as the area in the upper middle of the images in Figure 3. Among them, CNMF is characterized by severe spectral distortions, such as the lake region in the bottom left of the images in Figure 3. The results of the CNN-based methods are more similar to the ground truth and our SA-PNN exhibits an excellent quality both in spatial detail and spectral fidelity. What is noteworthy is that the result of PNN shows spectral distortion at the edge, which is caused by convolution without padding during the training phase. This is unnoticeable in the numerical indicators.
To further compare performance, the spectral difference curves between the ground truth and the pansharpening results, which are generated by subtracting the ground truth from the pansharpening results, are plotted in Figure 4. Specifically, the values along the difference axes in Figure 4 indicate the difference in gray values between the ground truth and the pansharpening results. As we can see, while performance for individual spatial coordinates varies, on average, when examining multiple coordinates and bands, the spectral difference curves of SA-PNN closely approximate the reference, demonstrating that SA-PNN offers superior spectral fidelity.

Experimental 2-IKONOS Hobart Dataset
This dataset covers an urban area of Hobart in Australia. A sub-scene with 256 × 256 pixels is used for testing in our experiments. Table 3 shows the results of the reduced resolution quality assessment on the IKONOS dataset. The pansharpening qualities of various methods vary with the characteristics of the sensor. PCA and CNMF get worse pansharpening results compared to other methods. Unlike the results on the WorldView-2 dataset, GSA is much more preferable among the traditional methods. Such a phenomenon implies the poor robustness of the traditional methods. Similar to the previous dataset, CNN-based methods achieve a significant improvement. Among them, PNN achieves better results compared with DRPNN and DiCNN1, while our SA-PNN also achieves the best performance. For visual inspection, Figure 5 shows the pansharpening results yielded by different methods. Similar to the conclusions drawn from Table 3, the results of the traditional methods are unsatisfactory for the spectral distortion and the diffused blur especially in the result of CNMF. Although the results of PNN, DRPNN, and DiCNN1 are superior in term of spatial details, they inevitably suffer from a little spectral distortion, e.g., the red roofs distributed in Figure 5. The result of our SA-PNN is more similar to the ground truth than others, with comparable spatial details and unnoticeable spectral distortion.

Experimental 3-Quickbird Sundarbans Dataset
This dataset covers a forest area of Sundarbans in India. Similar to the previous setting, a sub-scene with 256 × 256 pixels was used for testing.
The results of the reduced resolution quality assessment on the Quickbird dataset are shown in Table 4. As the previous dataset, our SA-PNN still keeps being largely preferable. In the traditional methods, PCA and CNMF present worse pansharpening qualities while MTF-GLP-HPM gets the best result. PNN, DRPNN, and DiCNN1 achieve close improvements, which far exceed the traditional methods.  Visual inspection is shown in Figure 6. It is obvious that CNMF suffers from severe spectral distortion. Furthermore, the result of traditional methods are spatially blurred, such as the area in the upper right of the images in Figure 6. Analogously, CNN-based methods achieve better visual results and our SA-PNN still attained the best performance in terms of spatial detail reconstruction and spectral fidelity.

Experimental 4-Pavia University Dataset
This dataset covers a urban area in Pavia, northern Italy. Limited to the scant spatial size of the dataset, we choose a sub-scene with 150 × 75 pixels for testing. Considering the large-volume data and the calculation efficiency, a dimensionality reduction operation is applied to this dataset when it is used in the CNN-based methods.    Table 5 presents the results of the reduced resolution quality assessment on the Pavia dataset. As we can see, although many traditional methods like GSA and Bayesian Sparse have better SCC than EXP, they are poor in guaranteeing the spectral fidelity due to the greatly increased number of bands. SFIM and MTF-GLP-HPM achieve relatively good performance. In the CNN-based methods, PNN gets high SCC while the other three metrics are very bad, which implies that PNN is not effective in improving spectral fidelity when used on data with a large number of bands. The results of DRPNN and DiCNN1 are competitive, with significant improvement compared to PNN. Our SA-PNN still achieved the best pansharpening quality. Visual inspection is also given in Figure 7. Obviously, the result of GFPCA is severely blurred while Bayesian Sparse shows spectral distortion such as the area in the upper left of the images in Figure 7. Other traditional methods like SFIM and CNMF achieve a trade-off between spatial details and spectral fidelity. Among all methods, our SA-PNN gets the best visual result.
Considering that the simulated pseudo-color visual results shown in Figure 7 may not fully exhibit the spectral characteristics of the data with a large number of bands, for the CNN-based methods, we also show the spectral difference curves at randomly selected spatial coordinates: (8,27), (8,125), (64, 11), and (68, 131) in Figure 8. It is obvious that the spectral difference curves of PNN deviate heavily from the reference line, which corresponds to the spectral distortion presented in the pansharpening result of PNN. The overall spectral difference curves of our SA-PNN are the most approximate to the reference line, which demonstrates the excellent spectral fidelity ability of our SA-PNN again. (

Conclusions
In this paper, we proposed a novel pansharpening neural network, SA-PNN, for MS/HS images. The network comprised of two branches, the detail branch and the approximate branch, has clear interpretability of detail injection. Furthermore, a spectral-aware strategy is used in SA-PNN, which is composed of SFT operations and 3D convolutions to strengthen the spectral processing. Thus, our network offers potential to reduce spectral distortion in the final pansharpening result. The experimental results on real-world MS/HS images validated the remarkable performance of our SA-PNN method.

Conflicts of Interest:
The authors declare no conflict of interest.