A Multi-Scale Wavelet 3D-CNN for Hyperspectral Image Super-Resolution

: Super-resolution (SR) is significant for hyperspectral image (HSI) applications. In single-frame HSI SR, how to reconstruct detailed image structures in high resolution (HR) HSI is challenging since there is no auxiliary image (e.g., HR multispectral image) providing structural information. Wavelet could capture image structures in di ﬀ erent orientations, and emphasis on predicting high-frequency wavelet sub-bands is helpful for recovering the detailed structures in HSI SR. In this study, we propose a multi-scale wavelet 3D convolutional neural network (MW-3D-CNN) for HSI SR, which predicts the wavelet coe ﬃ cients of HR HSI rather than directly reconstructing the HR HSI. To exploit the correlation in the spectral and spatial domains, the MW-3D-CNN is built with 3D convolutional layers. An embedding subnet and a predicting subnet constitute the MW-3D-CNN, the embedding subnet extracts deep spatial-spectral features from the low resolution (LR) HSI and represents the LR HSI as a set of feature cubes. The feature cubes are then fed to the predicting subnet. There are multiple output branches in the predicting subnet, each of which corresponds to one wavelet sub-band and predicts the wavelet coe ﬃ cients of HR HSI. The HR HSI can be obtained by applying inverse wavelet transform to the predicted wavelet coe ﬃ cients. In the training stage, we propose to train the MW-3D-CNN with L1 norm loss, which is more suitable than the conventional L2 norm loss for penalizing the errors in di ﬀ erent wavelet sub-bands. Experiments on both simulated and real spaceborne HSI demonstrate that the proposed algorithm is competitive with other state-of-the-art HSI SR methods.


Introduction
Hyperspectral image (HSI) is collected in contiguous bands over a certain electromagnetic spectrum, and the spectral and spatial information in HSI is helpful for identifying and discriminating different materials in the scene. HSI has been applied to many fields, including target detection [1], environment monitoring [2], and land-cover classification [3]. However, the spatial resolution of HSI is often limited due to the trade-off between the spatial and spectral resolutions. Some Earth Observation applications, such as urban mapping [4] and fine mineral mapping [5], require high resolution (HR) HSI. Therefore, enhancing the spatial resolution of HSI is of significance for the application of HSI.
There are several ways to enhance the spatial resolution of HSI. Some auxiliary images, e.g., panchromatic image and multispectral image (MSI), often have higher spatial resolution [6]. Hyperspectral pan-sharpening reconstructs HR HSI by fusing the low resolution (LR) HSI with a HR panchromatic image taken over the same area at the same time (or a similar period). Pan-sharpening could be implemented by

•
Unlike the previous deep learning models that reconstruct HR HSI directly [29][30][31][32], the proposed network predicts the wavelet coefficients of the latent HR HSI, which is beneficial for reconstructing detailed textures in HSI.

•
In the predicting subnet, different branches corresponding to different wavelet sub-bands are trained jointly in a unified network, and the inter sub-band correlation can be utilized.

•
The network is built based on 3D convolutional layers, which could exploit the correlation in both spectral and spatial domains of HSI.

•
Instead of the conventional L2 norm, we propose to train the network with the L1 norm loss, which is fit for both low-and high-frequency wavelet sub-bands.
The remainder of the paper is organized as follows. In Section 2, we introduce some related works. In Section 3, we present the proposed HSI SR method, including the architecture and training of the network. The experimental results are given in Section 4. In Section 5, we present some analyses and discussion on the experiment. In Section 6, we conclude with observations specific to the potential of our approach to single-frame HSI SR.

CNN Based Single Image SR
CNN could extract features from the local neighborhood of image by convolving with trainable kernels, which makes it easy to exploit spatial correlation in an image. CNN has become the most popular deep learning model in many image processing tasks, particularly in image SR [40][41][42][43][44][45][46].
In [38], Dong et al. proposed to learn the mapping between the LR and HR images using a CNN. The HR image can be inferred from its LR version with the trained network. Inspired by this idea, several CNN based single image SR methods have been proposed [41][42][43][44][45][46]. In [41], a very deep CNN for SR was proposed and trained with a residual learning strategy. Trainable parameters would drastically increase in very deep CNN, and a recursive CNN was proposed to address this issue by sharing the parameters of different layers in [42]. Most CNN SR methods employed the high-level features for reconstruction and neglected the low-and mid-level features. In [43,44], the authors proposed a residual dense network for SR, in which layers were densely connected to make full use of the hierarchical features. To address the challenge of super-resolving an image by large factors, the authors in [45] proposed progressive deep learning models to upscale the image gradually. Similarly, a Laplacian Pyramid SR CNN (LapSRN) was proposed in [46], which could progressively reconstruct the high-frequency details of different sub-bands of the latent HR image.

Application of Wavelet in SR
Wavelet describes image structures in different orientations. Employing wavelet in image SR, particularly the high-frequency wavelet sub-bands, is beneficial for preserving the detailed image structures. Many wavelet based SR methods have been proposed [47][48][49][50]. In [47], the LR image was decomposed into different wavelet sub-bands, the high-frequency sub-bands were interpolated and then combined with the LR image to generate HR image via inverse wavelet transformation. Similarly, the LR image was decomposed by two types of wavelets, and the high-frequency sub-bands of the two wavelets were then combined and followed by inverse wavelet transformation [48]. In [49,50], edge prior was utilized in the high-frequency sub-bands estimation to make the SR result sharper. Wavelet could also be used in CNN to better infer image details and enhance the sparsity of the network. For example, in [34,35], the mapping between the LR and HR images was learned by a CNN in wavelet domain for single image SR. However, these SR methods were designed for a single image, therefore applying these methods to HSI in band-by-band fashion would neglect the spectral correlation in HSI and lead to high spectral distortion.

Multi-Scale Wavelet 3D CNN For HSI SR
In this study, we transform the HSI SR problem into predicting the wavelet coefficients of HSI. In this section, we first introduce some basics on wavelet package analysis and 3D CNN, then we propose the MW-3D-CNN for HSI SR, including the architecture and the loss function.

Wavelet Package Analysis
Wavelet package transformation (WPT) could transform an image into a serial of wavelet coefficients sub-bands with the same size. An example of WPT with Haar wavelet function is given in Figure 1. The one-level decomposition is shown in Figure 1b. It can be found that the low-frequency sub-band (i.e., the top-left patch) describes the global topology. The detailed structures in vertical, horizontal, and diagonal orientations can be captured by different high-frequency sub-bands (i.e., the rest patches). By repeating the decomposition to each sub-band recursively, we can obtain higher-level WPT results, such as the two-level decomposition in Figure 1c. It is noted that the decomposition is applied to both the low-and high-frequency sub-bands, so the sub-bands of higher-level decomposition are of the same size. The original image can be reconstructed from these sub-bands via inverse WPT. Wavelet describes image structures in different orientations. Employing wavelet in image SR, particularly the high-frequency wavelet sub-bands, is beneficial for preserving the detailed image structures. Many wavelet based SR methods have been proposed [47][48][49][50]. In [47], the LR image was decomposed into different wavelet sub-bands, the high-frequency sub-bands were interpolated and then combined with the LR image to generate HR image via inverse wavelet transformation. Similarly, the LR image was decomposed by two types of wavelets, and the high-frequency subbands of the two wavelets were then combined and followed by inverse wavelet transformation [48]. In [49,50], edge prior was utilized in the high-frequency sub-bands estimation to make the SR result sharper. Wavelet could also be used in CNN to better infer image details and enhance the sparsity of the network. For example, in [34,35], the mapping between the LR and HR images was learned by a CNN in wavelet domain for single image SR. However, these SR methods were designed for a single image, therefore applying these methods to HSI in band-by-band fashion would neglect the spectral correlation in HSI and lead to high spectral distortion.

Multi-Scale Wavelet 3D CNN For HSI SR
In this study, we transform the HSI SR problem into predicting the wavelet coefficients of HSI. In this section, we first introduce some basics on wavelet package analysis and 3D CNN, then we propose the MW-3D-CNN for HSI SR, including the architecture and the loss function.

Wavelet Package Analysis
Wavelet package transformation (WPT) could transform an image into a serial of wavelet coefficients sub-bands with the same size. An example of WPT with Haar wavelet function is given in Figure 1. The one-level decomposition is shown in Figure 1b. It can be found that the low-frequency sub-band (i.e., the top-left patch) describes the global topology. The detailed structures in vertical, horizontal, and diagonal orientations can be captured by different high-frequency sub-bands (i.e., the rest patches). By repeating the decomposition to each sub-band recursively, we can obtain higherlevel WPT results, such as the two-level decomposition in Figure 1c. It is noted that the decomposition is applied to both the low-and high-frequency sub-bands, so the sub-bands of higher-level decomposition are of the same size. The original image can be reconstructed from these sub-bands via inverse WPT.

3D CNN
For HSI, both spatial and spectral domains should be exploited in feature extraction. By convolving with 3D kernels, 3D CNN could extract features from different domains of volumetric data. The activity of the k-th feature cube in the d-th layer following formulation in [51] can be written as

3D CNN
For HSI, both spatial and spectral domains should be exploited in feature extraction. By convolving with 3D kernels, 3D CNN could extract features from different domains of volumetric data. The activity of the k-th feature cube in the d-th layer following formulation in [51] can be written as where, c means the set of feature cubes in the (d-1)-th layer connected to the k-th feature cube in the d-th layer, d,k,c (u,v,w) is the value at position (u, v, w) of the 3D kernel associated with the k-th feature cube. The size of the 3D kernel is U × V × W. F d,k (x,y,z) is the value at position (x, y, z) of the k-th feature cube in the d-th layer. g(·) is a non-linear activation function such as Rectified Linear Unit (ReLU) and Sigmoid function, etc. By convolving with different kernels, several 3D feature cubes can be extracted in each layer of 3D CNN, as shown in Figure 2b. Pixels of spatial neighborhood and adjacent bands are involved in 3D convolution, and the spectral-spatial correlation in HSI can be jointly exploited in feature extraction [52,53].
Remote Sens. 2019, 11, x FOR PEER REVIEW  5 of 22 where, c means the set of feature cubes in the (d-1)-th layer connected to the k-th feature cube in the d-th layer, is the value at position ( , , ) u v w of the 3D kernel associated with the k-th feature cube. The size of the 3D kernel is is the value at position ( , , ) x y z of the k-th feature cube in the d-th layer. ( ) g ⋅ is a non-linear activation function such as Rectified Linear Unit (ReLU) and Sigmoid function, etc. By convolving with different kernels, several 3D feature cubes can be extracted in each layer of 3D CNN, as shown in Figure 2b. Pixels of spatial neighborhood and adjacent bands are involved in 3D convolution, and the spectral-spatial correlation in HSI can be jointly exploited in feature extraction [52,53].

Network Architecture of MW-3D-CNN
The correlation exists not only in the spatial and spectral domains, but also among the wavelet package sub-bands of HSI. Considering the inter wavelet package sub-bands correlation, an embedding subnet is designed to learn shared features for different wavelet package sub-bands. These shared features are then fed to a predicting subnet to infer the wavelet package coefficients. Both of the embedding and predicting subnets are built based on 3D convolutional layers, which could naturally exploit the spectral-spatial correlation in HSI. The overall architecture of MW-3D-CNN is shown in Figure 3.

Embedding Subnet
The embedding subnet projects the LR HSI into deep feature space and represents it as a set of feature cubes that are shared by different wavelet package sub-bands. 3D convolutional layers and non-linear activation layers are alternatively stacked in the embedding subnet. The embedding subnet extracts feature cubes from the LR HSI , where m, n, and L are the number of rows, columns, and spectral bands, respectively. Both spectral and spatial information of HSI can be encoded by 3D convolution during the feature extraction, after several 3D convolutional layers, the LR HSI X could be represented by a serial of spectral-spatial feature cubes, which are expressed as ( ) , where S is the number of feature cubes,  denotes the function of embedding subnet. It is noted that zero padding is adopted in each convolutional layer to make the feature cubes the same size with the LR HSI.

Predicting Subnet
The embedding subnet is followed by a predicting subnet, which infers wavelet package coefficients. There are multiple output branches in the predicting subnet, each of which corresponds to one wavelet package sub-band. The predicting subnet takes the feature cubes extracted by the embedding subnet as input, each branch of the predicting subnet is trained to infer the wavelet coefficients at each sub-band. Similar to the embedding subnet, each branch in the predicting subnet

Network Architecture of MW-3D-CNN
The correlation exists not only in the spatial and spectral domains, but also among the wavelet package sub-bands of HSI. Considering the inter wavelet package sub-bands correlation, an embedding subnet is designed to learn shared features for different wavelet package sub-bands. These shared features are then fed to a predicting subnet to infer the wavelet package coefficients. Both of the embedding and predicting subnets are built based on 3D convolutional layers, which could naturally exploit the spectral-spatial correlation in HSI. The overall architecture of MW-3D-CNN is shown in Figure 3.

Embedding Subnet
The embedding subnet projects the LR HSI into deep feature space and represents it as a set of feature cubes that are shared by different wavelet package sub-bands. 3D convolutional layers and non-linear activation layers are alternatively stacked in the embedding subnet. The embedding subnet extracts feature cubes from the LR HSI X ∈ R m×n×L , where m, n, and L are the number of rows, columns, and spectral bands, respectively. Both spectral and spatial information of HSI can be encoded by 3D convolution during the feature extraction, after several 3D convolutional layers, the LR HSI X could be represented by a serial of spectral-spatial feature cubes, which are expressed as ψ(X) ∈ R m×n×L×S , where S is the number of feature cubes, ψ : R m×n×L → R m×n×L×S denotes the function of embedding subnet. It is noted that zero padding is adopted in each convolutional layer to make the feature cubes the same size with the LR HSI.

Predicting Subnet
The embedding subnet is followed by a predicting subnet, which infers wavelet package coefficients. There are multiple output branches in the predicting subnet, each of which corresponds to one wavelet package sub-band. The predicting subnet takes the feature cubes extracted by the embedding subnet as input, each branch of the predicting subnet is trained to infer the wavelet coefficients at each sub-band. Similar to the embedding subnet, each branch in the predicting subnet is also stacked by 3D convolutional layers and non-linear activation layers with zero padding strategy adopted, and the predicted wavelet coefficients have the same spatial size with the LR HSI. The desired HR HSI is obtained by applying inverse WPT to the predicted wavelet coefficients, so the upscaling factor of SR depends on the number of WPT levels. Specifically, suppose the number of WPT levels is l, there would be N w = 4 l wavelet package sub-bands, and the number of output branches in the predicting subnet is also 4 l . Taking the shared feature cubes ψ(X) as input, the i-th branch ϕ i predicts the i-th wavelet package sub-band as ϕ i (ψ(X)) ∈ R m×n×L , where ϕ i : R m×n×L×S → R m×n×L , i = 1, 2, . . . , N w denotes the function of the i-th branch. The output of MW-3D-CNN can be denoted as a set of wavelet package coefficients: In the training stage, the MW-3D-CNN learns the mapping between the LR HSI and the wavelet package coefficients of the latent HR HSI. In the testing stage, given the LR HSI, the MW-3D-CNN would infer the wavelet package coefficients at each sub-band. Applying inverse WPT to the predicted wavelet package coefficients, the HR HSI can be obtained: where, φ denotes inverse WPT,Ŷ ∈ R (r×m)×(r×n)×L is the estimated HR HSI, r = 2 l is the upscaling factor of SR. Different wavelet sub-bands share the common deep layers in the embedding subnet due to the inter wavelet sub-bands correlation. The embedding subnet learns the shared feature cubes and the predicting subnet optimizes with respect to each wavelet package sub-band. The embedding subnet connects different branches into a unified predicting subnet and allows them to be jointly optimized. Specifically, the errors in each wavelet package sub-band can be jointly back-propagated to the embedding subnet to learn the shared features, and the embedding subnet will refine different branches in the predicting subnet. Compared with training each branch independently, such joint training could make different branches facilitate each other and implicitly capture the correlation among different wavelet sub-bands. is also stacked by 3D convolutional layers and non-linear activation layers with zero padding strategy adopted, and the predicted wavelet coefficients have the same spatial size with the LR HSI. The desired HR HSI is obtained by applying inverse WPT to the predicted wavelet coefficients, so the upscaling factor of SR depends on the number of WPT levels. Specifically, suppose the number of WPT levels is l, there would be 4 l w N = wavelet package sub-bands, and the number of output branches in the predicting subnet is also 4 l . Taking the shared feature cubes ( ) ψ X as input, the ith branch i ϕ predicts the i-th wavelet package sub-band as denotes the function of the i-th branch. The output of MW-3D-CNN can be denoted as a set of wavelet package coefficients: 1 2 { ( ( )), ( ( )),..., ( ( )),..., ( ( ))} 1, 2,..., In the training stage, the MW-3D-CNN learns the mapping between the LR HSI and the wavelet package coefficients of the latent HR HSI. In the testing stage, given the LR HSI, the MW-3D-CNN would infer the wavelet package coefficients at each sub-band. Applying inverse WPT to the predicted wavelet package coefficients, the HR HSI can be obtained: where, φ denotes inverse WPT, is the upscaling factor of SR. Different wavelet sub-bands share the common deep layers in the embedding subnet due to the inter wavelet sub-bands correlation. The embedding subnet learns the shared feature cubes and the predicting subnet optimizes with respect to each wavelet package sub-band. The embedding subnet connects different branches into a unified predicting subnet and allows them to be jointly optimized. Specifically, the errors in each wavelet package sub-band can be jointly back-propagated to the embedding subnet to learn the shared features, and the embedding subnet will refine different branches in the predicting subnet. Compared with training each branch independently, such joint training could make different branches facilitate each other and implicitly capture the correlation among different wavelet sub-bands. Our MW-3D-CNN focuses on predicting the wavelet package coefficients of HR HSI, compared with predicting the HR HSI directly, we consider three advantages. Firstly, wavelet coefficients

Sub-band
Inverse WPT LR HSI

Estimated Wavelet Coefficients
Input Figure 3. The architecture of the proposed multi-scale wavelet (MW)-3D-CNN, the number and the size of convolutional kernels are denoted at each layer, and the embedding subnet and predicting subnet have three and four layers respectively.
Our MW-3D-CNN focuses on predicting the wavelet package coefficients of HR HSI, compared with predicting the HR HSI directly, we consider three advantages. Firstly, wavelet coefficients describe the detailed textural information in HSI. Training the MW-3D-CNN to predict the wavelet coefficients is beneficial for recovering the detailed structures in HSI [33,36]. Secondly, a network with sparse activations is easier to be trained [34,35]. Wavelet coefficients have sparsity characteristics in the high-frequency sub-bands, and predicting wavelet coefficients promotes the sparsity of the MW-3D-CNN and makes the training easier and the trained network more robust. Finally, the MW-3D-CNN extracts features from the LR HSI directly. Compared with extracting features from the interpolated LR HSI, such as in [40,41], information in larger receptive field can be exploited.

Training of MW-3D-CNN
All the convolutional kernels and bias in the embedding and predicting subnets are trained in an end-to-end manner. L2 norm, which measures mean square error, is often used in loss function in the conventional CNN based image SR methods. However, the output of our network is the wavelet coefficients, which have larger values in the low-frequency sub-band and smaller values in the high-frequency sub-bands, as shown in the histograms in Figure 4. The L2 norm loss penalizes heavily on larger errors and is less sensitive to smaller errors [37]. On the contrary, the L1 norm loss penalizes equally on both larger and smaller errors, and it is more suitable than the L2 norm loss for wavelet coefficients prediction. In addition, compared with the L2 norm loss, the L1 norm loss is helpful for recovering sharper image structures with faster convergence [38]. Therefore, we propose to train the MW-3D-CNN with the L1 norm loss, the loss function is written as where, C i j andĈ i j = ϕ i (ψ(X j )) are the ground truth and the predicted wavelet package coefficients of the i-th sub-band respectively, j = 1, 2, . . . , N, N is the number of training samples, i = 1, 2, . . . , N w , N w = 4 l is the number of sub-bands. X j is the LR HSI of the j-th training sample. λ i is the weight balancing the trade-off between different wavelet sub-bands, which is set to 1 for simplicity in the experiment. The loss function is optimized using the adaptive moment estimation (ADAM) method with standard back propagation. The trainable convolutional kernels and bias are updated according to the following rule [54]: where, θ (t) denotes the trainable parameters (i.e., convolutional kernels and bias) at the t-th iteration, α is learning rate, ε is a constant to stabilize the updating, which is set to 10 −6 . m (t) and v (t) are bias-corrected first and second moment estimates respectively: where ∂θ is the gradient with respect to the trainable parameters θ. β 1 and β 2 are two exponential decay rates for the moment estimation. In our implementation, the learning rate α is initially set to 0.001 and decreased by half for every 50 training epochs. The exponential decay rates β 1 and β 2 are set to 0.9 and 0.999 respectively. The batch size is set to 64. The number of training epochs is 200.

Experimental Results
In this section, we compare the MW-3D-CNN with other state-of-the-art HSI SR methods on several simulated HSI datasets. In order to demonstrate the applicability of MW-3D-CNN, we also validate it on real spaceborne Hyperion HSI. Since there is no reference HSI for SR assessment in real data case, we use the no-reference HSI assessment method in [55] to evaluate the SR performance.

Experiment Setting
Three datasets were used in the experiment. The first one is the Reflective Optics System Imaging Spectrometer (ROSIS) dataset, which contains two images taken over Pavia University and Pavia Center with sizes of 610 × 340 and 1096 × 715, respectively. The spatial resolution is 1.3 m. After discarding the noisy bands, there are 100 bands remained in the spectral range 430~860 nm. The second dataset was collected by Headwall Hyperspec-VNIR-C imaging sensor over Chikusei, Japan, on July 29, 2014 [56]. The size is 2517 × 2335 with spatial resolution 2.5 m. There are 128 bands in the spectral range of 363~1018nm. The third dataset is 2018 IEEE GRSS Data Fusion Contest data (denoted as "grss_dfc_2018"), which was acquired by the National Center for Airborne Laser Mapping (NCALM) over Houston University, on February 16, 2017 [57]. The size of this data is 1202 × 4172. The spatial resolution is 1 m. It has 48 bands in the spectral range of 380~1050 nm.
The above data was treated as original HR HSI, the LR HSI was simulated via Gaussian downsampling, which is a process of simulating LR HSI via applying a Gaussian filter to HR HSI and then down-sampling it in both vertical and horizontal directions. The Gaussian down-sampling was implemented using the "Hyperspectral and Multispectral Data Fusion Toolbox" [16]. For downsampling by a factor of two, the Gaussian filter was of size 2 × 2 with zero mean and standard deviation 0.8493; for down-sampling by a factor of four, the Gaussian filter was of size 4 × 4 with zero mean and standard deviation 1.6986. All these parameters in down-sampling are suggested in [16,17].
We cropped three sub-images with rich textures from the original HSI as testing data, and the remainder was used as training data. About 100,000 LR-HR pairs were extracted as training samples to train the MW-3D-CNN. Each LR HSI sample was of size 16 × 16 × 16. For training the MW-3D-CNN by an upscaling factor of two, there were four branches in the predicting subnet, the output wavelet coefficients in each branch were of size 16 × 16 × 16, and the corresponding HR HSI sample was of size 32 × 32 × 16. For training the MW-3D-CNN by an upscaling factor of four, there were 16 branches in the predicting subnet, the output wavelet coefficients in each branch were also of size 16 × 16 × 16, and the corresponding HR HSI sample was of size 64 × 64 × 16. It is noted that there was no overlapping between the training and testing regions. The network parameters of MW-3D-CNN were set according to network parameters in Figure 3. Haar wavelet function was used in WPT.

Comparison with State-of-the-Art SR Methods
In this sub-section, we compare the proposed method with other state-of-the-art HSI SR methods. Spectral-spatial group sparse representation HSI SR method (denoted as SSG) [27], and two

Experimental Results
In this section, we compare the MW-3D-CNN with other state-of-the-art HSI SR methods on several simulated HSI datasets. In order to demonstrate the applicability of MW-3D-CNN, we also validate it on real spaceborne Hyperion HSI. Since there is no reference HSI for SR assessment in real data case, we use the no-reference HSI assessment method in [55] to evaluate the SR performance.

Experiment Setting
Three datasets were used in the experiment. The first one is the Reflective Optics System Imaging Spectrometer (ROSIS) dataset, which contains two images taken over Pavia University and Pavia Center with sizes of 610 × 340 and 1096 × 715, respectively. The spatial resolution is 1.3 m. After discarding the noisy bands, there are 100 bands remained in the spectral range 430~860 nm. The second dataset was collected by Headwall Hyperspec-VNIR-C imaging sensor over Chikusei, Japan, on July 29, 2014 [56]. The size is 2517 × 2335 with spatial resolution 2.5 m. There are 128 bands in the spectral range of 363~1018 nm. The third dataset is 2018 IEEE GRSS Data Fusion Contest data (denoted as "grss_dfc_2018"), which was acquired by the National Center for Airborne Laser Mapping (NCALM) over Houston University, on February 16, 2017 [57]. The size of this data is 1202 × 4172. The spatial resolution is 1 m. It has 48 bands in the spectral range of 380~1050 nm.
The above data was treated as original HR HSI, the LR HSI was simulated via Gaussian down-sampling, which is a process of simulating LR HSI via applying a Gaussian filter to HR HSI and then down-sampling it in both vertical and horizontal directions. The Gaussian down-sampling was implemented using the "Hyperspectral and Multispectral Data Fusion Toolbox" [16]. For down-sampling by a factor of two, the Gaussian filter was of size 2 × 2 with zero mean and standard deviation 0.8493; for down-sampling by a factor of four, the Gaussian filter was of size 4 × 4 with zero mean and standard deviation 1.6986. All these parameters in down-sampling are suggested in [16,17].
We cropped three sub-images with rich textures from the original HSI as testing data, and the remainder was used as training data. About 100,000 LR-HR pairs were extracted as training samples to train the MW-3D-CNN. Each LR HSI sample was of size 16 × 16 × 16. For training the MW-3D-CNN by an upscaling factor of two, there were four branches in the predicting subnet, the output wavelet coefficients in each branch were of size 16 × 16 × 16, and the corresponding HR HSI sample was of size 32 × 32 × 16. For training the MW-3D-CNN by an upscaling factor of four, there were 16 branches in the predicting subnet, the output wavelet coefficients in each branch were also of size 16 × 16 × 16, and the corresponding HR HSI sample was of size 64 × 64 × 16. It is noted that there was no overlapping between the training and testing regions. The network parameters of MW-3D-CNN were set according to network parameters in Figure 3. Haar wavelet function was used in WPT.

Comparison with State-of-the-Art SR Methods
In this sub-section, we compare the proposed method with other state-of-the-art HSI SR methods. Spectral-spatial group sparse representation HSI SR method (denoted as SSG) [27], and two CNN based SR algorithms, i.e., SRCNN [40] and 3D-CNN [32], were used for comparison. As an often used benchmark, bicubic interpolation was also compared. All the parameters of SSG, SRCNN, and 3D-CNN followed the default setting as described in [27,40], and [32]. The training samples and training epochs of SRCNN and 3D-CNN were the same with that of MW-3D-CNN, which guarantees the fairness of comparison.
The SR performance was assessed using peak-signal-noise-ratio (PSNR, dB), structural similarity index measurement (SSIM) [58], feature similarity index measurement (FSIM) [59], and spectral angle mean (SAM). We compute the PSNR, SSIM, and FSIM indices on each band, and then calculated the mean values over all the spectral bands.
The assessment indices of different SR methods are given in Tables 1 and 2. The scores of our method are better than the compared methods in most cases. The 3D-CNN in [32] could extract spectral-spatial features from HSI and jointly reconstruct different spectral bands, so it could lead to less spectral distortion than the SRCNN, as shown in Tables 1 and 2. Both 3D-CNN and MW-3D-CNN are in the framework of 3D CNN, and the MW-3D-CNN predicts the wavelet coefficients of the HR HSI, rather than directly predicting the HR HSI. Focusing on the wavelet coefficients makes the MW-3D-CNN more effective in preserving structures in HR HSI, so the results of MW-3D-CNN have higher PSNR values. In order to test the robustness of MW-3D-CNN over larger upscaling factor, we also implemented the SR by a factor of four and report the indices in Table 2. It can be found that the MW-3D-CNN can also achieve competitive results in most cases by an upscaling factor of four. In Figure 5, we plot the PSNR indices of different SR methods on each band. It is clear that the proposed method outperforms other methods on most spectral bands. CNN based SR algorithms, i.e., SRCNN [40] and 3D-CNN [32], were used for comparison. As an often used benchmark, bicubic interpolation was also compared. All the parameters of SSG, SRCNN, and 3D-CNN followed the default setting as described in [27,40], and [32]. The training samples and training epochs of SRCNN and 3D-CNN were the same with that of MW-3D-CNN, which guarantees the fairness of comparison. The SR performance was assessed using peak-signal-noise-ratio (PSNR, dB), structural similarity index measurement (SSIM) [58], feature similarity index measurement (FSIM) [59], and spectral angle mean (SAM). We compute the PSNR, SSIM, and FSIM indices on each band, and then calculated the mean values over all the spectral bands.
The assessment indices of different SR methods are given in Tables 1 and 2. The scores of our method are better than the compared methods in most cases. The 3D-CNN in [32] could extract spectral-spatial features from HSI and jointly reconstruct different spectral bands, so it could lead to less spectral distortion than the SRCNN, as shown in Tables 1 and 2. Both 3D-CNN and MW-3D-CNN are in the framework of 3D CNN, and the MW-3D-CNN predicts the wavelet coefficients of the HR HSI, rather than directly predicting the HR HSI. Focusing on the wavelet coefficients makes the MW-3D-CNN more effective in preserving structures in HR HSI, so the results of MW-3D-CNN have higher PSNR values. In order to test the robustness of MW-3D-CNN over larger upscaling factor, we also implemented the SR by a factor of four and report the indices in Table 2. It can be found that the MW-3D-CNN can also achieve competitive results in most cases by an upscaling factor of four. In Figure 5, we plot the PSNR indices of different SR methods on each band. It is clear that the proposed method outperforms other methods on most spectral bands.     9 and 11, we also give the residual maps of SR results, in which reconstruction error at each pixel can be reflected. In Figure 6, it is clear that the result of MW-3D-CNN is closer to the reference image, and the results of other compared methods are much brighter than the original HR image, which means that the spectral distortion is heavier. We also display some small areas by enlarging them to highlight the details of the SR results. In Figures 6 and 10, both SSG and SRCNN results suffer from artifacts with stripe-like patterns. By comparing the details in Figure 10, it can be found that our MW-3D-CNN SR results are sharper than the 3D-CNN results.
In the residual maps, it can be observed that all the SR results contain errors in the edges and details. Compared with other methods, our MW-3D-CNN method generates less errors. For example, in Figure 11, the error values in the MW-3D-CNN residual map are much sparser, which also demonstrates that predicting the wavelet coefficients is helpful for recovering the edges and detailed structures in the HR HSI.
We also present running time comparison of different SR methods in Tables 3 and 4. Most of the SR methods could infer HR HSI quickly. In the SSG method, dictionary learning and sparse coding is time consuming, so SSG takes the longest time to reconstruct HR HSI. The running time of MW-3D-CNN is comparable to 3D-CNN, as both of them could super-resolve HSI within 2 s. The running time comparison in Tables 3 and 4 indicates that our proposed method could achieve competitive performance in both SR accuracy and running time.   SSG result [27], (c) SRCNN result [40], (d) 3D-CNN result [32], and (e) the proposed MW-3D-CNN result. The residual maps are displayed by scaling to the minimum and maximum errors.  SSG result [27], (c) SRCNN result [40], (d) 3D-CNN result [32], and (e) the proposed MW-3D-CNN result. The residual maps are displayed by scaling to the minimum and maximum errors.

Application on Real Spaceborne HSI
In this sub-section, we also apply the MW-3D-CNN to real spaceborne HSI SR to demonstrate its applicability. Earth Observing-1 (EO-1)/Hyperion HSI was used as testing data. The spatial resolution of Hyperion HSI is 30 m. There are 242 spectral bands in the spectral range of 400~2500 nm. The Hyperion HSI suffers from noise, and after removing the noisy bands and water absorption bands, 83 bands remain. The Hyperion HSI in this experiment was taken over Lafayette, LA, USA on October 2015. We cropped a sub-image with size 341 × 365 from it as the study area.
As there is no HR HSI in real application, we used the Wald protocol to train the networks [24]. The original 30 m HSI was regarded as HR HSI, and LR HSI with resolution 60 m was simulated via down-sampling. The LR-HR HSI pairs were used to train the MW-3D-CNN that could super-resolve HSI by a factor of two. The trained MW-3D-CNN was then applied to the 30 m Hyperion HSI, and HR HSI with 15 m resolution could be obtained. The super-resolved Hyperion HSIs are shown in Figure 12. In Figures 13 and 14, we show, in zoom, the results of the compared methods. The resolution of Hyperion HSI is enhanced significantly through SR. Compared with other methods, the proposed MW-3D-CNN generates HSI with sharper edges and clearer structures, as indicated by the area highlighted in the dashed boxes.
Since there is no reference image for assessment, the traditional evaluation indices such as PSNR cannot be used here. We used the no-reference HSI quality assessment method in [55], which measures the deviation of reconstructed HSI from pristine HSI, to evaluate the super-resolved Hyperion HSIs. The original Hyperion images were first screened for noisy bands and water absorption bands. The remaining bands were used as training data, quality-sensitive features were extracted from the training data and a benchmark multivariate Gaussian model was learned for the no-reference HSI assessment. The no-reference HSI quality scores after SR are listed in Table 5. It shows that by an upscaling factor of two where the SR image is at 15 m resolution, the proposed MW-3D-CNN performs better than other methods with a lower score, which means that the SR result deviates less from the pristine HSI than other SR results.

Application on Real Spaceborne HSI
In this sub-section, we also apply the MW-3D-CNN to real spaceborne HSI SR to demonstrate its applicability. Earth Observing-1 (EO-1)/Hyperion HSI was used as testing data. The spatial resolution of Hyperion HSI is 30 m. There are 242 spectral bands in the spectral range of 400~2500 nm. The Hyperion HSI suffers from noise, and after removing the noisy bands and water absorption bands, 83 bands remain. The Hyperion HSI in this experiment was taken over Lafayette, LA, USA on October 2015. We cropped a sub-image with size 341 × 365 from it as the study area.
As there is no HR HSI in real application, we used the Wald protocol to train the networks [24]. The original 30 m HSI was regarded as HR HSI, and LR HSI with resolution 60 m was simulated via down-sampling. The LR-HR HSI pairs were used to train the MW-3D-CNN that could super-resolve HSI by a factor of two. The trained MW-3D-CNN was then applied to the 30 m Hyperion HSI, and HR HSI with 15 m resolution could be obtained. The super-resolved Hyperion HSIs are shown in Figure 12. In Figures 13 and 14, we show, in zoom, the results of the compared methods. The resolution of Hyperion HSI is enhanced significantly through SR. Compared with other methods, the proposed MW-3D-CNN generates HSI with sharper edges and clearer structures, as indicated by the area highlighted in the dashed boxes.
Since there is no reference image for assessment, the traditional evaluation indices such as PSNR cannot be used here. We used the no-reference HSI quality assessment method in [55], which measures the deviation of reconstructed HSI from pristine HSI, to evaluate the super-resolved Hyperion HSIs. The original Hyperion images were first screened for noisy bands and water absorption bands. The remaining bands were used as training data, quality-sensitive features were extracted from the training data and a benchmark multivariate Gaussian model was learned for the no-reference HSI assessment. The no-reference HSI quality scores after SR are listed in Table 5. It sho

Sensitivity Analysis on Network Parameters
It is theoretically hard to estimate the optimal network parameters of a deep learning architecture. We empirically tuned the network parameters and presented them in Figure 3. In this sub-section, we give the sensitivity analysis of MW-3D-CNN over the network parameters. We vary one network parameter and fix others, then observe the SR performance.
The sensitivity analysis over the size of 3D convolutional kernel is in Table 6. Proper large convolutional kernel size is necessary for collecting spatial and spectral information for HSI SR. It is clear that the best performance is achieved with convolutional kernel size 3 × 3 × 3. The performance decreases when the convolutional kernel size is set to 5 × 5 × 5. More spatial and spectral information can be exploited by larger convolutional kernel, but higher complexity of the network will be caused, and more parameters need to be trained. This may explain why the performance drops with the increase of kernel size.
The number of 3D convolutional kernels determines the number of feature cubes extracted by each layer. In our MW-3D-CNN, we set 32 convolutional kernels for each layer of the embedding subnet and 16 convolutional kernels for each layer of the predicting subnet, which leads to the best performance in most cases, as shown in Table 7. With the increase of convolutional kernel number, more feature cubes could be extracted, but the complexity of the network would be increased.
Usually, the deeper the network, the better the performance. With deeper architecture, the network would have larger capacity. In Table 8, it is shown that the best performance can be obtained in most cases when the number of convolutional layers in the embedding subnet and predicting subnet is set to three and four.

The Rationality Analysis of L1 Norm Loss
In order to verify the rationality of L1 norm loss, we trained the MW-3D-CNN using the L2 norm loss written as then compared it with the one trained using the L1 norm loss in Equation (4). The comparison is presented in Table 9. The L1 norm loss could mitigate the unbalance in penalizing low-and highfrequency wavelet package sub-bands caused by the L2 norm loss, so the MW-3D-CNN trained with the L1 norm loss performs better than the L2 norm loss on the testing data, as shown in Table 9.
In the training stage, the errors of the i-th wavelet package sub-band predicted by the MW-3D-CNN can be expressed as (C i j −Ĉ i j ), where j = 1, 2, . . . , N, N is the number of training sample. We present the histograms of the errors after 200 training epochs in Figure 15. It is clear that the errors of different wavelet package sub-bands have similar statistics, as most of the errors are close to zero and tend to follow Laplacian distributions. Compared with the L2 norm, the L1 norm is more suitable for penalizing the Laplacian-like errors, which demonstrates the rationality of the L1 norm loss as well.

The Rationality Analysis of L1 Norm Loss
In order to verify the rationality of L1 norm loss, we trained the MW-3D-CNN using the L2 norm loss written as then compared it with the one trained using the L1 norm loss in Equation (4). The comparison is presented in Table 9. The L1 norm loss could mitigate the unbalance in penalizing low-and highfrequency wavelet package sub-bands caused by the L2 norm loss, so the MW-3D-CNN trained with the L1 norm loss performs better than the L2 norm loss on the testing data, as shown in Table 9.
In the training stage, the errors of the i-th wavelet package sub-band predicted by the MW-3D-CNN can be expressed as ( ) , N is the number of training sample. We present the histograms of the errors after 200 training epochs in Figure 15. It is clear that the errors of different wavelet package sub-bands have similar statistics, as most of the errors are close to zero and tend to follow Laplacian distributions. Compared with the L2 norm, the L1 norm is more suitable for penalizing the Laplacian-like errors, which demonstrates the rationality of the L1 norm loss as well.  Figure 15. The histograms of errors in different wavelet sub-bands after 200 training epochs. The training data is extracted from Pavia University, the MW-3D-CNN is trained with the L1 norm loss, and the upscaling factor is two.

The Rationality Analysis of 3D Convolution
In this sub-section, in order to analyze the advantage of 3D convolution over 2D convolution for HSI SR, we replaced all the 3D convolutional layers in the MW-3D-CNN with 2D convolutional layers. In this case, it reduces to the architecture as the wavelet-SRNet method in [36]. Then we compared the MW-3D-CNN with the wavelet-SRNet. The loss function of wavelet-SRNet was originally designed with L2 norm in [36]. Here, we also trained the wavelet-SRNet with L1 norm as loss function, and the corresponding results are denoted as wavelet-SRNet-L2 and wavelet-SRNet-L1. The comparison between the MW-3D-CNN and the wavelet-SRNet is presented in Table 10.
In Table 10, it can be found that the MW-3D-CNN performs better than the wavelet-SRNet on the three datasets. The MW-3D-CNN is based on 3D convolutional layers, which could naturally exploit the spectral correlation and reduce the spectral distortion in HSI SR. We could also find that when the L1 norm is used as loss function for the wavelet-SRNet, the SR performance is slightly better than the L2 norm, which also demonstrates the effectiveness of L1 norm. Figure 15. The histograms of errors in different wavelet sub-bands after 200 training epochs. The training data is extracted from Pavia University, the MW-3D-CNN is trained with the L1 norm loss, and the upscaling factor is two.

The Rationality Analysis of 3D Convolution
In this sub-section, in order to analyze the advantage of 3D convolution over 2D convolution for HSI SR, we replaced all the 3D convolutional layers in the MW-3D-CNN with 2D convolutional layers. In this case, it reduces to the architecture as the wavelet-SRNet method in [36]. Then we compared the MW-3D-CNN with the wavelet-SRNet. The loss function of wavelet-SRNet was originally designed with L2 norm in [36]. Here, we also trained the wavelet-SRNet with L1 norm as loss function, and the corresponding results are denoted as wavelet-SRNet-L2 and wavelet-SRNet-L1. The comparison between the MW-3D-CNN and the wavelet-SRNet is presented in Table 10.
In Table 10, it can be found that the MW-3D-CNN performs better than the wavelet-SRNet on the three datasets. The MW-3D-CNN is based on 3D convolutional layers, which could naturally exploit the spectral correlation and reduce the spectral distortion in HSI SR. We could also find that when the L1 norm is used as loss function for the wavelet-SRNet, the SR performance is slightly better than the L2 norm, which also demonstrates the effectiveness of L1 norm.

Robustness over Wavelet Functions
In the experiment, we used Haar wavelet function in WPT. In this sub-section, we also perform the MW-3D-CNN with other two wavelet functions, Daubechies-2 and biorthogonal wavelet functions, to evaluate the robustness of MW-3D-CNN over the wavelet function. In Table 11, it can be found that the SR performance with different wavelet functions is close to each other. The SR performance changes slightly with different wavelet functions, which demonstrates the robustness of MW-3D-CNN over the wavelet functions. The MW-3D-CNN is implemented on Tensorflow [60], with a NVIDIA GTX 1080Ti graphic card. It takes about 7 h and 20 h to train the MW-3D-CNN with upscaling factor two and four respectively. In the testing stage, inferring a HR HSI only takes less than two seconds, it is fast because there is only feed forward operation involved.

Conclusions
In this study, a MW-3D-CNN for HSI SR was proposed. Instead of predicting the HR HSI directly, we predicted the wavelet package coefficients of the latent HR HSI, and then reconstructed the HR HSI via inverse WPT. The MW-3D-CNN is constituted by an embedding subnet and a predicting subnet, both of which are built on 3D convolutional layers. The embedding subnet projects the input LR HSI into feature space and represents it with a set of feature cubes. These feature cubes are then fed to the predicting subnet, which consists of several output branches. Each branch corresponds to a wavelet package sub-band and predicts the wavelet package coefficients of each sub-band. The HR HSI can be reconstructed via inverse WPT. The experiment results on both simulated and real spaceborne HSI demonstrate that the proposed MW-3D-CNN could achieve competitive performance. The MW-3D-CNN learns the knowledge from the external training data for HSI SR. HSI has its prior information in both spectral and spatial domains, such as the structural self-similarity [26] and low rank prior [61][62][63]. Exploiting these prior information helps regularize the ill-posed HSI SR problem. How to combine such internal prior with external learned knowledge in the deep learning will need to be examined in future work. Furthermore, integrating adversarial loss [64] in training the network is another direction to boost the SR performance.