SSML: Spectral-Spatial Mutual-Learning-Based Framework for Hyperspectral Pansharpening

: This paper considers problems associated with the large size of the hyperspectral pansharpening network and difﬁculties associated with learning its spatial-spectral features. We propose a deep mutual-learning-based framework (SSML) for spectral-spatial information mining and hyperspectral pansharpening. In this framework, a deep mutual-learning mechanism is introduced to learn spatial and spectral features from each other through information transmission, which achieves better fusion results without entering too many parameters. The proposed SSML framework consists of two separate networks for learning spectral and spatial features of HSIs and panchromatic images (PANs). A hybrid loss function containing constrained spectral and spatial information is designed to enforce mutual learning between the two networks. In addition, a mutual-learning strategy is used to balance the spectral and spatial feature learning to improve the performance of the SSML path compared to the original. Extensive experimental results demonstrated the effectiveness of the mutual-learning mechanism and the proposed hybrid loss function for hyperspectral pan-sharpening. Furthermore, a typical deep-learning method was used to conﬁrm the proposed framework’s capacity for generalization. Ideal performance was observed in all cases. Moreover, multiple experiments analysing the parameters used showed that the proposed method achieved better fusion results without adding too many parameters. Thus, the proposed SSML represents a promising framework for hyperspectral pansharpening.


Introduction
HSIs usually contain information on tens to hundreds of continuous spectral bands in the target area. Therefore, HSIs have a high spectral resolution but lower spatial resolution due to hardware limitations. In contrast, PANs are usually single-band images in the visible range, having high spatial resolution but low spectral resolution. Pansharpening involves the reconstruction of low-resolution (LR) HSIs and high-resolution (HR) PANs to generate HR-HSIs, and has been widely used in image classification [1], target detection [2], and road recognition [3].
Traditional HSI pansharpening technologies can be broadly divided into four categories: component substitution-based methods [4,5], model-based methods [6,7], multiresolution analysis [8], and hybrid methods [9]. Each of these categories has certain limitations. Component substitution-based methods can cause certain types of spectral distortion; multi-resolution analysis-based methods require complex calculations; hybrid methods combine component substitution and multi-resolution analysis, thus providing good spectral retention but fewer spatial details; and, finally, model-based methods are limited by network parameter number and computational complexity.
In recent years, deep learning has been widely used in the field of image processing [10][11][12][13][14][15][16], while pansharpening has been at the primary stage of exploration [17]. Yang et al. [18] proposed a convolutional neural network (CNN) for pansharpening (PanNet), which was performed via ResNet [19] in the high-pass filter domain. Zhu et al. [20] designed a spectral attention module (SeAM) to extract the spectral features of HSIs. Zhang et al. [21] designed a residual channel attention module (RCAM) to solve the spectral reconstruction problem. However, as is well-known, CNNs can learn one feature more easily than multiple features, and have fewer parameters. Moreover, in the feature extraction process, simultaneous learning of multiple features is affected by the features' effects on each other. To reduce the influence of these effects, Zhang et al. [22] improved classification results by measuring the difference in the probabilistic behavior between the spectral features of two pixels. Xie et al. [23] used the mean square error (MSE) loss and spectral angle mapper (SAM) loss to constrain spatial and spectral feature losses, respectively. Qu et al. [15] proposed a residual hyper-dense network and a CNN with cascade residual hyper-dense blocks. The former network extends Denset to solve the problem of spatial spectrum fusion. The latter network allows direct connections between pairs of layers within the same stream and those across different streams, which means that it learns more complex combinations between the HS and PAN images.
The above studies show that the better the spatial and spectral feature learning, the better the fusion result for deep-learning-based hyperspectral pansharpening methods. However, it is well known that hyperspectral images contain a large amount of data because of many bands. Thus, it is a challenge for the hyperspectral pansharpening method to fully learn and utilize the spatial and spectral features without increasing computation excessively. Commonly, single feature learning is easier than multiple feature learning, while multiple collaborative learning is more effective than single feature learning. Inspired by mutual learning, in this paper, we explore a novel pansharpening method that learns the spatial and spectral characteristics separately and establishes the relationship between them to learn from each other to achieve desirable results.
In recent years, a deep mutual-learning strategy (DML) [24] has been proposed for image classification, and includes multiple original networks that mutually learn from each other. This unique training strategy has great potential for multi-feature learning of a single task using few parameters. It therefore has research value in the field of HSI pansharpening. To the authors' knowledge, there has been no application of DML to HSI pansharpening. This paper proposes a deep mutual-learning framework integrating spectral-spatial information-mining (SSML) for HSI pansharpening. In the SSML framework, two simple networks, a spectral and a spatial network, are designed for mutual learning. The two networks learn different features independently; for instance, the spectral network captures only spectral features, while the spatial network focuses only on spatial details. Then, the DML strategy enables them to learn each other's features. In addition, a hybrid loss function is derived by constraining spectral and spatial information between the two networks. The main contributions of this paper are summarized below: • This paper proposes an SSML framework which introduces a DML strategy into HSI pansharpening for the first time; four cross experiments are performed to verify the proposed SSML framework's effectiveness, and the network's generalization ability is confirmed by the latest research results in the field of HSI generalization sharpening. • A hybrid loss function, which considers the HSI characteristics, is designed to enable each network in the SSML framework to learn a certain feature independently, thus improving its overall performance so that the SSML framework can successfully generate a high-quality HR-HSI.
The rest of the paper is organized as follows. Section 2 presents related work, while Section 3 introduces the proposed SSML. Section 4 describes and analyzes the experimental results. Finally, Section 5 concludes the paper with a short overview of its contributions to research.

Related Work
The DML strategy [24] was initially proposed for image classification, but, after several years of development, it has been applied in many fields [25][26][27]. The DML strategy uses a mutual-loss learning function, which allows multiple small networks to learn the same task together under different initial conditions, thereby improving the performance of each of the networks [24]. For classification problems, Kullback-Leibler (KL) divergence [28] has often been used as a mutual learning loss function in the DML because it can calculate the asymmetric measure of the probability distribution between two networks; it is defined by: where D KL (p i ||p j ) calculates the distance from p j to p i However, in the field of HSI pansharpening, it is usually necessary to evaluate the image quality rather than the probability distribution of pixels. HSIs have a high correlation between pixels in each band. Therefore, it is necessary to consider other loss functions as the mutual learning loss function instead of the KL divergence. Traditionally, MSE and SAM [29] have been used to evaluate the spatial quality and spectral distortion of HSIs. Therefore, the effects of the MSE and SAM on the proposed SSML framework's performance are examined in this paper.

Method
This section describes the proposed SSML framework and introduces the hybrid loss function.
In general, the HSI pansharpening problem can be considered a process in which a network generates an HR-HSI H HR by inputting an LR-HSI H LR and an HR-PAN P HR , and using the loss function constraint to network learning, which can be expressed as: where M(·) represents the mapping function between a CNN's input and output data, θ denotes the parameters to be optimized, and (θ) is the loss function.

Image Preprocessing
As shown in Figure 1, the proposed framework first performs bicubic interpolation on an LR-HSI H to obtain the H up , which has the same size as HR-PAN P [30]. Then a contrast-limited adaptive histogram equalization is applied to the image P to obtain P g with richer edge details [31,32]. Finally, H ini is obtained by injecting P g into H up through guided filtering, that is H ini = G(P g , H up ), for enhancing the spatial details of HSIs.

SSML Framework
As previously mentioned, the proposed SSML framework includes two networks, a spectral network, and a spatial network. They use specific structures to extract specific features-for instance, residual blocks for extracting spatial features and channel attention blocks for extracting spectral features. In addition, they constrain each other to learn other features by minimizing the hybrid loss function. Without loss of generality, their structures are designed to be universal and simple, as shown in Figure 2. The spectral network uses a spectral attention structure to extract spectral information, while the spatial network adopts residual learning and a spatial attention structure to capture spatial information. Two popular structures of the spectral network are illustrated in Figure 3a,b. The specific settings of the network are shown in Table 1. RCAM uses four convolutional layers, the size of the convolution kernel of the first two layers is 3 × 3, and the size of the last two layers is 1 × 1. The sigmoid function is used to process the feature map of the four convolutional layers, which is multiplied by the convolution result of the second layer. Then the results and input are processed in element-wise addition. The SeAM is divided into two branches after the convolution of the first two layers, which are the same as RCAM. The structure of the first branch is the same as that of the third and fourth layers of RCAM. The second branch replaces AvgPooling in the first branch with MaxPooling. The results of the two branches are processed in element-wise addition, and the subsequent steps are similar to RCAM.  As for the spectral structure, most of them have been designed using the pooling operation and then stimulated. The equation is: where f represents the stimulated process, P(·) indicates the pooling operation. Then, by multiplying s i by F, a new feature mapF can be obtained as follows: where s i and F i represent the weight and feature map of the ith feature. Two popular spatial network structures are presented in Figure 3c,d. The specific settings of the network are shown in Table 2. ResNet uses two convolutional layers of equal size. The convolution kernel size is 3 × 3, and the convolution result and input are processed by element-wise addition. The first layer of MSRNet uses a size of 1 × 1 convolution kernels. The convolution results are chunked into four feature maps of equal size, which are sent to four corresponding branches for convolution operations. The first branch uses a convolution layer size of 1 × 1. Branches 2, 3, and 4 added a Relu layer and convolution compared with the previous branch. Finally, the results of the four branches are concatenated and a 1 × 1 convolution is used in the last layer. Assume H denotes an HR-HSI and H denotes an LR-HSI and suppose there is a residual res cnn in H and H , which is expressed as : A CNN can be used to learn res cnn between H and H , and H can be obtained from res cnn and H as follows: The typical structure of the ResNet, which usually learns the residuals between the target and input data, is presented in Figure 3c. In contrast, Figure 3d shows a multi-scale ResNet (MSRNet), which learns feature maps with larger receptive fields by combining different convolution kernels.

Hybrid Loss Function
Inspired by KL divergence, this paper defines a hybrid loss function for the SSML framework according to the characteristics of the two networks in the proposed framework, forcing them to learn from each other. The hybrid loss function is defined by: whereŷ 1 is the prediction of S 1 ,ŷ 2 is the prediction of S 2 , y is the ground truth, λ 1 and λ 2 are the weights of the hybrid loss function, L spa and L spe are additional loss functions that constrain spatial information and spectral information, respectively, and L M is the main loss function to constrain the whole network.
In the two networks in the SSML framework, the L 1 -norm is used as the main loss function (L M ) due to its good convergence [33], and is defined by: For spectral feature learning in the S 1 network, L spa chooses the MSE to constrain the spatial information loss between y andŷ as follows: Similarly, for spatial feature learning in the S 2 network, L spe chooses the SAM to constrain the spectral information loss between y andŷ.
Finally, the SSML framework alternately updates the weights of θ S 1 and θ S 2 using the SGD as follows:

Datasets and Metrics
The proposed method was evaluated on two public datasets, CAVE [34] and Pavia Center [35]. In CAVE, the wavelength range was 400 nm-700 nm, the resolution was 512 × 512, and there were 31 bands for a total of 32 HSIs. In Pavia Center, the range was 430 nm-860 nm, the resolution was 1096 × 708, and 102 bands were used for one HSI. In training, 60% of the overall data was selected as a training set, and the remaining data were used as a test set. Before training, the Wald protocol [30] was adopted to obtain LR-HSIs through down-sampling. In the training set, the data size was 32 × 32 bands, and the batch size was 32. In testing, the original image size was the same as the input size. All networks were developed using the PyTorch framework, and the experiments were performed on NVIDIA GeForce GTX 2080ti GPU. In training, SGD's weight decay was 10 −5 , the momentum was 0.9, the learning rate was 0.1, the number of iterations was 2 × 10 4 , and the learning rate was reduced by half every 1000 iterations. The proposed method was implemented in Python 3.7.3.
The performance of the proposed method was analyzed both quantitatively and visually. The evaluation indicators used in the performance analysis included the SAM [29], peak signal-to-noise ratio (PSNR) [36], correlation coefficient (CC) [37], erreur relative globale adimensionnelle de synthèse (ERGAS) [38] and root mean squared error (RMSE) [39]. These metrics reflect the image similarity, image distortion, spectral similarity, spectral distortion, and the difference between the fused image and the reference image, respectively, which are described below.
Peak signal-to-noise ratio (PSNR): The peak SNR (PSNR) is used to evaluate the spatial quality of the fused image in the unit of the band. The PSNR of the kth band is defined as where H and W represent the height and width dimensions with the reference image, respectively. R k and Z k represent the reference image and the fused image of the kth band. · 2 refers to the two-norm. The final PSNR is the average of the PSNRs of all bands. The higher the PSNR, the better the performance. Correlation coefficient (CC): This is mainly used to score the similarity of the content between two images, which is defined as where R(i, j) and Z(i, j) denote the spectral vector of the reference image and the fused image, respectively, at the pixel position of (i, j). The CC in HSI fusion is calculated as the average over all bands. The larger the CC is, the better the fusion image can be. Spectral angle mapper (SAM): The SAM is generally utilized to evaluate the degree of spectral information preservation at each pixel, which is defined as where R(i, j), Z(i, j) refers to the inner product of R(i, j) and Z(i, j); the overall SAM is the average of the SAMs of all pixels. The lower the SAM, the better the performance. Erreur relative globale adimensionnelle de synthèse (ERGAS): The ERGAS is specially designed to assess the quality of high-resolution synthesized images, and measures the global statistical quality of the fused image. It is defined as where r refers to the ratio of the spatial downsampling ratio from HR-HSI to LR-HSI. u(R k ) denotes the mean value of the reference image of the kth band. The smaller the ERGAS, the better the performance. Root mean squared error (RMSE): RMSE can be used to measure the difference between R and Z, which is defined as where L represents the number of spectral bands. R k (i, j) and Z k (i, j) denote the element value at spatial location (i, j) in band k of the reference image and the fused image.The smaller the root mean squared error (RMSE), the better the performance.

DML Strategy Validation for Different Cases
The comparison results of the SSML framework for different deep networks are presented in Tables 3 and 4. Four cases were analyzed: The S 1 network uses RCAM or SeAM, and the S 2 network uses MSRNet or ResNet. Depending on the experience, it was set that λ 1 = 50 and λ 2 = 0.8.    Tables 3 and 4, the performance of S 1 and S 2 networks in the SSML exceeded that of the original network in most cases. Without loss of generality, the loss value curve of the SSML, having S 1 with the SeAM and S 2 with the ResNet, was analyzed at the Pavia Center to determine the reasons for the advantage of the DML strategy. A comparison of the loss value curves of S 1 in the SSML and original S 1 during 5000 training iterations on the Pavia Center is presented in Figure 4a, and their difference curve is presented in Figure 4b. As shown in Figure 4b, the loss values of S 1 in the SSML were slightly higher than those of the original S 1 before 1000 iterations; however, after 1000 iterations, the loss values of S 1 in the SSML were lower than those of the original S 1 . Thus, it can be concluded that the SSML had a slow convergence speed in the early training stage because of the alternate optimization. Nonetheless, it exhibited advantages of minimum loss value and convergence speed with increase in the training iteration number. This indicates that introducing the DML strategy in the SSML can help to achieve better results in HSI pansharpening.

Effect of the Number of Training Samples
This experiment investigated the effect of the proportion of the training set on the fusion effect. Usually, deep-learning-based hyperspectral image sharpening training sets and test sets select 60% and 40% content, respectively. In the experiment, 50% and 50%, 60% and 40%, and 70% and 30% were selected for the training and testing sets, respectively. The number of iterations, learning rate, and other parameters was the same. Each group of experiments was repeated 10 times; the experimental results are shown in Table 5. It can be seen that when 60% of the training samples were selected, the training samples were moderated, and the fusion results were improved. Therefore, 60% and 40% of the training and testing sets were selected for subsequent experiments.

Comparisons with Advanced Methods
The proposed SSML was compared with five state-of-the-art methods, including three traditional methods, namely, CNMF [6], Bayesian naive [7], GFPCA [9], and two deeplearning-based methods, namely, PanNet [18] and DDLPS [40]. The two deep-learning methods and our method were repeated 10 times for each group of experiments. The experiments were performed on the CAVE and Pavia Center datasets.

Results on CAVE Data Set
The results of different methods on the CAVE dataset are presented in Figures 5-7. The result in Figure 5b denotes a fuzzy visualization result; Figure 5d is too sharp, and Figure 5e has a color difference. In colormap, Figure 5a includes a large area of spectral distortion on the surface of the balloon; Figure 5b,c,e have significant spectral distortions at the edges. The results of the SSML framework with the (SeAM and ResNet) hybrid function and the other methods are presented in Figure 6. There is a certain spectral distortion in Figures 6h,i, which was generated by S 1 (SeAM) and S 2 (ResNet) in the SSML framework, but was lower than that of the other methods. The results of the SSML framework with the (RCAM and ResNet) hybrid function and the other methods are presented in Figure 6. The results in Figure 6h,i had higher visual image quality than the other results. Tables 6 and 7 show the evaluation indicators for the proposed method and several state-of-the-art methods. As shown in Table 6, CNMF, Bayesian naive, and GFPCA are not deep-learning methods. The results were stable, and the time was short, but the methods were found to be not as effective as the deep-learning methods. The SSML framework with S1 (RCAM) had slightly lower values of the ERGAS and RMSE than the original RCAM; in most cases, the SSML framework with S1 (RCAM) and SSML S2 (MSRNet) achieved better results than the other methods for all evaluation indicators. Regarding time consumption, the proposed method framework was much shorter in duration than DDLP and slightly higher than PanNet, but fusion performance was improved.   11.1993 Bold and underlined indicate the best results for S 1 and S 2 , respectively.

Results on Pavia Center Dataset
The results of different methods on the Pavia Center dataset are presented in Figure 8. The SSML framework used the (RCAM and MSRNet) hybrid function. The colormaps in Figure 8a,c,d indicate that the corresponding methods performed relatively poorly in dealing with the shadow part; in Figure 8e, certain details, such as the river surface, are missing. In Figure 8h,i, it can be seen that the proposed framework improved image details on the image compared to the original network. This also demonstrates the effectiveness of the proposed hybrid loss function in the mutual learning strategy.  Bold and underlined indicate the best results for S 1 and S 2 , respectively.
As presented in Table 7, the indicator results of the proposed SSML framework were better than those of the comparison methods. Compared with the original networks, the SSML achieved obvious improvements for all indicators, which demonstrated the effectiveness of the proposed hybrid loss function in the mutual learning strategy.

Hybrid Loss Function Analysis
In this section, the reason for using a hybrid loss function consisting of two different loss functions (e.g., Equations (12) and (13)) instead of a single mutual learning loss function is explained. We compare the proposed SSML framework with the typical DML model [24]. Table 8 shows the effect of different mutual learning loss functions on the model performance. The SSML framework used the combination of the SeAM (S 1 ) and MSRNet (S 2 ) functions on the CAVE dataset. When S 1 and S 2 used the (L 1 + SAM) loss function, there was a positive effect on S 2 but a negative effect on S 1 . The reason was that S 1 paid more attention to spectral features and no more spatial features could be learned from S 2 , while S 2 did the opposite. When S 1 and S 2 used the (L 1 + MSE) loss function, S 1 used its own spectral feature learning advantage and obtained spatial information form S 2 , which yielded good results in the PSNR and SAM. Thus, the experimental results demonstrated the feasibility of the proposed hybrid loss function.

Generalization Ability of SSML
To verify the generalization ability of the proposed SSML framework, we applied the SSML framework to the state-of-the-art residual hyper-dense network (RHDN) method [15]. The original fusion results of the RHDN method were used as H ini in the SSML framework, as shown in Figure 1. Then the spectral S 1 , spatial S 2 networks, and their hybrid loss functions based on mutual learning strategies, were used to transfer information of different features to improve the results.
In experiments performed, we used the Pavia Center dataset-added, which was divided into 160 × 160 image blocks for training the RHDN method. As shown in Figure 9, four cases were also analyzed: the S 1 network used RCAM or SeAM, and the S 2 network used MSRNet or ResNet. The fusion results of the RHDN network were guided by mutual learning. From five performance indexes, especially SAM, RMSE, and ERGAS, we can see that the SSML framework was able to effectively improve the fusion effect when selecting the appropriate spectral and spatial network structure. Furthermore, the SSML framework only took a short time to upgrade the fusion results. Thus, the proposed SSML framework demonstrated generalization ability for HSI pansharpening.

Effect of Deep Network Parameter Number on SSML Performance
SSML aims to learn the same tasks from each other to achieve optimal results. In Table 9, the parameter number comparison of S 1 and S 2 in the SSML framework and the PanNet and DDLPS is given. Compared with the PanNet, the number of parameters of the SSML networks was greatly reduced; in particular, the parameter number of the SeAM was only one fifth that of the PanNet. Compared with the DDPLS, the parameter number of the SeAM was reduced by 24.8%, MSRNet by 28%, ResNet by 31%, and RCAM by 62.2%. These results indicate that SSML has better feature extraction capability and has fewer parameters under the same task.

Conclusions
This paper proposes an SSML framework integrating spectral-spatial informationmining for HSI pansharpening. In contrast to the existing CNN-based hyperspectral pansharpening framework, based on the DML strategy, we designed spectral and spatial networks for learning the spectral and spatial features. Furthermore, a set of mixed loss functions, based on a mutual learning strategy, is proposed for transfer of information for different features, which can extract features without introducing excessive computation through mutual learning. In experiments undertaken, several cases were examined to evaluate the effect of DML on the pansharpening result. The results demonstrated that introducing the DML strategy into the SSML framework was able to help achieve improved results in HSI pansharpening. The performance of the SSML framework was compared with several state-of-the-art methods; the results of the comparisons demonstrated the effectiveness and advantages of the proposed SSML framework. The latest fusion results were used to verify the generalization ability of the SSML framework, with improved results observed. Discussion of the feasibility of the hybrid loss function and the number of deep network parameters suggested that the proposed SSML framework represents a promising framework for HSI pansharpening.
In future, HSI pansharpening under the SSML framework will be explored further to identify improved spectral-spatial features for HSIs. A further research direction will involve the application of the DML strategy to other image-processing fields.