Image fusion for spatial enhancement of hyperspectral image via pixel group based non-local sparse representation

: Restricted by technical and budget constraints, hyperspectral images (HSIs) are usually obtained with low spatial resolution. In order to improve the spatial resolution of a given hyperspectral image, a new spatial and spectral image fusion approach via pixel group based non-local sparse representation is proposed, which exploits the spectral sparsity and spectral non-local self-similarity of the hyperspectral image. The proposed approach fuses the hyperspectral image with a high-spatial-resolution multispectral image of the same scene to obtain a hyperspectral image with high spatial and spectral resolutions. The input hyperspectral image is used to train the spectral dictionary, while the sparse codes of the desired HSI are estimated by jointly encoding the similar pixels in each pixel group extracted from the high-spatial-resolution multispectral image. To improve the accuracy of the pixel group based non-local sparse representation, the similar pixels in a pixel group are selected by utilizing both the spectral and spatial information. The performance of the proposed approach is tested on two remote sensing image datasets. Experimental results suggest that the proposed method outperforms a number of sparse representation based fusion techniques, and can preserve the spectral information while recovering the spatial details under large magniﬁcation factors.


Introduction
Hyperspectral images (HSIs) usually contain dozens or even hundreds of spectral bands.They are useful for accurate terrain detection, military surveillance and medical diagnosis [1].However, owing to the technical and budget constraints, there is a tradeoff between spectral resolution and spatial resolution, which often implies low spatial resolution of HSIs.This fact may severely impede the practical use of HSIs and, therefore, various spatial resolution enhancement algorithms [2][3][4][5] have been proposed with spatial and spectral fusion approaches playing an important role.In contrast to hyperspectral sensors, multispectral sensors produce images with relatively higher spatial resolution but less spectral bands.Thus, the fusion of these two types of image data supports the integration of the spatial details of a high spatial resolution multispectral image (MSI) and the spectral information of a HSI, thereby producing a HSI with both high spatial and high spectral resolutions.
Multiple spatial and spectral fusion approaches have been developed for the spatial resolution enhancement.Traditionally, hyperspectral images or multispectral images are fused with a high spatial resolution panchromatic (PAN) image, which is commonly called pan-sharpening [5][6][7][8].Representative algorithms of pan-sharpening include the component substitution based methods [6,7] and the multi-resolution based methods [8].Component substitution based methods transform the multispectral or hyperspectral image into a certain domain, in which the first component is replaced by the PAN image.In multi-resolution based approaches, the wavelet transform is commonly used to decompose the source images into high and low frequency components.After that, the high frequency component extracted from the PAN image is merged into the multispectral or hyperspectral data.These approaches successfully improve the spatial resolution of the multispectral or hyperspectral image, but they may cause unavoidable spectral or spatial distortion.
Another category of methods implementing spatial-spectral image fusion is unmixing based approaches [9][10][11][12][13].In such schemes, low spatial resolution hyperspectral data are unmixed into the endmember spectra and the corresponding abundances, and then the abundance maps are fused with the high spatial resolution image of the same scene (such as Red-Green-Blue image or MSI).Based on the fact that neighboring pixels normally share fractions of the same underlying material, Bieniarz et al. [13] employed a jointly sparse model to perform the unmixing of these neighboring pixels.In order to enhance unmixing accuracy, a dictionary trained with MSIs or PAN images from unrelated scenes is used in [12].However, the performance of these approaches degrades seriously in highly mixed situations [14].More recently, matrix factorization [4,15] has emerged as a powerful tool in unmixing based approaches, which aims to factorize the image data into two matrices based on a linear spectral mixture model [16].A coupled nonnegative matrix factorization unmixing approach [4] is proposed where HSI and MSI are separately unmixed.By combining the endmember matrix of the HSI and the abundance matrix of the MSI, the fusion result is generated.
The sparsity property of an image is an effective representation of the image prior knowledge for various kinds of spatial-spectral image fusion tasks [17][18][19][20][21].In recent years, motivated by the observation that there are only a few materials contributing to each pixel in HSIs [16], sparsity has been introduced in to matrix factorization based algorithms.These approaches do not require prior knowledge of the spatial transform matrix but instead assume that the HSI and the high spatial resolution image have the same sparse coefficients in the spectral domain.The general framework of these approaches [14,[22][23][24] can be outlined as follows.Firstly, a spectral dictionary is trained by extracting distinct spectral vectors in the low spatial resolution HSI.Next, the high spatial resolution image is sparsely encoded with the corresponding spectral dictionary, named sparse representation.Finally, the coefficients generated in the sparse representation procedure are used to produce the desired high spatial resolution HSI.Such schemes can always obtain better visual results and lead to the state-of-the-art performance of the challenging spatial resolution enhancement problem.
In this work, a new spatial and spectral fusion algorithm is proposed using the pixel group based non-local sparse representation technique, which exploits the non-local self-similarity of spectral vectors in the HSI.Firstly, a spectral dictionary is trained by the low spatial resolution HSI.Secondly, each pixel of the high spatial resolution MSI is jointly encoded with its similar pixels.Finally, the iterative back-projection technique is employed to refine the resulting image.The contributions of this work can be outlined as follows: (1) Differing from the conventional approaches that employ non-local self-similarity in the spatial domain, this work introduces the non-local self-similarity of the spectral vectors in HSIs to the fusion based spatial resolution enhancement problem.(2) The selection of similar pixels is carried out by utilizing not only the spectral information, but also the spatial information.
(3) Rather than processing the fusion procedure pixel by pixel as some previous works reported, a pixel group based scheme is utilized in this work.
The rest of this paper is organized as follows.Section 2 reviews the framework of sparse representation and the spectral dictionary learning technique, and recalls the non-local self-similarity of HSIs.The proposed spatial and spectral fusion method is introduced in detail in Section 3. In Section 4, experimental results and discussions are given to verify the effectiveness of the proposed algorithm.Finally, Section 5 presents the conclusions of this paper.

Related Works
Relevant works are introduced in this section, including basic concepts of sparse representation, spectral dictionary learning, and the non-local self-similarity of HSIs.

Sparse Representation
Sparse representation has proven to be an extremely powerful tool for acquiring, representing, and compressing high-dimensional signals [25].Given a signal vector y ∈ R L , sparse representation aims to represent it as the linear combination of certain basis vectors extracted from a basis matrix D ∈ R L×k (also called a dictionary) and to seek the sparsest coefficient α ∈ R k .This process can be expressed as the following optimization problem: where ε ≥ 0 is a preset small constant, which denotes the decomposition error.The notation || • || 0 is the L 0 -norm counting the number of non-zero elements in the vector.
The above optimization formulated in Equation ( 1) is a non-deterministic polynomial-time hard (NP-hard) problem, which is very complex to solve.There are two categories of algorithms that have been developed to approximate the optimal solution of this problem.One strategy is to adapt a greedy pursuit algorithm, which selects one or more appropriate basis in the dictionary at each step to iteratively represent the vector to be decomposed.The representative algorithms include the orthogonal matching pursuit (OMP) [26] and many improved versions of OMP [27].
Another strategy is to use a convex optimization algorithm, which replaces the L 0 -norm with the L 1 -norm in Equation (1), represented by methods such as Basic Pursuit (BP) [28], Lasso [29] and the iterative thresholding algorithm [30].

Spectral Dictionary Learning
Research results have shown that a HSI can be sparsely represented in the spectral domain [16].Each single pixel y ∈ R L in a HSI is a column vector, termed the spectral vector.Due to the low spatial resolution of HSI, each pixel y consists of a small number of distinct materials.The mixed pixel can be approximately expressed as a linear combination of these materials according to the linear mixing model (LMM) [16], expressed as: where is the spectral dictionary with k columns, where each column d i (called an atom) is a L-sized column vector representing the reflectance vector of an underlying material.As the number of materials in each mixed pixel is small, the coefficient α can be seen as sparse.
The learning of the spectral dictionary is an important procedure which may affect the performance of sparse representation [31].The goal of spectral dictionary learning is to find a collection of atoms that best represents the sample spectral vectors.This is expressed as the follow optimization problem, with n training samples: arg min where is the set of training samples.The most commonly used algorithm for dictionary learning is the K-singular value decomposition (K-SVD) algorithm [32], in which the dictionary and the coefficient matrix are updated alternately.A Bayesian dictionary learning method is proposed in [24], where the dictionaries are learned with the Beta Process.The dictionary learning algorithm proposed in [33] falls into the class of online algorithms based on stochastic approximations, processing one sample at a time.Briefly, there are two main steps for each sample y i in the training set: (1) sparse decomposition to obtain the coefficient α i when D is fixed; and (2) dictionary updating using a second-order optimization technique when α i is fixed.More detailed descriptions can be found in [33].

Non-Local Self-Similarity of Hyperspectral Images
Due to the information redundancy of HSIs, there may exist many similar or repeating structures in an image (as shown in Figure 1).These similar patches can provide extra information useful for preserving the details and has been extensively utilized in various kinds of image processing tasks such as denoising [34], fusion [35] and super-resolution [36,37].The first algorithm using the non-local self-similarity property of an image is proposed for natural image denoising [34], in which each pixel of the noisy image is replaced by the weighted average of all pixels whose neighborhood is similar to the neighborhood of the current pixel.This methodology is also employed in a dictionary learning process for improved image fusion results [35].In the super-resolution approaches [36,37], the non-local similarity of HSIs is employed to regularize the reconstructed image by using this property as a regularization term.This has been proven helpful for improving the quality of the reconstructed image.
in [24], where the dictionaries are learned with the Beta Process.The dictionary learning algorithm proposed in [33] falls into the class of online algorithms based on stochastic approximations, processing one sample at a time.Briefly, there are two main steps for each sample i y in the training set: (1) sparse decomposition to obtain the coefficient i α when D is fixed; and (2) dictionary updating using a second-order optimization technique when i α is fixed.More detailed descriptions can be found in [33].

Non-Local Self-Similarity of Hyperspectral Images
Due to the information redundancy of HSIs, there may exist many similar or repeating structures in an image (as shown in Figure 1).These similar patches can provide extra information useful for preserving the details and has been extensively utilized in various kinds of image processing tasks such as denoising [34], fusion [35] and super-resolution [36,37].The first algorithm using the non-local self-similarity property of an image is proposed for natural image denoising [34], in which each pixel of the noisy image is replaced by the weighted average of all pixels whose neighborhood is similar to the neighborhood of the current pixel.This methodology is also employed in a dictionary learning process for improved image fusion results [35].In the super-resolution approaches [36,37], the non-local similarity of HSIs is employed to regularize the reconstructed image by using this property as a regularization term.This has been proven helpful for improving the quality of the reconstructed image.
Apart from the spatial self-similarity, non-local self-similarity of spectral vectors exists in HSIs.By exploiting the non-local self-similarity in the spectral domain, a new hyperspectral and multispectral image fusion approach is proposed in this paper.Rather than simply averaging the similar patches or pixels in hyperspectral images, this work jointly encodes the similar spectral vectors in the multispectral image, which can effectively avoid generating overly smooth results.

Problem Formulation
Given a HSI with low spatial resolution (hereafter termed LR-HSI) , and a high spatial resolution MSI (hereafter termed HR-MSI) of the same scene, this work aims to generate a HSI with high spatial resolution (hereafter, termed HR-HSI) . Here, m and M denote the respective image height, n and N the image width, and l and L the number of image bands.Note that there is the following relation: For the convenience of implementation, the m × n × L dimensional LR-HSI , where each column of h Y stands for one pixel in location ( , ) i j .Similarly, the HR-HSI and the HR-MSI , respectively.Apart from the spatial self-similarity, non-local self-similarity of spectral vectors exists in HSIs.By exploiting the non-local self-similarity in the spectral domain, a new hyperspectral and multispectral image fusion approach is proposed in this paper.Rather than simply averaging the similar patches or pixels in hyperspectral images, this work jointly encodes the similar spectral vectors in the multispectral image, which can effectively avoid generating overly smooth results.

Problem Formulation
Given a HSI with low spatial resolution (hereafter termed LR-HSI) Y h ∈ R m×n×L , and a high spatial resolution MSI (hereafter termed HR-MSI) Y m ∈ R M×N×l of the same scene, this work aims to generate a HSI with high spatial resolution (hereafter, termed HR-HSI) X h ∈ R M×N×L .Here, m and M denote the respective image height, n and N the image width, and l and L the number of image bands.Note that there is the following relation: m < M, n < N, l < L.
For the convenience of implementation, the m , where each column of Y h stands for one pixel in location (i, j).Similarly, the HR-HSI X h ∈ R M×N×L and the HR-MSI Y m ∈ R M×N×l are transformed to Each pixel in the HR-MSI may be regarded as the spectral degradation of the desired pixel in HR-HSI: where T ∈ R l×L is the spectral mapping matrix, which is determined by the relationship between the HSI sensor and the MSI sensor.Since in general l << L, the reconstruction of X h from Y m is impossible without other prior knowledge.According to the LMM formulated in Equation ( 2), each pixel of the desired HR-HSI X h can be decomposed as x h (i, j) = Dα ij .Considering the sparse constraint together with the linear mixing model, then the spatial resolution improvement problem can be solved by seeking the sparsest coefficient α ij that satisfies the degradation equation y m (i, j) = TDα ij : The spectral dictionary D is learned by applying an online dictionary learning algorithm [33] to a set of training samples, which are obtained by directly selecting column spectral vectors in Y h ∈ R L×mn .Once the spectral dictionary is known, the coefficient matrix can be computed by the proposed pixel group based non-local sparse representation technique, where a pixel group (PG) based strategy is adopted to implement the proposed method.

Pixel Group Based Non-Local Sparse Representation
The same scene of an HSI can contain many reoccurrences of the underlying materials that may exhibit similar spectral curves amongst materials of the same type (such as buildings, roads and lawns, etc.).The spectral reflectance of the central pixels in two similar cubic patches looks similar to each other (as shown in Figure 2).However, HR-HSI is not available during reconstruction.Thus, it is assumed that similar pixels in the HR-MSI are also similar in the HR-HSI at the same location.This is reasonable because the HR-MSI is obtained by the spectral down-sampling of HR-HSI.Therefore, the HR-MSI is employed to estimate the spectral self-similarity in HR-HSI.This spectral non-local image self-similarity is applied here to perform the pixel group based non-local sparse representation procedure, assuming that an ensemble of similar spectral vectors shares the same sparse pattern with different coefficients.The sparse codes of the desired HR-HSI are estimated by jointly encoding the similar pixels in each pixel group extracted from the HR-MSI.The pixel group based non-local sparse representation (hereafter termed PG-NLSR) procedure is illustrated in Figure 3.
Remote Sens. 2017, 9, 53 5 of 19 Each pixel in the HR-MSI may be regarded as the spectral degradation of the desired pixel in HR-HSI: where is the spectral mapping matrix, which is determined by the relationship between the HSI sensor and the MSI sensor.Since in general l L << , the reconstruction of h X from m Y is impossible without other prior knowledge.According to the LMM formulated in Equation ( 2), each pixel of the desired HR-HSI h X can be decomposed as ( , ) Considering the sparse constraint together with the linear mixing model, then the spatial resolution improvement problem can be solved by seeking the sparsest coefficient ij α that satisfies the degradation equation The spectral dictionary D is learned by applying an online dictionary learning algorithm [33] to a set of training samples, which are obtained by directly selecting column spectral vectors in Once the spectral dictionary is known, the coefficient matrix can be computed by the proposed pixel group based non-local sparse representation technique, where a pixel group (PG) based strategy is adopted to implement the proposed method.

Pixel Group based Non-Local Sparse Representation
The same scene of an HSI can contain many reoccurrences of the underlying materials that may exhibit similar spectral curves amongst materials of the same type (such as buildings, roads and lawns, etc.).The spectral reflectance of the central pixels in two similar cubic patches looks similar to each other (as shown in Figure 2).However, HR-HSI is not available during reconstruction.Thus, it is assumed that similar pixels in the HR-MSI are also similar in the HR-HSI at the same location.This is reasonable because the HR-MSI is obtained by the spectral down-sampling of HR-HSI.Therefore, the HR-MSI is employed to estimate the spectral self-similarity in HR-HSI.This spectral non-local image self-similarity is applied here to perform the pixel group based non-local sparse representation procedure, assuming that an ensemble of similar spectral vectors shares the same sparse pattern with different coefficients.The sparse codes of the desired HR-HSI are estimated by jointly encoding the similar pixels in each pixel group extracted from the HR-MSI.The pixel group based non-local sparse representation (hereafter termed PG-NLSR) procedure is illustrated in Figure 3.   ), but also the spectral information ( 2 w ).To do this, the similar weights between the current pixel ( , ) m y i j and pixel ( , ) m y s t in that searching window are first computed: Y y s t y s t y s t =  be the pixel group consisting of ( , ) m y i j and its similar pixels, in which the first column is the current pixel ( , ) m y i j .The SOMP algorithm is then employed to simultaneously encode the pixels in ( , )   i j m Y to obtain their non-local sparse coefficients, denoted by ( ˆi, j) A .Then the spatial resolution improvement problem formulated in (5) can be converted to the following pixel group based non-local sparse representation problem: where ε denotes the model error and , the desired HR-HSI is generated.More specifically, for each pixel y m (i, j) ∈ R l of the HR-MSI, there are two main steps of the PG-NLSR: (1) constructing the pixel group of similar pixels; and (2) computing the sparse coefficients of the pixel group in Equation ( 1) through the simultaneous orthogonal matching pursuit (SOMP) [38] algorithm.Similar pixels are sought within a cubic searching window centered at y m (i, j).The selection of similar pixels is carried out by considering not only the spatial information (w 1 ), but also the spectral information (w 2 ).To do this, the similar weights between the current pixel y m (i, j) and pixel y m (s, t) in that searching window are first computed: where ||p k ij − p k st || 2 2,a denotes the square of the Euclidean distance between the k-th band image patches p k ij and p k st , which are centered at y m (i, j) and y m (s, t), respectively.a > 0 denotes the standard deviation of the Gaussian kernel function and Z is the normalizing constant.The parameters h 1 , h 2 control the decay of the exponential function.SAM (Spectral Angle Mapper) denotes the spectral difference metric [39] between pixels y m (i, j) and y m (s, t).The first b biggest w(ij, st) are chosen and the corresponding y m (s, t) are selected as the similar pixels of y m (i, j).In addition, the similarity weight w(ij, st) is also used to add a weight to the inner product when running the SOMP algorithm in order to obtain the sparse coefficients. Let ] be the pixel group consisting of y m (i, j) and its similar pixels, in which the first column is the current pixel y m (i, j).The SOMP algorithm is then employed to simultaneously encode the pixels in Y (i,j) m to obtain their non-local sparse coefficients, denoted by Â(i,j) .Then the spatial resolution improvement problem formulated in (5) can be converted to the following pixel group based non-local sparse representation problem: where ε denotes the model error and Â(i,j) = [α s 1 t 1 , αs 2 t 2 • • • , αs b t b ] denotes the coefficient matrix.The notation || • || row,0 is the norm counting the number of non-zero rows in the matrix.By integrating the estimated pixel groups { X(i,j) • • • , N}, the desired HR-HSI is generated.

Algorithm
The proposed approach fuses the LR-HSI with a HR-MSI of the same scene.In the proposed algorithm, the LR-HSI is used to train the spectral dictionary, while the sparse coefficient matrix of each pixel group in HR-HSI is computed using the HR-MSI, guaranteeing that similar pixels are sparsely decomposed into the same subset of dictionary atoms.The proposed spatial and spectral fusion algorithm is illustrated in Figure 4. Firstly, the spectral dictionary is trained by performing an online dictionary learning algorithm to the given LR-HSI.Secondly, every pixel of the HR-MSI is extracted and a group of similar pixels of the current pixel is constructed.Thirdly, the pixels in the resulting group are jointly encoded using the SOMP algorithm.Finally, the spectral dictionary and the coefficients are combined to generate the required HR-HSI.

Algorithm
The proposed approach fuses the LR-HSI with a HR-MSI of the same scene.In the proposed algorithm, the LR-HSI is used to train the spectral dictionary, while the sparse coefficient matrix of each pixel group in HR-HSI is computed using the HR-MSI, guaranteeing that similar pixels are sparsely decomposed into the same subset of dictionary atoms.The proposed spatial and spectral fusion algorithm is illustrated in Figure 4. Firstly, the spectral dictionary is trained by performing an online dictionary learning algorithm to the given LR-HSI.Secondly, every pixel of the HR-MSI is extracted and a group of similar pixels of the current pixel is constructed.Thirdly, the pixels in the resulting group are jointly encoded using the SOMP algorithm.Finally, the spectral dictionary and the coefficients are combined to generate the required HR-HSI.

Experimental Results and Discussion
To verify the effectiveness of this proposed method, simulated experiments are carried out on two remote sensing datasets: (1) the AVRIS dataset; and (2) the ROSIS dataset.

Experimental Setup
In the experimental studies, the proposed algorithm is applied to four 224-band HSIs taken by AVIRIS [40] and two HSIs taken by ROSIS [41], where the LR-HSI is fused with a simulated HR-MSI.Some parameters used in the experimentation are set as follows: the number of atoms in spectral dictionary k is 326 (the first atom of the spectral dictionary is the spatially-constant "DC" component); the number of similar pixels chosen in the non-local sparse representation procedure is set to 4 b = ; the size of the cubic searching window is set to 5 5 l × × , while the size of the similar patch in Equation ( 7) is set to 3 3 × ; and the parameters 1 2 , μ μ in Equation ( 6) are empirically set as To further reduce the reconstruction error, the results are refined by the iterative back-projection technique.The performance of the proposed approach is compared with four fusion based schemes, namely the Matrix Factorization based approach (MF) [22], the Spatial and Spectral Fusion Model (SSFM) [23], the spatio-spectral sparse representation method, GSOMP [14], and the Bayesian Sparse Representation method (BSR) [24].Additionally, we also test a simple Principal Component Analysis (PCA) [42] to obtain the basis in spectral dictionary D and then use Equation ( 2) to directly solve for the sparse coefficients.The performance of the different algorithms for the two data sets were tested on a PC with an Intel Core i5-4570 CPU @ 3.20 GHz and 8 GB RAM, using MATLAB R2014a.

Experimental Results and Discussion
To verify the effectiveness of this proposed method, simulated experiments are carried out on two remote sensing datasets: (1) the AVRIS dataset; and (2) the ROSIS dataset.

Experimental Setup
In the experimental studies, the proposed algorithm is applied to four 224-band HSIs taken by AVIRIS [40] and two HSIs taken by ROSIS [41], where the LR-HSI is fused with a simulated HR-MSI.Some parameters used in the experimentation are set as follows: the number of atoms in spectral dictionary k is 326 (the first atom of the spectral dictionary is the spatially-constant "DC" component); the number of similar pixels chosen in the non-local sparse representation procedure is set to b = 4; the size of the cubic searching window is set to 5 × 5 × l, while the size of the similar patch in Equation ( 7) is set to 3 × 3; and the parameters µ 1 , µ 2 in Equation ( 6) are empirically set as µ 1 = 0.7, µ 2 = 0.3.To further reduce the reconstruction error, the results are refined by the iterative back-projection technique.The performance of the proposed approach is compared with four fusion based schemes, namely the Matrix Factorization based approach (MF) [22], the Spatial and Spectral Fusion Model (SSFM) [23], the spatio-spectral sparse representation method, GSOMP [14], and the Bayesian Sparse Representation method (BSR) [24].Additionally, we also test a simple Principal Component Analysis (PCA) [42] to obtain the basis in spectral dictionary D and then use Equation ( 2) to directly solve for the sparse coefficients.The performance of the different algorithms for the two data sets were tested on a PC with an Intel Core i5-4570 CPU @ 3.20 GHz and 8 GB RAM, using MATLAB R2014a.

Performance Evaluation
To quantitatively assess the performance of the proposed spatial and spectral fusion algorithm, five quality indices are considered.The first one is the Root Mean Square Error (RMSE), which measures the difference between the estimated Xh and the original HR-HSI X h (across all spectral bands) as follows: The Peak Signal-to-Noise Ratio (PSNR) index is then easily computed via RMSE: where MSE = RMSE 2 , and MAX represents the maximum value of Xh .To measure the structural spatial details of the estimated images, the third index, the Average-Structural SIMilarity (A-SSIM), is calculated by averaging the SSIM metric among all spectral bands: where µ, σ 2 and σ are the mean, variance and covariance of the corresponding image matrices, respectively; and X i h and Xi h denote the ith band of X h and Xh , respectively.To measure the spectral reconstruction performance, the Spectral Angel Mapper (SAM) [43] is computed.The SAM index is defined as the spectral angle between the estimated pixel xh (i, j) and the original pixel x h (i, j): and the final SAM is obtained by averaging the SAMs of all pixels in an image.The last index is the relative dimensionless global error in synthesis (ERGAS) [43], which is defined as: where d h /d l is the ratio between the pixel sizes of the HR-HSI and the LR-HSI.The best value of RMSE, SAM and ERGAS are zero; the best value of A-SSIM is 1; and the best value of PSNR is +∞.The RMSE, A-SSIM and PSNR indices show the degree of spatial similarity between the estimated image and the corresponding original HR-HSI, with the SAM index showing the degree of spectral similarity, while the ERGAS index reflects a global picture of the quality of the fused image.

Experiments on AVIRIS Dataset
In this section, we apply the proposed PG-NLSR algorithm to four 224-band remote sensing HSIs, which are taken by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor [44].The first image Cuprite was taken over Cuprite, NV, in 1997 with an original spatial resolution at 20 m.The second, Jasper-Ridge, was acquired over Jasper Ridge, CA, USA, in 1994, with a spatial resolution of 20 m.The third image Moffett-Field was acquired over Moffett Field, CA, USA, in 1994 by Jet Propulsion Laboratory at a 20 m resolution.The last image San Diego, acquired in 2002, covers a naval air station in San Diego, CA, USA, with a spatial resolution of 3.5 m.After removing the noisy and water vapor absorption bands, four sub-images (shown in Figure 5) of size 256 × 256 × 189 are selected and used as the original HR-HSIs.The LR-HSIs are simulated by first applying a 5 × 5 Gaussian kernel with standard deviation 2.5 to the original HR-HSIs and then downsampling along both horizontal and vertical directions with a scaling factor of 8 (i.e., the spatial resolution of the four simulated LR-HSIs is 160 m, 160 m, 160 m and 28 m, respectively).The Gaussian white noise is added to the LR-HSIs with a standard deviation 0.5.The HR-MSIs are generated by integrating the bands of the original HR-HSIs with uniform spectral response functions corresponding to Landsat TM bands 1-5 and 7 at 20 m or 3.5 m resolution, which cover the 450-520, 520-600, 630-690, 760-900, 1550-1750 and 2080-2350 nm regions, respectively [45].The estimated HR-HSIs of 460 nm, 540 nm, 620 nm and 1300 nm bands using the proposed approach are shown in Figures 6 and 7, where the fourth row shows the error image of these bands.Here, the error images are generated by computing the differences between the estimated HR-HSI and the original HR-HSI, on a pixel by pixel strategy.Quantitative evaluation measures for different enhancement approaches are compared in Table 1.The proposed method PG-NLSR outperformed all compared methods except BSR which for a couple of indices has better results in images Cuprite and San Diego.PG-NLSR has generated most of the best results which are indicated in bold.The LR-HSIs are simulated by first applying a 5 × 5 Gaussian kernel with standard deviation 2.5 to the original HR-HSIs and then downsampling along both horizontal and vertical directions with a scaling factor of 8 (i.e., the spatial resolution of the four simulated LR-HSIs is 160 m, 160 m, 160 m and 28 m, respectively).The Gaussian white noise is added to the LR-HSIs with a standard deviation 0.5.The HR-MSIs are generated by integrating the bands of the original HR-HSIs with uniform spectral response functions corresponding to Landsat TM bands 1-5 and 7 at 20 m or 3.5 m resolution, which cover the 450-520, 520-600, 630-690, 760-900, 1550-1750 and 2080-2350 nm regions, respectively [45].The estimated HR-HSIs of 460 nm, 540 nm, 620 nm and 1300 nm bands using the proposed approach are shown in Figures 6 and 7, where the fourth row shows the error image of these bands.Here, the error images are generated by computing the differences between the estimated HR-HSI and the original HR-HSI, on a pixel by pixel strategy.Quantitative evaluation measures for different enhancement approaches are compared in Table 1.The proposed method PG-NLSR outperformed all compared methods except BSR which for a couple of indices has better results in images Cuprite and San Diego.PG-NLSR has generated most of the best results which are indicated in bold.Qualitatively, the estimated HR-HSIs shown in Figures 6c and 7c are very close to the original HR-HSIs in Figures 6b and 7b.Error image (Figure 6d) for the Cuprite image show very minor differences between the estimated HR-HSI (Figure 6c) and the original image (Figure 6b).The estimation error is concentrated mainly around the areas where pixels change rapidly.However, higher errors can be found in Figure 7d in boundary of smaller objects and transitional land cover types.Figure 8 shows the visual results of the Moffett-field image.The spatial details of the fusion images among different algorithms are illustrated.The black box in Figure 8a shows the area-of-interest that were enlarged and compared in Figure 8c-h.Both buildings and roads can be clearly seen in all fused images.However, the details in the lower-left corner of Figure 8h are cleaner and the color of Figure 8h has a better resemblance to the original HR-HSI of Figure 8b.Quantitative comparisons shown in Table 1 indicate that the performance of the proposed scheme is better than that of the rest.This is because the proposed pixel group based non-local sparse representation technique makes full use of the similar spectral vectors within a certain searching window.Qualitatively, the estimated HR-HSIs shown in Figures 6c and 7c are very close to the original HR-HSIs in Figures 6b and 7b.Error image (Figure 6d) for the Cuprite image show very minor differences between the estimated HR-HSI (Figure 6c) and the original image (Figure 6b).The estimation error is concentrated mainly around the areas where pixels change rapidly.However, higher errors can be found in Figure 7d in boundary of smaller objects and transitional land cover types.Figure 8 shows the visual results of the Moffett-field image.The spatial details of the fusion images among different algorithms are illustrated.The black box in Figure 8a shows the area-of-interest that were enlarged and compared in Figure 8c-h.Both buildings and roads can be clearly seen in all fused images.However, the details in the lower-left corner of Figure 8h are cleaner and the color of Figure 8h has a better resemblance to the original HR-HSI of Figure 8b.Quantitative comparisons shown in Table 1 indicate that the performance of the proposed scheme is better than that of the rest.This is because the proposed pixel group based non-local sparse representation technique makes full use of the similar spectral vectors within a certain searching window.460 nm 540 nm 620 nm 1300 nm The number of similar pixels chosen in a pixel group (b) is an important parameter in the proposed PG-NLSR, which controls the balance between the fusion accuracy and the computational efficiency.The performance (RMSE) and running time (in seconds) of different values for parameter b on image Cuprite is shown in Figure 9.It can be seen from the curve that when more similar pixels are selected, the fusion results will be better, owing to that more extra information are provided and a more optimal sparse representation of a given pixel is found.However, the computational cost will inevitably increase rapidly as more similar pixels are selected.As shown in Figure 9a, the RMSE assessment decreases rapidly before b = 4, and the decrease becomes much slower after that point.By trading off the fusion accuracy and the computational efficiency, we select the four most similar pixels from the total of 25 pixels in the cubic searching window.The number of similar pixels chosen in a pixel group (b) is an important parameter in the proposed PG-NLSR, which controls the balance between the fusion accuracy and the computational efficiency.The performance (RMSE) and running time (in seconds) of different values for parameter b on image Cuprite is shown in Figure 9.It can be seen from the curve that when more similar pixels are selected, the fusion results will be better, owing to that more extra information are provided and a more optimal sparse representation of a given pixel is found.However, the computational cost will inevitably increase rapidly as more similar pixels are selected.As shown in Figure 9a, the RMSE assessment decreases rapidly before b = 4, and the decrease becomes much slower after that point.By trading off the fusion accuracy and the computational efficiency, we select the four most similar pixels from the total of 25 pixels in the cubic searching window.(c) PCA [42]; (d) MF [22]; (e) SSFM [23]; (f) GSOMP [14]; (g) BSR [24]; and (h) PG-NLSR.

Experiments on ROSIS Dataset
To further demonstrate the proposed method can be effective with concurrent sensors, we test the fusion with LR-HSI and Sentinel 2A-like HR-MSI.The experiment makes use of a 102-band HSI (Pavia Centre) and a 103-band HSI (Pavia University) as shown in Figure 10    The visual results of different approaches at scale factor = 8 are presented in Figure 12, while the quantitative evaluation measures are compared in Tables 2 and 3    The visual results of different approaches at scale factor = 8 are presented in Figure 12, while the quantitative evaluation measures are compared in Tables 2 and 3   The visual results of different approaches at scale factor = 8 are presented in Figure 12, while the quantitative evaluation measures are compared in Tables 2 and 3          As scale factor increases, the information lost in the down-sampling procedure rises rapidly, and the resolution enhancement task will be much more difficult.Qualitative and quantitative assessments have both demonstrated the effectiveness of the proposed PG-NLSR fusion method at different scale factors.Qualitative comparisons depicted in Figure 12 indicate that the estimated HR-HSI shown in Figure 12h seems no difference with the original high spatial resolution data in Figure 12b, while the quantitative indices in Tables 2 and 3 show that the proposed PG-NLSR technique outperforms the other fusion methods.As shown in Tables 2 and 3, The Bayesian Sparse Representation based method [24] performs slightly better at some indices, but the computing time is unavoidably much higher as a consequence (shown in Table 4).

Discussion
From the above experimental results, it can be seen that the proposed approach outperforms the other five fusion based methods, with acceptable running time.In particular, for the AVIRIS dataset, on average, the RMSE of the proposed method decreases 0.6 when compared with that of BSR, and in comparison with the GSOMP, SSFM, MF and the PCA algorithms, the improvement is even more significant reduction of 6.98, 2.9, 2.44 and 0.77 in RMSE, respectively).For the ROSIS dataset, the BSR-based method performs slightly better at some indices where the scale factor is large, but this is at the cost of a significant amount of time (to implement the Bayesian learning and coding procedures).
The superiority of the proposed approach is owing to the employment of pixel group based non-local sparse representation, where a group of similar pixels are encoded simultaneously.This strategy supports the encoding procedure to utilize the information provided by not only the current pixel itself but also those similar to it.Of course, how to choose similar pixels in a group is an important process that may affect the fusion outcome.In this work, the degree of similarity of two pixels is measured by a weighted average of the spatial similarity and the spectral similarity, with the weights empirically set to 0.7 for the former and 0.3 for the latter.However, this strategy does not take the structural property of HSI into consideration.Besides, the fixed weights may not be suitable for all images.Therefore, to achieve more accurate fusion results, it is interesting to investigate whether a more appropriate similarity measure can be devised, but this remains as further research.
It is also interesting to note that whilst the experimental results demonstrate the effectiveness of the proposed approach, the computational cost is generally quite high when the pixel group based non-local sparse representation technique is used.Methods that could help expedite the required computation would be very helpful.Currently, a spectral dictionary is trained for each input HIS, which is time consuming.Thus, learning an off-line dictionary that could be used for a large amount of images would be another interesting improvement of this research.

Conclusions
In this work, a spatial and spectral image fusion approach with non-local sparse representation has been presented.The proposed PG-NLSR approach fuses an LR-HSI with an HR-MSI of the same scene to improve the spatial resolution of the LR-HSI.It learns a spectral dictionary using the LR-HSI, and applies the pixel group based non-local sparse representation technique to obtain the sparse code of a desirable high spatial resolution image.In so doing, the proposed work exploits the non-local self-similarity of hyperspectral images.In addition, the present research allows for the selection of similar pixels to be carried out using not only the spectral information, but also the spatial information.The approach has been systematically compared with a number of existing fusion based techniques.
The experimental comparisons involve two remote sensing datasets, demonstrating the effectiveness of this work.

Figure 1 .
Figure 1.Similar or repeating patches (denoted by small white squares) in HSI composite images with bands 28, 19, and 10 in red green, and blue, respectively.

Figure 1 .
Figure 1.Similar or repeating patches (denoted by small white squares) in HSI composite images with bands 28, 19, and 10 in red green, and blue, respectively.

Figure 2 .
Figure 2. Similar pixels in a sample hyperspectral image: (a) A three-dimensional HSI; and (b) Reflected spectral curve at the corresponding pixels of each band in the HSI.

Figure 2 .
Figure 2. Similar pixels in a sample hyperspectral image: (a) A three-dimensional HSI; and (b) Reflected spectral curve at the corresponding pixels of each band in the HSI.

Figure 3 .
Figure 3. Pixel Group based Non-local Sparse Representation.More specifically, for each pixel ( , ) l m y i j R ∈ of the HR-MSI, there are two main steps of the PG-NLSR: (1) constructing the pixel group of similar pixels; and (2) computing the sparse coefficients of the pixel group in Equation (1) through the simultaneous orthogonal matching pursuit (SOMP) [38] algorithm.Similar pixels are sought within a cubic searching window centered at ( , ) m y i j .The selection of similar pixels is carried out by considering not only the spatial information ( 1 w), but also the spectral information ( 2 w ).To do this, the similar weights between the of the Euclidean distance between the k-th band image patches k ij p and k st p , which are centered at ( , ) m y i j and ( , ) m y s t , respectively.0 a > denotes the standard deviation of the Gaussian kernel function and Z is the normalizing constant.The parameters 1 2 , h h control the decay of the exponential function.SAM (Spectral Angle Mapper) denotes the spectral difference metric [39] between pixels ( , ) m y i j and ( , ) m y s t .The first b biggest ( , ) w ij st are chosen and the corresponding ( , )m y s t are selected as the similar pixels of ( , ) m y i j .In addition, the similarity weight ( , ) w ij st is also used to add a weight to the inner product when running the SOMP algorithm in order to obtain the sparse coefficients.Let denotes the coefficient matrix.The notation ,0 || || row ⋅ is the norm counting the number of non-zero rows in the matrix.By integrating the estimated pixel groups

Figure 3 .
Figure 3. Pixel Group based Non-local Sparse Representation.

Figure 4 .
Figure 4. Flowchart of the proposed HSI and MSI fusion approach.

Figure 4 .
Figure 4. Flowchart of the proposed HSI and MSI fusion approach.

Figure 9 .
Figure 9. Selection of the PG-NLSR parameter b (the number of similar pixels in a pixel group): (a) RMSE index; and (b) Running time in seconds.
with an original spatial resolution of 1.3 m.These two images were acquired in 2001 by the ROSIS (Reflective Optics System Imaging Spectrometer) optical sensor over the center area and the University of Pavia, Italy.The flight was operated by the Deutsches Zentrum für Luft-und Raumfahrt (DLR, the German Aerospace Agency) in the framework of the HySens project, managed and sponsored by the European Union.The noisy and water vapor absorption bands have been removed from the initially 115 bands.A region of 256 × 256 pixels are selected and used as the original HR-HSIs, which is then blurred by a 5 × 5 Gaussian kernel with standard deviation 2.5.Then the images were down-sampled with the scale factors (denoted by S) of 4, 8 and 16 to simulate the LR-HSIs in corresponding spatial resolutions of 5.2 m, 10.4 m and 20.8 m, respectively.The Gaussian white noise is added to the LR-HSIs with a standard deviation 0.5.The HR-MSI is generated by filtering the HR-HSI with Sentinel2A-like spectral responses (bands 1-8).The reflectance spectral responses of the simulated bands used for the fusion are depicted in Figure 11.

Figure 9 .
Figure 9. Selection of the PG-NLSR parameter b (the number of similar pixels in a pixel group): (a) RMSE index; and (b) Running time in seconds.

Figure 9 .
Figure 9. Selection of the PG-NLSR parameter b (the number of similar pixels in a pixel group): (a) RMSE index; and (b) Running time in seconds.
with an original spatial resolution of 1.3 m.These two images were acquired in 2001 by the ROSIS (Reflective Optics System Imaging Spectrometer) optical sensor over the center area and the University of Pavia, Italy.The flight was operated by the Deutsches Zentrum für Luft-und Raumfahrt (DLR, the German Aerospace Agency) in the framework of the HySens project, managed and sponsored by the European Union.The noisy and water vapor absorption bands have been removed from the initially 115 bands.A region of 256 × 256 pixels are selected and used as the original HR-HSIs, which is then blurred by a 5 × 5 Gaussian kernel with standard deviation 2.5.Then the images were down-sampled with the scale factors (denoted by S) of 4, 8 and 16 to simulate the LR-HSIs in corresponding spatial resolutions of 5.2 m, 10.4 m and 20.8 m, respectively.The Gaussian white noise is added to the LR-HSIs with a standard deviation 0.5.The HR-MSI is generated by filtering the HR-HSI with Sentinel2A-like spectral responses (bands 1-8).The reflectance spectral responses of the simulated bands used for the fusion are depicted in Figure 11.

Figure 10 .
Figure 10.Synthetic RGB images of the test HSIs, taken by ROSIS with bands 31, 21, and 11 as red, green, and blue, respectively: (a) Pavia Centre; and (b) Pavia University.
. The average running time (in seconds) of different algorithms is shown in Table 4.The spectral reflectance difference value of four single pixels ((a) (50, 50), (b) (100, 100), (c) (150, 150), and (d) (180, 200)) shown in Figures 13 and 14 compares the estimation error between the original HR-HSIs and the resulting images of different fusion algorithms.The results show that the proposed approach has the smallest difference when compared to the actual pixel value.
. The average running time (in seconds) of different algorithms is shown in Table 4.The spectral reflectance difference value of four single pixels ((a) (50, 50), (b) (100, 100), (c) (150, 150), and (d) (180, 200)) shown in Figures 13 and 14 compares the estimation error between the original HR-HSIs and the resulting images of different fusion algorithms.The results show that the proposed approach has the smallest difference when compared to the actual pixel value.
. The average running time (in seconds) of different algorithms is shown in Table 4.The spectral reflectance difference value of four single pixels ((a) (50, 50), (b) (100, 100), (c) (150, 150), and (d) (180, 200)) shown in Figures 13 and 14 compares the estimation error between the original HR-HSIs and the resulting images of different fusion algorithms.The results show that the proposed approach has the smallest difference when compared to the actual pixel value.

Table 1 .
Evaluation assessments of different fusion schemes for remote sensing HSIs.

Table 1 .
Evaluation assessments of different fusion schemes for remote sensing HSIs.

Table 2 .
Evaluation assessments for Pavia Centre HSI over different scale factors.

Table 2 .
Evaluation assessments for Pavia Centre HSI over different scale factors.

Table 2 .
Evaluation assessments for Pavia Centre HSI over different scale factors.

Table 3 .
Evaluation assessments for Pavia University HSI over different scale factors.

Table 3 .
Evaluation assessments for Pavia University HSI over different scale factors.

Table 4 .
Average Computational Time (in seconds) of Different Algorithms.