Spatially Enhanced Spectral Unmixing Through Data Fusion of Spectral and Visible Images from Different Sensors

: We propose an unmixing framework for enhancing endmember fraction maps using a combination of spectral and visible images. The new method, data fusion through spatial information-aided learning (DFuSIAL), is based on a learning process for the fusion of a multispectral image of low spatial resolution and a visible RGB image of high spatial resolution. Unlike commonly used methods, DFuSIAL allows for fusing data from different sensors. To achieve this objective, we apply a learning process using automatically extracted invariant points, which are assumed to have the same land cover type in both images. First, we estimate the fraction maps of a set of endmembers for the spectral image. Then, we train a spatial-features aided neural network (SFFAN) to learn the relationship between the fractions, the visible bands, and rotation-invariant spatial features for learning (RISFLs) that we extract from the RGB image. Our experiments show that the proposed DFuSIAL method obtains fraction maps with significantly enhanced spatial resolution and an average mean absolute error between 2% and 4% compared to the reference ground truth. Furthermore, it is shown that the proposed method is preferable to other examined state-of-the-art methods, especially when data is obtained from different instruments and in cases with missing-data pixels.


Introduction
Imaging spectrometers collect a large number of samples of the reflected light at different wavelengths along the electromagnetic spectrum [1]. Accordingly, each pixel of the spectral image holds a spectral signature that describes the chemical and physical characteristics of the surface [2]. This amount of valuable information can be used in critical image-based geoscience applications [3]. Unfortunately, the spatial resolution (SR) of remotely sensed spectral images is low, thus limiting the accuracy of the outcomes of spectral data applications [4][5][6][7], e.g., classification, unmixing, target and object detection, mineralogy, and change detection.
On the other hand, the full information provided by spectral images enables the extraction of quantitative subpixel information. One crucial and useful technique for this purpose is spectral unmixing [8,9]. Using this technique, the abundance fraction of each distinct material, i.e., a so-called endmember (EM), is estimated for each pixel in a given spectral image. The achieved subpixel information is essential for the quantitative analysis of pixel content. Traditional unmixing methods rely only on the spectral data of the image, whereas spatially adaptive methods also incorporate the spatial information of the image to enhance the accuracy of the estimated fraction. In both cases, however, there is no information regarding the spatial distribution of the EMs within the pixel area and so the SR of the extracted fraction maps is also limited and is as low as the SR of the spectral image.
Moreover, all existing frameworks for unmixing do not allow the incorporation of information from external sources, e.g., visible images. Fusion of two data sources to enhance the SR of spectral images is, however, a conventional practice in the field of remote sensing. Indeed, many techniques are available for enhancing the SR of the entire spectral image; one useful technique for this purpose is pan sharpening (PS) [10].
The PS process relies on data fusion of a low SR (LSR) spectral image and a corresponding high SR (HSR) panchromatic image. A variety of PS methods have been developed [10,11] and a general classification of which yields three main types [12]: component substitution-based methods, e.g., [13], multiresolution analysis-based methods, e.g., [14,15], and optimization-based approaches, e.g., [16]. Despite the availability of multiple PS methods, some significant limitations still exist, such as spectral distortion, spatial distortion, and the need for prior information. To overcome these drawbacks, advanced neural network (NN)-based algorithms have been introduced, mainly over the past two decades. NNs offer ways to fuse different data sources with a minimal need for prior assumptions [17]. Some of these algorithms rely on fully connected feedforward NNs, e.g., [18,19], but the majority were developed based on convolutional NNs (CNNs) [12,[20][21][22][23][24][25].
The use of the CNN-based approach has proven to improve the PS process significantly; one pioneer method in this regard is the Pansharpening by Convolutional Neural Networks (PNN) presented in [25]. PNN utilizes a relatively shallow network with the incorporation of extra information through maps of nonlinear radiometric indices. Since [25] was proposed, a variety of works addressed the application of PS based on CNNs. For instance, in [26], the PNN is used as a baseline. Then, a further improvement of the results is achieved by adding a fine-tuning layer within the network. Furthermore, other methods have been developed by addressing specific limitations of earlier practices. For example, among the most recent works, the proposed CNN architecture in [27], relies on the hierarchical pyramid structure to derive more robust features for better retrieval of HSR multispectral images. Whereas the proposed CNN in [28] is intended for detail-preserving through a cross-scale learning strategy, the applied strategy in [29] is based on a progressive cascade deep residual network to reduce the loss of high-frequency details. Finally, while most of the methods utilize shallow networks to prevent the effect of vanishing gradient, [30,31], the presented approach in [32] allows for using very deep CNNs based on dense blocks and residual learning. The potential of deeper networks to characterize and map challenging relationships between the fused data is assumed to be higher than that of shallow NNs.
Although the use of NNs leads to enhanced PS results compared with conventional methods, all existing PS algorithms require both spatial and temporal overlapping between the spectral and panchromatic images, a condition that is, in most cases, extremely challenging to fulfill. Thus, PS methods are limited to data sets acquired by the same instrument. The work presented in [33] reports the sensitivity of PS methods to temporal and instrumental changes between spectral and panchromatic data sets. Moreover, since PS retrieves a full HSR spectral image, the applied process is usually complicated and time-consuming. In addition, CNNs typically are designed and trained for specific types of images and, thus, do not fit for application to new data sources.
In this work, we present a new methodology for enhancing the SR of abundance fraction maps by fusion of spectral and visible images from different sensors. Traditional unmixing methods rely only on the spectral image and so the fraction maps obtained provide valuable quantitative subpixel information but with the same (usually low) SR of the image itself. Addressing this limitation, we propose a new framework that combines the spectral unmixing approach with machine learning for modeling the relationship between the EM fractions and the RGB values. In this regard, the proposed workflow allows for using different learning mechanisms for multivariate regression. Here we use an NN for this purpose due to its little need for prior assumptions. Furthermore, NNs were found in many cases to outperform standard machine learning procedures in solving remote sensing tasks, as shown, for example, in [34]. Unlike traditional methods, we fuse data sources without full spatial, spectral, and temporal overlapping between the two data types. We use spectral and spatial information that we extract from the involved data sources without any further information or assumptions. For this purpose, we design a new spatial features-aided neural network (SFANN). In particular, NNs provide an efficient tool for fusing different types of remote sensing data images. Due to the different conditions, different sensors, and a probable difference in acquisition dates, the images from the two data sources must be coregistered geometrically and calibrated radiometrically. In practice, even after coregistration, the pixels in the data sources do not perfectly match due to errors in the process itself and real changes in the land cover of the surface. Feeding the NN with data from mismatching pixels will lead to an error in the results. To prevent possible data mismatches, we propose a new strategy that does not require coregistration between the images. Instead, we use what we call invariant points (IPs), i.e., points that appear in the two images and represent features that are robust to the different acquisition conditions. We, therefore, examined different methods of IP extraction and found that the scale-invariant feature transform (SIFT) method [35] provides the best results for the data we use in our experiments. First, we use these IPs to radiometrically calibrate the RGB values in preprocessing. Then, we feed the SFANN with the calibrated RGB values and spatial features (SFs) of the IPs in the visible image, as input data, and with their corresponding fraction values, as target data. Finally, we derive the fraction values of the EMs from the LSR fraction maps obtained by the unmixing of the spectral image The proposed method is, to the best of our knowledge, the first to be developed using SFANN for the enhancement of the SR of a specific spectral product, the fraction maps in our case. This paper contributes in three main aspects: • We propose a computationally light strategy for the fusion of data from different sensors without the need for full spatial, spectral, and temporal overlapping.

•
We introduce a new rotation-invariant spatial feature for learning (RISFL) that allows for the incorporation of spatial information within the machine learning process without the need for CNNs.

•
Using IPs, as proposed in the paper, we suggest a new strategy for data fusion that allows for the use of sources with missing-data pixels, e.g., images provided by Landsat 7 with data gaps due to scan line corrector failure [36].
The proposed framework provides a beneficial tool that we predict will be very relevant for new remote sensing tasks that rely on data from different instruments. This relevance keeps increasing due to the continual development of different sensing platforms, including satellites and airborne and unmanned aerial vehicles. To demonstrate and evaluate the performance of the new technique, we use spectral images provided by Landsat 8, Sentinel 2 A, Venus, GeoEye-1, Ikonos, and WorldView-2 satellites as well as a Google Earth RGB image of high SR.

Spectral Unmixing
Spectral unmixing allows for extracting subpixel information by estimating the abundance fractions for a set of EMs. Let l and d denote the number of bands in a spectral image and the number of EMs to be used for the unmixing, respectively, and following the linear mixture model, the spectral signature of each pixel, Given the pixel's spectral signature m, the matrix of EMs d lÎ E  , and assuming that 1 lÎ n  represents a zero mean Gaussian noise, an estimation of the fraction vector 1 dÎ f  is achieved by minimizing the following fidelity term: 1  is a vector of ones. The optimization problem in Equation (2) represents a basic form of supervised unmixing without any regularization. In practice, different objective functions can also be used [38,39], and sparse [40] and spatial [41] regularizations are promoted within a modified objective function to enhance result accuracy. In addition to the influence of the selected optimization process, the accuracy of the used EMs also affects the accuracy of the obtained solution. Although a set of EMs can be selected from a spectral library, EMs that are extracted from the image itself [42] are usually preferred since they share acquisition conditions with the image pixels. The fraction maps obtained from the unmixing process provide valuable quantitative information. However, due to the use of the spectral image only, the SR of these maps is usually low. Our purpose is to enhance the SR by fusing the spectral image with a visible image of HSR.

Invariant Points (IPs) for Supervised Learning and Fusion of Data from Different Sensors.
Traditional methods for image data fusion in remote sensing usually require full (spatial and temporal) overlapping of the data sources. This constraint of using fully overlapping images prevents the use of data from different instruments causing us to miss the opportunity to use multiple available data combinations for data fusion tasks. In some cases, these data types are even available without charge, e.g., images from Landsat, Sentinel, and Google Earth. Figure 1 illustrates the limitation of using images from different instruments for traditional data fusion. Conceptual illustration of the mismatch between data sources due to the acquisition from different instruments. An IP is robust to the probable mismatches and has the same landcover type in both images.
Addressing this limitation, we suggest using IPs for a fusion process that is robust to probable mismatches between the data sources. In practice, an IP represents a point that can be detected in both images and has the same land cover type in both images. In other words, IPs are invariant to both geometrical and temporal changes in the data due to the acquisition conditions. We therefore use IPs for robust learning of the pattern that connects the HSR RGB values to the fraction values derived from the LSR multispectral (MS) image. In addition, the proposed IP-based learning is robust to probable missing-data pixels, i.e., pixels with no information in the image, for example, due to the presence of clouds or sensor malfunction [43,44].

Automatic Extraction of IPs
The use of IPs is essential for connecting the fused data sources, first for calibrating the RGB image to the spectral image and then to feed the SFANN with reliable input and target data. To obtain robust IPs, we apply a two-step strategy that combines; (1) an automatic method for extracting matching points and, (2) the random sample consensus (RANSAC) algorithm [45] to detect incorrect pairs of key points due to a probable mismatch in the previous step (see [46] for more details about all the steps in the IP extraction part). Several automatic methods are suitable for IP extraction. Among the three examined methods in our experiments, SIFT, speeded-up robust features (SURF) [47], and binary robust invariant scalable keypoints (BRISK) [48], the results obtained with SIFT were advantageous in regard to accuracy and number of extracted IPs. Quantitively, the number of IPs detected by SURF or BRISK is between 40% and 80% of the extracted IPs by SIFT. Nevertheless, this conclusion stands only for the datasets we used in our experiments, and one should always test different methods for different datasets.

Rotation Invariant Spatial Features for Learning for the Incorporation of Spatial Information
In natural scenes, important spatial information is inherent in the 2 D distribution of the image pixels. Combining this information within the unmixing process significantly improves results [49]. An efficient way of combining spatial information within a machine-learning process is by using CNNs. These, however, are usually complex, computationally heavy, and require a large number of training data samples, which are not always available. For example, in the proposed framework, the number of extracted IPs is probably insufficient for training a CNN. We therefore prefer to use a simpler learning mechanism, e.g., a feedforward NN. Then, to incorporate the spatial information of the images into the learning process, we map the 2D spatial information into a 1D vector. Unlike descriptors that are designed to match or categorize images, the desired spatial feature (SF) is intended for learning. In addition, we expect it to provide similar results for similar objects that may appear in the image in different orientations. The introduced variance in the data due to probable rotations of objects from the same class and the advantage of using rotation-invariant features to minimize the complexity of the learning process in this regard is addressed and presented well in [46]. Thus, the proposed SF must have the two following properties: 1. Be robust to rotation. 2. Maps the spatial distribution of the original colors surrounding the pixels.
For this purpose, we define a new rotation-invariant spatial feature for learning (RISFL) that we extract by calculating the spatial distribution of colors in the local neighborhood surrounding each pixel in an RGB image. In other words, the RISFL maps the directional distribution of the colors surrounding the pixel. To set a reference direction that is robust to probable rotations of the objects in the image, we use the gradient of color intensity around the pixel. Specifically, given three sets of values, r V , g V and b V , that respectively represent the R, G, and B values of the pixels within a specified neighborhood, N , surrounding pixel ( ) where P N Î is a set of pixels within N , and S , will include all pixels in N with azimuth values equal to or greater than 0 and smaller than 45 , and the last directional region, 8 S , will include all pixels with azimuths equal to or greater than 315 and smaller than 360 (see Figure 2) 4. For each region, calculate the mean value of the R, G, and B for all pixels that fall within the region. Figure 2 illustrates the main concepts of computing the proposed RISFL. The number of elements in the obtained RISFL is constant for a given number of directional regions ( n ), regardless of the neighborhood's size ( N ). Thus, the method allows for experimenting with different sizes of N , for the same value of n , without modifying any of the other elements of the learning framework.
Although the use of more directional regions is assumed to enable the extraction of more detailed RISFLs, the use of large values of n for small values of N is not practical. We should, therefore, consider the balance between these two parameters when designing the learning framework. Figures 3 and 4 further illustrate the steps for computing the proposed RISFL for a pixel in an area with an edge and in a homogenous area, respectively, at different degrees of rotation.  In both cases, the robustness of the RISFL to rotation is evident. Moreover, Figure 4 shows that the error in calculating the azimuth gradient in homogenous areas may be relatively large. For example, the azimuth of the gradient at the central pixel under a 0 rotation is 7.8 . Thus, the azimuth under a 45 rotation is supposed to be 52.8 , whereas the calculated azimuth is 44.8 , i.e., an error of 8 . Since the area is homogenous, however, the influence of this error on the obtained RISFL is minor, as observed from the presented features under the different rotations (tiles g-i). Although at this time we use only the mean value for each region, the RISFL allows the use of any other statistic or metric regarding the colors, e.g., variance, range, and covariance between the different colors.

Empirical Line Calibration
We use the empirical line calibration [50,51] to convert the unit-less DN values from the RGB image into reflectance units as follows: where l r and DN l are the reflectance and DN values in the spectral band l , respectively, and l a and l b are the calibration coefficients for band l . We use a robust fitting procedure to estimate the calibration coefficients using the extracted IPs between the images as presented in [51]. Then, we apply the estimated coefficients to all pixels in the RGB image to obtain their reflectance values. The calibration step contributes to the overall fusion process mainly by reducing the complexity of the pattern that the system needs to learn. Additionally, the robust fitting in this step helps to eliminate further outlier IPs, which may exist due to mismatches between key points in the IP extraction stage.

SFANN for Fraction Estimation
NNs are very useful for mapping relationships between input and target observations due to their high capability to learn intricate and hidden patterns within data. Relying on the insights from previous research in this regard, we decide to use a SFANN, based on a fully connected feed-forward NN (FCFFNN) to estimate the fraction values for each pixel in the HSR RGB image. The designed SFANN is intended to map the relationship between input data from the HSR image and the corresponding data from the spectral image. To fulfill the need for feeding the SFANN with a sufficient amount of training samples, we use IPs to derive this desired data from the images to be used for the fusion, with no need for any further data from additional sources.
The advantages of using FCFFNNs for the addressed fusion task are mainly the simple implementation and intuitive understanding of the network components. In practice, the three main components to be considered when designing an FCFFNN are the number of hidden layers (the depth of the network), the number of neurons in each of these layers, and the transfer function.
In comparison, CNNs usually require considering a much higher number of components and parameters when designing a new network. On the other hand, the use of spatial information in CNNs is inherent through convolution operations. In this regard, our challenge is to take advantage of the relatively simple design of FCFFNNs while also integrating the spatial information within the learning process. In practice, we incorporate additional information into the FCFFNN by concatenating the corresponding data to the 1D array of each (current) data sample (see Figure 5). The main disadvantage of this approach is, however, that each new data element increases the number of connections (weights) that we must adjust during the training process. We therefore propose a RISFL with only a moderate number of elements that sufficiently describe the spatial information in the pixel's neighborhood. Considering the architecture of an FCFFNN, the main two elements that need to be examined are the number of hidden layers and the number of neurons in each layer. On the one hand, a high number of hidden layers increases the ability of the network to learn a more complex function. On the other hand, it may lead to a vanishing gradient problem [30,31], whereby the neurons of the earlier layers learn very slowly, making the training process impractical. Similarly, more neurons in the hidden layers allow for the fitting of more complex functions but require a massive amount of training data, potentially leading to undesired overfitting. Thus, these two aspects, among others, should be considered when designing a new network. See [52] for a detailed discussion on the FCFFNN.
We design an SFANN that fuses spectral and visible data. The SFANN notably allows for learning the connecting pattern between the pixel's R, G, and B values and its RISFL and the corresponding fraction values from the LSR spectral image. Figure 5a presents a graphical description of the proposed SFANN.

Each input sample includes the RGB values
While the pure linear (purelin) function is used for the output layer: To train the SFANN, we use the Levenberg-Marquardt backpropagation algorithm [53] to minimize the mean square error. For this purpose, we feed the network with a set of IPs that is divided into three subsets: for training, validation, and testing, with respective ratios of 70%, 15%, and 15%. Although we use RGB values and RISFLs here, we could easily incorporate different spatial features within the proposed scheme, e.g., local binary pattern features [54] and the histograms of oriented gradients [55], as well as additional input data from other sources.

Methodological Framework
Given a spectral image with low SR and a visible image with HSR, we use these two types of data to extract fraction maps with enhanced SR. In practice, we use the described SFANN to fuse the two data sources by modeling the relationship between the RISFLs and the fractions of each EM and the RGB values, as follows: 1. Extract IPs between the spectral and visible RGB images. 2. Calibrate the RGB values using robust empirical line (EL) calibration. 3. Extract RISFLs. 4. Automatically extract EMs from the spectral image. 5. Estimate the fraction map for each EM using the unmixing process. 6. Using the IPs, train the SFANN to model the relationship between the calibrated RGB values, the RISFLs, and the fractions of each EM. 7. Apply the trained SFANN to all pixels in the visible image and their RISFLs to create fraction maps with HSR. 8. Apply the abundance non-negativity and sum-to-one constraints to provide fully constrained fractions. For this part, we project the output fractions onto the canonical simplex [56], as presented in [38] and [57].
Here we use the vertex component analysis (VCA) [58] method to extract EMs and the sparse unmixing by variable splitting and augmented Lagrangian (SUnSAL) [59] method to estimate the fraction, though any other combination of different methods may be efficiently implemented within the proposed framework as well. In addition to its significantly fast performance, VCA can estimate EMs also in the absence of pure pixels in the scene and so is preferable for cases of LSR in which the presence of pure pixels is not highly probable. SUnSAL is computationally light and has the advantage of yielding a sparse solution. Figure 5b presents the framework of the proposed methodology.

Data and Experimental Evaluation
To evaluate the proposed methodology, we use seven data sets: 1. A spectral image with seven bands and an SR of 30 m (Figure 6a), provided by the Landsat 8's Operational Land Imager (OLI), and a corresponding spectral image of the same area with SR of 10 m provided by the Sentinel 2 A satellite ( Figure 6b). 2. Venus dataset-a, a spectral image with 12 bands and an SR of 10 m along with its corresponding uncalibrated spectral image with an SR of 5 m, both provided by the Venus satellite (Figure 6c,d). 3. Venus dataset-b, similar to data set #2 but for a different area that contains multiple homogeneous regions (Figure 6e,f). 4. A spectral image provided by the GeoEye-1 satellite with four bands and SR of 1.84 meters and a corresponding panchromatic image with SR of 0.46 m. 5. A spectral image provided by the Ikonos satellite with four bands and SR of 3.28 meters and a corresponding panchromatic image with SR of 0.82 m. 6. A spectral image provided by the WorldView-2 satellite with eight bands and SR of 1.84 meters and a corresponding panchromatic image with SR of 0.46 m. 7. An RGB image, with an SR of 1 m, available for free, on Google Earth and a corresponding spectral image of the same area with SR of 10 m provided by the Sentinel 2 A satellite. We divide the seven datasets into three groups according to the availability and type of reference ground truth that we use for the evaluation as following: • Group-1 includes datasets 1,2, and 3. In each of these three cases, a real full-reference HSR spectral image is available and used for the evaluation of the obtained HSR fraction maps. Group-2 includes datasets 4, 5, and 6. Since a real full-reference is not available for these images, we adopted the strategy presented in [25] and [26] following Wald's protocol [60] for generating simulated full-reference. We reduced the SR of the original data by resampling the spectral and panchromatic images in these datasets by a factor of 0.25 to create new spectral images for GeoEye-1, Ikonos, and WorldView-2 with SR of 7.36 m, 13.12, and 7.36 m, respectively. The original MS image is then used as a reference ground truth. • Group-3 includes data set 7. The RGB image, taken from Google Earth, was acquired on 08/08/2014 and is the most recent available HSR image of the selected area. A corresponding spectral image was derived from the Sentinel-2 A image with SR of 10 m from dataset 1. No reference ground truth is available in this group. Figure 6. presents an RGB composite of the images from datasets 1, 2, and 3, and Figure 7 shows an RGB composite of the images from datasets 4, 5, and 6. Whereas the images from dataset 7 are presented within the results of Experiment 3.  Comparing to the state-of-the-art An alternative to the DFuSIAL for enhancing the SR of the unmixing fraction maps, through data fusion of visible HSR and spectral LSR images, is based on the PS technique. Using PS, we first retrieve an HSR spectral image and then apply the unmixing process to obtain HSR fraction maps. To analyze the performance of the DFuSIAL versus the state-of-the-art, we use two of the recent PS methods: (1) The conventional filter-based approach, presented in [61], is intended for preserving spectral quality (PSQ). For the sake of convenience, we use the term PSQ-PS to refer to this method in the rest of the paper. (2) The novel CNN-based approach, presented in [26], utilizes a target-adaptive fine-tuning step to achieve a further improvement of the results with comparison to its baseline and accurate method pansharpening neural network (PNN) [25]. We use the term CNN-PNN+ to refer to this method in the rest of the paper. Since CNN-PNN+ is designed for particular types of data, it is used for comparison only in experiments that involve dataset from Group-2.
Four different experiments are performed to evaluate the performance of the proposed method and for comparison with the state-of-the-art PSQ-PS and CNN-PNN+. In each experiment, we use three evaluation metrics: the mean absolute error (MAE), the root mean square error (RMSE), and the maximal absolute error (MaxAE). Let , Let q be the number of points to be used for the evaluation. The MAE i , its standard deviation  In this experiment, we use the datasets from Group-1, i.e., sets 1,2, and 3, to evaluate the performance of the tested methods under cases with images from different sensors, images from the same sensor, and images with missing data pixels. We perform four evaluation tests as follows: •

Test 1 -data from different sensors, Landsat and Sentinel
In this test, we use the Landsat image with an LSR of 30 m and the visible bands of the Sentinel image to create a corresponding RGB image with an SR of 10 m. The SIFT method detected 2600 IPs. Figure 8 presents an RGB composite of the images used in this experiment with a plot of the extracted IPs.
• Test 2 -data from the same sensor, Venus dataset-a We use the Venus image from the second dataset with an SR of 10 m as an LSR spectral image and the visible bands of the corresponding image provided by the same satellite as an HSR RGB image with an SR of 5 m (Figure 6). The SIFT method detected 3560 IPs • Test 3 -data from the same sensor, Venus dataset-b Like in Experiment 2, we apply the same test to the third dataset, which contains multiple homogeneous regions ( Figure 6). The SIFT method detected 1790 IPs.
• Test 4 -data from different sensors with simulated missing-data pixels.
Here we use the same data set as in Test 1. To simulate a case with missing data, we create vertical missing-data stripes across the LSR image (see Figure 9). We set the width of each stripe to five pixels and the distance between the stripes to 50 pixels. The SIFT method detected 500 IPs. Figure  9 presents an RGB composite of the images used in this experiment with a plot of the extracted IPs. It shows that IPs are detected only within a certain distance from missing data pixels. This property ensures the SFFAN is fed only with reliable data, which is essential for unbiased learning.  We use the available HSR spectral image to create an HSR reference fraction map to serve as the ground truth in each experiment. In Tests 1 and 4, we first spectrally convolve the bands of the Sentinel image to the spectral bands of the Landsat 8 image. We use the obtained results by applying SUnSAL to the HSR image as reference for the 1 st type of evaluation, whereas, for the 2 nd type, we use the LSR fraction maps as reference. In each of the four experiments, we apply the proposed DFuSIAL to enhance the SR of the obtained fraction maps for the LSR image and generate fraction maps with HSR. The PSQ-PS method is applied for comparison. First, we apply the PS process to retrieve the HSR spectral image. For this purpose, we use the LSR spectral image with a panchromatic image that we derive from the original HSR spectral image. Then, we obtain HSR fraction maps by applying SUnSAL to the retrieved HSR spectral image.
To test the stability of the solution under different SFFAN architectures, we experiment with a different number of hidden layers and neurons. In practice, we examine the performance of the DFuSIAL method using 15 different combinations. We apply different SFANNs with a varying number of hidden layers (3 to 5, i.e., 10 c = ) were found to be optimal: they performed faster and minimized the MAE for the data sets used in this work. Six EMs which we automatically extracted from the spectral image using the VCA method were used in each of the experimental tests 1-4.

Experiment 2: Testing with simulated full reference ground truth and comparing with CNNbased PS method
We use the datasets from Group-1, i.e., sets 4,5, and 6, to perform three experimental tests as follows: In all of the tests, 5, 6, and 7, the original spectral image is used for generating reference ground truth. Since GeoEye-1 and Ikonos images have only four bands, the number of EMs which we automatically extracted using the VCA method in Tests 4 and 5 is three, whereas in Test 6 we use six EMs. Both PSQ-PS and CNN-PNN+ methods are applied for comparison in this experiment. Moreover, the number of detected IPs in these three tests was relatively low due to the small size of the images. Thus, we set the number of layers and neurons in the SFANN to three and ten, i.e.,

Experiment 3: Hierarchal resolution enhancement •
Test 8 -We use the images from the seventh dataset; the Sentinel spectral image, with an SR of 10 m, an RGB image from Google Earth with an SR of 1 m, and an RGB image with an SR of 3 m that was created by resampling the 1 m RGB image. Six EMs extracted from the Sentinel image by VCA are used, and their fractions are estimated using SUnSAL. Then, we use a hierarchal DFuSIAL process to create fraction maps with SRs of 3 m and 1 m, as follows: Step1: Training the SFANN to estimate fractions for the RGB image, with an SR of 3 m, using the DFuSIAL algorithm.
Step2: To further enhance the resolution of fraction maps, we first radiometrically calibrate the Google Earth RGB image of the same area but with an SR of 1 m. Then, we apply the trained SFANN to the calibrated image to obtain fraction maps with an SR of 1 m.
For comparison, we use the PS fusion method to create corresponding fraction maps as follows: Step1: The PSQ-PS method is applied to create a new multispectral image with an SR of 3 m by fusing the original Sentinel 10 m image and a panchromatic image created from the RGB 3 m image.
Step2: Creating a new multispectral image with an SR of 1 m using the PSQ-PS method to fuse the multispectral 3 m image created in Step 1 and a panchromatic image created from the RGB 1 m image.
Step3: Applying SUnSAL to the two multispectral images created in Steps 1 and 2 to generate fraction maps with SRs of 3 m and 1 m, respectively.
An RGB composite of the images and the corresponding fraction map of three selected EMs are presented in the Section 3.

Experiment 4: Testing the influence of rotation and displacement mismatch between data sources.
In this experiment, we test the robustness of the examined methods to probable rotation and displacement mismatch between the spectral and RGB images. For this purpose, we use the images from the seventh dataset, i.e., WorlView-2. A description of the experiment and interpretation of the results are given in the results and discussions, i.e., Section 3.
To minimize the influence of a probable difference in radiometric values, both the HSR and LSR spectral images need to be in reflectance units. For this purpose, robust radiometric empirical line calibration is applied in each experimental test to the HSR image with respect to the LSR spectral image, using the valid IPs. Table 1 summarizes information about the data sources used in each experimental test. In all seven tests, we feed the network with 80% of the detected IPs and use the other 20% for evaluation type 2. Figure 10 shows an overview of the applied comparative testing used for the evaluation in experiments 1 and 2.  Note: s = neighborhood size; n = number of directional regions; * number of rows x columns x spectral bands in image; ** the IPs are between the MS image and the resampled RGB image with SR = 3.

Quantitative Evaluation
To analyze the performance of the proposed method, DFuSIAL, and the other tested state-ofthe-art methods, we observe both the quantitative and visual outcomes of the applied experiments. Table 2 and Table 3 present the M AE , STD , RMSE and overall MaxAE values of Type 1 and Type 2 evaluations, respectively, for the obtained fractions in Experiment 1. The results show that both PSQ-PS and DFuSIAL provide reliable results, with an average M AE between 2% and 4%. Indeed, the performance of both methods is highly correlated, with the exception of Test 4. The results obtained by DFuSIAL in Test 1 are more accurate than those obtained by PS. This advantage is mainly due to the absence of full spatial and temporal overlapping between the LSR and HSR data sources. The IP-based aided learning in DFuSIAL minimizes the influence of this lack of overlapping.
On the other hand, the results obtained by the PSQ-PS method in Test 3 are even more accurate. As mentioned before, the images in this experiment mainly include homogeneous regions. Since the detection of IPs within these regions is challenging, the number of IPs detected is almost half the number detected in Test 1 for a dataset of the same size and from the same sensor (see Table 1 for information on Tests 2 and 3). The relative lack of IPs within homogeneous regions reduces the ability of the SFFAN to learn the relationship between the fractions and the RGB data in these areas.

s n
Unlike the results of Tests 1-3, the results of Test 4 clearly show the advantage of DFuSIAL over PSQ-PS with respect to the accuracy of the fractions. The presence of missing-data pixels in the image has a tremendous negative influence on the PSQ-PS process, whereas IP-based learning prevents this influence. In practice, DFuSIAL is only slightly affected by missing-data pixels due to a lower number of detectable IPs (see Figure 9), and accordingly, fewer training data samples. With the exception of Test 3, STD and MaxAE values are lower for fractions obtained by DFuSIAL. The increase in MaxAE for the PSQ-PS method in Test 4 is noticeable, whereas the corresponding value for DFuSIAL is similar to that obtained in Test 1. This is due mainly to the robustness of the proposed method to missing-data pixels. In addition to the previous insights, the results show that M AE , STD , RMSE and MaxAE values obtained using evaluation types 1 and 2 are correlated and so type-2 evaluation may be used as an efficient indicator of process accuracy. This fact is beneficial in real cases, where a reference ground truth of the HSR fraction maps is not available for validation purposes.  In the same context, Table 4 presents the M AE , STD , RMSE , and overall MaxAE values of Type 1 and Type 2 evaluations, respectively, for the obtained fractions in Experiment 2, i.e., Tests 5-7. As mentioned before, in this experiment, both the methods PSQ-PS and CNN-PNN+ are used for comparison. The results in Table 4 clearly show the advantage of the proposed DFuSIAL over the two other methods in Tests 6 and 7, with respect to all the metrics used for the evaluation. On the other hand, the obtained results by CNN-PNN+ in Test 5 are the most accurate among the three tested methods. Besides, PSQ-PS slightly outperforms DFuSIAL in this test. In accordance with the results from Test 3 in Experiment 1, the results in Test 5 point out the slight increase of the M AE of the obtained results by DFuSIAL for images with a high presence of homogenous regions. Table 4. Type-1 evaluation of fraction values, obtained in Experiment 2 (Tests 5-7) by PSQ-PS, CNN-PNN+, and the proposed fusion method DFuSIAL, with respect to the obtained fractions, in HSR by SUnSAL.

MAE STD RMSE Max MAE STD RMSE Max MAE STD RMSE Max
Test 5  In Experiment 3 (Test 8), since HSR reference fraction maps are not available, we apply only the 2 nd type of evaluation (see Table 5). The enhanced-resolution fractions obtained by DFuSIAL have an average MAE of ~4.5%. The error for all evaluation metrics is slightly increased, compared with previous experiments, mainly due to changes in the land cover types between the two data sources that were acquired more than three years apart. Nonetheless, objects with invariant land cover, e.g., roofs, are assumed to preserve their fraction values, as shown later in the visual presentation of the results. On the other hand, the PSQ-PS method is highly influenced not only by temporal changes but also by the lack of overlap between the images. Thus, the fractions obtained by the PSQ-PS method have an average MAE of ~8% and a significantly increased error for all of the evaluation metrics used.

Visual Evaluation
To facilitate a visual evaluation, we present the fraction maps and MAE maps obtained in each experiment. Specifically, we present a combination of the fraction maps of three of the EMs used in the experiment. For each scenario, we select three dominant EMs, denoted EM1, EM2, and EM3. We then generate a false color composite by using the fraction maps of the three EMs as R, G, and B layers, respectively and use a color pyramid to interpret the obtained false color maps. Vertices (1,0,0), (0,1,0), and (0,0,1) are red, green, and blue, respectively, and represent pure pixels of EM1, EM2, and EM3, respectively. The points on the triangular face created by these three vertices represent pixels with different mixtures of the three selected EMs. Vertex (0,0,0) is black and represents a pixel with a mixture of the other EMs and a fraction value of zero for each of the three selected EMs. Other points located within the pyramid volume and on the triangular faces connected to vertex (0,0,0), represent different mixtures of different combinations of the six EMs. Figure 11 presents the planar unfolding of the color pyramid. Figures 12−15 present a zoom-in of selected areas of the fraction maps obtained in Experiment 1, i.e., Tests 1-4, respectively, by applying SUnSAL to the original LSR and HSR spectral images and by using the PSQ-PS method and the proposed DFuSIAL. Similarly, Figures 16−18 present the obtained maps by PSQ-PS, CNN-PNN+, and DFuSIAL in Experiment 2, i.e., Tests 5-7. Besides, in Experiment 1, we present the entire error map for each of the tested methods, while in Experiment 2, due to size limitation, we present only a part of the MAE maps that corresponds to the selected area for presentation in each test. We derive the error map by calculating the MAE of the fraction value obtained for each pixel with respect to the GT fractions obtained by applying SUnSAL to the available HSR spectral image. An independent color bar is attached to the MAE maps in each figure to facilitate interpretation of the MAE values. The visual presentation of the results in Experiment 1 clearly shows the enhancement of fraction map resolution achieved by both methods. Figure 12 shows the advantage of DFuSIAL over the PSQ-PS method, which is in line with our quantitative results. This advantage is especially noticeable in areas in which the fractions undergo rapid spatial changes, e.g., the area containing roofs. Although both methods introduce specific spatial regularization into the estimated fractions, the results obtained by PSQ-PS are oversmoothed, and small spatial features are filtered out of the obtained fraction map. This spatial regularization is controlled in DFuSIAL mainly by changing the parameters used for RISFL extraction. The results presented for Tests 2 and 3 (see Figures 13 and 14, respectively) show a high correlation between the results obtained by the two methods. It is essential to mention that this correlation holds for both the spatial distribution and the intensity of the fractions, which means that the obtained fraction values are highly similar. The MAE maps in Figure 14 emphasize the advantage of the PSQ-PS method over DFuSIAL within homogeneous regions, although DFuSIAL performs better along the boundaries between these regions. The results presented in Figure15 are essential and show the significant advantage of DFuSIAL in cases that contain missingdata pixels. While the presence of such pixels has a clear negative influence on the results obtained by the PSQ-PS method, the results obtained by DFuSIAL are not affected. As can be observed in Figure 15b, the PSQ-PS method retrieves only a minor amount of information from the missing-data stripes, yet the footprint of these stripes appears clearly in both fraction and MAE maps. On the other hand, the results obtained by DFuSIAL are similar to those obtained in Test 1, without missing-data pixels, indicating high robustness of the proposed fusion method to this kind of influence.   Similar to the insights from Experiment 1, the visual presentation of the results in Experiment 2 emphasizes the achieved enhancement of fraction map by the three examined methods. Moreover, and in accordance with the quantitative results, the visual interpretation of the results shows that except for the case with the high presence of homogeneous areas, DFuSIAL has a significant advantage over the other two examined methods. The maps in Figures 16−18 clearly show that while the obtained results by CNN-PNN+ and DFuSIAL preserve the sharpness of the fraction maps, to a high extent, the obtained maps by PSQ-PS are significantly oversmoothed. This probably occurs due to the use of a filter-based approach. Moreover, while the obtained maps by CNN-PNN+ look more similar to the reference fractions, with regards to their spatial distribution, the obtained fractions by DFuSIAL are more accurate with regards to their real value. For example, the results in Figure 18 clearly show that there is an overestimation of the vegetation EM within the obtained fractions by CNN-PNN+   Regarding experiment 3 (Test 8), the results clearly show the ability of the DFuSIAL method to significantly enhance the SR of the fraction maps also through a hierarchical process. The obtained results, of both 3 m and 1 m resolutions, preserve the sparsity of the fractions. The trained SFANN is very useful for further enhancing the resolution. In this case, the resampled image, with a 3 m resolution, is used as a mediator in the process to enhance the resolution from 10 m to 1 m. In practice, once a SFANN is created and trained according to the DFuSIAL algorithm, the trained network can be applied to different images from the same area but with different resolutions.
On the other hand, due to the lack of full overlap between the multispectral and the RGB images, the geometrical distortion is noticeable in the results obtained by the PSQ-PS method. In this regard, the results presented in Figure 19 emphasize the robustness of DFuSIAL to temporal and geometrical partial-nonoverlapping between the fused data sources. It is important to note that direct enhancement, from a 10 m to a 1 m resolution, yielded quite similar results to those obtained by the described hierarchal process. The hierarchal process is, however, more efficient in terms of computation time and complexity. In summary, both the quantitative and visual results in Experiments 1-3 show that DFuSIAL is preferable not only for cases with data from different sensors but also for data from the same sensor and when the SR of the RGB image is significantly larger than that of the spectral image used for the fusion. While the effort in other fusion methods is to retrieve the entire HSR MS image, the focus in DFuSIAL is only on enhancing the SR of one product of the spectral data, i.e., the EM fractions. The pattern that DFuSIAL needs to learn is, therefore, more straightforward and relatively less complicated. Thus, it outperforms the other examined methods with regards to the accuracy of the spatially enhanced fraction maps. On the other hand, the performance of DFuSIAL is slightly affected by the presence of homogeneous areas within the fused image, since fewer IPs are detected within these areas.

Evaluation of the Performance Under Rotation and Displacement Mismatch Between Data Sources (Experiment 4).
The results of Tests 1, 4, and 8 demonstrate the advantage of using DFuSIAL to fuse data from different sources. This advantage is due mainly to the robustness of DFuSIAL to mismatches between the images. Although the mismatch between the Landsat and Sentinel images is relatively minor, it nevertheless affects the PSQ-PS method negatively. In many cases, especially when using modern remote sensing platforms, e.g., unmanned aerial vehicles, this mismatch can reach considerable proportions. Thus, to test the robustness of the PSQ-PS, CNN-PNN+, and DFuSIAL methods under significant mismatch conditions, we repeat Test 7 while creating a gradual mismatch between the data sources, i.e., the original WorldView-2 image and its corresponding resampled image with reduced SR. First, we create pairs of data with different degrees of radial and displacement mismatch by rotating the LSR image at an angle of 0 ,1 ,2 , 3 , 4 ,and 5       and shifting the HSR image by 0, 1, 2, 3, 4, and 5 pixels. In each of the 36 combinations, we fuse the images using the three methods and observe the MAE value and its standard deviation, as explained previously (see Figure 20).  Figure 20 clearly shows the superiority of the DFuSIAL method over the PSQ-PS and CNN-PNN+ methods with regards to the accuracy of the retrieved HSR fraction maps. DFuSIAL is robust to the two examined types of data source mismatch. In contrast, the MAE and STD values according to the PSQ-PS CNN-PNN+ methods increase rapidly and monotonously with the increase in mismatch. The use of IPs for learning and the incorporation of spatial information through invariant features, i.e., RISFLs, make the proposed method DFuSIAL highly robust to mismatches between the data sources. Thus, it is more suitable for the fusion of data from different instruments.

Sensitivity Test
The results obtained by DFuSIAL are sensitive mainly to two flexible parameters of the SFANN: (a) number of layers and (b) number of neurons in each layer. To test this sensitivity, we repeat the fusion process with 25 different combinations of these two parameters. We set the number of layers to 1, 2, 3, 4, or 5, and the number of neurons to 10, 15, 20, 25, or 30. In each case, we apply the fusion process and examine the performance of the method by observing the MAE of the results and the run time of the network training. We repeat the test ten times, picking the median values to prevent the influence of unexpected events during the calculations (see Figure 21). MAE values vary from 0.054 to 0.057, with a moderate variance. The MAE surface shows that increasing the number of layers in the network decreases the magnitude of the error more than doubling the number of neurons in each layer. In terms of computation run time, increasing the number of layers or the number of neurons increases the run time quadratically. Besides, the number of extracted IPs limits the allowed number of layers and neurons in the network. Thus, we should test these two parameters carefully when designing a new SFANN. Table 6 shows the computation run time of the methods in each test in Experiment 1. The overall run time of PSQ-PS for fraction estimation includes two main parts: (a) retrieval of an HSR spectral image, and (b) fraction estimation through unmixing. On average, the first part requires 60% of the computation run time, while the second part requires 40%. In DFuSIAL, the overall process includes four parts: (a) unmixing of the LSR spectral image, (b) extraction of IPs and empirical line calibration, (c) RISFL extraction, and (d) training and application of the SFANN. On average, the respective percentages of the total computation run-time are ~5%, 15%, 20%, and 60%.

Conclusions
Traditional methods for spectral unmixing rely only on spectral images. On the other hand, image fusion methods usually require datasets form the same sensor and that overlap both geometrically and temporally. Addressing these two limitations, we developed a new methodology for enhancing the SR of the fraction maps through data fusion of HSR visible images and LSR spectral images. Using automatically extracted IPs for learning, the proposed DFuSIAL is highly robust to the geometrical and acquisition conditions and to changes due to different acquisition dates. It is, therefore, beneficial for the fusion of data from different remote sensing instruments. Furthermore, we present a useful way for the incorporation of spatial information within NN-based learning, through RISFLs, without the need for CNNs. The results of real data experiments suggest that DFuSIAL gives both reliable and accurate results with respect to the HSR GT fraction maps. Comprehensive testing was applied using real data sets. A quantitative and visual evaluation of the results, with respect to two of the state-of-the-art conventional and CNN-based PS methods, shows that the proposed fusion method DFuSIAL is advantageous concerning the accuracy of the obtained HSR fraction maps. Furthermore, the proposed method was shown to be highly robust to an extreme mismatch between the fused data sources. It also yields a tenfold resolution enhancement by applying the hierarchal process. It would be interesting in future work to examine longer chains of the hierarchal process for further improvement as well as the use of additional metrics of the spatial features. Finally, the proposed framework is based on supervised learning through reliable IPs. In practice, once the input and output data samples are defined, other multivariate regression procedures, e.g., support vector machine [62] and random forest [63] (for regression), can be easily applied for the learning part of DFuSIAL and would also be considered in future work. Meanwhile, we prefer to use an FFNN for this purpose as it requires only a minimal number of assumptions and heuristic parameters.