An Enhanced Single-Pair Learning-Based Reﬂectance Fusion Algorithm with Spatiotemporally Extended Training Samples

: Spatiotemporal fusion methods are considered a useful tool for generating multi-temporal reﬂectance data with limited high-resolution images and necessary low-resolution images. In particular, the superiority of sparse representation-based spatiotemporal reﬂectance fusion model (SPSTFM) in capturing phenology and type changes of land covers has been preliminarily demonstrated. Meanwhile, the dictionary training process, which is a key step in the sparse learning-based fusion algorithm, and its effect on fusion quality are still unclear. In this paper, an enhanced spatiotemporal fusion scheme based on the single-pair SPSTFM algorithm has been proposed through improving the process of dictionary learning, and then evaluated using two actual datasets, with one representing a rural area with phenology changes and the other representing an urban area with land cover type changes. The validated strategy for enhancing the dictionary learning process is divided into two modes to enlarge the training datasets with spatially and temporally extended samples. Compared to the original learning-based algorithm and other employed typical single-pair-based fusion models, experimental results from the proposed fusion method with two extension modes show improved performance in modeling reﬂectance using the two preceding datasets. Furthermore, the strategy with temporally extended training samples is more effective than the strategy with spatially extended training samples for the land cover area with phenology changes, whereas it is opposite for the land cover area with type changes.


Introduction
Given the growing application requirements for a variety of refined and high-frequency monographic studies, such as land use and cover change [1], ecological environment monitoring [2], forest and pasture [3], oceanographic survey [4], and disaster monitoring [5], possible solutions for frequent acquisition of high-spatial-resolution remotely sensed data have been widely proposed. One significant attempt among current works presented a radical solution that involves the progressively increasing launch of various high-quality remote sensors, some of which adopt high

Methodology
Although two main existing spatiotemporal fusion methods based on sparse learning theory vary in model construction, fusion pattern, and complexity, the primary theoretical basis and its contributing mode for their fusion process are nearly the same. Considering the universality and simplicity of the algorithm with single image pair, an improved fusion scheme from the single-image-pair method is firstly proposed on a basis of an enhanced dictionary training strategy and then evaluated by two remotely sensed datasets.

Proposed Fusion Scheme with Enhanced Dictionary-Training Process
In the sparse-learning fusion method, remotely sensed images from the same sensor and channel are treated as different sparse "versions" of an invariable overcomplete dictionary D on different acquiring dates. Thereinto, the sparse "version" is generally called the sparse coefficient α and considered as an indicator of the seasonal factor of an acquired remotely sensed image, related to D (mainly indicates spatial and texture features). When no significant discrepancy in texture context occurs, the dictionary D derived from the observed image-pair can be on behalf of the one from the modeled image-pair. The key step of the sparse-learning fusion algorithm is therefore to Remote Sens. 2018, 10, 1207 4 of 19 retrieve a high-precision overcomplete dictionary D through training the high-and the low-resolution image pair at the observed date. If a steadily performing dictionary training algorithm is applied (e.g., coupled K-SVD algorithm), the accuracy of the sparse-learning fusion results is significantly related to the sufficiency of inputting training samples, which is, obviously, not satisfied in the original single-pair-based fusion algorithm.
For the retrieval of high-precision D, an improved sparse-learning fusion method with enhanced dictionary training process is proposed in this study and the overall processing flow is shown in Figure 1. In this method, spatiotemporally extended training samples are utilized to promote the sufficiency of dictionary training operations in both fusion layers of the single-pair learning-based algorithm. Two modes are furthermore designed to increase employed training samples: the spatially extended mode and the temporally extended mode. Specifically, the spatially extended mode increases only the image size (from S 0 to S 1 in Figure 1) of all the inputting training samples (including the low-resolution image and the high-low resolution image-pair) at the observed date (t 1 in Figure 1). By contrast, the temporally extended mode increases the number of inputting training samples, which are all obtained from different acquired dates (t 3 , t 4 to t n in Figure 1) and have the same image size as the original inputting images.
Considering the case where the single image pair is employed as inputs, assume that H 1 and L 1 denote high-resolution image and low-resolution image at t 1 (observed date), L 2 denotes low-resolution image at t 2 (modeled date), H 2 denotes high-resolution image at t 2 that is to be predicted, and the image size of L 1 , L 2 , M 1 and M 2 is S 0 × S 0 . The high-resolution dictionary D h and the low-resolution dictionary D l are now derived by minimizing the following improved objective functions: where α 1 is the sparse coefficients of D l and D h at t 1 ; and X new 1 and Y new 1 , respectively, denote the spatially or temporally extended training sample matrices instead of the original training sample matrices X 1 and Y 1 extracted from the difference image (H 1 − L 1 ) and the low-resolution image L 1 .
As to the dictionary training strategy with the spatially extended mode, the employed training sample matrices X new 1 and Y new 1 in Equations (1) and (2) are finally extracted from the enlarged difference image H enl 1 − L enl 1 and the enlarged low-resolution image L enl 1 both with an spatially extended image size S 1 × S 1 rather than the original image size S 0 × S 0 . When no significant seasonal change occurs between t 1 and t 2 (type changes), the completeness of dictionary D is mainly limited by spatial heterogeneity and diversity of surface features. This mode actually intends to address the issue of completeness of spatial features by learning larger image area where more samples of surface features can be found.
Another situation is, when the temporally extended mode is selected, the training sample matrices X new . . , L add n are additional high-and low-resolution training images (with the same image size as H 1 and L 1 ) observed at t 3 , t 4 , . . . , and t n . Under the assumption that seasonal change occurs between t 1 and t 2 , the temporally extended mode of training samples can improve the description of phenology features extracted from training data observed at different dates. Since a single dictionary D is considered to be hard to provide a complete and precise expression for overall seasonal features, it is reasonable to define an approximately "overcomplete" and phenology-based dictionary. Moreover, to find out the effectivity of two modes mentioned above, an evaluation strategy is presented in Section 3 (Results) to give a convictive selection proposal when both spatially and temporally extended samples are available and execution efficiency is required. Landsat and MODIS data at t 1 MODIS data at t 2 S1 S0 Observed at t1 S0 S1 Figure 1. The proposed fusion scheme with spatially or temporally extended samples in dictionary training.

Assessment Indices of the Proposed Fusion Scheme
To obtain an accurate description of the fusion results, four types of indices are provided from the aspect of spectral errors per band, similarity of the overall structure, spectral distortion, and overall spectral errors, and then applied to the modeled reflectance and the actual reflectance for an all-sided quality evaluation of the fusion results. Five quantitative indices, namely, average absolute difference (AAD), root-mean-square error (RMSE), structure similarity (SSIM) [35], spectral angle mapper (SAM) [36], and Erreur Relative Global Adimensionnelle de Synthèse (ERGAS) [37], which correspond to the indicated aspects of foregoing assessment indices, are gathered to validate the quality of the predicted images from different assessment views. SSIM, SAM, and ERGAS are obtained by computing the following equations: where and are the reflectance in band ∈ [1, ] of the modeled image and the actual image ; , , , , and correspond to the mean value, standard deviation, and covariance in band of and , respectively; = * and = * ; and are generally set as 0.01 and 0.03; is the grayscale of reflectance images; is the RMSE in band of and ; and and are the spatial resolutions of and . Small values of AAD, RMSE, SAM, and ERGAS and a high value of SSIM between the modeled reflectance image and the actual reflectance image indicate a considerable fusion result.
Scatter plots based on the channel-specified reflectance of the modeled data against actual data are provided to supplement the aforementioned quantitative indices with the visualized pattern to provide an intuitive quality assessment of fusion results, and, moreover, the total time-consumption of employed channels used in fusion strategies with spatiotemporally extended training samples is also considered here to give a general description of their efficiencies.

Assessment Indices of the Proposed Fusion Scheme
To obtain an accurate description of the fusion results, four types of indices are provided from the aspect of spectral errors per band, similarity of the overall structure, spectral distortion, and overall spectral errors, and then applied to the modeled reflectance and the actual reflectance for an all-sided quality evaluation of the fusion results. Five quantitative indices, namely, average absolute difference (AAD), root-mean-square error (RMSE), structure similarity (SSIM) [35], spectral angle mapper (SAM) [36], and Erreur Relative Global Adimensionnelle de Synthèse (ERGAS) [37], which correspond to the indicated aspects of foregoing assessment indices, are gathered to validate the quality of the predicted images from different assessment views. SSIM, SAM, and ERGAS are obtained by computing the following equations: where ρ P i and ρ R i are the reflectance in band i ∈ [1, B] of the modeled image P and the actual image R; µ P i , µ R i , σ P i , σ R i , and σ P i R i correspond to the mean value, standard deviation, and covariance in band i of P and R, respectively; C 1 = (k 1 * L) 2 and C 2 = (k 2 * L) 2 ; k 1 and k 2 are generally set as 0.01 and 0.03; L is the grayscale of reflectance images; RMSE i is the RMSE in band i of P and R; and p and r are the spatial resolutions of P and R. Small values of AAD, RMSE, SAM, and ERGAS and a high value of SSIM between the modeled reflectance image and the actual reflectance image indicate a considerable fusion result. Scatter plots based on the channel-specified reflectance of the modeled data against actual data are provided to supplement the aforementioned quantitative indices with the visualized pattern to provide an intuitive quality assessment of fusion results, and, moreover, the total time-consumption of employed channels used in fusion strategies with spatiotemporally extended training samples is also considered here to give a general description of their efficiencies.

Results
In view of only one acquired image-pair (the high-and the low-resolution images) the proposed learning-based algorithm has required, two reconstruction-based spatiotemporal fusion models, the STARFM and the semi-physical reflectance fusion model, are employed in comparison to the original and the improved learning-based fusion algorithms. Specifically, the single-pair version of STARFM with default parameters and the improved algorithm based on semi-physical fusion model (SPFM) [30] are finally adopted to perform the experiment.

Datasets
In this paper, two datasets, which consist of rural and urban datasets, are employed to perform the fusion strategy that utilizes spatiotemporally extended training samples. The rural dataset uses the same experimental data as in [23], which has been characterized as a study area with phenology changes [32] and comprises Landsat ETM+ images with 30 m spatial resolution and the MODIS daily 500 m surface reflectance product (MOD09GHK) acquired on 24 May, 11 July, and 12 August 2001 ( Figure 2). Beijing, which is a typical urban area in China, is selected as the urban dataset to validate the fusion quality with extended spatiotemporal training samples because the sparse learning fusion method is more sensitive to texture and structural features of fused images than others. For the urban dataset listed in Table 1, reflectance products comprised the 20 Landsat-8 OLI (30 m spatial resolution) scenes, and the corresponding MODIS 8-day MOD09-A1 (500 m spatial resolution) and-Q1 (250 m spatial resolution) acquired from 2013 to 2017 are used to perform the fusion strategy described in Figure 1. The Landsat-8 Surface Reflectance product, whose performance is accepted to be either close or better than Landsat TM/ETM+ reflectance products from the Landsat Ecosystem Disturbance Adaptive Processing System (LEDAPS) [38], has been generated from the Landsat Surface Reflectance Code (LaSRC). By contrast, the MODIS reflectance product is provided by combining the green channel of the MOD09A1 and the red and NIR channels of the MOD09Q1 that are directly downloaded from the Land Processes Distributed Active Archive Center (LPDAAC). Notably, only centered sub-images with 500 × 500 Landsat pixels (covering an area of 15 km × 15 km) of the preceding two datasets are used in the fusion procedure, and spatially or temporally extended training samples will be merely utilized in the process of dictionary training. Figures 2 and 3 show that training samples initially cover an area of 15 km × 15 km (500 × 500 Landsat pixels) for both datasets and finally reach 36 km × 36 km (1200 × 1200 Landsat pixels) for the rural dataset and 60 km × 60 km (2000 × 2000 Landsat pixels) for the urban dataset as their maximum sizes in the spatially extended fusion experiment. Between the original and the maximal sizes of training samples in each dataset, a dozen training samples with different image sizes are clipped with a size step of 3 km × 3 km (100 × 100 Landsat pixels). As a result, all training samples have been resized as 500 × 500, 600 × 600, . . . , 1200 × 1200 Landsat pixels for the rural dataset, and 500 × 500, 600 × 600, . . . , 2000 × 2000 for the urban dataset. In consideration of the heterogeneity and the diversities of the spatial extension of surface features in different directions, all the employed training samples yield to the same central position as their original training images, which are also inputted as the observed reflectance for Landsat and MODIS (500 × 500 Landsat pixels). For instance, the urban study area is positioned at the center of Beijing City proper, which is composed of two main parts, namely, Dongcheng District and Xicheng District. By this way, the fusion strategy with spatial extension of training samples tends to be less sensitive to the variety and texture features of land cover from different study areas. A fair and authentic comparison between the original fusion algorithm and its modified strategy with spatial extension can therefore be expected.
To ensure a consistent comparison against temporal directions, a bi-direction fusion scheme is adopted for the preceding two datasets. The dates 24 May and 11 July 2001 are determined as the bi-direction observed or modeled dates for the rural dataset, while 10 July and 12 September 2017 are selected for the urban dataset. The bi-direction fusion indicates that one of the bi-directional dates acts as the observed time, and the other serves as the modeled time (to be predicted). Original inputting reflectance images from each dataset at the observed dates are spatially replaced by or temporally added to the extended training samples to retrieve a new, enhanced training sample.       In this experiment, the rural dataset is employed to implement the bi-directional fusion scheme for modeling reflectance images on 24 May (Table 2) or 11 July (Table 3) 2001 with different sizes of spatially extended training samples, which are used in the dictionary learning process. The quality assessment of fusion results from the aforementioned bi-direction scheme is shown in Tables 2 and 3 and Figure 4 for a graphical description of the total statistics. Several modeled images are selected from the predicted results and are then validated by scatter plots with actual reflectance ( Figure 5).

Experiments with Temporally Extended Training Samples
Considering that only three pairs of temporal reflectance images are held in the rural dataset, the image pair acquired on 12 August 2001 is always taken as an additional training sample to model either 24 May or 11 July 2001. The resulting fused images from the bi-directional fusion scheme with temporally extended training samples and their reflectance scatter plots are shown in Figure 6, and the assessment indices are listed in Table 4.

Experiments with Temporally Extended Training Samples
Considering that only three pairs of temporal reflectance images are held in the rural dataset, the image pair acquired on 12 August 2001 is always taken as an additional training sample to model either 24 May or 11 July 2001. The resulting fused images from the bi-directional fusion scheme with temporally extended training samples and their reflectance scatter plots are shown in Figure 6, and the assessment indices are listed in Table 4.

Experiments with Spatially Extended Training Samples
Along with the temporal-corresponding MODIS reflectance product MOD09A1 and MOD09Q1 (Table 1), only the Landsat-8 surface reflectance products acquired on 10 July and 12 September 2017 (Figure 3a,c) are used as basic experimental data to apply the bi-directional fusion scheme with spatially extended training samples that cover the Beijing urban area. Assessment indices related to both temporal directions are summarized in Tables 5 and 6, with the graphical presentation shown in Figure 7. Several typical modeled results and their scatter plots with actual reflectance from this spatially extended fusion with training image sizes of 500 × 500 and 1500 × 1500 pixels are displayed in Figure 8.

Experiments with Spatially Extended Training Samples
Along with the temporal-corresponding MODIS reflectance product MOD09A1 and MOD09Q1 (Table 1), only the Landsat-8 surface reflectance products acquired on 10 July and 12 September 2017 (Figure 3a,c) are used as basic experimental data to apply the bi-directional fusion scheme with spatially extended training samples that cover the Beijing urban area. Assessment indices related to both temporal directions are summarized in Tables 5 and 6, with the graphical presentation shown in Figure 7. Several typical modeled results and their scatter plots with actual reflectance from this spatially extended fusion with training image sizes of 500 × 500 and 1500 × 1500 pixels are displayed in Figure 8.

Experiments with Temporal Extended Training Data
With regard to the temporal extension of training samples, we first defined an optimized selection mode for temporal training samples by analyzing the fusion quality of the Beijing urban dataset in 2017 and selecting eligible acquired dates from 2013 to 2016 according to Landsat-8 reflectance data, which are accumulated as additional training samples into the dictionary learning process. This bi-directional fusion strategy with the temporally extended training samples is shown in Figure 9.     Nearly all 12 reflectance data acquired from 31 January to 17 December (Table 1) were taken as additional training samples for the dictionary learning process, except for two Landsat reflectance images acquired on 10 July and 12 September 2017. The assessment indices are listed in Tables 7 and 8.
Although only a small discrepancy exists among the fusion results with additional training samples from different acquisition dates, from which the reflectance data are between or close to the observed date and the modeled date, slightly higher fusion accuracy can be expected (23 May and 28 September 2017 in this experiment). Two reflectance images that satisfy the aforementioned assumption were then selected from each year from 2013 to 2016 and finally used to perform and validate the fusion with temporally accumulated training samples (Tables 9 and 10 and Figure 10).

Fusion Quality with Spatial Extended Training Samples
The resulting assessment indices from the two datasets indicate good agreement in both temporal directions of the fusion strategy with spatially extended training samples, which varied from 500 × 500 to 1200 × 1200 and 2000 × 2000 pixels (Tables 2, 3, 5 and 6 and Figures 5 and 8). On the one hand, the overall fusion quality increases with larger training image sizes, and only small improvements can be expected when the size of the training images reaches a threshold, which is approximately two and three times the original image size for the rural and the urban datasets, respectively. Besides, the proposed fusion algorithm generally has a better performance than STARFM and SPFM models under the "size threshold" for both the rural dataset (phenology changes) and the urban dataset (type changes). On the other hand, AAD, RMSE, and SSIM indices show increasing errors from the green, red, and NIR bands through all training image sizes (Figures 4 and 7). A reasonable explanation for the threshold size of the training samples is the reduced spatial similarity. Therefore, the features of image structures become less effective with the increasing image size of the training samples.
The different levels of fusion errors over bands yield different standard deviations for each band. For the rural dataset with phenology changes, the standard deviations of a reflectance image used to quantify the spectral amount of variation or dispersion of images for the acquisitions on 24 May and 11 July 2001 are 0.0102, 0.013, and 0.0476 and 0.0108, 0.0165, and 0.0306, respectively. A more integrative description of fusion results has been addressed by SSIM, SAM, and ERGAS indices rather than AAD and RMSE indices. The ERGAS index in particular provides a significant difference between assessment values of AAD and RMSE indices with very small discrepancies in one or more channels. Similarly, the increasing fusion errors from the green and the red bands to the NIR band yield standard deviations of 0.0334, 0.0413, and 0.0611 and 0.031, 0.0374, and 0.0565, for the Beijing urban data acquired on 10 July and 12 September 2017, respectively. The change in the threshold size of training images from two times (rural area) to three times (urban area) is mainly ascribed to the difference in surface features and employed reflectance products (Landsat-7 ETM plus and MODGHK for the rural dataset, and Landsat-8 OLI and MOD09A1/Q1 for the urban dataset). At the threshold size of the urban training images (approximately 1500 × 1500 pixels), the modeled images in both temporal directions, especially for the reflectance on 12 September 2017, seem to have less noise disturbance than other image sizes (Figure 8). Considering running time of the fusion with spatially extended training samples, the procedure with larger size of training image become more and more time consuming, which is growing not linearly but exponentially.

Fusion Quality with Temporal Extended Training Samples
Unlike the original bi-directional fusion results, the temporally extended fusion strategy can promote fusion quality and perform better than the spatially extended fusion strategy when an equal number of training images are handled. The training image size in the temporally extended fusion scheme with the rural dataset (Table 4) corresponds to the training image size of 700 × 700 pixels used in the spatially extended fusion scheme (Tables 2 and 3). The temporal extension scheme is therefore more efficient than the spatial extension scheme in training the rural dataset with phenology changes. Moreover, the assessment indices from the urban dataset primarily show decreasing fusion error when temporal training samples are added (Figure 11), and disagreement occurs when the two reflectance images acquired in 2016 participate in the training sample set. This phenomenon may be attributed to the large seasonal difference between the acquisition dates in 2016 and the observed-modeled period. In addition, a more effective fusion strategy for the rural dataset (the Beijing area) is used to bring the spatially extended training samples, rather than the temporally extended training samples, into the training set ( Figure 12).

Fusion Quality with Temporal Extended Training Samples
Unlike the original bi-directional fusion results, the temporally extended fusion strategy can promote fusion quality and perform better than the spatially extended fusion strategy when an equal number of training images are handled. The training image size in the temporally extended fusion scheme with the rural dataset (Table 4) corresponds to the training image size of 700 × 700 pixels used in the spatially extended fusion scheme (Tables 2 and 3). The temporal extension scheme is therefore more efficient than the spatial extension scheme in training the rural dataset with phenology changes. Moreover, the assessment indices from the urban dataset primarily show decreasing fusion error when temporal training samples are added (Figure 11), and disagreement occurs when the two reflectance images acquired in 2016 participate in the training sample set. This phenomenon may be attributed to the large seasonal difference between the acquisition dates in 2016 and the observedmodeled period. In addition, a more effective fusion strategy for the rural dataset (the Beijing area) is used to bring the spatially extended training samples, rather than the temporally extended training samples, into the training set ( Figure 12).  The results from the rural dataset, especially for the NIR channel, are more sensitive to the added temporal training image than the urban dataset (Figures 6d,h and 10d,h). Regardless of the seasonal characteristics of the employed temporal training images, the discrepancy in spatial features, such as texture and structure, plays a leading role in the comparison of fused results from two datasets with different change types (primarily the phenology change in the rural dataset and the texture and structural change in the urban dataset). The time-consumption of the fusion with temporally extended training samples intends to be similar with the strategy with spatial extension due to the proportional image size between spatially and temporally extended training samples.

Conclusions
An enhanced fusion scheme based on the single-pair sparse learning fusion model is proposed by improving the dictionary training process, and its evaluation strategy is designed by employing the spatially and the temporally extended training samples in this paper. Results from the bidirectional fusion scheme show high agreement in the assessment indices of the fusion quality, which indicates a decrease in prediction errors and an increase in image similarity with the extension of spatial or temporal training samples. This fusion scheme is significantly effective until the spatial The results from the rural dataset, especially for the NIR channel, are more sensitive to the added temporal training image than the urban dataset (Figure 6d,h and Figure 10d,h). Regardless of the seasonal characteristics of the employed temporal training images, the discrepancy in spatial features, such as texture and structure, plays a leading role in the comparison of fused results from two datasets with different change types (primarily the phenology change in the rural dataset and the texture and structural change in the urban dataset). The time-consumption of the fusion with temporally extended training samples intends to be similar with the strategy with spatial extension due to the proportional image size between spatially and temporally extended training samples.

Conclusions
An enhanced fusion scheme based on the single-pair sparse learning fusion model is proposed by improving the dictionary training process, and its evaluation strategy is designed by employing the spatially and the temporally extended training samples in this paper. Results from the bi-directional fusion scheme show high agreement in the assessment indices of the fusion quality, which indicates a decrease in prediction errors and an increase in image similarity with the extension of spatial or temporal training samples. This fusion scheme is significantly effective until the spatial threshold size (approximately two to three times the original image size used here) of the training images is reached or one or more temporal training sample(s) with dissimilar acquisition seasons is added. Compared to STARFM and SPFM models, a better fusion quality can also be obtained by the proposed method with an enhanced "threshold" training size. In detail, the fusion strategy with spatially extended training samples obtain better performance than the fusion strategy with temporally extended training samples for the urban dataset, whereas an opposite inference can be derived from the rural dataset. In consideration of the land cover characteristics of the two datasets, in which phenology changes occur in the rural dataset and type changes appear in the urban dataset, a reliable approach is to adopt an adaptive pattern of training samples extended spatially or temporally to promote fusion quality according to the data acquisition condition and the land cover change type of a study area. The results of the temporally extended fusion scheme are significantly affected by additional training samples with different seasonal features. Therefore, the proposed sparse learning-based fusion scheme is more sensitive to temporal changes than to spatial changes in surface features. To promote the efficiency of sparse learning-based fusion methods, a spatial and temporal similarity measure should be designed for filtering and training spatiotemporal samples and then integrated with the fusion procedure after its availability is validated by typical areas with various land cover changes.
Compared to the whole process of the original sparse-learning algorithm cost 3.7 min for an image with 500 × 500 pixels, which is more efficient than the STARFM (about 4 min) and less than the SPFM (2.3 min), the proposed method with spatiotemporally extended training samples will become far more time consuming if a better fusion result is required. Actually, this issue can be effectively addressed by some updated sparse coding techniques. For instance, the online dictionary learning methods [39,40] can significantly reduce the time consumption of the entire training process to 1 min or so for 500 × 500 pixels and expect to be more effective with growing image size. By this way, the proposed method has a high potential in processing large scenes (spatial or temporal acquirements) usually with multiple channels and taking them into consideration for reflectance reconstruction.