A Hybrid Color Mapping Approach to Fusing MODIS and Landsat Images for Forward Prediction

: We present a simple, and efﬁcient approach to fusing MODIS and Landsat images. It is well known that MODIS images have high temporal resolution and low spatial resolution, whereas Landsat images are just the opposite. Similar to earlier approaches, our goal is to fuse MODIS and Landsat images to yield high spatial and high temporal resolution images. Our approach consists of two steps. First, a mapping is established between two MODIS images, where one is at an earlier time, t 1 , and the other one is at the time of prediction, t p . Second, this mapping is applied to map a known Landsat image at t 1 to generate a predicted Landsat image at t p . Similar to the Spatial and Temporal Adaptive Reﬂectance Fusion Model (STARFM), SpatioTemporal Image-Fusion Model (STI-FM), and the Flexible Spatiotemporal DAta Fusion (FSDAF) approaches, only one pair of MODIS and Landsat images is needed for prediction. Using seven performance metrics, experiments involving actual Landsat and MODIS images demonstrated that the proposed approach achieves comparable or better fusion performance than that of STARFM, STI-FM, and FSDAF.


Introduction
Fusing high spatial resolution/low temporal resolution Landsat images with low spatial resolution/high temporal resolution MODIS images will have many applications, such as drought monitoring, fire damage assessment, flood damage monitoring, etc. In [1], a fusion approach known as Spatial and Temporal Adaptive Reflectance Fusion Model (STARFM) was proposed and demonstrated. The STARFM has been used to generate daily time-series vegetation index and evapotranspiration for crop condition and drought monitoring [1,2]. Several alternative algorithms [3,4] were published to further improve the fusion performance. The Bayesian prediction approach [5][6][7], which was also proposed for fusing satellite images with complementary characteristics, can be an alternative fusion method for Landsat and MODIS. According to the survey paper [8], the STAARCH [3] approach can handle abrupt changes, but requires two pairs of MODIS and Landsat images. The ESTARFM [4] algorithm focuses on enhancing the performance of mixed pixels. Similar to STAARCH, ESTARFM also requires two pairs of images. As a result, both STAARCH and ESTARFM may not be suitable for forward prediction.
Recently, the Flexible Spatiotemporal DAta Fusion (FSDAF) algorithm can use one pair of MODIS and Landsat images for forward prediction [9]. The method has six steps, including land cover classification, grouping of neighboring pixels, etc. The computational load is heavy, as one prediction may take more than one hour to finish for an image size of 1200 × 1200. Moreover, since there are multiple steps in the prediction process, it is hard to grasp which step contributes the most to the prediction performance. A relatively simple and more efficient algorithm, the SpatioTemporal Image-Fusion Model (STI-FM) [10] applies clustering to the images first, and, for each cluster, performs a separate prediction. In addition to the above algorithms, some alternative fusion ideas [11][12][13][14][15][16] were proposed and evaluated. Interested readers can find concise reviews for these alternative methods in [10]. In this paper, we present a simple approach to fusing MODIS and Landsat images for forward prediction. Our approach was motivated by our recent pansharpening work for synthesizing a high resolution hyperspectral image by fusing a high resolution color image with a low resolution hyperspectral image cube [17]. Our approach is called hybrid color mapping (HCM), which has also been applied to enhance Mastcam images [18] in the Curiosity rover and Mars satellite imagers THEMIS and TES [19]. The HCM is simple, intuitive, and computationally efficient, as compared to other algorithms in the literature. Most importantly, it has high performance because numerous recent comparative studies [18][19][20] have demonstrated its efficacy. Similar to STARFM, STI-FM, and FSDAF, only one pair of MODIS and Landsat images is needed for prediction.
Our paper is organized as follows. Section 2 introduces our approach. Section 3 presents the experimental results. Two types of real images have been used. One type is homogeneous and the other is the more challenging type of heterogeneous images. Discussions are included in Section 4. Finally, the conclusions and future directions are given in Section 5.

Materials and Methods
Our approach is simple, intuitive, and has two steps. First, a mapping is established between two MODIS images where one is at an earlier time, t 1 , and the other one is at the time of prediction, t p . To achieve better prediction results, it is necessary to divide the images into small patches, which can be either overlapping or non-overlapping. The mapping is then obtained for each patch. It should be noted that the mapping is between a pixel vector in one image and another pixel vector in another image. Second, this mapping is applied to a known Landsat image at t 1 to generate a predicted Landsat image at t p .

A Simple Motivating Example of Our Approach
We will use one band of MODIS and LANDSAT to illustrate our approach. Figure 1 shows two pairs of MODIS (M1 and M2) and Landsat (L1 and L2) images. The pairs, (M1, L1) and (M2, L2) were collected on the same days. From Figure 1, we have two observations. First, the MODIS images can be treated as blurred versions of their Landsat counterparts. Second, the intensity relationship between the MODIS pixels is somewhat similar to that of those Landsat pixels. For example, the middle of L1 is darker than the rest of the scene, and this can be easily seen in M1 as well. Similarly, the middle part of L2 is slightly brighter than the rest and one can see the same in M2. The above observations indicate that the heterogeneous landscape information is captured in the MODIS images. If we can capture the intensity mapping between the MODIS images at two different times, then we can use that mapping to predict the Landsat image at time t p using Landsat image at t k . It turns out that, although the above idea is simple and intuitive, the prediction results using this idea are quite accurate in previous experiments [19,20].
To further substantiate the above observations, we include some statistics (Table 1) from four pairs of MODIS and Landsat images. Days 128 and 144 belong to homogeneous regions and Days 214 and 246 are heterogeneous regions. It can be easily seen that the means and the standard deviations of MODIS and Landsat images are indeed very close, corroborating the visual observations in Figure 1.  Figure 2 illustrates our proposed prediction approach. Based on the available MODIS images that were collected at t k and t p , we learn the pixel by pixel mapping between the two images. The learned matrix F is then applied in the prediction step. Using similar notations in related work [1], the prediction of Landsat image at t p can be achieved by where L(·, ·, ·) denotes a pixel vector (up to Q with Q being the number of bands) for this application and F is a pixel to pixel mapping/transformation matrix with appropriate dimensions. F can be determined by using the following relationship where M(·, ·, ·) denotes a pixel vector (Q bands). To account for the intensity differences between two images, a variant of (2) can be described as where F 2 is a vector of constants. With no loss of generality, let us focus on the determination of F in (2). Now, let us denote Y t p as the set of all of the multispectral pixels M(x, y, t p ) ∈ R Q for all the pixels in the image at t p and Y t k as the set of all the multispectral pixels M(x, y, t k ) ∈ R Q in the image at t k . Q is the number of bands. Since M(x, y, t p ) and M(x, y, t k ) are vectors, Y t p and Y t k can be expressed as where N R and N C are the numbers of rows and columns, respectively. We call the mapping F in (2) the global version and all of the pixels in Y t k and Y t p are used in estimating F. To estimate F, we will use the least square approach, which minimizes the error where P(x, y) = M(x, y, t p ) − FM(x, y, t k ). Following the definition of Frobenius norm [21], (6) is equivalent to Solving F in (7) involves the following. Since Differentiating (8) Setting the above to zero will yield Unlike normal image mapping, such as the 3 × 3 image transform matrices (a rotation or perspective transformation), which maps between the spatially distributed patches, it should be noted that F is a pixel to pixel mapping between two multispectral pixels.
To avoid instability, we can add a regularization term in (7). That is, where λ is a regularization parameter and the optimal F becomes with I an identity matrix with the same dimension as Y t k Y t k T .

Remark 1.
Difference between the HCM in this paper and the HCM in [17].
Although the mathematical derivations are the same, the proposed HCM in this paper is different from that in [17]. The key difference is the implementation. In [17], the mapping is between a downsampled high resolution image and a low resolution hyperspectral image, whereas here the mapping is between two low resolution MODIS images.

Remark 2. Addition of a bias term.
If the means of the two MODIS images are different, we can easily extend the above derivation to arrive at a new expression for F 1 and F 2 in (3). This can be done by rewriting (3) as that is, we simply augment the vector M by one more row of 1. After that, the derivation will be the same as before.

Remark 3. Local Mapping.
Based on our observations in some cases, prediction results will be more accurate if we divide the images into patches. Each patch will have its own mapping matrix. Figure 3 illustrates the local prediction approach. The patches can be overlapped or non-overlapped. In this paper, overlapped patches were used for homogeneous areas and non-overlapped patches were used for the heterogeneous areas.  It should be noted that the derivations for the forward prediction are done jointly for all of the bands. Actually, the mapping can also be done band by band.

Remark 5. Only one pair of MODIS and Landsat images.
Unlike STAARCH and ESTARFM, our approach does not require two pairs of MODIS and Landsat images for prediction. This will be ideal for forward prediction, where only past measurements of Landsat images are available. Remark 6. One mapping per cluster.
One can also perform image clustering/segmentation first and then use HCM for each individual cluster. We implemented this approach and the results are similar to that of the non-clustering approach. Discussions on this clustering approach will be summarized in Section 4.

Evaluation Metrics
We evaluate the performance of algorithms using the following seven objective metrics. Moreover, computational times are also used in our comparative studies.

•
Absolute Difference (AD). The AD of two vectorized images S (ground truth) andŜ (prediction) is defined as where Z is the number of pixels in each image. The ideal value of AD is 0 if the prediction is perfect.
where Z is the number of pixels in each image. The ideal value of RMSE is 0 if the prediction is perfect. • CC (Cross-Correlation). We used the codes from Open Remote Sensing website (https:// openremotesensing.net/). The ideal value of CC is 1 if the prediction is perfect. • ERGAS (Erreur Relative. Globale Adimensionnelle de Synthese). We used the codes from [23]. The ERGAS is defined as for some constant d depending on the resolution and µ is the mean the ground truth image. The ideal value of ERGAS is 0 if a prediction algorithm flawlessly reconstructs the Landsat bands. • SSIM (Structural Similarity). It is a metric to reflect the similarity between two images. An equation of SSIM can be found in [9]. The ideal value of SSIM is 1 for perfect prediction. • SAM (Spectral Angle Mapper) [23]. The spectral angle mapper measures the angle between two vectors. The ideal value of SAM is 0 for perfect reconstruction. • Q2n: A definition for Q2n can be found in [24,25]. The ideal value of Q2n is 1. The codes can be downloaded from Open Remote Sensing website.

Results
In this section, we present extensive experimental results. Since our main interest is in forward prediction, we only compare with STARFM, STI-FM, and FSDAF.

Data Set 1: Scene Contents Are Homogeneous
Data set 1 is the Boreal Ecosystem-Atmosphere Study (BOREAS) southern study area (54.6 • N, 105.8 • W) that has been used by Gao et al. [26] in the data fusion test. This is a relatively homogeneous area. The major land cover type is forest, with subsidiary fen and spare vegetation. Land cover patches are large. Landsat and MODIS data have been reprocessed using the latest available collections when we started this study. Landsat surface reflectance images (L1T) were ordered from U.S. Geological Survey (USGS). MODIS daily surface reflectance products (MOD09GA, Collection 6) [27] were corrected to nadir BRDF-adjusted reflectance (NBAR) using MODIS BRDF products (MCD43A1, Collection 5) [26]. Note that the Collection 6 of daily MODIS NBAR at 500 m resolution is now available and can be directly used. Co-registration between Landsat and MODIS was applied to all of the Landsat-MODIS image pairs using maximum correlation approach [27]. Four Landsat-MODIS image pairs (day 128, 144, 192 and 224) on 2001 were re-processed for the study. Each image has six bands. We can perform six forward prediction experiments.
The parameters of our prediction algorithm for the first data set are as follows. Band to band prediction was used and no bias term was introduced. The images are divided into patches with a patch size of 80 Landsat pixels. The regularization term was chosen to be 0.001. The overlapping window size for patches is 40. Tables 2, 3 and 4summarize the forward prediction results of using Landsat image 128 to predict Landsat image 144, Landsat image 144 to predict Landsat image 192, Landsat image 192 to predict Landsat image 224, respectively. The tables are arranged in a similar manner as that of [9]. The first block of each table compares the source Landsat image and the ground truth Landsat image to be predicted. We have applied cloud masks in calculating those values in the tables. In Table 2, one can see that our prediction results in terms of AD, RMSE, CC, ERGAS, SSIM, and Q2n metrics are better than that of STARFM, STI-FM, and FSDAF in almost all of the bands. In the false color image (formed by treating NIR, Red, and Green as Red, Green, and Blue, respectively) that is shown in Figure 4, if one looks at the zoomed-in area inside the red circles, one can see that our results can recover more fine details as compared to that of STARFM and are comparable to that of FSDAF and STI-FM. From Table 3, it can be seen that our results have similar trends as Table 2. This is reflected in Figure 5, as it can be seen that, inside the red circles, our results can preserve the color information slightly better than that of STARFM and STI-FM, and are comparable to FSDAF. From Table 4, our results also give better performance metrics in almost all of the metrics as compared to those of STARFM, STI-FM, and FSDAF. In Figure 6, one can see that the STARFM image has an area in the center/bottom half (inside the white circle) that seems to have some spectral distortion, while our predicted image looks much more like the actual image. Table 4 also shows that our prediction, especially the NIR band, is a lot closer to the ground truth image.

Data Set 2: Scene Contents Are Heterogeneous
Data set 2 is a rain-fed agricultural area in central Iowa (42.4 • N, 93.4 • W). The major crops were corn and soybean. They were planted at different times and had different crop phenology. This is a relatively heterogeneous area. Same data processing procedures that were used in Data set 1 were applied. Three Landsat-MODIS image pairs (day 144, 182, and 224) on 2002 were processed.
MODIS daily NBAR were resampled to match Landsat 30 m resolution using the bi-linear interpolation approach. Both data sets have the image size of 1200 by 1200. For each time, one image-pair was used to predict Landsat observation for another pair date and then compared to the actual Landsat reflectance. There are three pairs of MODIS and Landsat images in this data set. Each image has six bands. We can perform two forward prediction experiments.
The parameters of our prediction algorithm for the second data set are as follows. The learning and prediction were performed band by band. For heterogeneous contents, we observed that we need to pick small patch sizes in order to achieve good prediction performance. Patches with a size of two were used and there is no overlapping between the patches. The regularization term was chosen to be 0.001. Table 5 summarizes the forward prediction results from Landsat image 182 to Landsat image 214. It can be seen that FSDAF has the best performance, followed by STI-FM, HCM, and STARFM. In terms of subjective comparisons, one can see from the false color image in Figure 7, which is created by using the order of near infrared (NIR), red, and green bands, as the RGB bands, that our results are slightly closer to the ground truth image in terms of color. Table 6 summarizes the prediction results from Landsat image 214 to Landsat image 246. Although our results are comparable to those of STARFM and STI-FM, and slightly inferior to those of FSDAF, the numbers are close. This is also reflected in the images shown in Figure 8 where the predictions results using STARFM, STI-FM, FSDAF, and our approach are all very close visually.

Discussions
The STARFM algorithm is the pioneering work on the fusion of MODIS and Landsat images. The FSDAF incorporates land cover classification and some additional processing steps and yields better performance over the STARFM in heterogeneous areas. When comparing the proposed HCM with the FSDAF method, our method is much simpler and efficient. In fact, HCM is comparable to that of STI-FM in terms of computational complexity. One may think that, since the resolution between Landsat and MODIS is 16 to 1, the proposed HCM based mapping may not be effective. It turns out that we have applied HCM to enhance images with even greater resolution differences: Worldview-2 (25 to 1) [20] and THEMIS and TES (30 to 1) [19]. The results in [19,20] as well as the results in this paper clearly demonstrate that the HCM algorithm is a very competitive method for fusing different types of satellite images. More supporting arguments are presented in the following paragraphs.

Additional Simulation Studies Using Synthetic Data to Address 16:1 Resolution Concern for HCM
To further validate the performance HCM, we have carried out some additional studies. Similar to the FSDAF paper, we used a synthetic data set, as shown in Figure 9 below. There are three areas besides the background, which has a magnitude of 0.5. A small Gaussian noise (0.001) was added to the image. The pixel magnitudes in Landsat image at t1 are as follows: 0.01 in the circle (radius: 56 pixels), 0.3 in the rectangle and the line. The pixel magnitudes in Landsat image at t 2 are: 0.05 in the circle (radius: 72 pixels), 0.2 in the rectangle and the line. The circles in the two Landsat images are to emulate gradual changes (phenology), and the lines and rectangle are to emulate land cover type changes. The MODIS images were generated by averaging 16 × 16 blocks in the Landsat images. Now, we summarize the prediction results below in Figure 10. Table 7 summarizes four performance metrics. It can be seen that the FSDAF results have smaller prediction errors than those of HCM. The HCM results have some blurry peripherals. Because this synthetic data set is somewhat similar to the heterogeneous landscape, the FSDAF method performed well. However, this simple example also substantiates that our HCM does not have any "grave" issues; it just performs slightly worse than FSDAF in the heterogeneous cases. We did not include the STI-FM results here, as thorough comparisons have been carried out earlier in Section 3.  To further alleviate the concern that our proposed HCM method may not work for large spatial resolution applications, we would like to mention some of the fusion results where we have 25:1 resolution difference between two images. The 25 to 1 resolution difference application is for enhancing the eight SWIR bands (7.5 m resolution) using the pan band (0.31 m resolution) in Worldview-3 images. A paper has been published [20]. See Table 5 in [20]. It can be seen that our HCM results are slightly better than other state-of-the-art methods in the literature using three performance metrics, even if the resolution difference is 25:1.

Performance of HCM for Applications with 30:1 Resolution Difference
Here, we mention one more application of HCM to image fusion where the spatial resolution difference between the two images is 30:1. The 30 to 1 resolution difference application is for the fusion of THEMIS (100 m resolution) and TES (3 km resolution) images for NASA's Mars exploration. A paper was presented in 2017 IGARSS [19]. The performance of HCM was also excellent for this application. See Figure 4 and Table 1 in [19] for details. Those results clearly demonstrated that HCM can still perform well in 30:1 resolution difference application). Table 1 in [19] indicates that HCM actually has the best performance in this application.

Necessity and Importance of Having Diverse Methods for Image Fusion
We would like to argue that no one method could perform well under all of the conditions in many remote sensing applications. For example, a recent research by our team focused on the fusion of Planet and Worldview images. In this application, the resolution of Planet images is actually uncertain because its resolution is much worse than the declared resolution of 3.125 m. There are both homogeneous and heterogeneous regions in the images. We carried out three case studies in that paper. It was observed that none of the three methods (STARFM, FSDAF, and HCM) can outperform others in all cases. This shows that it is important to have more diversity in the fusion methods. HCM is simple and efficient, and hence may be more suitable for near real-time applications. On the other hand, FSDAF and STARFM require more computational times and may be suitable for batch, off-line processing applications. The bottom line is that the remote sensing community needs to have some freedom in choosing the most appropriate fusion method that can meet each individual's specific needs. In this respect, HCM offers a reasonable alternative.

Combine HCM with Image Clustering
We have implemented a general clustering approach, which does not require the cluster maps at different times to be the same. To perform the clustering, we use k-means for MODIS images at t k and t p . The clusters can be dramatically different at two different times. Let us use Figure 11 below to illustrate our idea. At t k , there are three clusters and at t p , there are four clusters. When we overlay the cluster maps, we will have six combinations of clusters. If there are N clusters at t k and M clusters at t p , then there will be at most NM clusters in total. In each cluster, we estimate a matrix F. We applied this clustering approach to two cases. The results are shown below.

Example in Homogeneous Area
We used Day 128 to predict Day 144. We tried three cluster combinations: 50 × 50; 20 × 20; and, 5 × 5, where the first number indicates the number of clusters at time t k and the second number is the number of clusters at t p . From Table 8, it can be seen that the local HCM approach is slightly better than the cluster based approach.

Example in Heterogeneous Area
We used Day 214 to predict Day 246. We tried three cluster combinations: 50 × 50; 20 × 20; and, 5 × 5. Again, from Table 9, it can be seen that the two approaches have very close performances and the local based method has a very slight edge over the cluster based approach. When comparing the cluster-based HCM with non-cluster/local based HCM, we see that the results are mixed. We do see some cases where the cluster-based approach is slightly better, but in other cases, the non-cluster based HCM performs well. A simple explanation is that our non-cluster based HCM is local in nature. Since we normally set the patch size to be small for heterogeneous areas, the pixels within those small patches generally do not have much variation.
We would also like to mention that there is another clustering based method (STI-FM) [10]. The idea is to use the ratio of MODIS pixels at two different times to perform clustering. We implemented that algorithm and included the results in this paper. See Tables 2-6 in Section 3.
In short, the use of clustering based approach could be a good future research direction, especially in the case of heterogeneous images.

General Comments and Observations
For homogeneous regions, our proposed HCM algorithm performed the best in terms of objective evaluations using seven performance metrics. See Tables 2-4 for details. Those seven performance metrics are widely used in the literature to compare image fusion and pansharpening algorithms.
In terms of subjective visualization for homogeneous regions, one can see that our results are comparable or better than others. Some details in Figures 4-6 clearly show more closeness to the ground truth by using the proposed method.
For heterogeneous regions, the proposed HCM method is slightly inferior to that of FSDAF, which explicitly incorporates land cover classification. However, in some bands, such as SW1 and SW2, our results are comparable or better than that of FSDAF in terms of objective metrics. In terms of subjective comparisons, our prediction performance is comparable to that of the other methods.
In terms of computational complexity, STARFM completes the prediction of one image in less than 3 min, while FSDAF completes the prediction of one image in about 1.5 h for both datasets. There is a fast variant of FSDAF in that package. However, the prediction performance is not as good as the slow version. STI-FM completes a forward prediction in about 2 s for both of the datasets. The computational time for HCM is comparable to that of STI-FM, but varies depending on which dataset is used since different parameters were used for each dataset. For the homogenous dataset, the computational time for a single prediction is roughly 10 s, while for the heterogeneous dataset, the computational time is about 8 min.
Since none of the prediction methods can work well under all situations, it might be better to adopt a hybrid approach for Landsat and MODIS image fusion. That is, we propose to use HCM for homogeneous areas and FSDAF or STI-FM for heterogeneous areas.
One future research is to incorporate the high temporal resolution fused images into a remote sensing data product, such as fire damage assessment. Another potential research direction is to apply deep learning approach to learn the mapping between the MODIS images and then use that mapping for Landsat image prediction. A third direction is to utilize the high temporal and high spatial resolution images for anomaly detection, border monitoring, and target detection.

Conclusions
In this paper, we present a simple, and high performance forward prediction approach to generating Landsat images with high temporal resolution. The idea is based on learning a mapping between MODIS images. Once the mapping is learned, it is then applied to a Landsat image that is collected at an earlier time in order to predict a future Landsat image. When compared to other fusion approaches, such as STAARCH and ESTARFM, our approach does not require two pairs of MODIS and Landsat images, and hence it is more appropriate for forward prediction. Experiments using actual MODIS and Landsat images demonstrated that the proposed approach achieves a comparable performance as that of STARFM, STI-FM, and FSDAF. The comparisons were done at least after the three digits of the decimal point. Consequently, the outcomes were similar in comparison to the other methods. In addition, computationally, the proposed HCM and STI-FM were found to be relatively simpler in comparison to the other two methods.