Spatiotemporal Image Fusion in Remote Sensing

In this paper, we discuss spatiotemporal data fusion methods in remote sensing. These methods fuse temporally sparse fine-resolution images with temporally dense coarse-resolution images. This review reveals that existing spatiotemporal data fusion methods are mainly dedicated to blending optical images. There is a limited number of studies focusing on fusing microwave data, or on fusing microwave and optical images in order to address the problem of gaps in the optical data caused by the presence of clouds. Therefore, future efforts are required to develop spatiotemporal data fusion methods flexible enough to accomplish different data fusion tasks under different environmental conditions and using different sensors data as input. The review shows that additional investigations are required to account for temporal changes occurring during the observation period when predicting spectral reflectance values at a fine scale in space and time. More sophisticated machine learning methods such as convolutional neural network (CNN) represent a promising solution for spatiotemporal fusion, especially due to their capability to fuse images with different spectral values.


Introduction
Image fusion is a well-established research field [1][2][3][4][5] with important developments during the last years.The reason for these developments is the increasing demand for satellite images with higher spatial, temporal and/or spectral resolution.In the past, image fusion methods were dedicated to enhancing spatial resolution [6] and to combining multimodal input images.Recently, the focus of these methods has changed to fusing fine spatial resolution images with high temporal frequency images [7].
The last decades have witnessed the emergence of various satellite-borne sensors.The NASA's Moderate resolution Imaging Spectroradiometer (MODIS) sensor onboard the Terra (operating since 1999) and Aqua satellites (operating since 2002), for example, collects data covering an area of 2330 km at different spatial (250 m, 500 m, 1 km) and temporal resolution.The Enhanced Thematic Mapper (ETM+) (launched on 5 October 1993) or Operational Land Imager (OLI) (launched in 2013) sensors onboard Landsat satellites collect 15 m panchromatic and 30 m multi-spectral bands in a 185 km wide swath, with a revisit time of 16 days [8].MultiSpectral Instrument (MSI) onboard Sentinel-2 measures the Earth's reflected radiance with a high revisit time, i.e., 5 days since the launch of Sentinel-2B on 7 March 2017, and a high spatial resolution (four bands at 10 m, six bands at 20 m and three bands at 60 m spatial resolution).The Sentinel-3 Ocean and Land Color Imager (OLCI) sensor (launched in 2015) delivers currently images at a spatial resolution of 300 m and a temporal resolution of 2.8 days, which will be increased once Sentinel-3B is launched.The micro-satellites launched by Planet acquire images daily and at a spatial resolution of 3.125 m.Due to these technical advances, the remote sensing community now has access to both dense time-series data and high spatial, spectral resolution images.Yet, there are studies which require access to fine temporal and spatial resolution images collected in the past.These data are vital for a wide range of applications, including urban expansion and deforestation monitoring, crop mapping and yield estimation, inundation and wetland mapping [9] or quantifying the magnitude and climate change type [10].Therefore, advanced methods to fuse high-temporal frequency and fine spatial-resolution sensors are required [11,12].
Several papers have reviewed available methods developed to increase the spatial and temporal resolution of remote sensing [13,14].Pohl et al. [1] summarized remote sensing image fusion solutions with a focus on pansharpening used to enhance spatial resolution.Zhan et al. [15] focused in their review paper on the methods dedicated to disaggregating land surface temperature, while Zhu et al. [16] discussed in details the taxonomy, concepts, and principles of spatiotemporal reflectance fusion methods.The overall goal of our paper is to review existing methods dedicated to enhancing the spatiotemporal resolution of satellite images with a focus on the ability of the described methods to account for gradual changes, such as changes in vegetation phenology occurring during the period of fused images.We consider this review timely, given the importance of high temporal and spatial resolution images for different applications dedicated to surface dynamics mapping at seasonal and annual temporal scales, and the diversity of sensors which acquire images at different spatiotemporal resolutions.Note that we did not review the applications where the reviewed spatiotemporal image fusion methods were applied.Interested readers can find more information on this topic in the recent review paper by Zhu et al. [16].Does the remote sensing community still need spatiotemporal data fusion methods nowadays, when the industry is constantly developing sensors capable of delivering sub-meters images on a daily basis?We argue that these methods represent a cost-effective solution to generate high-resolution images in space and time, thereby offering the possibility to leverage existing image archives and to efficiently use them for environmental, ecological and agricultural mapping and monitoring applications.
Different terms are used in the literature to refer to the methods used to increase the resolution of an image in spatial, temporal or spectral domains, namely, image fusion [6], spatial sharpening, downscaling [13], super-resolution or disaggregation.Image fusion is defined as the "combination of two or more different images to form a new image by using a certain algorithm" [6].Downscaling is used to increase the resolution of satellite images in the spatial domain [13].Downscaling, disaggregation and spatial sharpening terms are used interchangeably in the literature.We use the term spatiotemporal image fusion throughout this paper to refer to the methods for blending fine spatial resolution images with high temporal frequency images.

Fusion Methods to Increase Spatiotemporal Resolution of Satellite Images
Data fusion is a well-established research field [1][2][3][4].Image fusion methods are primarily used for improving the level of interpretability of the input data [17].Additionally, they can be utilized to address the problem of missing data caused by cloud or shadow contamination in satellite images time series [18][19][20].According to Schmitt and Zhu [21] "the main objective of data fusion is either to estimate the state of a target or object from multiple sensors, if it is not possible to carry out the estimate from one sensor or data type alone, or to improve the estimate of this target state by the exploitation of redundant and complementary information".The authors included in their study 12 definitions of data fusion from different application domains including computer science, information theory, tracking or surveillance [21].
Long revisit time is not suitable for seasonal vegetation phenology monitoring or rapid surface changes.Therefore, we need high-resolution images in both time and space.Commercial satellites offer images at fine spatial scale and high temporal resolution.Among these sensors, we can mention the micro-satellites launched by Planet Labs [34] which acquire daily images of the Earth with a spatial resolution of about 3 m, or RapidEye sensors which acquire images with 5 m spatial resolution every day.Yet, these images are too costly for many applications in areas as diverse as agriculture mapping, timely monitoring of natural hazards or mining and illegal deforestation activities monitoring [34,35].During last years, there have been different free sensors acquiring images at an increased spatial (e.g., Sentinel-2) and temporal resolution (Sentinel-3).These images could serve as a solution to the above-mentioned challenge.In addition, spatiotemporal image fusion methods, called also spatiotemporal downscaling methods [36], represent an efficient solution to generate images at a high temporal resolution [37] for more detailed land cover mapping and monitoring applications [38] and to improve the resolution of historical satellite images.
Since its launch in the early 1970s, and especially after allowing the public access to the enormous data archives [39], Landsat data products have been used in different climate, biodiversity, water or agriculture mapping studies.These data products have a great potential to accurately map land cover classes and to monitor land surface parameters.However, these data have a revisit time of only 16 days and therefore, their potential use for monitoring gradual changes such as changes in vegetation phenology or soils moisture, just to give a few examples, is rather reduced, especially in cloudy areas (e.g., tropical areas), where only a few cloud-free images per year are available.NASA's MODIS sensor, on the other hand, acquires data twice a day which makes them more suitable for various surface dynamics mapping.Therefore, to increase the temporal resolution of fine spatial resolution images, many spatiotemporal image fusion methods have been developed during the last years.These methods use spatial information from the fine spatial resolution images and temporal information from coarse resolution satellites images to generate high spatial-temporal images.Spatiotemporal image fusion methods apply several steps to generate high spatiotemporal images: (1) both coarse and fine-resolution satellite images Digital Numbers (DN) have to be atmospherically corrected; (2) the pair-images have to be geometrically corrected and (3), in the end, one of the existing spatiotemporal fusion methods is applied to generate images which are at an increased spatial and temporal resolution (Figure 1).When the application required, calculation of indices before spatiotemporal fusion is performed [40].
Remote Sens. 2018, 10, x FOR PEER REVIEW 3 of 21 low-frequency images with low resolution, high-frequency images [32] and (6) fusing microwave (passive) and microwave (active) sensors [33].Long revisit time is not suitable for seasonal vegetation phenology monitoring or rapid surface changes.Therefore, we need high-resolution images in both time and space.Commercial satellites offer images at fine spatial scale and high temporal resolution.Among these sensors, we can mention the micro-satellites launched by Planet Labs [34] which acquire daily images of the Earth with a spatial resolution of about 3 m, or RapidEye sensors which acquire images with 5 m spatial resolution every day.Yet, these images are too costly for many applications in areas as diverse as agriculture mapping, timely monitoring of natural hazards or mining and illegal deforestation activities monitoring [34,35].During last years, there have been different free sensors acquiring images at an increased spatial (e.g.Sentinel-2) and temporal resolution (Sentinel-3).These images could serve as a solution to the above-mentioned challenge.In addition, spatiotemporal image fusion methods, called also spatiotemporal downscaling methods [36], represent an efficient solution to generate images at a high temporal resolution [37] for more detailed land cover mapping and monitoring applications [38] and to improve the resolution of historical satellite images.
Since its launch in the early 1970s, and especially after allowing the public access to the enormous data archives [39], Landsat data products have been used in different climate, biodiversity, water or agriculture mapping studies.These data products have a great potential to accurately map land cover classes and to monitor land surface parameters.However, these data have a revisit time of only 16 days and therefore, their potential use for monitoring gradual changes such as changes in vegetation phenology or soils moisture, just to give a few examples, is rather reduced, especially in cloudy areas (e.g.tropical areas), where only a few cloud-free images per year are available.NASA's MODIS sensor, on the other hand, acquires data twice a day which makes them more suitable for various surface dynamics mapping.Therefore, to increase the temporal resolution of fine spatial resolution images, many spatiotemporal image fusion methods have been developed during the last years.These methods use spatial information from the fine spatial resolution images and temporal information from coarse resolution satellites images to generate high spatial-temporal images.Spatiotemporal image fusion methods apply several steps to generate high spatiotemporal images: (1) both coarse and fine-resolution satellite images Digital Numbers (DN) have to be atmospherically corrected; (2) the pair-images have to be geometrically corrected and (3), in the end, one of the existing spatiotemporal fusion methods is applied to generate images which are at an increased spatial and temporal resolution (Figure 1).When the application required, calculation of indices before spatiotemporal fusion is performed [40].According to Chen, et al. [41], spatiotemporal image fusion methods can be classified into three categories: (1) reconstruction-based; (2) unmixing based and (3) learning-based methods.The advantages and disadvantages of these methods are presented in [42].An overview of some of the According to Chen et al. [41], spatiotemporal image fusion methods can be classified into three categories: (1) reconstruction-based; (2) unmixing based and (3) learning-based methods.The advantages and disadvantages of these methods are presented in [42].An overview of some of the existing spatiotemporal image fusion methods are presented in Table 1.These methods have been successfully applied in different application domains including forest and crop monitoring, daily-field scale evapotranspiration etc. [10].
Table 1.Overview of some of the spatiotemporal image fusion methods developed for blending high spatial resolution images with high temporal resolution images.The dark-grey highlighted methods are those based on feature fusion level, whereas the remaining ones are based on pixel fusion level.

Spatiotemporal Fusion Model Categories
Gao et al. [ We evaluated how many times the above-listed methods were evaluated using the Web of Science database (Figure 2).This evaluation revealed that STARFM method, followed by ESTARFM and STAARCH are among the most popular spatiotemporal image fusion methods.
We evaluated how many times the above-listed methods were evaluated using the Web of Science database (Figure 2).This evaluation revealed that STARFM method, followed by ESTARFM and STAARCH are among the most popular spatiotemporal image fusion methods.1. Reported citations are recorded in the Web of Science database.

Reconstruction-Based Spatiotemporal Image Fusion Methods
Reconstruction-based spatiotemporal methods are also called filter-based methods [55] or weighted-function-based [56] and are used to generate synthetic spectral reflectance by means of the weighted sum of the neighboring similar pixels of the input image source.A widely-used reconstruction-based image fusion methods is the STARFM [32].This method generates synthetic high-resolution images (e.g. 30 m resolution) on a daily basis by employing a neighborhood weighting process.It assumes the existence of co-temporal pairs of fine spatial resolution and coarse spatial resolution images and, therefore, the quality of the fused time series is dependent on the number of observations from the high temporal resolution images set [57] and on the availability of cloud-free pair images of the matching dates [46].When no-matching dates images are found, the method starts searching for the closest image in the temporal domain to predict the value in the fine resolution output image.STARFM involves four main steps.First, coarse resolution images are coregistered and resampled to the resolution of the high spatial resolution images, e.g. 30 m for Landsat 8.In the next step, a moving window (w) is applied to identify similar pixels in the fine resolution images.In the third step, a weight is assigned to the homogeneous pixels based on the following criteria: (1) the spectral difference between the surface reflectance of the images pair; (2) the temporal differences in the dates of the coarse resolution, high frequency images (date of the pair-images and the prediction date); (3) the Euclidean distance between the neighbor and the central pixel.In the last step, the surface reflectance of the central pixel is calculated based on the following equation: where ( ,  ) represents the central pixel of the moving window (w) to be predicted for the fine resolution images at time  , P is the total number of pixels in FR and CR images, N is the total  1.
Reported citations are recorded in the Web of Science database.

Reconstruction-Based Spatiotemporal Image Fusion Methods
Reconstruction-based spatiotemporal methods are also called filter-based methods [55] or weighted-function-based [56] and are used to generate synthetic spectral reflectance by means of the weighted sum of the neighboring similar pixels of the input image source.A widely-used reconstruction-based image fusion methods is the STARFM [32].This method generates synthetic high-resolution images (e.g., 30 m resolution) on a daily basis by employing a neighborhood weighting process.It assumes the existence of co-temporal pairs of fine spatial resolution and coarse spatial resolution images and, therefore, the quality of the fused time series is dependent on the number of observations from the high temporal resolution images set [57] and on the availability of cloud-free pair images of the matching dates [46].When no-matching dates images are found, the method starts searching for the closest image in the temporal domain to predict the value in the fine resolution output image.STARFM involves four main steps.First, coarse resolution images are co-registered and resampled to the resolution of the high spatial resolution images, e.g., 30 m for Landsat 8.In the next step, a moving window (w) is applied to identify similar pixels in the fine resolution images.In the third step, a weight is assigned to the homogeneous pixels based on the following criteria: (1) the spectral difference between the surface reflectance of the images pair; (2) the temporal differences in the dates of the coarse resolution, high frequency images (date of the pair-images and the prediction date); (3) the Euclidean distance between the neighbor and the central pixel.In the last step, the surface reflectance of the central pixel is calculated based on the following equation: where FR w 2 , w 2 t pr represents the central pixel of the moving window (w) to be predicted for the fine resolution images at time t pr , P is the total number of pixels in FR and CR images, N is the total number of pixels within the defined moving window, CR x i , y i , t pr represents the pixels values of the coarse resolution data on the prediction date, FR(x i , y i , t 0 ) and CR(x i , y i , t 0 ) represent the pixel values of the base pair input images and W ik represents the weight assigned to the similar neighboring pixels.
Zhu et al. [44] proposed the extension of the STARFM method [32] and developed an enhanced spatial and temporal adaptive reflectance fusion method which proved to successfully predict fine resolution reflectance, especially in complex and heterogeneous landscape.The method requires two or more pairs of fine-coarse resolution images collected on the same day and a series of time series coarse resolution data for prediction dates.Emelyanova et al. [58] found out that this new method performed better than the method proposed by Gao et al. [32] in areas with spatial variance dominance, but is less successful in areas where the temporal variance was dominant.
These methods were successfully tested for fusing Landsat-MODIS images [36].A more generic spatiotemporal image fusion method, in terms of the used input images, was developed by Luo et al. [46].The method firstly applies a data gap-filling procedure, followed by an interpolation model for capturing the spatial information available in the image with the highest spatial resolution.The relationship between the fine resolution and coarse pixels is modeled using the following formula: where x, y represents the aligned fine resolution (FR) and coarse resolution (CR) pixels, T j is the acquisition date for the pair-images and ε(x, y, t i ) represents the differences between the pixels of the two images (i.e., errors) caused by e.g., viewing angle geometry.The prediction of the fine resolution pixels values (FR) at t pr is performed as follows: where ∆ x, y, t p represents the difference between the spectral reflectance values of the two input pair images at location x, y and date t pr .Two types of temporal changes need to be considered when developing spatiotemporal fusion methods, namely the seasonal change of vegetation, i.e., vegetation phenology, and the land cover change, i.e., deforestation.While the first temporal change is successfully considered by the method proposed by Gao et al. [32] and by Luo et al. [46], the second temporal change required further developments.
Hilker et al. [43] proposed a spatial temporal adaptive method for detecting reflectance changes associated with land cover change and disturbance.This image fusion method uses high spatial resolution tasseled cap transformation and high frequency of tasseled cap transformation to identify changes in reflectance.The developed method is simple and intuitive and requires at least two fine resolution images representing the beginning and the end of the user-defined time interval.These two images are used to identify the changes in the data.The changes are calculated using the so-called Disturbance Index (DI) which relies on three tasseled cap indices, namely brightness, greenness, and wetness.
This index identifies the changes (or disturbances) in the coarse resolution pixel values between two consecutive dates.The presence of a disturbance event in the study area influences which pair images are used for predicting the reflectance value of the fine resolution images.For example, if the disturbance occurs after the prediction date, then the first fine-course resolution pair-images is used.If the pixel to be predicted lies after the occurrence of the disturbance, then the last pair-images is used.While this method is one of the initial image fusion efforts which consider landscape changes [7], there are studies which reported that it is suited only for forest disturbances scenario as the method selects the optimal fine resolution image date for prediction based on forest disturbance date detection from coarse resolution images.Therefore, this method is of limited value if when other types of land cover changes occur in the investigated areas.
Zhao et al. [47] developed a method called robust adaptive spatial and temporal fusion method, which is capable to consider not only gradual changes, i.e., shape change (e.g., crop rotation), but also abrupt changes such as urban sprawl.To do this, the authors extended the spatiotemporal image fusion method of Gao et al. [32] by replacing the Inverse Distance Weighted (IDW) with a local linear regression model to calculate the weights assigned to the pixels similar to the central pixel.To do this, the method (which uses only one prior pair of fine-coarse resolution images) follows three steps.First, it searches for similar neighboring pixels in the searching moving window as: where c represents the central pixel, s represents the similar pixels, ||P (c,t0) − P (s,t0) ||2a 2 is a Gaussian kernel and Z(c) is a normalizing constant: Second, it calculates the similar neighbor's weights based on the spectral differences between the pixels in the fine resolution and coarse resolution images, the temporal differences between the input and prediction dates of the fine resolution images and the spectral difference between the central pixels and neighboring pixels.
The third step consists of predicting the spectral reflectance of the fine resolution pixels (FR) as: where p (s,t k ) and p (s,t 0 ) represent the coarse resolution images pixels that spatially correspond to high resolution pixels FR (s,t 0 ) .As presented above, these methods are very efficient for spatiotemporal image fusion in relatively homogeneous landscape and when the input images have the same input spectral values.Nevertheless, there is a need for spatiotemporal fusion methods capable to use different images as input, i.e., images with different spectral values (such as learning-based methods) and methods capable to be applied in a heterogeneous landscape where pixels from the coarser satellite images are mixed (such as unmixing-based methods).

Learning-Based Spatiotemporal Image Fusion Methods
Learning-based methods use machine learning to predict finer temporal resolution images from coarse spatial resolution images [56].Compared to reconstruction-based and unmixing-based methods which allows spatiotemporal fusion of images with unified spectral values, learning-based methods allows fusion between images with different spectral values.One of the first studies dedicated to using sparse representation method into data fusion is presented by Huang et al. [7].Sparse representation methods learn the differences between fine spatial resolution images and high temporal coverage images [59] by making use of a dictionary created from the image patches generated from the two image types.Similar to reconstruction-based methods, important research efforts were dedicated to developing a learning-based method that considers the phenology of vegetation and other disturbances caused by land cover changes that might occur before the prediction date.In this context, Huang et al. [7] developed an image fusion method which accounts for both vegetation phenology and land cover changes occurring over the observation period of the fused images.The original implementation of the method relies on fine-coarse resolution image pairs before and after prediction date, and one coarse resolution image at prediction date.Later, the authors extended their method to consider one fine-coarse resolution image-pairs instead of prior and posterior pairs of images.The new spatiotemporal image fusion method is called SP-One [49].
Chen et al. [41] proposed a hierarchical spatiotemporal adaptive fusion method capable of predicting temporal changes such as seasonal phenology and land-cover changes using only one image pair (one prior or posterior image pair).The authors compared the results obtained by this method with those obtained by the spatiotemporal image fusion methods proposed by [12,32,49] and concluded that their method performed better especially in capturing land cover changes in the predicted fine resolution reflectance images.Kwan et al. [60] proposed a spatiotemporal image fusion method which relies on learning the mapping between MODIS images (or between overlapping or non-overlapping patches from the two images obtained by k-means classifier) and is applied to an earlier acquired Landsat image to predict an image at a later time.The method can be easily adapted to new multisource images.However, it performed worse than other spatiotemporal image fusion methods when applied in a heterogeneous landscape, where the pixels from the coarser satellite images are not pure.Therefore, a new category of spatiotemporal image fusion methods can be used to address this challenge, namely those based on spectral unmixing models [61].

Unmixing-Based Spatiotemporal Image Fusion Methods
The need to fuse images from very heterogeneous environments has been systematically addressed by the unmixing-based spatiotemporal image fusion methods.Spectral unmixing methods rely on the linear spectral mixture to extract endmembers and abundances, i.e., proportion, at the sub-pixel level [36].The number of endmembers and abundances is obtained from a high-resolution data set, and the spectral signature of the endmembers is unmixed from the coarse resolution images.Linear mixed methods assume that "the reflectance of each coarse spatial resolution pixel is a linear combination of the responses of each land cover class contributing to the mixture" [52].Thus, spectral reflectance of coarse resolution pixels CR(i, t i ) consisting of n land cover classes is weighted by classes abundance as follows [52]: where fc(i,c) is the abundance of land cover class c in coarse pixel i; r f (c,t i ) is the mean reflectance of pixels from the fine resolution images belonging to land cover class c at time t i ; and ε(i,t i ) is the residual.Unmixing-based methods usually start with the classification of the image with high spatial resolution using unsupervised methods such as k-means (or fuzzy k-means), followed by the spectral unmixing of the image with high temporal frequency by making use of the classification information obtained during the first step [36,61].Alternatively, up-to-date land cover/land use maps can be used to identify the endmembers as shown by Zurita-Milla et al. [53].A comparison between unmixing-based image fusion and reconstruction-based methods is provided by Gevaert et al. [36].The authors also implemented a Bayesian unmixing-based fusion method to downscale coarse resolution images to the spatial resolution of fine resolution images using one base fine-coarse resolution image pair.This method outperformed those proposed by [32] when fewer input fine resolution images area used.
Given the landscape heterogeneity and complexity caused by different geomorphological conditions or anthropogenic activities, land cover change is expected during the period of blended images [62].To address this problem, [50] proposed a spatiotemporal image fusion method based on unmixing and using two or more image pairs.The authors assumed that the spectral information of pixels belonging to the same class have the same temporal variation.This method has been successfully used for high-resolution (i.e., 30 m) leaf area index estimation [63], for generating daily synthetic Landsat imagery [52] or for land surface temperature [64].Since it neglects the differences among different sensors and thus the window size is fixed, Wu et al. [52] introduced a modified spatial and temporal data fusion method by including adaptive window size and moving steps selection for disaggregating coarse pixels.The method requires fine resolution images acquired at the beginning and end of the observation period, coarse resolution reflectance data acquired on the same date as fine resolution data and land cover data to predict daily fine resolution images.
Huang et al. [51] described an image fusion method capable to account for both phenological and land-cover changes.The method requires two pairs of fine-coarse resolution images at the beginning and end of the observation period and a coarse resolution image for the prediction date.Proposed data fusion is sensitive to the scale parameter used to delineate homogeneous change regions from fine resolution images through segmentation.It proved, however, to perform better than reconstruction-based [32] and other spectral unmixing based methods [44], mainly because it relies on neighboring spatial information for blending the images, whereas the other evaluated methods consider linear land cover change during the observation period.Additional unmixing-based methods dedicated to fusing Landsat and Medium Resolution Imaging Spectrometer (MERIS) are described by [53,65].Zurita-Milla et al. [53] described a linear mixing method to downscale MERIS data from 300 m to 25 m resolution using the Dutch land use database to derive the fractional composition of input pixels.
Besides the above presented methods, we also have the so called hybrid methods that rely on different technologies to perform the fusion.Zhu et al. [12], for example, developed a spatiotemporal image fusion method capable to predicting pixel values at fine resolution in challenging situation such as heterogeneous areas and where land cover change occurs during the period between the input and prediction dates.First, this method assumes that all pixels of coarse resolution images are mixed pixels.The reflectance of these pixels can be described using a linear mixture model: where C m and C n represent the reflectance of the mixing pixel at date t m and t n , f i is the fraction of the i-th land cover, F im and F in represent the reflectance of the i-th land cover at date, and t m , t n , a and b represent the coefficients of the linear regression model developed for relative calibration between fine resolution and coarse resolution images.Second, the changes of coarse-resolution reflectance from t m and t n is calculated as follows: Similar to Zhu et al. [12], Zhang et al.
[54] developed a method for fusing coarse-spatial, fine-temporal and fine-spatial, coarse-temporal images that assumes that surface reflectance values of coarse resolution pixels are mixed.The method relies on predicting the fraction map of fine resolution images from the available coarse resolution fraction maps by making use of images acquired before and after the prediction dates.The fraction maps can be obtained using any available spectral unmixing model such as a linear spectral mixture model or multiple endmember spectral mixture analysis model.

Synthesis: Challenges and Opportunities
The spatiotemporal image fusion methods reviewed in this paper generate images at a higher temporal resolution which can be further used for land surface monitoring applications, especially in cloudy areas (e.g., tropical areas), where only a few cloud-free images per year are available.Leckie [66] reported, for example, that there is a 10% probability of acquiring cloud-free images such as Landsat for a certain time interval.The problem of data gaps caused by the presence of clouds can be addressed by combining microwave images with optical images similar to the approach proposed by Mizuochi et al. [27], who proposed fusing Advanced Microwave Scanning Radiometer (AMSR) series with MODIS images followed by MODIS-Landsat fusion.
Developed spatiotemporal image fusion methods have been successfully used to fuse optical images collected by different remote sensing platforms.Wu et al. [67] reconstructed daily 30 m remote sensing data from Huanjing satellite constellation, Gaofen satellite, Landsat, and MODIS data using the spatiotemporal methods developed by Wu et al. [50].Generated remote sensing data are efficient for extracting vegetation phenology and for mapping crops with an overall accuracy higher than those obtained from multi-temporal Landsat NDVI data.Quan et al. [68] proposed a method to fuse Landsat, MODIS and geostationary satellites to 100 m resolution and one-hour interval.This method was compared with those developed by Gao et al. [32], Zhu et al. [44], and Wu et al. [69] and proved to perform better over heterogeneous landscapes and changing land cover types.Kwan et al. [70] evaluated how existing spatiotemporal image fusion methods perform for fusing Planet and WorldView images scenarios, and emphasized the importance of reducing the magnitude differences in the reflectance values between the two input sensors products and of aligning them to avoid misregistration errors.Wang et al. [71] fused MSI and OLI sensors for a study dedicated to land cover/land use mapping and proved that land cover change accuracy increases when Landsat-8 panchromatic band is used in the image fusion task.This is one of the few studies which fuses the two sensors products.

Other Advanced Methods for Spatiotemporal Image Fusion
Deep learning has gained the attention of the remote sensing community in the last years for various image understanding and image classification problems [72].It has for example been successfully used for pan-sharpening [73][74][75][76], feature and decision-level fusion [77,78] and spatial-spectral image fusion [79][80][81].An overview of deep learning for data fusion is provided by Liu et al. [82] and Audebert et al. [30].Despite its proven efficiency in different image fusion scenarios, this method has rarely been used for spatiotemporal image fusion.Song et al. [83] proposed a CNN model for fusing Landsat and MODIS data by considering both spatial heterogeneity of the landscape and temporal changes occurring during the observation period.Our remote sensing community could further benefit from advances in deep learning occurring in computer vision where efficient convolutional networks for learning spatiotemporal features from video dataset have been successfully developed [84].
Bayesian methods have been successfully applied for fusing images in the spatial and spectral domain [85].Despite their ability to handle uncertainties in input images, a rather reduced number of Bayesian methods for spatiotemporal fusion of satellite images have been developed [86,87].
Another example of spatiotemporal image fusion methods includes those based on physical models.Roy et al. [88], for example, proposed a semi-physical fusion method that uses MODIS Bi-directional Reflectance Distribution Function (BRDF)/Albedo to predict BRDF at the spatial resolution of the ETM+ images.

Increasing the Resolution of Various Satellite-Derived Data Products
Important research efforts are dedicated to increasing the resolution of satellite-derived products such as NDVI, land surface temperature, evapotranspiration, and precipitation.In order to obtain a better resolution of these data products, several studies used the remote sensing data whose spatiotemporal resolution has been improved by means of one of the methods presented in this paper.
A large number of these studies are dedicated to land surface temperature data, mainly because these data are important in a wide range of environmental modeling applications from local to global scales [89,90].There are basically two main methods used to increase the coarse resolution of land surface temperature, namely the methods which rely on physical models and those using statistical models [15] such as the linear regression models [91,92], co-kriging model [93] or random forest (RF) regression [89,94].Yang et al. [94] used a random forest model to downscale MODIS based land surface temperature in arid regions from 1 km to 500 m.Yang et al. [95] proposed a disaggregation method for subpixel temperature using the remote sensing endmember index-based method.ASTER visible and near-infrared bands and shortwave bands at 30 m spatial resolution were used in combination with the 990 m resolution MODIS land surface temperature data.Merlin et al. [96] disaggregated MODIS surface temperature at 100 m by considering the temperature difference between photosynthetically and non-photosynthetically active vegetation.For this study, the authors used Formosat-2 data due to their high spatial resolution, i.e., 8 m, and high temporal resolution, i.e., one image per day.Wu et al. [64] applied the spatiotemporal data fusion methods developed by [32,44,50] to generate high temporal and high spatial resolution land surface temperature product by combining ASTER and MODIS data products.The authors concluded that the quality of the generated land surface temperature products increased by using high spatiotemporal resolution satellite images.
A region-based and pixel-based disaggregation method is proposed by Alidoost et al. [97] to improve the resolution of evapotranspiration data from MODIS images from 1 km to 250 m and further to 30 m resolution.Liu et al. [98] proposed the extension of the traditional cokriging method from the spatial domain to spatiotemporal domain capable to account for spatiotemporal structures of the input images.The method was tested for fusing MODIS NDVI images available at 250 m with ETM+ 30 m NDVI images.Hwang et al. [99] fused multi-temporal MODIS and Landsat data together with topographic information for a better estimation of biophysical parameters over complex terrain.Data were validated by making use of a ground-based continuous fraction of absorbed photosynthetically active radiation and leaf area index measurements.
Several studies were dedicated to increasing the spatial resolution of soil moisture through disaggregation methods [100,101].Jia et al. [102] disaggregated tropical Rainfall Measuring Mission dataset using digital elevation model data and SPOT satellite images The results were validated by using in-situ data from different local stations present in the study area, i.e., Qaidam basin.Duan and Bastiaanssen [103] disaggregated the same data product on the basis on a rather limited number of rain gauge data sets.The authors generated an improved monthly pixel-based precipitation with a special resolution of 1 km.

Methods to Increase Spatiotemporal-Spectral Resolution of Images
Most image fusion methods described in this paper are dedicated to increasing the spatiotemporal resolution of input images.Only a few methods enable the fusion of spatiotemporal and spectral information [55,104,105].Huang et al. [104] described a Bayesian method to generate synthetic satellite images with high spectral, temporal and spectral resolution.Meng et al. [105] developed a unified framework for spatiotemporal-spectral fusion based on maximum posteriori theory.The framework was successfully tested on QuickBird, ETM+, MODIS, Hyperspectral Digital Imagery Collection Experiment (HYDICE) and SPOT-5 images [55].

Quality Assessment of Spatiotemporal Blended Images
Validation of the generated image fusion products is performed either visually, i.e., using a qualitative assessment [106] or by employing a quantitative metric [1,107].Among these metrics we can refer to spectral angle mapper [108], peak signal-to-noise ratio [109], structural similarity index used to assess the spatial distortion of the fused image [12], image quality index proposed by [110] and its vector extension [111], absolute difference [60,112], root mean squared error (RMSE) [112], cross-correlation [112] and Erreur Relative Globale Adimensionnelle de Synthese [113].The performance of the developed data fusion methods is evaluated by either using real data or synthetic data [32,51,60].There are studies which evaluated the quality of the fused images by using evaluation metrics which do not require reference data [114,115].The most used quantitative evaluation metrics available in the literature are presented in Table 2.
Currently, there is no agreement on which spatiotemporal image fusion method performs best for blending fine spatial resolution images with high temporal coverage images [42].Wu et al. [64] compared the methods developed by [32,44,50] for generating high temporal and spatial resolution LST product from ASTER and MODIS LST data in different landscape areas and concluded that all methods perform satisfactorily especially in desert areas.Furthermore, the spatiotemporal data method developed by Wu et al. [50] is capable to deal with noises in the data much better than the other two evaluated methods.Chen et al. [42] compared several image fusion methods and concluded that those relying on reconstruction-based concepts and theories are more stable than the learning-based methods.

Data Fusion Performance Metrics Authors
Spectral angle mapper (SAM) Yuhas et al. [108] Peak Signal-to-noise-ratio (SNR) Sheikh et al. [116] Structural Similarity Index (SSIM) Wang et al. [117] Image quality index Wang et al. [110] Extended image quality index Alparone et al. [111] Quality w/no reference index Alparone et al. [114] Enhancement measure evaluation (EME) Agaian et al. [115] Entropy Tsai et al. [118] Erreur Relative Globale Adimensionnelle de Synthese (ERGAS) Wald [113] Kwan et al. [60] argued that none of the available image fusion methods perform well under all conditions of remote sensing applications.To test this hypothesis, we need a ready to use and open access library of the available image fusion methods [70] that would allow us to compare them across different landscapes [42] and agroecological regions, taking into account their sensitivity to noises in the data or to spatial and temporal variances [58].Additionally, a hybrid image fusion framework can serve as a viable solution to combine methods that work well for heterogeneous landscape with those which perform well under homogeneous landscape conditions.
Uncertainty analysis of the predicted/fused spatiotemporal images has been neglected by many fusion methods presented in this paper.Recently, Wang and Huang [119] proposed a spatiotemporal fusion method based on the geostatistical ordinary kriging method that allows the estimation of the prediction uncertainty.The method proposed by Zhong and Zhou [120] enables not only the estimation of the uncertainty of the predicted image, but it accounts also for the uncertainties of the input images used for the fusion purpose.

Spatiotemporal Image Fusion Methods for Sentinel Images
The new program of the European Space Agency (ESA), namely the Sentinel missions, gained the attention of the remote sensing community due to the increasing spatial, spectral and temporal resolution [121].In the perspective of combined use of Sentinel-2 and Landsat 8, there are several differences to be considered.For example, in a previous study [122] we found out relatively high discrepancies between Normalized Difference Vegetation Indeed (NDVI) computed from Sentinel-2 and Landsat-8.We concluded in that work that, in order to take advantage of the 10 m resolution of Sentinel-2 and use these data along with Landsat-8 data, it would be desirable to adjust the reflectance values of the two sensors.A useful method for this purpose could be the one presented by Flood [123] or co-registration techniques such as phase correlation [124].Besides differences in spectral values, previous studies also reported a misalignment of several pixels between Landsat 8 and Sentinel-2 [125].Therefore, images coming from the two sensors need to be co-registered before any concurrent use.Further information on Sentinel-2 MSI and Landsat-8 OLI sensors characteristics is provided by Zhang et al. [126].
Wang et al. [48] described an advanced method to fuse MSI onboard Sentinel-2 platform and OLCI onboard Sentinel-3 platform which relies on regression model fitting to relate spectral reflectance from two acquisition times and spatial filtering to remove the artifacts in the regression model fitting prediction.Their proposed method accounts successfully for the temporal land cover changes when creating nearly daily Sentinel-2 images.Furthermore, it relies on a single Sentinel-3/Sentinel 2 images pair.Besides these studies dedicated to Sentinel missions, important research has been dedicated to developing more generic methods which can be used to fuse different sensors products.An example of such a method is those developed by Luo et al. [46].
It is expected that a large number of spatiotemporal image fusion methods will be developed in the future in order to leverage the available satellite data archive and to increase the spatial resolution of the latest satellite images such as those acquired by microsatellites (e.g., planet images).
3.6.Important Data Pre-processing Issues to be Considered when Fusing Spatiotemporal Images When developing new spatiotemporal image fusion methods, the following issues need to be considered [46,88]: • Spectral responses of input images have to be unified: Reconstruction and unmixing spatiotemporal image fusion methods assume that input images have similar spectral information.Therefore, their application is limited, given that the sensors might have different wavelength.When blending information from different remote sensing data sources, we have to spectrally normalize the input sensors to common wavebands [70].According to Pinty et al. [127], the absence of similar wavelength has a low impact on the fusion results when physically-based reflectance methods are used for blending surface reflectance of the input images.Machine learning based spatiotemporal image fusion methods, on the other hand, are less sensitive to similarity between spectral responses of the input images.

•
Co-registration of multi-source input images: Multi-source images alignment is a very important issue to be considered when fusing them.For example, reported misalignments between Landsat and Sentinel-2 by several pixels need to be carefully addressed when fusing the two input images [125].Further investigation in the development of automatic solutions for images alignments is highly required [70].

•
Atmospheric corrections: Radiometric consistency of the multi-source images to be fused might vary because of the presence of clouds and haze, or because of the differences in the illumination and acquisition angles [88].Therefore, input images have to be radiometrically corrected before fusing them [70] using one of the available existing radiometric corrections techniques such as MODerate spectral resolution TRANsmittance code (MODTRAN) [128].These techniques can be grouped into two categories, namely absolute and relative techniques.Absolute techniques require information on the sensor spectral profile for sensor calibration and corrections of images for atmospheric effects [129].Relative radiometric techniques involve either the selection of landscape elements whose reflectance remain constant over time [130,131] or normalization using regression [132,133].

Future Directions
Image fusion methods need to be flexible enough to accomplish different fusion tasks under different environmental conditions [55,104,105].While most of the presented spatiotemporal image fusion methods are capable enough to capture reflectance changes which are caused by vegetation dynamics through time, i.e., vegetation phenology, not all of them account for sudden land cover changes (e.g., deforestation, flooding events) that might occur during investigated observational time [47].More efforts are, therefore, required to capture accurate temporal changes when fusing high spatial resolution and high frequent temporal coverage from different remotely sensed data [43,51].By considering temporal changes, spatiotemporal fusion methods can be successfully used for mapping and monitoring applications in areas with rapid land cover changes [41,55].
Given the increasing number of new sensors which acquire images at fine spatial and spectral resolution, we consider that spatiotemporal image fusion methods should be extended to fuse multi-sensor images, i.e., more than two sensor types data over a specific observation period.Indeed, this is a challenging task given the diversity of images to be fused in terms of spectral, spatial, temporal and radiometric resolution.In this context, Dynamic Time Warping could offer a flexible framework for performing spatiotemporal fusion of different images accurately.This is due to the ability of this method to predict the surface reflectance pixel values by accounting for the phenological changes of vegetation and by the presence of clouds which contaminate the pixels values.
Another important issue to be considered in the future is the computational time.As an increasing amount of data to be fused is available, the fusion methods have to be capable to scale up to regional and global scales [46].They can be included into operational applications for various image fusion tasks by making use of advanced cloud computing technologies such as Google Earth Engine [134].This will allow almost near-real time fusion of images, a task that is of paramount importance for disaster management activities.

Conclusions
Three categories of spatiotemporal image fusion methods have been discussed in this paper, namely reconstruction-based, learning-based and unmixing based methods.The most popular method for blending fine-spatial resolution with high-temporal resolution images is the reconstruction-based STARFM.Recently, new methods have been developed to fuse images acquired by other sensor products, e.g., Sentinel-2, Sentinel-3 or images acquired by microsatellites.Also, many reviewed methods account for land cover changes occurring during the observation period.The problem of cloud obscuration can be solved by fusing microwave data with optical images.
Future efforts are required for generating spatiotemporal image fusion solutions which are: (1) generic enough to consider various sensors characteristics; (2) computationally efficient to be able to scale up to regional and global level; (3) robust to temporal and spatial variations specific to landscape of a different heterogeneity and complexity caused by different physio-geographical conditions, soil or land management practices, (4) flexible enough to consider phenological dynamics of vegetation or land cover changes caused by both external factors, such as natural hazards and anthropogenic activities, i.e., urban sprawl, and (5) testing and implementation of more sophisticated machine learning methods such as deep learning.It is unlikely that there will be a single optimal spatiotemporal method capable to address data blending needs of various remote sensing applications [70].Therefore, an efficient and operational framework for benchmarking existing spatiotemporal image fusion methods similar to those developed by IEEE GRSS Data Fusion Contest is of paramount importance to help remote sensing community assess the performance of the spatiotemporal image fusion methods and the quality of the resulting output images.

Figure 1 .
Figure 1.Theoretical workflow of spatiotemporal image fusion.DN stands for Digital Number.

Figure 1 .
Figure 1.Theoretical workflow of spatiotemporal image fusion.DN stands for Digital Number.

Figure 2 .
Figure 2. Sum of citations per year for each spatiotemporal image fusion method included in Table 1.Reported citations are recorded in the Web of Science database.

Figure 2 .
Figure 2. Sum of citations per year for each spatiotemporal image fusion method included in Table 1.Reported citations are recorded in the Web of Science database.

Table 2 .
Most common metrics used to evaluate the accuracy of the spatiotemporal fused product.